### **"dl4td_keras_tuner_1"**
-----

#### **Libraries Used:**
- pandas
- scikit-learn
- TensorFlow (imported as keras)
- Kerastuner (RandomSearch)

#### **Data Preparation:**
- Loaded dataset from "../drinking_water_potability.csv".
- Checked for missing values.
- Split data into features and labels.
- Split data into train and test sets (test_size=0.2, random_state=42).
- Imported custom module "data_cleaning" for data cleaning.
- Cleaned data using function "clean_data" from "data_cleaning" module.
- Standardized features using StandardScaler.

#### **Model Building:**
- Defined function "build_model" to construct model architecture using Kerastuner's hyperparameters.
- Built a Sequential model with a variable number of dense layers (2 to 20).
- Used ReLU activation function for hidden layers and sigmoid activation function for the output layer.
- Compiled the model with binary crossentropy loss function and Adam optimizer with varying learning rates (1e-2, 1e-3, 1e-4).

#### **Hyperparameter Tuning:**
- Utilized RandomSearch tuner to search for the best model based on validation accuracy.
- Specified maximum trials (5) and executions per trial (3).
- Saved results in the directory 'my_dir' under the project name 'helloworld'.

#### **Model Training:**
- Conducted a search for the best model using the training data (X_train, y_train).
- Trained each model for 50 epochs and evaluated performance on the validation data (X_test, y_test).

#### **Results Summary:**
- To view a summary of the best results, use the `results_summary()` method of the tuner object.

-----------

## Load libraries

In [12]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow import keras
from tensorflow.keras import layers
from kerastuner.tuners import RandomSearch


In [13]:
df = pd.read_csv("../data/drinking_water_potability.csv")
df

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
0,7.080795,204.890456,20791.31898,7.300212,368.516441,564.308654,10.379783,86.990970,2.963135,0
1,3.716080,129.422921,18630.05786,6.635246,333.775777,592.885359,15.180013,56.329076,4.500656,0
2,8.099124,224.236259,19909.54173,9.275884,333.775777,418.606213,16.868637,66.420093,3.055934,0
3,8.316766,214.373394,22018.41744,8.059332,356.886136,363.266516,18.436525,100.341674,4.628771,0
4,9.092223,181.101509,17978.98634,6.546600,310.135738,398.410813,11.558279,31.997993,4.075075,0
...,...,...,...,...,...,...,...,...,...,...
3271,4.668102,193.681736,47580.99160,7.166639,359.948574,526.424171,13.894419,66.687695,4.435821,1
3272,7.808856,193.553212,17329.80216,8.061362,333.775777,392.449580,19.903225,66.396293,2.798243,1
3273,9.419510,175.762646,33155.57822,7.350233,333.775777,432.044783,11.039070,69.845400,3.298875,1
3274,5.126763,230.603758,11983.86938,6.303357,333.775777,402.883113,11.168946,77.488213,4.708658,1


## Data Cleaning

In [16]:
import sys
sys.path.append("../modules")  # Add parent directory to path if necessary

# Now you can import the module or function
import data_cleaning

# Then you can call the desired function
df_cleang = data_cleaning.clean_data(df)


In [17]:
# Check for missing values
df_cleang.isnull().sum()

ph                 0
Hardness           0
Solids             0
Chloramines        0
Sulfate            0
Conductivity       0
Organic_carbon     0
Trihalomethanes    0
Turbidity          0
Potability         0
dtype: int64

## Model implementation

In [18]:
# Define features and labels
features = df_cleang.columns[:-1]
labels = df_cleang.columns[-1]

In [19]:
# Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df_cleang[features], df_cleang[labels], test_size=0.2, random_state=42)

In [26]:
# Initialize StandardScaler
scaler = StandardScaler()

# Scale features
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


In [27]:
# Build model structure. Choose some parameters
def build_model(hp):
    model = keras.Sequential()
    for i in range(hp.Int('num_layers', 2, 20)):
        model.add(layers.Dense(units=hp.Int('units_' + str(i), 
                                            min_value=32, 
                                            max_value=512, 
                                            step=32),
                               activation='relu'))
    model.add(layers.Dense(1, activation='sigmoid'))
    model.compile(
        optimizer=keras.optimizers.Adam(
            hp.Choice('learning_rate', [1e-2, 1e-3, 1e-4])),
        loss='binary_crossentropy',
        metrics=['accuracy'])
    return model


In [28]:
# Search for best model according to accuracy criterion
tuner = RandomSearch(
    build_model,
    objective='val_accuracy',
    max_trials=5,
    executions_per_trial=3,
    directory='my_dir',
    project_name='keras_tuner_1')

In [30]:
# Summarize model structures
tuner.search_space_summary()

Search space summary
Default search space size: 4
num_layers (Int)
{'default': None, 'conditions': [], 'min_value': 2, 'max_value': 20, 'step': 1, 'sampling': 'linear'}
units_0 (Int)
{'default': None, 'conditions': [], 'min_value': 32, 'max_value': 512, 'step': 32, 'sampling': 'linear'}
units_1 (Int)
{'default': None, 'conditions': [], 'min_value': 32, 'max_value': 512, 'step': 32, 'sampling': 'linear'}
learning_rate (Choice)
{'default': 0.01, 'conditions': [], 'values': [0.01, 0.001, 0.0001], 'ordered': True}


In [31]:
# Search for the best model
tuner.search(X_train, y_train,
             epochs=5,
             validation_data=(X_test, y_test))

Trial 5 Complete [00h 00m 16s]
val_accuracy: 0.6661101778348287

Best val_accuracy So Far: 0.6805787483851115
Total elapsed time: 00h 01m 35s


## Results

In [32]:
# Show summary of results 
tuner.results_summary()

Results summary
Results in my_dir\keras_tuner_1
Showing 10 best trials
Objective(name="val_accuracy", direction="max")

Trial 1 summary
Hyperparameters:
num_layers: 4
units_0: 448
units_1: 352
learning_rate: 0.001
units_2: 160
units_3: 416
units_4: 192
units_5: 480
units_6: 128
units_7: 32
units_8: 384
units_9: 448
units_10: 64
units_11: 288
units_12: 384
units_13: 416
units_14: 96
units_15: 480
units_16: 512
units_17: 128
units_18: 192
units_19: 64
Score: 0.6805787483851115

Trial 2 summary
Hyperparameters:
num_layers: 10
units_0: 128
units_1: 96
learning_rate: 0.001
units_2: 480
units_3: 480
units_4: 160
units_5: 128
units_6: 320
units_7: 384
units_8: 224
units_9: 256
units_10: 320
units_11: 224
units_12: 256
units_13: 160
units_14: 128
units_15: 128
units_16: 288
units_17: 96
units_18: 320
units_19: 320
Score: 0.6688926021258036

Trial 4 summary
Hyperparameters:
num_layers: 7
units_0: 480
units_1: 192
learning_rate: 0.0001
units_2: 192
units_3: 32
units_4: 384
units_5: 512
units_6: 