# FORMATIVE ASSIGNMENT II: WATER QUALITY MODEL

## 1. Introduction
**Assignment**: Building a Classification Model Using Neural Networks

**Objective:**
Develop a neural network-based classification model using a provided dataset, incorporating multiple optimization techniques and ensuring equitable group contribution.

**In this notebook, we will take the cleaned and imputed dataset and use it to train, test, and evaluate a deep learning model**:

The key steps we'll cover are:
1. Loading the preprocessed (imputed) dataset.
2. Separating features and the target variable.
3. Splitting the dataset into three distinct portions: training, validation, and testing sets. This is crucial for robust model development and evaluation.
4. Applying feature scaling (StandardScaler) correctly after the split to prevent data leakage.

**Note:** The data cleaning and imputation steps were performed in a previous notebook. 
If you'd like to review that process, please refer to: [Data Preprocessing Notebook](data_preprocessing.ipynb).

**Model Details**


| Engineer Name     | Regularizer | Optimizer | Early Stopping  | Dropout Rate | Learning Rate |
| ----------------- | ---------------------------- | --------- | ------------------------------------------------ | ------------ | ------------- |
| Christian Iradukunda B. | Your Regularizer           | Nadam      | Your early stopping                    | your dropout rate         | Your Learning rate        |



## 2. Feature Scaling
### 2.1 Declaring Utility Functions

In [1]:
# Import necessary libraries
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.regularizers import l1, l2
from tensorflow.keras.optimizers import Nadam
from tensorflow.keras.callbacks import EarlyStopping



In [2]:
def load_data(file_path: str) -> pd.DataFrame:
    """
    Loads a CSV file into a pandas DataFrame.

    Args:
        file_path (str): The path to the CSV file.

    Returns:
        pd.DataFrame: The loaded DataFrame.

    Raises:
        TypeError: If file_path is not a string.
        FileNotFoundError: If the file specified by file_path is not found.
        Exception: For other pandas-related read errors.
    """
    if not isinstance(file_path, str):
        raise TypeError("file_path must be a string.")
    try:
        df = pd.read_csv(file_path)
        print(f"Successfully loaded dataset from: {file_path}")
        return df
    except FileNotFoundError:
        raise FileNotFoundError(f"Error: The file '{file_path}' was not found.")
    except Exception as e:
        raise Exception(f"Error reading CSV file '{file_path}': {e}")

In [3]:
# Display the first few rows of the dataset 
# We verify the columns, data types, and ensure that our imputation worked 
# All columns should have non-null counts matching the total number of entries.
def display_initial_info(df: pd.DataFrame):
    """
    Displays the first few rows and basic info of a DataFrame.

    Args:
        df (pd.DataFrame): The DataFrame to inspect.

    Raises:
        TypeError: If df is not a pandas DataFrame.
        ValueError: If df is None.
    """
    print("===============First 5 rows:===============")
    print(df.head())
    print("\n===============Information:===============")
    df.info()

In [4]:
def separate_features_target(df: pd.DataFrame, target_column: str) -> tuple[pd.DataFrame, pd.Series]:
    """
    Separates features and the target variable from a DataFrame.

    Args:
        df (pd.DataFrame): The DataFrame to separate.
        target_column (str): The name of the target column.

    Returns:
        tuple[pd.DataFrame, pd.Series]: A tuple containing features (X) and target (y).

    Raises:
        TypeError: If df is not a DataFrame or target_column is not a string.
        ValueError: If df is None or target_column is not found in df.columns.
    """
    if target_column not in df.columns:
        raise ValueError(f"Target column '{target_column}' not found in DataFrame columns: {df.columns.tolist()}")

    X = df.drop(columns=[target_column])
    y = df[target_column]
    print(f"\nFeatures (X) and target (y, column: '{target_column}') have been separated.")
    print(f"Shape of features (X): {X.shape}")
    print(f"Shape of target (y): {y.shape}")
    return X, y


In [5]:
def split_data(X: pd.DataFrame, y: pd.Series,
               test_size: float = 0.15,
               val_relative_to_train_val_size: float = 0.15 / 0.85,
               random_state: int = 42,
               stratify_data: bool = True) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.Series, pd.Series, pd.Series]:
    """
    Splits feature and target data into training, validation, and test sets.

    Args:
        X (pd.DataFrame): Features.
        y (pd.Series): Target variable.
        test_size (float): Proportion of the dataset to allocate to the test set.
        val_relative_to_train_val_size (float): Proportion of the (train+validation) set to allocate to validation.
        random_state (int): Seed for random number generator for reproducibility.
        stratify_data (bool): Whether to stratify the split based on the target variable y.

    Returns:
        tuple: X_train, X_val, X_test, y_train, y_val, y_test

    Raises:
        TypeError: If X is not a DataFrame or y is not a Series.
        ValueError: If X or y is None, or if shapes are incompatible.
    """
    if len(X) != len(y):
        raise ValueError(f"Shape mismatch: X has {len(X)} samples, y has {len(y)} samples.")

    stratify_option_main = y if stratify_data else None
    X_train_val, X_test, y_train_val, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state, stratify=stratify_option_main
    )

    stratify_option_tv = y_train_val if stratify_data else None
    X_train, X_val, y_train, y_val = train_test_split(
        X_train_val, y_train_val, test_size=val_relative_to_train_val_size,
        random_state=random_state, stratify=stratify_option_tv
    )

    print("\nData splitting completed.")
    return X_train, X_val, X_test, y_train, y_val, y_test

In [6]:
def scale_features(X_train: pd.DataFrame, X_val: pd.DataFrame, X_test: pd.DataFrame) -> tuple[StandardScaler, pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """
    Fits StandardScaler on X_train and transforms X_train, X_val, X_test.

    Args:
        X_train (pd.DataFrame): Training features.
        X_val (pd.DataFrame): Validation features.
        X_test (pd.DataFrame): Test features.

    Returns:
        tuple: The fitted scaler object, and the scaled DataFrames (X_train_scaled, X_val_scaled, X_test_scaled).

    Raises:
        TypeError: If any input is not a pandas DataFrame.
        ValueError: If any input DataFrame is None.
    """
    for df_name, df_obj in [("X_train", X_train), ("X_val", X_val), ("X_test", X_test)]:
        if not isinstance(df_obj, pd.DataFrame):
            raise TypeError(f"Input '{df_name}' must be a pandas DataFrame. Got {type(df_obj)}.")
        if df_obj is None:
            raise ValueError(f"Input DataFrame '{df_name}' cannot be None.")

    scaler = StandardScaler()

    print("\nFitting StandardScaler on X_train...")
    # Ensure X_train has data (not empty) before fitting
    if X_train.empty:
        raise ValueError("X_train is empty, cannot fit StandardScaler.")
    scaler.fit(X_train) # Fit ONLY on training data
    print("Scaler fitted.")

    print("Transforming X_train, X_val, and X_test...")
    X_train_scaled_array = scaler.transform(X_train)
    X_val_scaled_array = scaler.transform(X_val)
    X_test_scaled_array = scaler.transform(X_test)
    print("Transformation complete.")

    # Convert back to DataFrames
    X_train_scaled = pd.DataFrame(X_train_scaled_array, columns=X_train.columns, index=X_train.index)
    X_val_scaled = pd.DataFrame(X_val_scaled_array, columns=X_val.columns, index=X_val.index)
    X_test_scaled = pd.DataFrame(X_test_scaled_array, columns=X_test.columns, index=X_test.index)
    print("Scaled data converted back to DataFrames.")

    return scaler, X_train_scaled, X_val_scaled, X_test_scaled



### 2.2 Feature Scaling Implementation

In [7]:
# Here, we load the dataset that has already undergone cleaning and imputation.
# It's important to use the version where missing values have already been handled.
# Path to the file which has already missing fields handled
FILE_PATH_IMPUTED = "../data/imputed_water_potability_data.csv"

# Load the dataset
df = load_data(FILE_PATH_IMPUTED)

# Display the overview of the dataset
display_initial_info(df)


Successfully loaded dataset from: ../data/imputed_water_potability_data.csv
         ph    Hardness        Solids  Chloramines     Sulfate  Conductivity  \
0  7.036752  204.890455  20791.318981     7.300212  368.516441    564.308654   
1  3.716080  129.422921  18630.057858     6.635246  333.073546    592.885359   
2  8.099124  224.236259  19909.541732     9.275884  333.073546    418.606213   
3  8.316766  214.373394  22018.417441     8.059332  356.886136    363.266516   
4  9.092223  181.101509  17978.986339     6.546600  310.135738    398.410813   

   Organic_carbon  Trihalomethanes  Turbidity  Potability  
0       10.379783        86.990970   2.963135           0  
1       15.180013        56.329076   4.500656           0  
2       16.868637        66.420093   3.055934           0  
3       18.436524       100.341674   4.628771           0  
4       11.558279        31.997993   4.075075           0  

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3276 entries, 0 to 3275
Data col

In [8]:
# Separate features and target variable
X, y = separate_features_target(df, "Potability")


Features (X) and target (y, column: 'Potability') have been separated.
Shape of features (X): (3276, 9)
Shape of target (y): (3276,)


In [9]:
# Initialize split variables to None
X_train, X_val, X_test, y_train, y_val, y_test = [None] * 6

if X is not None and y is not None:
    try:
        X_train, X_val, X_test, y_train, y_val, y_test = split_data(X, y, stratify_data=True)

        # Display proportions
        print(f"Shape of X_train: {X_train.shape}, y_train: {y_train.shape}")
        print(f"Shape of X_val: {X_val.shape}, y_val: {y_val.shape}")
        print(f"Shape of X_test: {X_test.shape}, y_test: {y_test.shape}")

        total_samples = len(X)
        print("\nProportion of samples in each set (approximate):")
        print(f"Training set: {len(X_train)/total_samples*100:.2f}%")
        print(f"Validation set: {len(X_val)/total_samples*100:.2f}%")
        print(f"Test set: {len(X_test)/total_samples*100:.2f}%")

    except TypeError as e:
        raise ValueError(f"TypeError during data splitting: {e}")
    except ValueError as e:
        raise ValueError(f"ValueError during data splitting: {e}")
else:
    raise ValueError("X or y is None.")


Data splitting completed.
Shape of X_train: (2292, 9), y_train: (2292,)
Shape of X_val: (492, 9), y_val: (492,)
Shape of X_test: (492, 9), y_test: (492,)

Proportion of samples in each set (approximate):
Training set: 69.96%
Validation set: 15.02%
Test set: 15.02%


In [10]:
scaler_object, X_train_scaled, X_val_scaled, X_test_scaled = [None] * 4

if X_train is not None and X_val is not None and X_test is not None:
    try:
        scaler_object, X_train_scaled, X_val_scaled, X_test_scaled = scale_features(X_train, X_val, X_test)

        print("\n--- Verification of Scaled Training Data ---")
        display_initial_info(X_train_scaled)

        print("\nDescriptive statistics of X_train_scaled (mean ~0, std ~1):")
        print(X_train_scaled.describe().round(2))

        print("\nWorkflow complete. Data is prepared for model training.")
        print("Prepared data sets:")
        print(f"X_train_scaled: {X_train_scaled.shape}, y_train: {y_train.shape if y_train is not None else 'N/A'}")
        print(f"X_val_scaled: {X_val_scaled.shape}, y_val: {y_val.shape if y_val is not None else 'N/A'}")
        print(f"X_test_scaled: {X_test_scaled.shape}, y_test: {y_test.shape if y_test is not None else 'N/A'}")

    except (TypeError, ValueError) as e:
        print(f"Error during feature scaling: {e}")
else:
    print("Skipping feature scaling as data splits are not available.")


Fitting StandardScaler on X_train...
Scaler fitted.
Transforming X_train, X_val, and X_test...
Transformation complete.
Scaled data converted back to DataFrames.

--- Verification of Scaled Training Data ---
            ph  Hardness    Solids  Chloramines   Sulfate  Conductivity  \
2766  0.292120  0.027566  0.522327     0.038376 -1.781081     -0.634797   
2505 -0.104021  0.816164 -0.481655     0.540023 -0.954435      0.033630   
163  -0.647064 -1.665957 -1.420733     0.308547 -1.555323     -0.140706   
43    1.961394  0.193955 -1.379953    -0.160524  0.108557     -1.163968   
2040 -0.022284  1.805507 -0.796542     0.846080  0.798238      0.565539   

      Organic_carbon  Trihalomethanes  Turbidity  
2766       -0.341967         0.209553  -0.180315  
2505       -0.619904         0.161566  -1.631684  
163         0.672902        -0.993148   0.866453  
43          2.973166         0.352298   0.939613  
2040        0.099535        -1.655482  -0.941722  

<class 'pandas.core.frame.DataFra

## 3. Building and Training a Model

### 3.1. Model Definition

In [34]:
# Defnine a function to create a Keras Sequential model with specified hyperparameters
def create_model(input_shape: tuple,
                 dropout_rate: float = 0.15,
                 l1_reg: float = 0.006,
                 l2_reg: float = 0.006) -> Sequential:
    """
    Creates and returns a Keras Sequential model with the specified hyperparameters.

    Args:
        input_shape (tuple): The shape of the input data (number of features,).
        dropout_rate (float): The rate for the Dropout layer.
        l1_reg (float): The factor for the L1 activity regularizer.
        l2_reg (float): The factor for the L2 kernel regularizer.

    Returns:
        Sequential: The compiled Keras model.
    """
    # Initialize the regularizers
    kernel_regularizer = l2(l2_reg)
    activity_regularizer = l1(l1_reg)

    # Define the model architecture
    model = Sequential([
        # Input Layer and First Hidden Layer
        Dense(64, activation='relu', input_shape=input_shape, name='hidden_layer_1'),

        # Second Hidden Layer with specified Regularizers
        Dense(128, activation='relu',
              kernel_regularizer=kernel_regularizer,
              activity_regularizer=activity_regularizer,
              name='hidden_layer_2_with_regularizers'),
        
        # Dropout Layer to prevent overfitting
        Dropout(dropout_rate, name='dropout_layer'),

        # Third Hidden Layer
        Dense(64, activation='relu', name='hidden_layer_3'),

        # Output Layer
        Dense(1, activation='sigmoid', name='output_layer')
    ])

    print("\nModel architecture created successfully.")
    model.summary() # Print a summary of the model
    return model

In [37]:
# Defnine a function to train the model
def train_model(model: Sequential,
                X_train: pd.DataFrame,
                y_train: pd.Series,
                X_val: pd.DataFrame,
                y_val: pd.Series,
                learning_rate: float = 0.006,
                patience: int = 8,
                epochs: int = 150,
                batch_size: int = 32) -> tf.keras.callbacks.History:
    """
    Compiles and trains the Keras model.

    Args:
        model (Sequential): The Keras model to train.
        X_train, y_train: Training data and labels.
        X_val, y_val: Validation data and labels.
        learning_rate (float): The learning rate for the Nadam optimizer.
        patience (int): Number of epochs with no improvement after which training will be stopped.
        epochs (int): The maximum number of epochs to train for.
        batch_size (int): The number of samples per gradient update.

    Returns:
        tf.keras.callbacks.History: Object containing the history of the training process.
    """
    # 1. Set up the EarlyStopping callback
    # We monitor 'val_loss' and stop if it doesn't improve for 'patience' epochs.
    # restore_best_weights=True ensures the model weights are reset to the best ones found.
    early_stopping_callback = EarlyStopping(
        monitor='val_loss',
        patience=patience,
        verbose=1,
        restore_best_weights=True
    )

    # 2. Compile the model
    # We use the Nadam optimizer with the specified learning rate.
    model.compile(
        optimizer=Nadam(learning_rate=learning_rate),
        loss='binary_crossentropy',
        metrics=['accuracy', tf.keras.metrics.AUC(name='auc')] # Tracking accuracy and AUC
    )
    print("\nModel compiled successfully.")

    # 3. Train (fit) the model
    print("Starting model training...")
    history = model.fit(
        X_train,
        y_train,
        validation_data=(X_val, y_val),
        epochs=epochs,
        batch_size=batch_size,
        callbacks=[early_stopping_callback],
        verbose=1 # Set to 1 to see progress bar, 2 for one line per epoch, 0 for silent
    )
    print("Model training complete.")
    return history



### 3.2. Create and Train the Model

In [40]:
# Check if the necessary data from preprocessing steps is available
if 'X_train_scaled' in locals() and X_train_scaled is not None:
    try:
        # Define Hyperparameters
        L1_VALUE = 0.03
        L2_VALUE = 0.03
        DROPOUT_RATE = 0.15
        LEARNING_RATE = 0.06
        PATIENCE = 15
        EPOCHS = 150 # Max epochs
        BATCH_SIZE = 32

        # Create the Model
        input_shape = (X_train_scaled.shape[1],)
        model = create_model(
            input_shape=input_shape,
            dropout_rate=DROPOUT_RATE,
            l1_reg=L1_VALUE,
            l2_reg=L2_VALUE
        )

        # Train the Model
        history = train_model(
            model=model,
            X_train=X_train_scaled,
            y_train=y_train,
            X_val=X_val_scaled,
            y_val=y_val,
            learning_rate=LEARNING_RATE,
            patience=PATIENCE,
            epochs=EPOCHS,
            # batch_size=BATCH_SIZE
        )
        
        print("\nModel training workflow has been executed.")

    except NameError as e:
        print(f"A required variable is not defined: {e}")
        print("Please ensure the data preprocessing steps have been run successfully.")
    except Exception as e:
        print(f"An unexpected error occurred during model training: {e}")
else:
    print("Scaled training data ('X_train_scaled') not found. Skipping model building and training.")
    print("Please ensure the previous notebook steps (loading, splitting, scaling) have been run.")


Model architecture created successfully.


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)



Model compiled successfully.
Starting model training...
Epoch 1/150
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.6148 - auc: 0.4992 - loss: 2.9252 - val_accuracy: 0.6098 - val_auc: 0.5000 - val_loss: 0.6719
Epoch 2/150
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 835us/step - accuracy: 0.5898 - auc: 0.4930 - loss: 0.6799 - val_accuracy: 0.6098 - val_auc: 0.5000 - val_loss: 0.6712
Epoch 3/150
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 830us/step - accuracy: 0.6071 - auc: 0.4594 - loss: 0.6734 - val_accuracy: 0.6098 - val_auc: 0.5000 - val_loss: 0.6699
Epoch 4/150
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 825us/step - accuracy: 0.6086 - auc: 0.4855 - loss: 0.6705 - val_accuracy: 0.6098 - val_auc: 0.5000 - val_loss: 0.6694
Epoch 5/150
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 813us/step - accuracy: 0.6205 - auc: 0.4815 - loss: 0.6652 - val_accuracy: 0.6098 - val_a