## Machine Learning, Artificial Neural Networks and Deep Learning - Part 2
### Exam Solution

**Student:** Pablo Rimoldi
**ID Number:** 535345

# 0 - Introduction

This section introduces the dataset loading process, utilizing the requests library to download the necessary data from the source.

In [None]:
import requests
import zipfile
import pickle as pk

url = "https://frasca.di.unimi.it/MLDNN/input_data.zip"

response = requests.get(url)
with open("data.zip", "wb") as f:
    f.write(response.content)

with zipfile.ZipFile("data.zip", 'r') as zip_ref:
     zip_ref.extractall("unzipped_data")

with open("unzipped_data/input_data.pkl", "rb") as f:
     df = pk.load(f)


In [None]:
%pip install tensorflow pydot graphviz lime

import os
import random
import string
import itertools
import unicodedata
import warnings

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import backend as K
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import plot_model, to_categorical

import keras
from keras.regularizers import l1, l2
from keras.layers import (
    Input, Dense, BatchNormalization, Dropout, Embedding,
    Bidirectional, LSTM, Concatenate, Flatten
)
from keras.models import Model

import nltk
from nltk.corpus import stopwords

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import (
    accuracy_score, f1_score, mean_squared_error, confusion_matrix, classification_report
)

from IPython.display import Image, display

warnings.filterwarnings("ignore")
np.random.seed(42)
tf.random.set_seed(42)

scaler = MinMaxScaler()


I suppress warnings to keep the output clean; mainly related to known deprecations in libraries that do not impact the result.

sanity check, everything is working fine 

In [None]:
df.head



---



# 2 - Input

This section in the exam was divided in two subparts:
* How to (if) preprocess input data and which data would you retain/use;
* Which is the input of the model, and how is it represented;

to avoid any kind of data leakage, I first divide into train, validation, and test

In [None]:
train_df, temp_df = train_test_split(
    df, test_size=0.20, random_state=42, shuffle=True
)
val_df, test_df = train_test_split(
    temp_df, test_size=0.50, random_state=42, shuffle=True
)
print(len(train_df), len(val_df), len(test_df))

## 2.1 preprocessing:

### 2.1.1 - Search Queries
 

As I said during the exam, these vectors once one-hot encoded do not need any further preprocessing

In [None]:
def extract_search_queries(df):
    return np.stack(df['Search Queries'].values)

train_hot_tensor = extract_search_queries(train_df)
val_hot_tensor = extract_search_queries(val_df)
test_hot_tensor = extract_search_queries(test_df)

### 2.1.3 - Timestamp 

As stated in the exam, I am going to:

1. convert: minute, hour, day, month using a cyclic sine/cosine function
2. convert the year into distance from the first istance


In [None]:
def transform_dates(df, min_year=None):
    df['Timestamp'] = pd.to_datetime(df['Timestamp'], format='%m/%d/%Y %H:%M') # Parse the timestamp with both date and time

    # Extract components
    df['Day'] = df['Timestamp'].dt.day
    df['Month'] = df['Timestamp'].dt.month
    df['Hour'] = df['Timestamp'].dt.hour
    df['Minute'] = df['Timestamp'].dt.minute
    df['Year'] = df['Timestamp'].dt.year

    # Cyclic encoding for day, month, hour, minute
    df['Day_sin'] = np.sin(2 * np.pi * df['Day'] / 31)
    df['Day_cos'] = np.cos(2 * np.pi * df['Day'] / 31)
    df['Month_sin'] = np.sin(2 * np.pi * df['Month'] / 12)
    df['Month_cos'] = np.cos(2 * np.pi * df['Month'] / 12)
    df['Hour_sin'] = np.sin(2 * np.pi * df['Hour'] / 24)
    df['Hour_cos'] = np.cos(2 * np.pi * df['Hour'] / 24)
    df['Minute_sin'] = np.sin(2 * np.pi * df['Minute'] / 60)
    df['Minute_cos'] = np.cos(2 * np.pi * df['Minute'] / 60)
    if min_year is None:
        min_year = df['Year'].min()

    #distance of years   
    df['Year_Since'] = df['Year'] - min_year
    df.drop(columns=['Timestamp', 'Day', 'Month', 'Hour', 'Minute', 'Year'], inplace=True)
    return df, min_year

train_df, min_year = transform_dates(train_df)
val_df, _ = transform_dates(val_df, min_year)
test_df, _ = transform_dates(test_df, min_year)

### 2.1.4 Ad Topic Line

I clean the ad text by removing punctuation, stop words, and non-ASCII characters, and converting everything to lowercase

In [None]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
def preprocess_ad(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')    
    text = text.replace('--', ' ')    
    words = text.split()    
    table = str.maketrans('', '', string.punctuation)
    words = [w.translate(table) for w in words]    
    words = [w.lower() for w in words if w.isalpha() and w.lower() not in stop_words]
    return ' '.join(words)

To better understand the input data, I analyze the distribution of the number of words in each ad title after preprocessing. I compute basic statistics and plot a histogram to visualize how ad titles lengths are spread.

In [None]:
for df_ in [train_df, val_df, test_df]:
    df_["clean_text"] = df_["Ad Topic Line"].apply(preprocess_ad)
    df_["ad_length"] = df_["clean_text"].apply(lambda x: len(x.split()))
    df_["tokens"] = df_["clean_text"].str.split()

In [None]:
ad_length_stats = train_df["ad_length"].describe()
print(ad_length_stats)
print("\n")
quantiles = [0.50, 0.75, 0.90, 0.95, 0.98, 0.99, 1.00]
for q in quantiles:
    length_at_quantile = train_df["ad_length"].quantile(q)
    print(f"Quantile {q*100:.0f}%: Ads have length <= {length_at_quantile:.0f} words.")

plt.figure(figsize=(14, 7))
sns.histplot(data=train_df, x="ad_length", bins=50, kde=True,
             label="Length distribution", stat="density")
plt.axvline(train_df["ad_length"].quantile(0.75), color='green', linestyle='--',
            label=f'75th Quantile ({train_df["ad_length"].quantile(0.75):.0f} words)')
plt.axvline(train_df["ad_length"].quantile(0.90), color='orange', linestyle='--',
            label=f'90th Quantile ({train_df["ad_length"].quantile(0.90):.0f} words)')
plt.axvline(train_df["ad_length"].quantile(0.95), color='red', linestyle='--',
            label=f'95th Quantile ({train_df["ad_length"].quantile(0.95):.0f} words)')
plt.title('Distribution of Ad Topic Line with Highlighted Quantiles', fontsize=16)
plt.xlabel('Number of Words per Ad Topic Line', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.legend()
plt.xlim(0, train_df["ad_length"].quantile(0.99))
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()


Based on the text preprocessing, an analysis of the ad topic line lengths was conducted. The distribution and highlighted quantiles reveal several key insights:

*   The ad topic lines are predominantly short and concise, with the most frequent length being **3 words**.
*   The distribution is highly concentrated, with **75% of ad topic lines having 3 words or fewer**.
*   The distribution has a very short tail, as **95% of all ad topic lines contain 4 words or fewer**.

Based on this analysis, a maximum sequence length (`MAX_LEN`) of **5** was chosen. This value sits just above the 95th percentile, striking an effective balance: it captures the complete context for the vast majority of samples while minimizing the computational overhead from excessive padding.

Consequently, sequences shorter than 5 words will be post-padded to reach this length. To ensure these padding tokens do not influence the model's learning process, the `Embedding` layer will be configured with `mask_zero=True`. This instructs subsequent layers, such as the LSTM, to disregard these padded positions during training, effectively isolating the model's focus on the actual content of the sequence.

In [None]:
ad_all_train_words = [word for tokens in train_df['tokens'] for word in tokens]
ad_unique_words = sorted(list(set(ad_all_train_words)))

ad_word_index = {word: i + 2 for i, word in enumerate(ad_unique_words)}
ad_word_index["<OOV>"] = 1
AD_MAX_LEN = 5
AD_VOCAB_SIZE = len(ad_unique_words) + 2


def ad_text_to_sequence(tokens, word_index_dict):
    seq = [word_index_dict.get(word, word_index_dict["<OOV>"]) for word in tokens]
    return seq

ad_train_sequences = [ad_text_to_sequence(tokens, ad_word_index) for tokens in train_df['tokens']]
ad_val_sequences = [ad_text_to_sequence(tokens, ad_word_index) for tokens in val_df['tokens']]
ad_test_sequences = [ad_text_to_sequence(tokens, ad_word_index) for tokens in test_df['tokens']]

train_ad_seqs = pad_sequences(ad_train_sequences, maxlen=AD_MAX_LEN, padding='post', truncating='post')
val_ad_seqs = pad_sequences(ad_val_sequences, maxlen=AD_MAX_LEN, padding='post', truncating='post')
test_ad_seqs = pad_sequences(ad_test_sequences, maxlen=AD_MAX_LEN, padding='post', truncating='post')

print("Shape of final training sequences tensor:", train_ad_seqs.shape)
print("Example of a processed sequence:", train_ad_seqs[29])

### 2.1.5 Country

In [None]:
def preprocess_location(country):
    # Normalize unicode
    country = unicodedata.normalize('NFKD', country).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    country = country.replace('--', ' ')
    # Remove punctuation except spaces
    table = str.maketrans('', '', string.punctuation.replace(' ', ''))
    country = country.translate(table)
    country = country.lower()
    # Remove extra spaces
    country = ' '.join(country.split())
    # Join multi-word country names with no separator (single word)
    country = country.replace(' ', '')
    # Only allow alphabetic (no numbers or other chars)
    if country.isalpha() and country:
        return country
    else:
        return ''

In [None]:
for df_ in [train_df, val_df, test_df]:
    df_["clean_country"] = df_["Country"].apply(preprocess_location)

In [None]:
def create_word_index(words):
    """
    Create a word-to-index dictionary for a list of words.
    Indexing starts at 2, with 1 reserved for <OOV>.
    """
    unique_words = sorted(set(words))
    word_index = {word: i + 2 for i, word in enumerate(unique_words)}
    word_index["<OOV>"] = 1
    return word_index

In [None]:


# Create word index for clean_country using only train_df to avoid data leakage
country_word_index = create_word_index(train_df["clean_country"])
COUNTRY_VOCAB_SIZE = len(country_word_index) + 1

def country_to_index(country):
    return country_word_index.get(country, country_word_index["<OOV>"])

train_country_tensor = train_df["clean_country"].apply(country_to_index).values
val_country_tensor = val_df["clean_country"].apply(country_to_index).values
test_country_tensor = test_df["clean_country"].apply(country_to_index).values

### 2.1.6 city

In [None]:
for df_ in [train_df, val_df, test_df]:
    df_["clean_city"] = df_["City"].apply(preprocess_location)

In [None]:
# Create word index for clean_city using only train_df to avoid data leakage
city_word_index = create_word_index(train_df["clean_city"])
CITY_VOCAB_SIZE = len(city_word_index) + 1 

def city_to_index(city):
    return city_word_index.get(city, city_word_index["<OOV>"])

train_city_tensor = train_df["clean_city"].apply(city_to_index).values
val_city_tensor = val_df["clean_city"].apply(city_to_index).values
test_city_tensor = test_df["clean_city"].apply(city_to_index).values

### 2.1.7 - Numerical group of features: Age,  Area Income,  Daily Internet Usage, Male, Daily Time Spent on Site, converted timestamp 

In [None]:
numeric_cols = [
    'Age', 'Area Income', 'Daily Internet Usage', 'Male', 'Daily Time Spent on Site',
    'Day_sin', 'Day_cos', 'Month_sin', 'Month_cos',
    'Hour_sin', 'Hour_cos', 'Minute_sin', 'Minute_cos', 'Year_Since'
]

train_nums_tensor = scaler.fit_transform(train_df[numeric_cols])
val_nums_tensor   = scaler.transform(val_df[numeric_cols])
test_nums_tensor  = scaler.transform(test_df[numeric_cols])



---



## 2.2. Input of the model 

The model architecture is designed to accept five distinct and parallel inputs, each representing a different type of information extracted from the original data. This multi-input architecture allows the model to learn specific representations for each data type before merging them for the final prediction.

The five input tensors are summarized below, showing the shape and a brief description of their content. These tensors represent the textual, categorical, and numerical data, respectively.

As outlined in the exam, the textual data input will be processed either by an embedding layer or an LSTM layer. The other two inputs (categorical and numerical), which are here separated just for clarity, will be concatenated with the output representation from the embeddign layers and the lstm layer. This combined vector will then serve as the input for the subsequent fully-connected deep neural network.

In [None]:

print("\n1. TEXTUAL INPUT -Ad Topic Line (Padded Token Sequences)")
print(f"   - Training tensor shape: {train_ad_seqs.shape}")
print(f"   - Validation tensor shape: {val_ad_seqs.shape}")
print(f"   - Test tensor shape: {test_ad_seqs.shape}")

print("\n2. TEXTUAL INPUT -Country (integer index)")
print(f"   - Training tensor shape: {train_country_tensor.shape}")
print(f"   - Validation tensor shape: {val_country_tensor.shape}")
print(f"   - Test tensor shape: {test_country_tensor.shape}")

print("\n3. TEXTUAL INPUT -City (integer index)")
print(f"   - Training tensor shape: {train_city_tensor.shape}")
print(f"   - Validation tensor shape: {val_city_tensor.shape}")
print(f"   - Test tensor shape: {test_city_tensor.shape}")

print("\n4. CATEGORICAL INPUT (One-Hot Encoded)")
print(f"   - Training tensor shape: {train_hot_tensor.shape}")
print(f"   - Validation tensor shape: {val_hot_tensor.shape}")
print(f"   - Test tensor shape: {test_hot_tensor.shape}")

print("\n5. NUMERICAL INPUT (Normalized Features)")
print(f"   - Training tensor shape: {train_nums_tensor.shape}")
print(f"   - Validation tensor shape: {val_nums_tensor.shape}")
print(f"   - Test tensor shape: {test_nums_tensor.shape}")
print(f"   - Included columns: {numeric_cols}")

# 3 - 4 (OUTPUT - LOSS - MODEL CONFIGURATION)

This section represents the following three parts:

3. OUTPUT/LOSS: How would you design the output layer and why; Which loss function would you use to train your model and
why;
4. MODEL CONFIGURATION
  * Model overall composition/pipeline,
  * model optimization;

## 3.1 Output Layer, Loss Function, and Label Preprocessing

- **Label Preprocessing**: The target labels (`Clicked on Ad`) are already encoded as integers (0 and 1). I will keep them in this integer format rather than manually one-hot encoding them. This is because I will be using the `sparse_categorical_crossentropy` loss function, which is designed to work directly with integer labels. It's important to note that, under the hood, this is equivalent to comparing the model's output probabilities with a one-hot encoded ground truth.

- **Output Layer**: As explained during the exam, the model's final layer will be a `Dense` layer with **two units and a softmax activation function**. A two-unit softmax output is crucial because frameworks like LIME require the model to output a full probability distribution over all classes (in this case, `P(Did Not Click)` and `P(Clicked)`) for its analysis. A single sigmoid unit would only provide the probability for the positive class.

- **Loss Function**: Given that the model outputs a probability distribution, the ideal loss function is `sparse_categorical_crossentropy`. Its purpose is to minimize the divergence (specifically, the cross-entropy) between the model's predicted probability distribution and the true distribution (represented by the integer labels). This effectively trains the model to produce predictions that accurately match the ground truth probabilities.

In [None]:
train_labels = train_df["Clicked on Ad"]
val_labels = val_df["Clicked on Ad"]
test_labels = test_df["Clicked on Ad"]

## 3.2  MODEL CONFIGURATION

In [None]:

def create_model(
    country_vocab_size,
    city_vocab_size,
    ad_vocab_size,
    query_onehot_dim,
    numeric_features,
    ad_max_len=5,
    country_max_len=1,
    city_max_len=1,
    ad_embedding_dim=32,
    ad_lstm_units=16,
    country_embedding_dim=8,
    city_embedding_dim=8,
    hidden_layer_sizes=[256, 128, 64, 32],
    dropout_rate=0.1,
    weight_reg=l2(0.01)
):
    country_input = Input(shape=(country_max_len,), name='country_input')
    city_input = Input(shape=(city_max_len,), name='city_input')
    ad_input = Input(shape=(ad_max_len,), name='ad_input')
    query_input = Input(shape=(query_onehot_dim,), name='query_input')
    numeric_input = Input(shape=(numeric_features,), name='numeric_input')


    x_country_text = Embedding(input_dim=country_vocab_size,
                       output_dim=country_embedding_dim,
                       input_length=country_max_len)(country_input)
    x_country_text = Flatten()(x_country_text)


    
    x_city_text = Embedding(input_dim=city_vocab_size,
                       output_dim=city_embedding_dim,
                       input_length=city_max_len)(city_input)
    x_city_text = Flatten()(x_city_text)


    x_ad_text = Embedding(input_dim=ad_vocab_size,
                       output_dim=ad_embedding_dim,
                       input_length=ad_max_len,
                       mask_zero=True)(ad_input)
    x_ad_text = LSTM(ad_lstm_units,
                                activation='tanh',
                                recurrent_activation='sigmoid',
                                dropout=dropout_rate,
                                recurrent_dropout=dropout_rate,
                                kernel_initializer='glorot_uniform', 
                                recurrent_initializer='glorot_uniform'
                               )(x_ad_text)

    x = Concatenate()([x_ad_text, x_country_text, x_city_text, query_input, numeric_input])


    for i, size in enumerate(hidden_layer_sizes, start=1):
        x = Dense(size,
                  activation='relu',
                  kernel_initializer='he_uniform',
                  kernel_regularizer=weight_reg,
                  name=f'dense_{i}')(x)
        x = BatchNormalization(name=f'bn_{i}')(x)
        x = Dropout(dropout_rate, name=f'dropout_{i}')(x)

    output = Dense(2, activation="softmax")(x)

    return Model(inputs=[country_input, city_input, ad_input, query_input, numeric_input],
                 outputs=[output])


### 5 - Hyperparametr optimization

In [None]:
param_grid = {
    'learning_rate':     [2e-3, 5e-3, 1e-2],
    'batch_size':        [64, 128, 256],
    'epochs':            [5, 10, 20, 25],
    'dropout_rate':      [0.2, 0.3, 0.4, 0.5],
    'weight_reg':        [l2(1e-4), l2(1e-3), l2(1e-2)],
    'ad_embedding_dim':  [16, 32, 64],
    'ad_lstm_units':     [8, 16, 32],
    'country_embedding_dim': [4, 8, 16],
    'city_embedding_dim':    [4, 8, 16],
    'optimizer':         ['adam'],
    'hidden_layer_sizes': [
        [256, 128, 64, 32],
        [128, 64, 32],
        [64, 32]
    ]
}
num_samples = 3

- We are tuning parameters like learning_rate, dropout_rate, and loss_weight_score, which have a significant influence on model convergence and generalization without drastically increasing training time per trial. More computationally expensive parameters, such as the number of LSTM layers or a much wider range for embedding_dim, have been intentionally kept fixed or limited. This focused approach allows for an efficient yet meaningful optimization process within the available computational budget.

- Due to time constraints, the following random search is limited to num_samples = 5. This serves as a proof of concept for the tuning methodology rather than an exhaustive search for an optimal model. For the final evaluation, a pre-vetted, well performing configuration is used to ensure a meaningful analysis

In [None]:
def random_search(param_grid, samples=5):
    """
    Performs a random hyperparameter search for the classification model.
    """
    combos = list(itertools.product(*param_grid.values()))
    num_to_sample = min(samples, len(combos))
    sampled_combos = random.sample(combos, num_to_sample)
    configs = [dict(zip(param_grid.keys(), c)) for c in sampled_combos]

    best_model = {
        'f1_score': (0, None),
        'accuracy': (0, None)
    }

    # Prepare input dictionaries using the provided variable names
    train_inputs = {
        'country_input': train_country_tensor,
        'city_input': train_city_tensor,
        'ad_input': train_ad_seqs,
        'query_input': train_hot_tensor,
        'numeric_input': train_nums_tensor
    }
    val_inputs = {
        'country_input': val_country_tensor,
        'city_input': val_city_tensor,
        'ad_input': val_ad_seqs,
        'query_input': val_hot_tensor,
        'numeric_input': val_nums_tensor
    }

    for idx, cfg in enumerate(configs):
        K.clear_session()
        print(f"\n{'='*10} Training config {idx+1}/{len(configs)} {'='*10}")
        print("Config:", cfg)
        
        # 1. Model Creation
        # The dimensions are obtained dynamically from the shapes of the input tensors
        model = create_model(
            country_vocab_size=COUNTRY_VOCAB_SIZE,
            city_vocab_size=CITY_VOCAB_SIZE,
            ad_vocab_size=AD_VOCAB_SIZE,
            query_onehot_dim=train_hot_tensor.shape[1],
            numeric_features=train_nums_tensor.shape[1],
            ad_max_len=train_ad_seqs.shape[1],
            country_max_len=1, #train_country_tensor.shape[1],
            city_max_len=1, #train_city_tensor.shape[1],
            ad_embedding_dim=cfg['ad_embedding_dim'],
            ad_lstm_units=cfg['ad_lstm_units'],
            country_embedding_dim=cfg['country_embedding_dim'],
            city_embedding_dim=cfg['city_embedding_dim'],
            hidden_layer_sizes=cfg['hidden_layer_sizes'],
            dropout_rate=cfg['dropout_rate'],
            weight_reg=cfg['weight_reg']
        )

        Optimizer = keras.optimizers.Adam if cfg['optimizer'] == 'adam' else keras.optimizers.RMSprop
        model.compile(
            loss='sparse_categorical_crossentropy',
            optimizer=Optimizer(learning_rate=cfg['learning_rate']),
            metrics=['accuracy']
        )

        # 3. Model Training
        history = model.fit(
            train_inputs,
            train_labels,
            epochs=cfg['epochs'],
            batch_size=cfg['batch_size'],
            validation_data=(val_inputs, val_labels), 
            verbose=1,
            callbacks=[
                keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
            ]
        )
        
        # 4. Evaluation
        pred_probs = model.predict(val_inputs, verbose=0)
        pred_labels = np.argmax(pred_probs, axis=1) 

        true_labels = val_labels
        acc = accuracy_score(true_labels, pred_labels)
        f1 = f1_score(true_labels, pred_labels, average='weighted')
        
        print("\nEvaluation metrics:")
        print(f"  Accuracy = {acc:.4f}, F1-Score = {f1:.4f}\n")
        # zero_division=0 to avoid warnings if a class has no predictions
        print(classification_report(true_labels, pred_labels, zero_division=0))
        # 5. Update best results
        if f1 > best_model['f1_score'][0]:
            best_model['f1_score'] = (f1, cfg)
        if acc > best_model['accuracy'][0]:
            best_model['accuracy'] = (acc, cfg)

    print("\n{'='*15} Best Results Summary {'='*15}")
    print(f"Best model (by F1-Score): {best_model['f1_score'][0]:.4f} -> Config: {best_model['f1_score'][1]}")
    print(f"Best model (by Accuracy): {best_model['accuracy'][0]:.4f} -> Config: {best_model['accuracy'][1]}")
    
    return best_model

In [None]:
best_results = random_search(param_grid)

The following configuration was identified as a well performing candidate during more extensive, offline experiments. For reproducibility within this notebook, we will use this specific configuration for final training and evaluation

In [None]:
best_learning_rate = 0.005
best_batch_size = 64
best_epochs = 20
best_dropout_rate = 0.4
best_weight_reg = keras.regularizers.l2(1e-2)
best_ad_embedding_dim = 16
best_ad_lstm_units = 16
best_country_embedding_dim = 16
best_city_embedding_dim = 4
best_optimizer = 'adam'
best_hidden_layer_sizes = [256, 128, 64, 32]

In [None]:
# Set initial random state for reproducibility
RANDOM_SEED = 42
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)

K.clear_session()

best_model = create_model(
    country_vocab_size=COUNTRY_VOCAB_SIZE,
    city_vocab_size=CITY_VOCAB_SIZE,
    ad_vocab_size=AD_VOCAB_SIZE,
    query_onehot_dim=train_hot_tensor.shape[1],
    numeric_features=train_nums_tensor.shape[1],
    ad_max_len=train_ad_seqs.shape[1],
    country_max_len=1,  # train_country_tensor.shape[1],
    city_max_len=1,     # train_city_tensor.shape[1],
    ad_embedding_dim=best_ad_embedding_dim,
    ad_lstm_units=best_ad_lstm_units,
    country_embedding_dim=best_country_embedding_dim,
    city_embedding_dim=best_city_embedding_dim,
    hidden_layer_sizes=best_hidden_layer_sizes,
    dropout_rate=best_dropout_rate,
    weight_reg=best_weight_reg
)

Optimizer = keras.optimizers.Adam if best_optimizer == 'adam' else keras.optimizers.RMSprop

best_model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=Optimizer(learning_rate=best_learning_rate),
    metrics=['accuracy']
)

train_inputs = {
    'country_input': train_country_tensor,
    'city_input': train_city_tensor,
    'ad_input': train_ad_seqs,
    'query_input': train_hot_tensor,
    'numeric_input': train_nums_tensor
}
val_inputs = {
    'country_input': val_country_tensor,
    'city_input': val_city_tensor,
    'ad_input': val_ad_seqs,
    'query_input': val_hot_tensor,
    'numeric_input': val_nums_tensor
}

history = best_model.fit(
            train_inputs,
            train_labels, 
            epochs=best_epochs,
            batch_size=best_batch_size,
            validation_data=(val_inputs, val_labels),
            verbose=1,
            callbacks=[keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)])

In [None]:

import os 
print("="*60)
print("MODEL COMPOSITION REPORT")
print("="*60)
best_model.summary()


print("\n\n" + "="*60)
print("MODEL ARCHITECTURE PLOT")
print("="*60)


plot_model(
best_model,
to_file='model_architecture.png',
show_shapes=True,           
show_layer_names=True,      
show_dtype=False,           
show_layer_activations=True )


if os.path.exists("model_architecture.png"):
    display(Image('model_architecture.png'))


The plots below illustrate the training and validation performance of the classification model over successive epochs. We analyze the learning curves for both the loss function and key performance metrics, primarily accuracy. This analysis is essential for evaluating model convergence, identifying signs of overfitting by observing the divergence between training and validation scores, and ultimately assessing the efficacy of the chosen training configuration.

In [None]:
import matplotlib.pyplot as plt

def plot_training_history(history):
    train_loss = history.history.get('loss', [])
    val_loss = history.history.get('val_loss', [])
    train_acc = history.history.get('accuracy', [])
    val_acc = history.history.get('val_accuracy', [])

    epochs = range(1, len(train_loss) + 1)

    fig, axs = plt.subplots(1, 2, figsize=(14, 6))
    fig.suptitle('Training and Validation History', fontsize=20)

    axs[0].plot(epochs, train_loss, 'bo-', label='Training Loss')
    axs[0].plot(epochs, val_loss, 'ro-', label='Validation Loss')
    axs[0].set_title('Loss')
    axs[0].set_xlabel('Epochs')
    axs[0].set_ylabel('Loss')
    axs[0].legend()
    axs[0].grid(True)

    if train_acc and val_acc:
        axs[1].plot(epochs, train_acc, 'bo-', label='Training Accuracy')
        axs[1].plot(epochs, val_acc, 'ro-', label='Validation Accuracy')
        axs[1].set_title('Accuracy')
        axs[1].set_xlabel('Epochs')
        axs[1].set_ylabel('Accuracy')
        axs[1].legend()
        axs[1].grid(True)
    else:
        axs[1].set_visible(False)

    plt.tight_layout(rect=[0, 0.03, 1, 0.95])
    plt.show()

plot_training_history(history)



---



# 5 - MODEL EVALUATION

This section is dedicated to the final evaluation of the best-performing model, identified through the hyperparameter optimization process.

The evaluation is conducted on the **unseen test set** to provide an unbiased estimate of the model's generalization capabilities. The process follows the plan outlined in the exam:
1.  Evaluate the classification task.
2.  Visualize the results to gain deeper insights (e.g., Confusion Matrix and performance plots).

In [None]:

from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, classification_report
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Predict on test set
test_pred_probs = best_model.predict(
    {
        'country_input': test_country_tensor,
        'city_input': test_city_tensor,
        'ad_input': test_ad_seqs,
        'query_input': test_hot_tensor,
        'numeric_input': test_nums_tensor
    },
    verbose=0
)
test_pred_labels = np.argmax(test_pred_probs, axis=1)
y_true = test_labels

# Confusion Matrix
cm = confusion_matrix(y_true, test_pred_labels)
# Plot Confusion Matrix
plt.figure(figsize=(8,6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()


# Classification Report
report = classification_report(y_true, test_pred_labels)
print("Classification Report:\n", report)


# Accuracy Score
acc = accuracy_score(y_true, test_pred_labels)
print("Accuracy Score:", acc)

# F1 Score
f1 = f1_score(y_true, test_pred_labels, average='weighted')
print("F1 Score (weighted):", f1)


# 6 - MODEL INTERPRETATION (LIME )


As outlined during the exam, we will use the LIME framework to generate local explanations for the model's predictions. However, our model's multi-modal architecture, which accepts both tabular and textual inputs, presents a technical challenge. LIME is structured with distinct explainers for different data types, namely `LimeTabularExplainer`, `LimeTextExplainer`, and `LimeImageExplainer`.

 **Method 1: Explanation with a Subset of Features**
> Following the idea discussed in the exam—*"We can decide to use all the features or some of them to train the linear model"*—our first approach will be to generate an explanation using only the non-textual features. This allows us to use the standard `LimeTabularExplainer` on the numerical data only. The primary limitation of this approach is significant: it cannot account for the impact of textual features, thus providing an incomplete view of the model's reasoning. We will implement this method first to demonstrate its functionality and its inherent drawbacks.

**Method 2: A Holistic Explanation with all the features**
> To overcome the limitations of the first method and obtain a complete explanation, a more sophisticated strategy is required. This approach involves treating the entire multi-input data as a single, unified tabular array for LIME. The key to this method is a custom 'wrapper' function that serves as an intermediary. This wrapper receives the perturbed data from LIME, correctly reshapes it into the multiple distinct inputs our Keras model expects, and returns the prediction. This allows us to leverage the power of `LimeTabularExplainer` across all features simultaneously, ensuring a complete and faithful interpretation of our black-box model.

## 6.1 Explanation with a Subset of Features

In [None]:
import lime
import lime.lime_tabular
import numpy as np
from IPython.display import display, HTML, Markdown

In [None]:
# 6.1.1- extract the numerical features only 
X_train_for_lime_explainer = train_nums_tensor
feature_names_for_lime_explainer = numeric_cols
print(f"Shape of the data provided to the LIME explainer: {X_train_for_lime_explainer.shape}")
print(f"Number of feature names provided: {len(feature_names_for_lime_explainer)}")

In [None]:
# This wrapper is the bridge between LIME's simplified world (only numerics) and our complex model's world (all 5 inputs)
def prediction_wrapper_numerics_only(perturbed_numerical_data):
    num_samples = perturbed_numerical_data.shape[0]
    # Create fixed, neutral placeholders for all inputs that LIME is NOT perturbing.
    placeholder_country = np.zeros((num_samples, 1), dtype=int)
    placeholder_city = np.zeros((num_samples, 1), dtype=int)
    placeholder_ad = np.zeros((num_samples, AD_MAX_LEN), dtype=int)
    placeholder_query = np.zeros((num_samples, train_hot_tensor.shape[1]), dtype=int)
    # Assemble the full input for the original `best_model`.
    model_inputs = {
        'country_input': placeholder_country,
        'city_input': placeholder_city,
        'ad_input': placeholder_ad,
        'query_input': placeholder_query,
        'numeric_input': perturbed_numerical_data  # only part that varies.
    }
    # Get a prediction from the original, fully-trained black-box model.
    return best_model.predict(model_inputs, verbose=0)

In [None]:
explainer_numerics_only = lime.lime_tabular.LimeTabularExplainer(
    training_data=X_train_for_lime_explainer,
    feature_names=feature_names_for_lime_explainer,
    class_names=['Did Not Click', 'Clicked'],
    mode='classification',
    random_state=42
)

In [None]:
instance_idx = 33 # instance from the test set to explain.

instance_to_explain_numerics_only = test_nums_tensor[instance_idx]
# Get the model's prediction for this instance's numerics
prediction_probs_numerics = prediction_wrapper_numerics_only(instance_to_explain_numerics_only.reshape(1, -1))
predicted_label_numerics = np.argmax(prediction_probs_numerics)

print("="*60)
print(f"NUMERICS-ONLY EXPLANATION FOR INSTANCE #{instance_idx}")
print("="*60)
print(f"Original True Label: {'Clicked' if test_labels.iloc[instance_idx] == 1 else 'Did Not Click'}")
print(f"Model Prediction (based on its numerical inputs only): {'Clicked' if predicted_label_numerics == 1 else 'Did Not Click'}")
print(f"Prediction Probabilities: {prediction_probs_numerics[0]}\n")

In [None]:
explanation_numerics_only = explainer_numerics_only.explain_instance(
    instance_to_explain_numerics_only,
    prediction_wrapper_numerics_only,
    num_features=10,
    top_labels=1
)
html_explanation_numerics = explanation_numerics_only.as_html(show_table=True, show_all=False)
display(HTML(html_explanation_numerics))

## 6.2 Explanation with all the features

In [None]:

# 1. Convert one-hot encoded queries to a single categorical index.
train_query_indices = np.argmax(train_hot_tensor, axis=1)
test_query_indices = np.argmax(test_hot_tensor, axis=1) # We'll need this for the instance to explain
NUM_QUERY_CATEGORIES = train_hot_tensor.shape[1]

# 2. Create inverse mapping dictionaries (index -> word/name).
def invert_word_index(word_index):
    return {idx: word for word, idx in word_index.items()}

index_to_country = invert_word_index(country_word_index)
index_to_city = invert_word_index(city_word_index)
index_to_ad_word = invert_word_index(ad_word_index)
index_to_query_name = {i: f"Query_{i}" for i in range(NUM_QUERY_CATEGORIES)}

# 3. Add mappings for padding/OOV tokens.
index_to_country[0] = '<N/A>'
index_to_city[0] = '<N/A>'
index_to_ad_word[0] = '<PAD>'

# 4. Assemble the unified training data matrix for LIME.
train_country_tensor_2d = train_country_tensor.reshape(-1, 1)
train_city_tensor_2d = train_city_tensor.reshape(-1, 1)
train_query_indices_2d = train_query_indices.reshape(-1, 1)

X_train_unified = np.hstack([
    train_country_tensor_2d,
    train_city_tensor_2d,
    train_ad_seqs,
    train_query_indices_2d,
    train_nums_tensor
])

print(f"Shape of the unified training matrix for LIME: {X_train_unified.shape}")

In [None]:
# 1. Define the complete list of feature names in the correct order.
country_feature_names = ['Country']
city_feature_names = ['City']
ad_feature_names = [f'Ad_Word_{i+1}' for i in range(AD_MAX_LEN)]
query_feature_names = ['Search_Query']
feature_names = (
    country_feature_names + city_feature_names + ad_feature_names +
    query_feature_names + numeric_cols
)

# 2. Identify the indices of all categorical features.
num_categorical_features = (
    len(country_feature_names) + len(city_feature_names) +
    len(ad_feature_names) + len(query_feature_names)
)
categorical_features_indices = list(range(num_categorical_features))

# 3. Build the dictionary that maps feature indices to their translation maps.
categorical_names = {}
categorical_names[feature_names.index('Country')] = index_to_country
categorical_names[feature_names.index('City')] = index_to_city
for i in range(AD_MAX_LEN):
    categorical_names[feature_names.index(f'Ad_Word_{i+1}')] = index_to_ad_word
categorical_names[feature_names.index('Search_Query')] = index_to_query_name

*Handling One-Hot Encoded Features for LIME*

The one-hot encoding of the `Search Query` feature required a specific handling strategy to ensure meaningful explanations. `LimeTabularExplainer` interprets each column of a one-hot vector as an independent binary feature, which violates the mutually exclusive nature of the original categorical data and leads to nonsensical perturbations,also, leaving the `Search Query` as it is, would bring to and output that is low-level and confusing, such as "`Query_164=0` contributed positively," which fails to capture the high-level concept of the user's search. If anyone is interested I left this version as the final point **6.3 - appendix**, to demonstrate how all the categorical features take over the explanations of LIME, making it practically useless


To address this, we implemented a conversion process. For the explainer, the one-hot vectors were collapsed into a single categorical feature representing the query's index. This forces LIME to perturb the entire feature at a conceptual level (i.e., changing the query). Subsequently, the `prediction_wrapper` function is responsible for re-expanding this index back into the one-hot format required by the Keras model's `query_input` layer. This round-trip conversion ensures both a conceptually sound explanation from LIME and valid communication with the trained black-box model.



In [None]:
# This wrapper function translates LIME's unified tabular data into the
# multi-input format required by our trained Keras model.
def prediction_wrapper(data):
    # Define slicing points based on the unified data structure
    country_end = 1
    city_end = country_end + 1
    ad_end = city_end + AD_MAX_LEN
    query_end = ad_end + 1
    
    # Slice the data into its constituent parts
    country_input = data[:, :country_end].astype(int)
    city_input = data[:, country_end:city_end].astype(int)
    ad_input = data[:, city_end:ad_end].astype(int)
    query_indices = data[:, ad_end:query_end].astype(int)
    numeric_input = data[:, query_end:]
    
    # Revert the query index back to one-hot format for the model
    query_input_onehot = to_categorical(query_indices, num_classes=NUM_QUERY_CATEGORIES)
    
    # Assemble the input dictionary for the model
    model_inputs = {
        'country_input': country_input,
        'city_input': city_input,
        'ad_input': ad_input,
        'query_input': query_input_onehot,
        'numeric_input': numeric_input
    }
    
    # Return the model's prediction
    return best_model.predict(model_inputs, verbose=0)

In [None]:
# Create the holistic LIME explainer with our custom settings.
explainer = lime.lime_tabular.LimeTabularExplainer(
    training_data=X_train_unified,
    feature_names=feature_names,
    class_names=['Did Not Click', 'Clicked'],
    categorical_features=categorical_features_indices,
    categorical_names=categorical_names,
    mode='classification',
    random_state=42
)

In [None]:
# Select a specific instance from the test set to explain.
instance_idx = 33
original_instance_data = test_df.iloc[instance_idx]

# Assemble the unified data vector for this specific instance.
instance_to_explain = np.hstack([
    test_country_tensor[instance_idx].reshape(1, -1),
    test_city_tensor[instance_idx].reshape(1, -1),
    test_ad_seqs[instance_idx].reshape(1, -1),
    test_query_indices[instance_idx].reshape(1, -1),
    test_nums_tensor[instance_idx].reshape(1, -1)
]).flatten()

# Get the true label and the model's prediction for this instance.
true_label = test_labels.iloc[instance_idx]
prediction_probs = prediction_wrapper(instance_to_explain.reshape(1, -1))
predicted_label = np.argmax(prediction_probs)

# Print a summary header.
print("="*60)
print(f"DETAILED EXPLANATION FOR INSTANCE #{instance_idx} FROM THE TEST SET")
print("="*60)
print(f"Original Ad Topic Line: '{original_instance_data['Ad Topic Line']}'")
print(f"Original Country: '{original_instance_data['Country']}'")
print(f"Age: {original_instance_data['Age']:.0f}, Area Income: ${original_instance_data['Area Income']:,.2f}")
print("-"*60)
print(f"True Label: {'Clicked' if true_label == 1 else 'Did Not Click'}")
print(f"Model Prediction: {'Clicked' if predicted_label == 1 else 'Did Not Click'}")
print(f"Predicted Probabilities: [P(Did Not Click)={prediction_probs[0][0]:.4f}, P(Clicked)={prediction_probs[0][1]:.4f}]\n")

In [None]:
# Generate the explanation for the selected instance.
explanation = explainer.explain_instance(
    instance_to_explain,
    prediction_wrapper,
    num_features=20, # Use a big number of features for a detailed view
    top_labels=1
)

html_explanation = explanation.as_html(show_table=True, show_all=False)
display(HTML(html_explanation))

## 6.3 Appendix 

In [None]:

# Unlike Method 6.2, here we use the one-hot tensor directly.
# LIME will see ~300 binary columns instead of a single categorical feature.
X_train_naive_onehot = np.hstack([
    train_country_tensor_2d,
    train_city_tensor_2d,
    train_ad_seqs,
    train_hot_tensor, # The key difference: we use the one-hot tensor
    train_nums_tensor
])
print(f"Shape of the naive one-hot training matrix for LIME: {X_train_naive_onehot.shape}")

query_feature_names_naive = [f'Query_Feature_{i}' for i in range(train_hot_tensor.shape[1])]
feature_names_naive = (
    country_feature_names + city_feature_names + ad_feature_names +
    query_feature_names_naive + numeric_cols
)

# We identify all categorical features, including the ~300 for the query.
categorical_features_indices_naive = list(range(
    len(country_feature_names) + len(city_feature_names) +
    len(ad_feature_names) + len(query_feature_names_naive)
))

categorical_names_naive = {}
categorical_names_naive[feature_names_naive.index('Country')] = index_to_country
categorical_names_naive[feature_names_naive.index('City')] = index_to_city
for i in range(AD_MAX_LEN):
    categorical_names_naive[feature_names_naive.index(f'Ad_Word_{i+1}')] = index_to_ad_word


def prediction_wrapper_naive_onehot(data):
    country_end = 1
    city_end = country_end + 1
    ad_end = city_end + AD_MAX_LEN
    query_end = ad_end + train_hot_tensor.shape[1]
    
    model_inputs = {
        'country_input': data[:, :country_end].astype(int),
        'city_input': data[:, country_end:city_end].astype(int),
        'ad_input': data[:, city_end:ad_end].astype(int),
        'query_input': data[:, ad_end:query_end], # Pass the one-hot data directly
        'numeric_input': data[:, query_end:]
    }
    return best_model.predict(model_inputs, verbose=0)

explainer_naive = lime.lime_tabular.LimeTabularExplainer(
    training_data=X_train_naive_onehot,
    feature_names=feature_names_naive,
    class_names=['Did Not Click', 'Clicked'],
    categorical_features=categorical_features_indices_naive,
    categorical_names=categorical_names_naive,
    mode='classification',
    random_state=42
)

instance_to_explain_naive = np.hstack([
    test_country_tensor[instance_idx].reshape(1, -1),
    test_city_tensor[instance_idx].reshape(1, -1),
    test_ad_seqs[instance_idx].reshape(1, -1),
    test_hot_tensor[instance_idx].reshape(1, -1), #We use the one-hot vector of the instance
    test_nums_tensor[instance_idx].reshape(1, -1)
]).flatten()

print("="*60)
print(f"GENERATING A NAIVE EXPLANATION FOR INSTANCE #{instance_idx}")
print("="*60)
print("This explanation treats each one-hot column as an independent feature.")

explanation_naive = explainer_naive.explain_instance(
    instance_to_explain_naive,
    prediction_wrapper_naive_onehot,
    num_features=20,
    top_labels=1
)

html_explanation_naive = explanation_naive.as_html(show_table=True, show_all=False)
display(HTML(html_explanation_naive))


As the explanation above demonstrates, treating the one-hot encoded `Search Query` feature naively leads to a significantly less useful result, even within a holistic framework.

**Key Problems Observed:**
1.  **Low-Level Information:** The explanation is cluttered with statements like `Query_Feature_164=0`, `Query_Feature_282=0`, etc. This tells us the impact of a specific query *not being present*, which is far less intuitive than knowing the impact of the query that *was* present. The single most important piece of information—which query was actually active—is often drowned out.
2.  **Potential for Unrealistic Scenarios:** LIME's perturbations might have created invalid states (e.g., a sample with no query active, or multiple queries active) to train its local linear model, potentially compromising the reliability of the explanation itself.
3. As the "Feature | Value" table from the explanation clearly demonstrates, the output is **flooded with low-level, uninformative features** like `Query_Feature_89=0`, `Query_Feature_0=0`, and `Query_Feature_84=0`.
