<a href="https://colab.research.google.com/github/Rushil-K/Deep-Learning/blob/main/ANN/nmrk2627_ANN_DLM_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Deep Learning Project 1 : Artificial Neural Networks**

Contributors:
- Rushil Kohli
- Navneet Mittal


# **Executive Summary: ANN Model for Customer Conversion Rate Prediction**

## **Project Overview**

The primary objective of this project is to predict customer conversion rates based on a dataset containing **1 million records**. The term "conversion" refers to a customer completing a desired action, such as making a purchase or subscribing to a service. The goal is to build an **Artificial Neural Network (ANN)** model that can effectively predict whether a customer will convert based on various behavioral and demographic features.

Once trained, the model was deployed using **Streamlit**, providing an interactive dashboard for real-time predictions and performance insights. The dashboard allows users to experiment with different model configurations, visualize key evaluation metrics, and gain a deeper understanding of customer behavior.

This report provides a comprehensive breakdown of the dataset, preprocessing steps, model architecture, evaluation techniques, and key insights derived from the model’s predictions.

---

## **Dataset Description**

The dataset consists of **1 million entries** and **8 features**, each representing different customer attributes. Below is a detailed description of each feature:

| Feature       | Description  |
|--------------|----------------------------------------------------------|
| **CustomerID** | Unique identifier assigned to each customer. |
| **Age** | Age of the customer (numerical). |
| **Gender** | Gender of the customer (categorical: 0 for Male, 1 for Female). |
| **Income** | Annual income of the customer (continuous variable). |
| **Purchases** | Number of purchases made by the customer. |
| **Clicks** | Number of times the customer clicked on advertisements or product links. |
| **Spent** | Total amount of money spent by the customer. |
| **Converted** | Target variable (1 = converted, 0 = not converted). |

The **target variable**, "Converted," represents whether a customer completed the desired action. The dataset exhibits **class imbalance**, where the majority of customers did not convert.

---

## **Data Preprocessing**

Data preprocessing plays a crucial role in preparing the dataset for model training. The following steps were performed:

### **1. Handling Missing Values**
- Checked for missing values and handled them using imputation techniques where necessary.

### **2. Encoding Categorical Variables**
- The **Gender** column was encoded using **Ordinal Encoding** (0 for Male, 1 for Female) to make it suitable for the ANN model.

### **3. Handling Class Imbalance**
- The dataset exhibited class imbalance (i.e., significantly more non-converted customers than converted ones). To mitigate this, **SMOTE (Synthetic Minority Over-sampling Technique)** was applied to oversample the minority class.

### **4. Train-Test Split**
- The dataset was split into **80% training data** and **20% test data** to assess the model’s performance on unseen data.

### **5. Class Weights Computation**
- Since the target variable was imbalanced, **compute_class_weight** was used to assign appropriate weights, ensuring the model did not favor the majority class.

---

## **Model Development and Training**

### **1. Model Architecture**
A **Sequential Artificial Neural Network (ANN)** was built using **TensorFlow** and **Keras**, with the following architecture:

- **Input Layer**: Accepts input features (Age, Gender, Income, Purchases, Clicks, Spent).
- **Hidden Layers**: Multiple fully connected (Dense) layers with **ReLU activation**.
- **Dropout Layers**: Used to prevent overfitting by randomly deactivating neurons.
- **Output Layer**: A single neuron with **sigmoid activation** to classify customers into **Converted (1) or Not Converted (0)**.

### **2. Hyperparameter Tuning**
Users can select the following hyperparameters via the **Streamlit Dashboard**:
- **Optimizer**: Adam, SGD, RMSProp
- **Learning Rate**: 0.01, 0.001, 0.0001
- **Dropout Rate**: 0.1 to 0.5
- **Number of Dense Layers**: 2 to 5
- **Batch Size**: 128
- **Epochs**: User-defined

### **3. Training Process**
- The model was trained with the preprocessed dataset using **binary cross-entropy loss** and evaluated using **accuracy and loss metrics**.
- **Validation split** was set to **20%** to track generalization performance during training.

---

## **Model Evaluation**

### **1. Performance Metrics**
After training, the model was evaluated using the following metrics:

| Metric | Value |
|--------------|--------------|
| **Test Accuracy** | 69.91% |
| **Test Loss** | Binary cross-entropy loss |
| **Precision (Class 0)** | 0.70 |
| **Precision (Class 1)** | 0.00 |
| **Recall (Class 0)** | 1.00 |
| **Recall (Class 1)** | 0.00 |
| **F1-score (Class 0)** | 0.82 |
| **F1-score (Class 1)** | 0.00 |
| **Macro Avg F1-score** | 0.41 |
| **Weighted Avg F1-score** | 0.58 |



### **2. Training Performance Visualization**
- **Loss and Accuracy Plots**: Displayed over epochs to detect overfitting or underfitting.

### **3. Classification Report**
- Displays Precision, Recall, and F1-score for each class.

---

## **Model Interpretation**

### **1. SHAP (Shapley Additive Explanations)**
To interpret model predictions, **SHAP values** were computed:
- **Feature Importance Ranking**: Identified key factors influencing conversion.
- **Summary Plot**: Displayed which features contributed most to predictions.

---

## **Deployment with Streamlit**

The trained model was deployed using **Streamlit**, enabling users to:
- **Train the Model** with adjustable hyperparameters.
- **Visualize Training Performance** (accuracy, loss curves).
- **Analyze Model Performance** (confusion matrix, classification report).
- **View Feature Importance** using SHAP values.

---

## **Tech Stack**

- **Google Colab**: Used for data preprocessing and model training.
- **TensorFlow & Keras**: Built the ANN model.
- **Streamlit**: Created an interactive dashboard.
- **Scikit-learn**: Used for data preprocessing and evaluation.
- **SHAP**: Interpreted model predictions.
- **SMOTE**: Handled class imbalance.
- **gdown**: Downloaded the dataset from Google Drive.

---

## **Conclusion**

This project successfully demonstrates how an **Artificial Neural Network (ANN)** can be used to predict customer conversion rates. Key takeaways:

- **Addressing Class Imbalance**: Using SMOTE improved the model’s ability to learn from minority-class samples.
- **Feature Scaling & Encoding**: Standardization enhanced training efficiency and model stability.
- **Model Performance**: Achieved a test accuracy of **69.91%**, though performance on minority class requires improvement.
- **Deployment**: The **Streamlit dashboard** enables non-technical users to interact with the model and explore its results visually.


### Analysis

In [1]:
# Import necessary libraries
import os
import requests
import io
from io import StringIO
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

In [2]:
# Replace with your actual file ID
file_id = '1OPmMFUQmeZuaiYb0FQhwOMZfEbVrWKEK'

# Construct the URL for direct download (using export)
url = f'https://drive.google.com/uc?export=download&id={file_id}'

# Fetch the data using requests

response = requests.get(url)
response.raise_for_status()  # Raise an exception for bad responses

# Read the data into a pandas DataFrame using StringIO
# Specify encoding if needed, e.g., encoding='latin1' or encoding='utf-8'
nmrk2627_df = pd.read_csv(StringIO(response.text), encoding='utf-8')

# Display the head of the dataframe to verify data loading.
display(nmrk2627_df.head())

Unnamed: 0,CustomerID,Age,Gender,Income,Purchases,Clicks,Spent,Converted
0,1,41,Female,52618.0,26,67,2434.0,0
1,2,43,Male,53114.0,3,14,2937.0,0
2,3,43,Female,96145.0,4,78,2076.0,0
3,4,35,Female,92590.0,10,13,1437.0,1
4,5,23,Female,69262.0,14,62,1675.0,1


In [3]:
nmrk2627_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
 #   Column      Non-Null Count    Dtype  
---  ------      --------------    -----  
 0   CustomerID  1000000 non-null  int64  
 1   Age         1000000 non-null  int64  
 2   Gender      1000000 non-null  object 
 3   Income      1000000 non-null  float64
 4   Purchases   1000000 non-null  int64  
 5   Clicks      1000000 non-null  int64  
 6   Spent       1000000 non-null  float64
 7   Converted   1000000 non-null  int64  
dtypes: float64(2), int64(5), object(1)
memory usage: 61.0+ MB


## **ANALYSIS**

In [12]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from imblearn.over_sampling import SMOTE
from sklearn.utils.class_weight import compute_class_weight
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# 1. Data Preprocessing and Feature Engineering:

# Assuming 'nmrk2627_df' is your DataFrame
X = nmrk2627_df.drop('Converted', axis=1)
y = nmrk2627_df['Converted']

# One-hot encode 'Gender'
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')  # Create OneHotEncoder
encoded_gender = encoder.fit_transform(X[['Gender']])  # Fit and transform Gender column
gender_df = pd.DataFrame(encoded_gender, columns=encoder.get_feature_names_out(['Gender']))  # Create DataFrame
X = X.drop('Gender', axis=1)  # Drop original Gender column
X = pd.concat([X, gender_df], axis=1)  # Concatenate encoded Gender

# Split data into train, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=552627)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=552627)

# Standardize numerical features for training set
numerical_features = ['Age', 'Income', 'Purchases', 'Clicks', 'Spent']
scaler = StandardScaler()
X_train[numerical_features] = scaler.fit_transform(X_train[numerical_features])

# Standardize validation and test sets using training set's statistics
X_val[numerical_features] = scaler.transform(X_val[numerical_features])
X_test[numerical_features] = scaler.transform(X_test[numerical_features])

# Handle class imbalance using SMOTE only on training set
smote = SMOTE(random_state=552627)  # Using consistent random state
X_train, y_train = smote.fit_resample(X_train, y_train)

# Calculate class weights
class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(y_train), y=y_train)
class_weights_dict = {i: weight for i, weight in enumerate(class_weights)}

# 2. Model Building and Training:

model = Sequential()
model.add(Dense(128, activation='relu', input_shape=(X_train.shape[1],)))
model.add(Dropout(0.3))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Use Early Stopping to prevent overfitting
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# Save the best model during training
model_checkpoint = ModelCheckpoint('best_model.h5', monitor='val_accuracy', save_best_only=True)

# Train the model and store the history
history = model.fit(X_train, y_train, epochs=10, batch_size=256,
                    validation_data=(X_val, y_val),
                    callbacks=[early_stopping, model_checkpoint],
                    class_weight=class_weights_dict)  # Add class_weight


# 3. Prediction and Evaluation:

y_pred = model.predict(X_test)
y_pred_classes = (y_pred > 0.5).astype(int)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred_classes)
print(f"Accuracy: {accuracy}")

# Generate classification report
print(classification_report(y_test, y_pred_classes))

# Generate confusion matrix
print(confusion_matrix(y_test, y_pred_classes))

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/10
[1m3276/3276[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.4996 - loss: 961.8517



[1m3276/3276[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 5ms/step - accuracy: 0.4996 - loss: 961.6132 - val_accuracy: 0.7001 - val_loss: 0.6915
Epoch 2/10
[1m3276/3276[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 3ms/step - accuracy: 0.4986 - loss: 0.8123 - val_accuracy: 0.2999 - val_loss: 0.6956
Epoch 3/10
[1m3276/3276[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 3ms/step - accuracy: 0.5000 - loss: 0.7450 - val_accuracy: 0.7001 - val_loss: 0.6926
Epoch 4/10
[1m3276/3276[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 3ms/step - accuracy: 0.4999 - loss: 0.7053 - val_accuracy: 0.2999 - val_loss: 0.6934
Epoch 5/10
[1m3276/3276[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 3ms/step - accuracy: 0.5000 - loss: 0.6989 - val_accuracy: 0.7001 - val_loss: 0.6923
Epoch 6/10
[1m3276/3276[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 3ms/step - accuracy: 0.498

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
