# Classification Model Tuning: Predicting Purchase Behavior  

## Project Overview  
This project aims to identify the **optimal classification model** for predicting whether users will purchase a product based on demographic and financial features from the `Social_Network_Ads.csv` dataset.  

### Dataset Features  
| Feature            | Type       | Description                          |  
|---------------------|------------|--------------------------------------|  
| `User ID`           | Identifier | Unique user identifier (removed)     |  
| `Gender`            | Categorical| User's gender (Male/Female)          |  
| `Age`               | Numerical  | User's age                           |  
| `EstimatedSalary`   | Numerical  | User's estimated annual salary      |  
| `Purchased`         | Binary     | Target label (0=Not Purchased, 1=Purchased)|  

**Objective**:  
1. Compare performance of **Logistic Regression**, **KNN**, and **Naïve Bayes** models.  
2. Optimize models via **hyperparameter tuning**.  
3. Evaluate using accuracy, precision, recall, and F1-score.  

---

## Models Explored  
### 1. Logistic Regression  
- **Algorithm**: Predicts probabilities using a logistic function.  
- **Tuned Parameters**:  
  - `C` (Inverse regularization strength): [0.01, 0.1, 1, 10, 100]  
  - `penalty`: L1/L2 regularization  
  - `solver`: Optimization algorithms (`saga`, `liblinear`)  

### 2. K-Nearest Neighbors (KNN)  
- **Algorithm**: Classifies based on majority vote of k-nearest neighbors.  
- **Tuned Parameters**:  
  - `n_neighbors`: [2-9]  
  - `weights`: `uniform` (equal weighting) vs `distance` (weight by inverse distance)  
  - `metric`: Distance metrics (`euclidean`, `manhattan`)  

### 3. Naïve Bayes (Gaussian)  
- **Algorithm**: Probabilistic classifier using Bayes' theorem with feature independence assumption.  
- **No hyperparameter tuning** applied (used as baseline).  

---

#### Let's begin with importing all the necessary libraries

In [68]:
import numpy as np
import pandas as pd

# Scikit-Learn Imports
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score,
                             mean_squared_error, mean_absolute_error, r2_score)
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

In [69]:
data = pd.read_csv('Social_Network_Ads.csv')

In [70]:
data.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


## Data preprocessing & Feature Engineering:

- Dropped `User ID` (non-predictive).
- Encoded `Gender` using **one-hot encoding** (`Gender_Female`, `Gender_Male`).
- Feature Scaling
  - Applied **StandardScaler** to `Age` and `EstimatedSalary` for normalization.
- Data Splitting
  - Stratified sampling ensures balanced class distribution in train/test sets.

In [71]:
print(data.isnull().sum())

User ID            0
Gender             0
Age                0
EstimatedSalary    0
Purchased          0
dtype: int64


We see, that dataset is fine. It has no empty values. 

In the opposite case we would handle the miising data in the following way:
- Gender -> most frequent gender: `data['Gender'] = data.fillna(data['Gender']. value_counts(). index[0])`
- age field -> median value: `data['Age'] = data.fillna(data['Age'].median())`
- salary -> mean value: `data['EstimatedSalary'] = data.fillna(data['EstimatedSalary'].median())`

In [72]:
data.drop(columns = ['User ID'], inplace = True)

encoder = OneHotEncoder(sparse_output = False, handle_unknown = 'ignore')
encoded_arr = encoder.fit_transform(data[['Gender']])
feature_names = encoder.get_feature_names_out()
encoded_df = pd.DataFrame(encoded_arr, columns = feature_names)
data = pd.concat([data.drop(columns=['Gender']), encoded_df], axis=1)

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(columns=['Purchased']),
    data['Purchased'],
    test_size=0.2,
    random_state = 42,
    stratify = data['Purchased'] #Making sure that train and test data has equal percentage of purchases
)
    
#Now we need to apply the feature scaling
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)

---
## Model Training & Evaluation

In [73]:
#Logistic Regression
parameters_grid = {
    'C' : [0.01, 0.1, 1.0, 10.0, 100.0],
    'solver' : ['saga', 'liblinear'],
    'penalty': ['l1', 'l2'],
}
reg = LogisticRegression(max_iter = 1000)
grid_search = GridSearchCV(reg, parameters_grid, cv = 8, n_jobs = -1)
grid_search.fit(X_train, y_train)
best_logi_reg = grid_search.best_estimator_
y_pred_logi_reg = best_logi_reg.predict(X_test)

#K Nearest Neighbours
parameters_grid = {
    'n_neighbors': np.arange(2, 10, 1),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan'], 
}
knn = KNeighborsClassifier()
grid_search_knn = GridSearchCV(knn, parameters_grid, cv = 8, n_jobs = -1)
grid_search_knn.fit(X_train, y_train)
best_knn = grid_search_knn.best_estimator_
y_pred_knn = best_knn.predict(X_test)


#Naive Gaussian Bayes
gauss_nb = GaussianNB()
gauss_nb.fit(X_train, y_train)
y_pred_gauss = gauss_nb.predict(X_test)

metrics = {
    "Logistic Regression": {
        "accuracy": accuracy_score(y_test, y_pred_logi_reg),
        "precision": precision_score(y_test, y_pred_logi_reg, average="weighted"),
        "recall": recall_score(y_test, y_pred_logi_reg, average="weighted"),
        "f1_score": f1_score(y_test, y_pred_logi_reg, average="weighted"),
    },
    "KNN": {
        "accuracy": accuracy_score(y_test, y_pred_knn),
        "precision": precision_score(y_test, y_pred_knn, average="weighted"),
        "recall": recall_score(y_test, y_pred_knn, average="weighted"),
        "f1_score": f1_score(y_test, y_pred_knn, average="weighted"),
    },
    "Naive Bayes": {
        "accuracy": accuracy_score(y_test, y_pred_gauss),
        "precision": precision_score(y_test, y_pred_gauss, average="weighted"),
        "recall": recall_score(y_test, y_pred_gauss, average="weighted"),
        "f1_score": f1_score(y_test, y_pred_gauss, average="weighted"),
    }
}

In [74]:
print(metrics)

{'Logistic Regression': {'accuracy': 0.775, 'precision': 0.7842436974789916, 'recall': 0.775, 'f1_score': 0.7574942791762014}, 'KNN': {'accuracy': 0.7375, 'precision': 0.7347894265232975, 'recall': 0.7375, 'f1_score': 0.7195584635661835}, 'Naive Bayes': {'accuracy': 0.875, 'precision': 0.8741264849755416, 'recall': 0.875, 'f1_score': 0.8739697802197803}}


---
## Model Performance Comparison

Below is a detailed comparison of classification metrics across the three models after hyperparameter tuning:

| Model               | Accuracy | Precision | Recall | F1-Score |  
|---------------------|----------|-----------|--------|----------|  
| Logistic Regression | 77.50%   | 78.42%    | 77.50% | 75.75%   |  
| K-Nearest Neighbors | 73.75%   | 73.48%    | 73.75% | 71.96%   |  
| **Naïve Bayes**     | **87.50%** | **87.41%** | **87.50%** | **87.40%** |  

### Key Observations:
1. **Naïve Bayes Dominance**  
   - Achieved the highest scores across all metrics (87.5% accuracy)
   - Strong performance despite no hyperparameter tuning
   - Suggests Gaussian assumptions align well with data distribution

2. **Logistic Regression Tradeoffs**  
   - Moderate performance (77.5% accuracy)
   - 1.4% precision-recall gap indicates slight class prediction imbalance

3. **KNN Limitations**  
   - Lowest performance (73.75% accuracy)  
   - Potential sensitivity to feature scaling or noisy observations

*All metrics calculated on a held-out test set (20% of original data).*

---
## More complex models  
### 1. Random Forest
- **Algorithm**: Classifies based on majority vote of results done by n decision trees, each of whom is based on bootstrapping and randomly picked features.  
- **Tuned Parameters**:  
  - `n_estimators`: number of decision trees.
  - `criterion`: “gini”, “entropy”, “log_loss”

In [75]:
from sklearn.ensemble import RandomForestClassifier
