# Classification Model Tuning Project  

## Project Overview  
This project focuses on **tuning the best model for classification** based on a given dataset. The dataset consists of the following features:  

- **Salary** (Numerical)  
- **Age** (Numerical)  
- **Sex** (Categorical)  

The target variable (**Label**) is:  
- **Purchased** (Binary: `0` - Not Purchased, `1` - Purchased)  

The primary goal is to **research different classification models**, apply **hyperparameter tuning**, and evaluate their performance to determine the most effective approach.  

##  Models Explored  
To achieve the objective, the following classification models will be analyzed:  
- **Logistic Regression**  
- **K-Nearest Neighbors (KNN)**  
- **Naïve Bayes**  

Each model will undergo **hyperparameter tuning** to enhance its predictive accuracy and efficiency.  

##  Key Objectives  
 Explore and preprocess the dataset  
 Train different classification models  
 Tune hyperparameters for each model  
 Compare model performance based on key metrics  

##  Tools & Libraries  
- Python  
- Jupyter Notebook  
- Scikit-learn (`sklearn`)  
- Pandas  
- NumPy  

##  Learning Outcome  
This project serves as a hands-on approach to **reinforcing theoretical knowledge** by practically implementing machine learning techniques, tuning models, and analyzing classification performance.  


---

#### Let's begin with importing all the necessary libraries

In [2]:
import numpy as np
import pandas as pd

# Scikit-Learn Imports
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score,
                             mean_squared_error, mean_absolute_error, r2_score)
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

In [36]:
data = pd.read_csv('Social_Network_Ads.csv')

In [37]:
data.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


Now we need to preprocess the data. Checking if the dataset contains empty entries, reducing the UserID field, using one-hot encoding for the sex field and scaling age and salary fields.

In [38]:
print(data.isnull().sum())


#We see, that dataset is fine. It has no empty values. 
#In the opposite case we would handle the miising data in the following way:
#Gender -> most frequent gender: data['Gender'] = data.fillna(data['Gender']. value_counts(). index[0])
#age field -> median value: data['Age'] = data.fillna(data['Age'].median())
#salary -> mean value: data['EstimatedSalary'] = data.fillna(data['EstimatedSalary'].median())



User ID            0
Gender             0
Age                0
EstimatedSalary    0
Purchased          0
dtype: int64


In [39]:
data.drop(columns = ['User ID'], inplace = True)

In [40]:
encoder = OneHotEncoder(sparse_output = False, handle_unknown = 'ignore')
encoded_arr = encoder.fit_transform(data[['Gender']])
feature_names = encoder.get_feature_names_out()
encoded_df = pd.DataFrame(encoded_arr, columns = feature_names)
data = pd.concat([data.drop(columns=['Gender']), encoded_df], axis=1)

In [44]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(columns=['Purchased']),
    data['Purchased'],
    test_size=0.2,
    random_state = 42,
    stratify = data['Purchased'] #Making sure that train and test data has equal percentage of purchases
)
    
#Now we need to apply the feature scaling
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)

Let's now consider different models

In [45]:
#Logistic Regression
parameters_grid = {
    'C' : [0.01, 0.1, 1.0, 10.0, 100.0],
    'solver' : ['saga', 'liblinear'],
    'penalty': ['l1', 'l2'],
}
reg = LogisticRegression(max_iter = 1000)
grid_search = GridSearchCV(reg, parameters_grid, cv = 8, n_jobs = -1)
grid_search.fit(X_train, y_train)
best_logi_reg = grid_search.best_estimator_
y_pred_logi_reg = best_logi_reg.predict(X_test)

#K Nearest Neighbours
parameters_grid = {
    'n_neighbors': np.arange(2, 10, 1),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan'], 
}
knn = KNeighborsClassifier()
grid_search_knn = GridSearchCV(knn, parameters_grid, cv = 8, n_jobs = -1)
grid_search_knn.fit(X_train, y_train)
best_knn = grid_search_knn.best_estimator_
y_pred_knn = best_knn.predict(X_test)


#Naive Gaussian Bayes
gauss_nb = GaussianNB()
gauss_nb.fit(X_train, y_train)
y_pred_gauss = gauss_nb.predict(X_test)

metrics = {
    "Logistic Regression": {
        "accuracy": accuracy_score(y_test, y_pred_logi_reg),
        "precision": precision_score(y_test, y_pred_logi_reg, average="weighted"),
        "recall": recall_score(y_test, y_pred_logi_reg, average="weighted"),
        "f1_score": f1_score(y_test, y_pred_logi_reg, average="weighted"),
    },
    "KNN": {
        "accuracy": accuracy_score(y_test, y_pred_knn),
        "precision": precision_score(y_test, y_pred_knn, average="weighted"),
        "recall": recall_score(y_test, y_pred_knn, average="weighted"),
        "f1_score": f1_score(y_test, y_pred_knn, average="weighted"),
    },
    "Naive Bayes": {
        "accuracy": accuracy_score(y_test, y_pred_gauss),
        "precision": precision_score(y_test, y_pred_gauss, average="weighted"),
        "recall": recall_score(y_test, y_pred_gauss, average="weighted"),
        "f1_score": f1_score(y_test, y_pred_gauss, average="weighted"),
    }
}

In [46]:
metrics

{'Logistic Regression': {'accuracy': 0.775,
  'precision': 0.7842436974789916,
  'recall': 0.775,
  'f1_score': 0.7574942791762014},
 'KNN': {'accuracy': 0.7375,
  'precision': 0.7347894265232975,
  'recall': 0.7375,
  'f1_score': 0.7195584635661835},
 'Naive Bayes': {'accuracy': 0.875,
  'precision': 0.8741264849755416,
  'recall': 0.875,
  'f1_score': 0.8739697802197803}}

So we can conclude that K nearest neighbours is the most ortimal model for solving the following task.