# Model Training and Evaluation
You should build a machine learning pipeline with a complete model training and evaluation step. In particular, you should do the following:
- Load the `mnist` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).
- Conduct data exploration, data preprocessing, and feature engineering if necessary.
- Choose a few machine learning algorithms, such as [KNN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), [decision tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), and [gradient boosting](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html).
- Define a grid of hyperparameters for every selected model.
- Conduct [grid search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) or [random search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) using k-fold cross-validation on the training set to find out the best model (i.e., the best algorithm and its hyperparameters).
- Train the best model on the whole training set.
- Test the trained model on the test set and report various [evaluation metrics](https://scikit-learn.org/0.15/modules/model_evaluation.html).  
- Check the documentation to identify the most important hyperparameters, attributes, and methods. Use them in practice.

# ***Importing Libraries***

In [60]:
import pandas as pd
import IPython.display
import os

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [61]:
os.system('color')

RED = '\033[31m'
BLUE = '\033[34m'
RESET = '\033[0m'

In [18]:
df = pd.read_csv("https://raw.githubusercontent.com/m-mahdavi/teaching/refs/heads/main/datasets/mnist.csv")
print(BLUE + "df.head()")
display(df.head())

print(BLUE + "df.info()" + RESET)
display(df.info())

print(BLUE + "df.isnull().sum()")
display(df.isnull().sum())

[34mdf.head()


Unnamed: 0,id,class,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
0,31953,5,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,34452,8,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,60897,5,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,36953,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1981,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


[34mdf.info()[0m
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Columns: 786 entries, id to pixel784
dtypes: int64(786)
memory usage: 24.0 MB


None

[34mdf.isnull().sum()


Unnamed: 0,0
id,0
class,0
pixel1,0
pixel2,0
pixel3,0
...,...
pixel780,0
pixel781,0
pixel782,0
pixel783,0


# spliting the data into train and test

In [16]:
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
print(f"{RED}Shape of original data:- {BLUE}{df.shape}")
print(f"{RED}Shape of train data:- {BLUE}{df_train.shape}")
print(f"{RED}Shape of test data:- {BLUE}{df_test.shape}")

[31mShape of original data:- [34m(4000, 786)
[31mShape of train data:- [34m(3200, 786)
[31mShape of test data:- [34m(800, 786)



1.   dropping id column
2.   dropping class column from df_train and df_test to get x_train and x_test
3.   adding class column from df to df_train and df_test

In [17]:
x_train = df_train.drop(["class","id"], axis=1,)
y_train = df_train["class"]

x_test = df_test.drop(["class","id"], axis=1)
y_test = df_test["class"]

print(f"{RED}Shape of x_train:- {BLUE}{x_train.shape}")
print(f"{RED}Shape of x_test:- {BLUE}{x_test.shape}")
print(f"{RED}Shape of y_train:- {BLUE}{y_train.shape}")
print(f"{RED}Shape of y_test:- {BLUE}{y_test.shape}")

[31mShape of x_train:- [34m(3200, 784)
[31mShape of x_test:- [34m(800, 784)
[31mShape of y_train:- [34m(3200,)
[31mShape of y_test:- [34m(800,)


# Defining hyperparameter grids
(not adding Gradient Boost here because it takes too much time)

In [37]:
param_grid_knn = {
    'n_neighbors': [3, 5, 7],
    'weights': ['uniform', 'distance']
}

param_grid_dt = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [9, 10, 20],
    'min_samples_leaf': [1, 3, 5],
    'splitter': ['best', 'random'],
    'min_impurity_decrease': [0.0, 0.01],
    'min_weight_fraction_leaf': [0.0, 0.1],
    'max_features': [None, 'sqrt', 'log2'],
}

# Initializing models

In [38]:
models = {
    "KNN": (KNeighborsClassifier(), param_grid_knn),
    "Decision Tree": (DecisionTreeClassifier(random_state=42), param_grid_dt),
}

# Performing Grid Search

In [39]:
best_models = {}

for name, (model, param_grid) in models.items():
    grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
    grid_search.fit(x_train, y_train)
    best_models[name] = grid_search.best_estimator_


print(best_models)

{'KNN': KNeighborsClassifier(n_neighbors=3, weights='distance'), 'Decision Tree': DecisionTreeClassifier(criterion='entropy', max_depth=20, random_state=42)}


# Doing it Manually (KNN)

In [49]:
knn_model = KNeighborsClassifier(n_neighbors=3, weights='distance')
knn_model = knn_model.fit(x_train,y_train)

In [56]:
y_pred_knn = knn_model.predict(x_test)
accuracy = accuracy_score(y_test, y_pred_knn)
report = classification_report(y_test, y_pred_knn)
conf_matrix = confusion_matrix(y_test, y_pred_knn)

print(f"{RED}accuracy using KNN model:- {BLUE}{accuracy*100}%")
print(f"{RED}classification report:- {BLUE}")
print(report)
print(f"{RED}confusion matrix:- {BLUE}")
print(conf_matrix)

[31maccuracy using KNN model:- [34m92.375%
[31mclassification report:- [34m
              precision    recall  f1-score   support

           0       0.96      1.00      0.98        70
           1       0.91      0.99      0.95       100
           2       0.98      0.89      0.94        73
           3       0.92      0.92      0.92        86
           4       0.88      0.89      0.88        80
           5       0.86      0.94      0.90        64
           6       0.99      0.94      0.97        90
           7       0.91      0.96      0.93        67
           8       0.95      0.86      0.91        94
           9       0.88      0.86      0.87        76

    accuracy                           0.92       800
   macro avg       0.92      0.92      0.92       800
weighted avg       0.93      0.92      0.92       800

[31mconfusion matrix:- [34m
[[70  0  0  0  0  0  0  0  0  0]
 [ 0 99  0  0  1  0  0  0  0  0]
 [ 1  3 65  2  0  0  0  0  1  1]
 [ 1  0  0 79  1  3  0  2  0  0]

# Doing it Manually (Decision Tree)

In [53]:
dt_model = DecisionTreeClassifier(random_state=42, criterion='gini', max_depth=9,
                                  min_samples_leaf = 3, splitter = 'best',
                                  min_impurity_decrease=0.0,min_weight_fraction_leaf = 0.0,
                                  max_features =None)
dt_model = dt_model.fit(x_train,y_train)

In [59]:
y_pred_dt = dt_model.predict(x_test)
accuracy = accuracy_score(y_test, y_pred_dt)
report = classification_report(y_test, y_pred_dt)
conf_matrix = confusion_matrix(y_test, y_pred_dt)

print(f"{RED}accuracy using DT model:- {BLUE}{accuracy*100}%")
print(f"{RED}classification report:- {BLUE}")
print(report)
print(f"{RED}confusion matrix:- {BLUE}")
print(conf_matrix)

[31maccuracy using DT model:- [34m75.375%
[31mclassification report:- [34m
              precision    recall  f1-score   support

           0       0.87      0.86      0.86        70
           1       0.86      0.87      0.87       100
           2       0.69      0.63      0.66        73
           3       0.81      0.76      0.78        86
           4       0.72      0.75      0.74        80
           5       0.59      0.75      0.66        64
           6       0.77      0.77      0.77        90
           7       0.84      0.85      0.84        67
           8       0.73      0.56      0.63        94
           9       0.67      0.76      0.71        76

    accuracy                           0.75       800
   macro avg       0.75      0.76      0.75       800
weighted avg       0.76      0.75      0.75       800

[31mconfusion matrix:- [34m
[[60  0  0  1  0  4  3  0  1  1]
 [ 0 87  0  1  3  2  0  1  6  0]
 [ 1  7 46  4  2  3  3  1  5  1]
 [ 1  3  2 65  1 10  0  3  0  1]


# Doing it Manually (Gradient Boost)

In [29]:
gb_model =  GradientBoostingClassifier(random_state=42)
gb_model = gb_model.fit(x_train,y_train)

In [58]:
y_pred_gb = gb_model.predict(x_test)
accuracy = accuracy_score(y_test, y_pred_gb)
report = classification_report(y_test, y_pred_gb)
conf_matrix = confusion_matrix(y_test, y_pred_gb)

print(f"{RED}accuracy using GB model:- {BLUE}{accuracy*100}%")
print(f"{RED}classification report:- {BLUE}")
print(report)
print(f"{RED}confusion matrix:- {BLUE}")
print(conf_matrix)

[31maccuracy using GB model:- [34m91.0%
[31mclassification report:- [34m
              precision    recall  f1-score   support

           0       0.99      0.94      0.96        70
           1       0.94      0.95      0.95       100
           2       0.88      0.89      0.88        73
           3       0.94      0.88      0.91        86
           4       0.92      0.96      0.94        80
           5       0.80      0.92      0.86        64
           6       0.94      0.88      0.91        90
           7       0.98      0.91      0.95        67
           8       0.89      0.87      0.88        94
           9       0.84      0.89      0.87        76

    accuracy                           0.91       800
   macro avg       0.91      0.91      0.91       800
weighted avg       0.91      0.91      0.91       800

[31mconfusion matrix:- [34m
[[66  0  1  0  0  2  0  0  0  1]
 [ 0 95  0  1  1  0  1  0  2  0]
 [ 0  2 65  1  1  0  1  0  2  1]
 [ 0  0  1 76  1  4  2  1  0  1]
 [