# Stochastic Gradient Descent
You should build an end-to-end machine learning pipeline using a stochastic gradient descent model. In particular, you should do the following:
- Load the `mnist` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).
- Build an end-to-end machine learning pipeline, including a [stochastic gradient descent](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) model.
- Optimize your pipeline by validating your design decisions. 
- Test the best pipeline on the test set and report various [evaluation metrics](https://scikit-learn.org/0.15/modules/model_evaluation.html).  
- Check the documentation to identify the most important hyperparameters, attributes, and methods of the model. Use them in practice.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as pt
import seaborn as sns
%matplotlib inline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split

In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/m-mahdavi/teaching/main/datasets/mnist.csv')
df.head(5)

Unnamed: 0,id,class,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
0,31953,5,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,34452,8,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,60897,5,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,36953,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1981,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
df.isna().sum()

id          0
class       0
pixel1      0
pixel2      0
pixel3      0
           ..
pixel780    0
pixel781    0
pixel782    0
pixel783    0
pixel784    0
Length: 786, dtype: int64

In [5]:
df[df.duplicated()]

Unnamed: 0,id,class,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784


In [6]:
X = df.drop('class',axis=1)
X.sample(5)

Unnamed: 0,id,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
2289,65768,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2980,27962,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1816,54288,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
252,52019,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3744,25135,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
y = df['class']
y.sample(5)

2000    7
3304    1
3563    7
1440    5
3292    5
Name: class, dtype: int64

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=42)

In [9]:
print(f'The X_train size is : {X_train.shape}')
print(f'The X_test size is  : {X_test.shape}')
print(f'The y_train size is : {y_train.shape}')
print(f'The y_test size is  : {y_test.shape}')

The X_train size is : (3000, 785)
The X_test size is  : (1000, 785)
The y_train size is : (3000,)
The y_test size is  : (1000,)


In [10]:
categorical_attributes = X_train.select_dtypes(['object']).columns
numerical_attributes = X_train.select_dtypes(['int64','float64']).columns

In [11]:
ct = ColumnTransformer ([
    ('OneHotEncoder',OneHotEncoder(handle_unknown='ignore'),categorical_attributes),
    ('StandardScaler',StandardScaler(),numerical_attributes)
     ])

x_train = ct.fit_transform(X_train)
x_test = ct.fit_transform(X_test)

In [12]:
sgd_clf = SGDClassifier()

In [13]:
param_grid = {
    'loss' : ['hinge','log_loss'],
    'penalty' : ['l2', 'l1'],
    'alpha' : [0.0001,0.001,0.01]
}

In [14]:
grid_search = GridSearchCV(sgd_clf, param_grid, cv=2, scoring='accuracy', n_jobs=1)

In [15]:
grid_search.fit(x_train,y_train)

In [34]:
best_model = grid_search.best_estimator_
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f'Best model : {best_model}')
print(f'Best params : {best_params}')
print(f'Best score  : {best_score}')

Best model : SGDClassifier()
Best params : {'alpha': 0.0001, 'loss': 'hinge', 'penalty': 'l2'}
Best score  : 0.8733333333333333


In [36]:
y_pred = best_model.predict(x_test)

accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

print('Classification Report:')
print(classification_report(y_test,y_pred))

Accuracy: 0.881
Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.98      0.96        90
           1       0.96      0.97      0.96       122
           2       0.85      0.82      0.83        92
           3       0.85      0.85      0.85       103
           4       0.94      0.91      0.93       103
           5       0.80      0.81      0.81        86
           6       0.91      0.92      0.91       107
           7       0.92      0.88      0.90        89
           8       0.83      0.79      0.81       114
           9       0.80      0.87      0.83        94

    accuracy                           0.88      1000
   macro avg       0.88      0.88      0.88      1000
weighted avg       0.88      0.88      0.88      1000

