# Logistic Regression
You should build a machine learning pipeline using a logistic regression model. In particular, you should do the following:
- Load the `mnist` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). 
- Train and test a logistic regression model using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).
- Check the documentation to identify the most important hyperparameters, attributes, and methods of the model. Use them in practice.

In [34]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [9]:
df = pd.read_csv("/Users/amulya/Documents/GISMA Academics/Data Science/teaching-main/datasets/mnist.csv")
df

Unnamed: 0,id,class,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
0,31953,5,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,34452,8,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,60897,5,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,36953,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1981,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3995,25268,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3996,6473,6,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3997,5821,7,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3998,1751,9,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
df.set_index("id")

Unnamed: 0_level_0,class,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
31953,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
34452,8,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
60897,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
36953,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1981,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25268,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6473,6,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5821,7,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1751,9,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
#add features to X
#in the table everything except price is a feature
#axis = 1 means dropping a coloumn
#axis = 0 means dropping a row
X = df.drop(['class'],axis=1)
#add target values to y
y = df["class"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

print("df : ",df.shape)
print("X_train : ",X_train.shape)
print("X_test : ",X_test.shape)
print("y_train : ",y_train.shape)
print("y_test : ",y_test.shape)

df :  (4000, 786)
X_train :  (3000, 785)
X_test :  (1000, 785)
y_train :  (3000,)
y_test :  (1000,)


In [32]:
log_reg = LogisticRegression(max_iter=10000, random_state=0).fit(X_train, y_train)

Warning: if the max_iter is small, it will throw a Convergence warning. So increase the value to optimise better.

In [30]:
y_predict = log_reg.predict(X_test)

In [37]:
accuracy = accuracy_score(y_test, y_predict)
accuracy

0.881