# Day 08. Exercise 03
# Overfitting

## 0. Imports

In [1]:
import joblib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier as DTC
from sklearn.linear_model import LogisticRegression as LR
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.model_selection import train_test_split
#from sklearn.model_selection import cross_val_score, GridSearchCV

## 1. Preprocessing

1. Read the file `dayofweek.csv` to a dataframe.
2. Using `train_test_split` with parameters `test_size=0.2`, `random_state=21` get `X_train`, `y_train`, `X_test`, `y_test`.
3. Using, for example, `value_counts()` to check if the distribution of classes is similar in train and test.
4. Use the additional parameter `stratify=` and check the distribution again, now it should be more or less similar in both datasets.

In [None]:
df = pd.read_csv('../data/dayofweek.csv')
df

In [None]:
x = df.drop('dayofweek', axis=1)
y = df['dayofweek']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x, y,
                                                    test_size=0.2,
                                                    random_state=21)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x, y, 
                                                    test_size=0.2, 
                                                    random_state=21, 
                                                    stratify=y)

In [None]:
y_train.value_counts()

In [None]:
y_test.value_counts()

## 2. Baseline models

1. Train exactly the same baseline models from the previous exercise and calculate the accuracies using the test dataset with stratification.
2. Did all the models show the similar values of the metric? Which one has the largest difference comparing the current exercise and the previous? Put the answer to the markdown cell in the end of the section.

### a. Logreg

In [None]:
lr = LR(random_state=21, fit_intercept=False, max_iter=3000).fit(X_train, y_train)
lr_pred = lr.predict(X_test)

In [None]:
print("Accuracy:", 
      metrics.accuracy_score(y_test, lr_pred))

### b. SVM

In [None]:
svc = SVC(kernel='linear', probability=True, random_state=21).fit(X_train, y_train)
svc_pred = svc.predict(X_test)

In [None]:
print("Accuracy:", 
      metrics.accuracy_score(y_test, svc_pred))

### c. Decision tree

In [None]:
dtc = DTC(max_depth=4, random_state=21).fit(X_train, y_train)
dtc_pred = dtv.predict(X_test)

In [None]:
print("Accuracy:", 
      metrics.accuracy_score(y_test, dtc_pred))

### d. Random forest

In [None]:
rfc = RDC(max_depth=4, random_state=21).fit(X_train, y_train)
rfc_pred = rfc.predict(X_test)

In [None]:
print("Accuracy:", 
      metrics.accuracy_score(y_test, rfc_pred))

## 3. Crossvalidation

We could play with parameters of the model trying to achive a better accuracy on the test dataset, but it is a bad practice. It leads us again to overfitting. Test dataset is only for checking quality of a final model.

But there is another way of solving the problem – crossvalidation. It does not use test dataset, but creates one more split of train dataset. Again, there are different ways of doing it, but the common thing is that there is a validation dataset that is used for hyperparameters optimization.

1. Using `cross_val_score` with `cv=10` calculate the mean accuracy and standard deviation for every model that you used before (logreg with `solver='liblinear'`, SVC, decision tree, random forest).

### a. Logreg

In [None]:
lr = LR(solver='liblinear', random_state=21)
scores = cross_val_score(lr, x, y, scoring='accuracy', cv=10)

In [None]:
np.mean(scores)

In [None]:
np.std(scores)

### b. SVM

### c. Decision tree

### d. Random forest

## 4. Optimization

1. Choose the best model and play a little bit with the parameters on cross-validation, find a good enough parameter or a combination of the parameters.
2. Calculate the accuracy for the final model on the test dataset.
3. Draw a plot that displays the top-10 most  important features for that model.
4. Save the model using `joblib`.
5. Load the model, make predictions for the test dataset and calculate the accuracy.