# Assignment

In this assignment, we want to implement cross-validation for logistic regression. Cross-validation is a powerful technique for model selection (such as when choosing the right hyper-parameters), especially when the data size is not very large. The goal of this assignment is to first implement cross-validation and compare it to a baseline model (with no cross-validation).

1. Refactor the code from the lab and train and evaluate the `LogisticRegression` classifier just like we did in the lab. <span style="color:red" float:right>[2 point]</span>

In [56]:
from sklearn.datasets import load_boston
import pandas as pd
import seaborn as sns
import numpy as np
boston = load_boston()
df_boston = pd.DataFrame(boston['data'], columns = boston['feature_names'])

In [57]:
# Creates training and test data from Boston dataset
df_boston['is_above_40k'] = boston['target'] >= 40
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(df_boston.drop(columns = 'is_above_40k'), 
                                                    df_boston['is_above_40k'], 
                                                    test_size = 0.20, 
                                                    random_state = 0)

In [58]:
from sklearn.linear_model import LogisticRegression
# Runs a basic Logistic regression
logit = LogisticRegression(max_iter=5000)

logit.fit(x_train, y_train)
y_test_pred = logit.predict(x_test)

In [59]:
# Prints stats
from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

       False       0.96      1.00      0.98        95
        True       1.00      0.43      0.60         7

    accuracy                           0.96       102
   macro avg       0.98      0.71      0.79       102
weighted avg       0.96      0.96      0.95       102



2. The `LogisticRegression` classifier has an argument called `class_weight`. Read the documentation to see what it does, then train a new model this time by providing the class weights. <span style="color:red" float:right>[2 point]</span>

In [60]:
# Runs a Logistic regression with balanced class weights to account for uneven catagories
logit_balanced = LogisticRegression(max_iter = 5000, class_weight = 'balanced')
logit_balanced.fit(x_train, y_train)
y_test_pred_balanced = logit_balanced.predict(x_test)

3. Does it change any of the results? In what way? <span style="color:red" float:right>[2 point]</span>

In [61]:
# Prints stats
print('Balanced:')
print(classification_report(y_test, y_test_pred_balanced))
print('Basic:')
print(classification_report(y_test, y_test_pred))

Balanced:
              precision    recall  f1-score   support

       False       0.97      0.97      0.97        95
        True       0.57      0.57      0.57         7

    accuracy                           0.94       102
   macro avg       0.77      0.77      0.77       102
weighted avg       0.94      0.94      0.94       102

Basic:
              precision    recall  f1-score   support

       False       0.96      1.00      0.98        95
        True       1.00      0.43      0.60         7

    accuracy                           0.96       102
   macro avg       0.98      0.71      0.79       102
weighted avg       0.96      0.96      0.95       102



This follows the problem discussed in class where the "Highest Accuracy" model is not necessarily the best model for future data given the weight of the sample data. for example 96% of the sample data are of one category and 4% are not. It is of higher accuracy given our test data to just always assume it is of the original category regardless of the traits of the second category. By adding weights we decrease our accuracy (94 vs 96) but add the classification of a second option rather than our earlier generalization 

4. Return to the training step but use `LogisticRegressionCV` this time (the CV stands for cross-validation). <span style="color:red" float:right>[1 point]</span>

In [62]:
from sklearn.linear_model import LogisticRegressionCV

# Runs a Logistic regression Cross validaiton with 5 folds and 10,000 max iterations
logitCV = LogisticRegressionCV(cv=5, max_iter = 10000)
logitCV.fit(x_train, y_train)
y_test_pred_CV = logitCV.predict(x_test)

5. Does cross-validation seem to make a difference in the results we get? <span style="color:red" float:right>[2 point]</span>

In [63]:
# Prints stats
print('Basic:')
print(classification_report(y_test, y_test_pred))
print('Cross Validation:')
print(classification_report(y_test, y_test_pred_CV))

Basic:
              precision    recall  f1-score   support

       False       0.96      1.00      0.98        95
        True       1.00      0.43      0.60         7

    accuracy                           0.96       102
   macro avg       0.98      0.71      0.79       102
weighted avg       0.96      0.96      0.95       102

Cross Validation:
              precision    recall  f1-score   support

       False       0.96      1.00      0.98        95
        True       1.00      0.43      0.60         7

    accuracy                           0.96       102
   macro avg       0.98      0.71      0.79       102
weighted avg       0.96      0.96      0.95       102



NO. The cross validation does not appear to make a diffrence. At least not in this rounded format.

6. Change the number of folds from 5 to 10 and train the CV model again? Notice any difference in performance? Note that *performance* here refers to the model's overall accuracy, based on your choice of metric, it does NOT refer to run-time. <span style="color:red" float:right>[3 point]</span>

In [64]:
# Runs a Logistic regression Cross validaiton with 10 folds and 10,000 max iterations
logitCV10 = LogisticRegressionCV(cv=10, max_iter = 10000)
logitCV10.fit(x_train, y_train)
y_test_pred_CV10 = logitCV10.predict(x_test)

In [65]:
print('Cross Validation 5:')
print(classification_report(y_test, y_test_pred_CV))
print('Cross Validation 10:')
print(classification_report(y_test, y_test_pred_CV10))

Cross Validation 5:
              precision    recall  f1-score   support

       False       0.96      1.00      0.98        95
        True       1.00      0.43      0.60         7

    accuracy                           0.96       102
   macro avg       0.98      0.71      0.79       102
weighted avg       0.96      0.96      0.95       102

Cross Validation 10:
              precision    recall  f1-score   support

       False       0.96      1.00      0.98        95
        True       1.00      0.43      0.60         7

    accuracy                           0.96       102
   macro avg       0.98      0.71      0.79       102
weighted avg       0.96      0.96      0.95       102



7. What was the cost of increasing the number of folds in terms of run-time? <span style="color:red" float:right>[2 point]</span>

In [67]:
%%time 
# Sets a timer for the cell

# same as above
logitCV = LogisticRegressionCV(cv=5, max_iter = 10000)
logitCV.fit(x_train, y_train)
y_test_pred_CV = logitCV.predict(x_test)

CPU times: user 9.48 s, sys: 47.2 ms, total: 9.52 s
Wall time: 9.46 s


In [68]:
%%time 
# Sets a timer for the cell

# same as above
logitCV10 = LogisticRegressionCV(cv=10, max_iter = 10000)
logitCV10.fit(x_train, y_train)
y_test_pred_CV10 = logitCV10.predict(x_test)

CPU times: user 16.5 s, sys: 98.5 ms, total: 16.6 s
Wall time: 16.5 s


In [69]:
print("going from 5 to 10 folds takes approx.", round(16.7/9.54),"times as long")

going from 5 to 10 folds takes approx. 2 times as long


# End of assignment