## Logistic Regression: Fit and evaluate a model

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

In this section, I will fit and evaluate a simple Logistic Regression model.

![CV](../img/CV.png)
![Cross-Val](../img/Cross-Val.png)

Using grid search within K-fold cross-validation in order to find the optimal hyper parameter settings for logistic regression that generates the best model onmy data. 

### Read in Data

In [4]:
import joblib
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)

tr_features = pd.read_csv('../data/train_features.csv')
tr_labels = pd.read_csv('../data/train_labels.csv')

### Hyperparameter tuning

![C](../img/c.png)

In [5]:
# since grid search gives so much info, this function will clean everything and-
# print out the results a bit more clean

# in essence what it does is for every hyper-parameter combination it will print out-
# the average accuracy score across the 5 folds in my 5 fold cross-validation.

# it will also print out the standard dev of those 5 accuracy scores.

# this will give me all the info i need to select the optimal hyperparameter settings.
def print_results(results):
    print('BEST PARAMS: {}\n'.format(results.best_params_))

    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print('{} (+/-{}) for {}'.format(round(mean, 3), round(std * 2, 3), params))

In [6]:
# defining my parameters dictionary every time i use grid search CV.
# the key in this dictionary, which is C, needs to align with the name of the hyperparameter-
# within LogisticRegression.

# the value within the dictionary is going to be a list of values i want to explore for the-
# given hyper-parameter
lr = LogisticRegression()
parameters = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
}

# call grid search CV, pass in the model object, my parameter dictionary and how many-
# folds i want to do with cross-validation.
cv = GridSearchCV(lr, parameters, cv=5)

# use .fit to fit the model and pass in my features and labels

# one note here, my training labels are stored as a column vector type, but what scikit-learn-
# really wants them to be is an array. i will convert the column vector to an array using values.ravel()
cv.fit(tr_features, tr_labels.values.ravel())

# print the results by calling the finction above
print_results(cv)

# i can see below that when C is very low, which means high regularization, my models not performing very well.
# likely because the high regularization is actually causing it to under fit.

# i can see that the model performance peaks, when C is equal to 1, 10, and 100

# to conclude, all these results are showing me that when C is low and regularizaation is high-
# performance suffers because its under fitting 

# and then when C is high, which means regularization is low, performance also suffers. but now its-
# because its over fitting and actually fitting to closely to the training data.

# i can see that C equal to 1 is generating the best performance.

BEST PARAMS: {'C': 1}

0.67 (+/-0.077) for {'C': 0.001}
0.708 (+/-0.098) for {'C': 0.01}
0.777 (+/-0.134) for {'C': 0.1}
0.8 (+/-0.118) for {'C': 1}
0.794 (+/-0.116) for {'C': 10}
0.796 (+/-0.111) for {'C': 100}
0.794 (+/-0.116) for {'C': 1000}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [7]:
# this scikit-learn method tells me which is the best estimate in my model
cv.best_estimator_

### Write out pickled model
So that i can compare it to the best models using some of the other algorithms in the final notebook

In [8]:
joblib.dump(cv.best_estimator_, '../data/LR_model.pkl')

['../data/LR_model.pkl']