# Logistic Regression, Scaling & Hyperparameter Tuning

In this notebook you will see a short example of how to select a model, scale your data and tune the hyperparamters of your models using grid or random search. 

We will use the titanic dataset.  
Since you've already worked your way through the steps of exploring and cleaning the data as well as selecting proper features for modelling in another notebook, we will skip this part here and use the **preprocessed data** from the logistic regression notebook. 

In [None]:
# Import packages 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from timeit import default_timer as timer

# eye candy plots
plt.style.use('https://github.com/dhaitz/matplotlib-stylesheets/raw/master/pitayasmoothie-light.mplstyle')
# source https://github.com/dhaitz/matplotlib-stylesheets

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

import warnings
warnings.filterwarnings("ignore")
RSEED = 10

In [None]:
# Import data 
df = pd.read_csv('data/titanic_preprocessed.csv')
df.head(2)

In [None]:
#... or we could reload the stored titanic_dmy from the other notebook
%store -r titanic_dmy
titanic_dmy.head(2)

But in this notebook, we will use the preprocessed data from the titanic_preprocessed.csv!

In [None]:
# Check for missing data
df.isnull().sum()

## Train-Test-Split

Train-Test-Split splits arrays or matrices, also for example our dataframe, into random train and test subsets.  
The main idea of splitting the dataset into a validation set is to prevent our model from overfitting i.e., the model becomes really good at classifying the samples in the training set but cannot generalize and make accurate classifications on the data it has not seen before.  

We will define the target and predictors and split our dataset into a train and test set.

In [None]:
# Define predictors and target
y = df.Survived
X = df.drop('Survived', axis=1)

In [None]:
df.head(2)

In [None]:
# Check Y
y.head()

In [None]:
# Check X
X.head(2)

In [None]:
# Train-test-split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

## Logistic Regression in scikit-learn

In [None]:
#import Logistic Regression classifier from sklearn
from sklearn.linear_model import LogisticRegression

So, how easy is it to make some predictions now?

In [None]:
classifier = LogisticRegression() # instantiate a sklearn logistic regression classs
classifier.fit(X_train, y_train) # fit the classifier/model on our train data 
y_prediction = classifier.predict(X_test) # use the fit model to predict on our test data 

#have a look at the predicitons
y_prediction[:10]

## Model performance metrics
example 1: confusion matrix

In [None]:
cm = confusion_matrix(y_test, y_prediction, labels=[0,1]) # assign a confusion matrix that compares test data and predictions 
cm

In this labels parameter sequence, the matrix horizontally reads the "predicted" and vertically the "actual" labels,  
so in the first line True Negatives, False Positives and in the second line False Negatives and True Positives.  
The results from the confusion matrix are telling us that 97 and 43 are the number of correct predictions. 13 and 25 are the number of incorrect predictions.

sklearn classifikation report

In [None]:
print(classification_report(y_test, y_prediction))

__Precision__ is the accuracy of positive predictions:  
Precision = TP/(TP + FP)  
  
__Recall__ tells you what percent of the positive cases did you in fact catch?  
The fraction of positives that were correctly identified:  
Recall = TP/(TP+FN)

**Let us learn about Classification Report: https://muthu.co/understanding-the-classification-report-in-sklearn/**

## Model selection

1. import all the classifiers you want to evaluate

In [None]:
#import all the classifiers you want to evaluate
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

2. append them to a list

In [None]:
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
models.append(('SGD', SGDClassifier(random_state=RSEED)))

3. iterate over the list and get a performance metric for every model  
(here, we chose "accuracy", the ratio of the number of correctly classified cases to the total of cases under evaluation - but it could be based on every other perfomance metric)

In [None]:

# evaluate each model in turn
results = []
names = []
for name, model in models:
	kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
	cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy')
	results.append(cv_results)
	names.append(name)
	print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))

Based on these results, we move on with the Linear Regression Classifier

---
## Features Scaling

Often the input features of your model have different units which means that the variables also have different scales. While some model types (e.g. tree-based models like decision tree or random forest) are unaffected by the scale of numerical input variables, many machine learning algorithms including for example algorithms using distance measures (e.g. KNN, SVM) perform better when the input features are scaled to a specific range. 

The most popular techniques for scaling are **normalization** and **standardization**. 

Chech the [link](https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/) for further info. 

![scaling](images/normalization_vs_standardization.png)

In [None]:
# Before we have a look at the different methods, 
# we have to define which columns we want to scale.
display(df.describe().round(2))
col_scale = ['Age', 'SibSp', 'Parch', 'Fare']

### Data Standardization 

In order to standardize a dataset it is necessary to rescale the distribution of values so that the mean of observed values is 0 and the standard deviation is 1. You can think of it as subtracting the mean value or centering the data. 
Sklearn provides us for this case with the [Standard scaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

A value is standardized as follows: 


$ x_{scaled} = \frac{x – \mu}{\sigma}  $, where 

$ \mu = \frac{\sum{x}}{m} $ is the mean, where m is the number of observations

$ \sigma = \sqrt{ \frac{\sum{ (x – \mu)^2 }}{m}} $ is the standard deviation



In [None]:
# Scaling with standard scaler
# the fit part method is calculating the mean and the variance of the data
# fit_transform applies this to transform all the features in respect to that values
# transform applies this to new data in respect to that already learned values, not the new data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train[col_scale])
X_test_scaled = scaler.transform(X_test[col_scale])

# Concatenating scaled and dummy columns 
X_train_preprocessed = np.concatenate([X_train_scaled, X_train.drop(col_scale, axis=1)], axis=1)
X_test_preprocessed = np.concatenate([X_test_scaled, X_test.drop(col_scale, axis=1)], axis=1)

In [None]:
X_train_preprocessed

### Data normalization 

Normalizing the data means to rescale it from the original range so that all values lie within the new range of 0 and 1.
We can easily do this by using the [Min-Max-Scaler](https://scikitlearn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) from sklearn. This scaler transforms the feature(s) by scaling it(them) to a given range (default range is 0 to 1). 

A value is normalized as follows: 

$ x_{scaled} = \frac{x – x_{min}}{x_{max} – x_{min}} $

(Where the min and max values pertain to the value x being normalized, from your **train** dataset)

In [None]:
# Scaling with MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

In [None]:
# Define predictors and target
y2 = df.Survived
X2 = df.drop('Survived', axis=1)

In [None]:
# Train-test-split
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.2, random_state=42, stratify=y2)

In [None]:
# Before we have a look at the different methods, 
# we have to define which columns we want to scale.
col_scale = ['Age', 'SibSp', 'Parch', 'Fare']

In [None]:
# Scaling with minmax scaler
mmscaler = MinMaxScaler()
X2_train_scaled = mmscaler.fit_transform(X2_train[col_scale])
X2_test_scaled = mmscaler.transform(X2_test[col_scale])

In [None]:
# Concatenating scaled and dummy columns
X2_train_preprocessed = np.concatenate([X2_train_scaled, X2_train.drop(col_scale, axis=1)], axis=1)
X2_test_preprocessed = np.concatenate([X2_test_scaled, X2_test.drop(col_scale, axis=1)], axis=1)

In [None]:
print("test", pd.DataFrame(X2_train_preprocessed).describe())
print("---")
print("train", pd.DataFrame(X2_test_preprocessed).describe())

---
## Predictive Modelling

We will evaluate our model performance with a quick and more reliable way using sklearn's [cross_val_score()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) which implements K-fold cross validation. When training a model based on train and test split we only have one experiment. Can we really trust one experiment? 

Think of [K-fold cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) as doing K experiments and then taking the average error. It is still not perfect but better than 1 experiment which can randomly turn out to be really good. 

Whenever we have K, comes the question about the value of K.. common values are between 5 and 10 and you need to take into account the technical limitations: dataset size, compute power and available memory and time. CV takes time on large datasets.


![cv](images/cross_validation.png)

### LogisticRegression Classifier - unscaled data

In [None]:
# Fit and evaluate model without hyperparameter tuning using cross validation and unscaled data 
logreg_classifier = LogisticRegression()
scores = cross_val_score(logreg_classifier, X_train, y_train, cv=5, n_jobs=-1)

# Evaluation 
print('Score (unscaled):', round(scores.mean(), 4))

In [None]:
# plotting the scores and average score
plt.axhline(y=scores.mean(), color='y', linestyle='-')
sns.barplot(x=[1,2,3,4,5],y=scores).set_title('Scores of the K-Folds Models - unscaled data');

### LogisticRegression Classifier - standardized scaled data

In [None]:
# Fit and evaluate model using cross validation and scaled data 
logreg_scaled = LogisticRegression()
scores_scaled_std = cross_val_score(logreg_scaled, X_train_preprocessed, y_train, cv=5, n_jobs=-1)

# Evaluation
print('Score (scaled):', round(scores_scaled_std.mean(), 4))

In [None]:
plt.axhline(y=scores_scaled_std.mean(), color='y', linestyle='-')
sns.barplot(x=[1,2,3,4,5],y=scores_scaled_std).set_title('Scores of the K-Folds Models - standardized data');

The model errors on standardized features have a slightly bigger standard deviation than on non-scaled features.

### LogisticRegression Classifier - normalized scaled data

In [None]:
# Fit and evaluate model using cross validation and scaled data 
log_reg_scaled = LogisticRegression()
scores_scaled_norm = cross_val_score(log_reg_scaled, X2_train_preprocessed, y_train, cv=5, n_jobs=-1, scoring='accuracy')
# If "scoring"=None, the estimator’s default scorer (if available) is used.

# Evaluation
print('Score (scaled):', round(scores_scaled_norm.mean(), 4))

plt.axhline(y=scores_scaled_norm.mean(), color='y', linestyle='-')
sns.barplot(x=[1,2,3,4, 5],y=scores_scaled_norm).set_title('Scores of the K-Folds Models - standardized data');

In [None]:
print('Score (unscaled):', round(scores.mean(), 4))
print('Score (scaled, standardized):', round(scores_scaled_std.mean(), 4))
print('Score (scaled, normalized):', round(scores_scaled_norm.mean(), 4))


Based on these first results, we'd go for normalized data!  
But can we improve even better?


---
## Hyperparameter Tuning

Most models have many parameters that work better with some datasets than with others. Same goes with the parameters from regularization which we learned that are selected based on a trial and error process. So how do we deal selecting the parameter values that work best for our data?

#### GridSearchCV

[Grid search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) is a tuning technique that attempts to compute the optimum values of hyperparameters. It performs an exhaustive search over a prior defined parameter space using cross-validation (hence the **CV** suffix). That means it will evaluate all of the possible parameter combinations of the search space in order to find and return the best combination. 


This task, however, starts to become very time-consuming if there are many hyperparameters and the search space is huge. As you can see for k= 5 and for 2 parameters with 2, and respectively 3 values, thus 6 combinations, the GridSearcCV runs 30 modeling steps in order to just come up with the best values for the two parameters.

![grid search](images/grid_search_cv.png)  

In [None]:
#what parameters does sklearn.linear_model.LogisticRegression() have?
logreg_classifier.get_params()#.keys()

In [None]:
# Defining parameter grid (as dictionary)
param_grid = {"solver" : ["lbfgs", "liblinear", "newton-cg’", "sag", "saga"],
              "penalty" : ["l2", "l1", "elasticnet"],
              "fit_intercept" : [True, False]
             }

# Instantiate gridsearch and define the metric to optimize 
gs = GridSearchCV(logreg_classifier, param_grid, scoring='accuracy',
                  cv=5, verbose=1)

# Fit gridsearch object to data. Also lets see how long it takes.
start = timer()
gs.fit(X2_train_preprocessed, y_train)
end = timer()
gs_time = end-start

In [None]:
# Best score
print('Best score:', round(gs.best_score_, 3))
print('Score (scaled, normalized):', round(scores_scaled_norm.mean(), 3))

# Best parameters
print('Best parameters:', gs.best_params_)

In [None]:
# we will do this at least twice.. according to DRY we should write a function
def print_pretty_summary(name, model, y_test, y_pred_test):
    print(name)
    print('=======================')
    print('solver: {}'.format(model.solver))
    print('fit_intercept: {}'.format(model.fit_intercept))
    print('penalty: {}'.format(model.penalty))
    accuracy = accuracy_score(y_test, y_pred_test)
    print('Test accuracy: {:2f}'.format(accuracy))
    return accuracy

In [None]:
# Assigning the fitted LogRegClassifier model with best parameter combination to a new variable logreg_best
logreg_best = gs.best_estimator_

# Making predictions on the test set
y_pred_test_gs = logreg_best.predict(X2_test_preprocessed)
# Let us print out the performance of our model on the test set.
gs_accuracy = print_pretty_summary('LogReg Classifier model', logreg_best, y_test, y_pred_test_gs)

In [None]:
#Have a look at the confusion matrix below
confusion_matrix(y_test, y_pred_test_gs)

#### [RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)

As an alternative to grid search we can use sklearn's RandomizedSearchCV(). Random search will not try every possible combination of our search space but will randomly pick and evaluate parameter combinations. 

In [None]:
# Define paramter grid for randomized search
param_grid = {"solver" : ["lbfgs", "liblinear", "newton-cg’", "sag", "saga"],
              "penalty" : ["l2", "l1", "elasticnet"],
              "fit_intercept" : [True, False]
             }

# Instantiate random search and define the metric to optimize 
rs = RandomizedSearchCV(logreg_classifier, param_grid, scoring='accuracy',
                  cv=5, verbose=1, n_jobs=-1, n_iter=6)

# Fit randomized search object to data
start = timer()
rs.fit(X2_train_preprocessed, y_train)
end = timer()
rgs_time = end-start

In [None]:
# Best score
print('Best score:', round(rs.best_score_, 3))

# Best parameters
print('Best parameters:', rs.best_params_)

In [None]:
# Assigning the fitted SGDClassifier model with best parameter combination to a new variable sgd_best
logreg_best_rs = rs.best_estimator_

# Making predictions on the test set
y_pred_test_rs = logreg_best_rs.predict(X2_test_preprocessed)


# Let us print out the performance of our model on the test set.
rs_accuracy = print_pretty_summary('LogReg Classifier model (randomizedGSCV)', logreg_best_rs, y_test, y_pred_test_rs)

In [None]:
print(f"Grid search took {gs_time} seconds to run with accuracy: {gs_accuracy:f}")
print(f"Randomized Grid search took {rgs_time} seconds to run with accuracy: {rs_accuracy:f}")

In [None]:
#confusion matrix for the grid search
confusion_matrix(y_test, y_pred_test_gs)

In [None]:
#confusion matrix for the randomized grid search
confusion_matrix(y_test, y_pred_test_rs)