# Week08 Example 2 - Logistic Regression with tuning

In [16]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, confusion_matrix
from sklearn.model_selection import GridSearchCV

## Load and Encode Data

Load data from csv file on server

In [17]:
df = pd.read_csv("https://raw.githubusercontent.com/timcsmith/MIS536-Public/master/Data/logit.csv")
df

Unnamed: 0,X,Y
0,0.5,Blue
1,1.1,Blue
2,1.5,Blue
3,2.0,Blue
4,3.3,Blue
5,4.7,Blue
6,5.3,Blue
7,7.0,Blue
8,6.5,Blue
9,7.5,Blue


Note that we need to encode our Y variable. 

In [18]:
df.Y = df.Y.replace('Green', 1, regex=True) 
df.Y = df.Y.replace('Blue', 0, regex=True) 


df

Unnamed: 0,X,Y
0,0.5,0
1,1.1,0
2,1.5,0
3,2.0,0
4,3.3,0
5,4.7,0
6,5.3,0
7,7.0,0
8,6.5,0
9,7.5,0


## Train Test Split

In [19]:
X_train,X_test,y_train,y_test=train_test_split(df.X,df.Y,test_size=.3,random_state=1)

In [20]:
# because there is only one feature in X, we need to convert it to a dataframe (or a array of arrays, but dataframe is easier here)
X_train = pd.DataFrame(X_train) 
X_test = pd.DataFrame(X_test)

In [21]:
logClassifier=LogisticRegression(random_state=1) # Create the model
_ = logClassifier.fit(X_train,y_train) # fit the model to training data. NOTE: underscore is a dummy variable that is useed to suppress output

## Measure performance of model on validation data

NOTE: This is a demonstration. There is no preference to any specific scoring metric.

In [22]:
THRESHOLD = 0.25
y_pred_dthreshold25 = np.where(logClassifier.predict_proba(X_test)[:,1] > THRESHOLD, 1, 0)
pd.DataFrame({"predicted":y_pred_dthreshold25,"actual":y_test})

Unnamed: 0,predicted,actual
20,1,1
17,1,1
3,0,0
13,1,0
19,1,1
16,1,1
10,0,1


In [23]:
y_pred = logClassifier.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print("***********************")
print(f"{'Recall Score:':18}{recall_score(y_test, y_pred):.3f}")
print("***********************")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred):.3f}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred):.3f}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred):.3f}")
print("***********************")

[[1 1]
 [1 4]]
***********************
Recall Score:     0.800
***********************
Accuracy Score:   0.714
Precision Score:  0.800
F1 Score:         0.800
***********************


## Hyperparameter Tuning for LogisticRegression

Logistic regression has very few parameters that are generally used for tuning.
  - See official documentation on sklearn logistic regression parameters [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

- You can see useful differences in performance or convergence with different solvers (solver).
    - Remember class when I described gradient descent? This is a process to identify the minimum of a loss function. The solve function parameter sets the method through which logistic regression will solve this problem of minimizing errors.
    - solver{‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’

- Regularization (penalty) can sometimes be helpful.
    - Regularization can be used to avoid overfitting. 
      - Penalized logistic regression imposes a penalty to the logistic model for having too many variables. This results in shrinking the coefficients of the less contributive variables toward zero. 
    - penalty in [‘none’, ‘l1’, ‘l2’, ‘elasticnet’]
      - Note: not all solvers support all regularization terms.
        - ‘newton-cg’ - [‘l2’, ‘none’]
        - ‘lbfgs’ - [‘l2’, ‘none’]
        - ‘liblinear’ - [‘l1’, ‘l2’]
        - ‘sag’ - [‘l2’, ‘none’]
        - ‘saga’ - [‘elasticnet’, ‘l1’, ‘l2’, ‘none’]

- The C parameter controls the penality strength, which can also be effective.
    - C in [100, 10, 1.0, 0.1, 0.01]


In [24]:
param_grid = { 'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
                      'penalty': ['l1', 'l2', 'none'], # NOTE: 'elasticnet' is only supported by 'saga' solver
                      'C': [100, 10, 1.0, 0.1, 0.01],
                      'max_iter': [500000] # number of iterations to converge (sometimes the default is not enough - and sometimes, it will never converge)
                     }

#> NOTE: KLearns implementation of the solver uses a method to obtain the step size (learning rate), so there is not a way that you can change the learning rate (unless you want to change the source code). This [paper](https://hal.inria.fr/hal-00860051/document) defines this method. (source: [here]((https://datascience.stackexchange.com/questions/16751/learning-rate-in-logistic-regression-with-sklearn)))

In [25]:
best_logClassifer = GridSearchCV(estimator=LogisticRegression(random_state=1),
                                    scoring='recall', param_grid=param_grid, 
                                    cv=2, verbose=0,  n_jobs = -1)
best_logClassifer = best_logClassifer.fit(X_train, y_train)

60 fits failed out of a total of 150.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/timsmith/opt/miniconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/timsmith/opt/miniconda3/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1091, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/Users/timsmith/opt/miniconda3/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 61, in _check_solver
    raise ValueError(
ValueError: Solver newton-cg supports only 'l2' or 'none' pena

In [26]:
y_pred = best_logClassifer.predict(X_test)
print("***********************")
print(f"{'Recall Score:':18}{recall_score(y_test, y_pred):.3f}")
print("***********************")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred):.3f}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred):.3f}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred):.3f}")
print("***********************")


***********************
Recall Score:     1.000
***********************
Accuracy Score:   0.857
Precision Score:  0.833
F1 Score:         0.909
***********************


## Deploy our model by using on new data

Create a dataframe containing new input data.

In [27]:
df_new = pd.DataFrame({"X":[1,2,3,4,5,6,7,8,9,10]}) # Create a new data frame with the values we want to predict
df_new # sdisplay new input data

Unnamed: 0,X
0,1
1,2
2,3
3,4
4,5
5,6
6,7
7,8
8,9
9,10


Predict y from new X data

In [28]:
df_new['Y_pred'] = best_logClassifer.predict(df_new[['X']]) # predict the values for the new data frame
df_new


Unnamed: 0,X,Y_pred
0,1,0
1,2,0
2,3,0
3,4,0
4,5,0
5,6,0
6,7,1
7,8,1
8,9,1
9,10,1


In [29]:
df_new.Y_pred = df_new.Y_pred.replace(1, 'Green') 
df_new.Y_pred = df_new.Y_pred.replace(0, 'Blue') 
df_new

Unnamed: 0,X,Y_pred
0,1,Blue
1,2,Blue
2,3,Blue
3,4,Blue
4,5,Blue
5,6,Blue
6,7,Green
7,8,Green
8,9,Green
9,10,Green


Predict the probability of Green and Probabily of Blue for each value of X

In [30]:
best_logClassifer.predict_proba(df_new[['X']]).round(3)

array([[0.566, 0.434],
       [0.555, 0.445],
       [0.544, 0.456],
       [0.533, 0.467],
       [0.522, 0.478],
       [0.511, 0.489],
       [0.5  , 0.5  ],
       [0.489, 0.511],
       [0.478, 0.522],
       [0.467, 0.533]])