Exercises - Logistic Regression

Logistic Regression
Fit the logistic regression classifier to your training sample and transform, i.e. make predictions on the training sample

Run through steps 2-4 using another solver (from question 5)
Which performs better on your in-sample data?
Save the best model in logit_fit

In [91]:
import graphviz
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

import matplotlib.pyplot as plt
import seaborn as sns
# ignore warnings
import warnings
warnings.filterwarnings("ignore")

from acquire import get_titanic_data
from prepare import prep_titanic_data
from acquire import get_iris_data
from prepare import prep_iris_data

In [92]:
df = prep_titanic_data(get_titanic_data())

In [93]:
X = df[['pclass','age','fare','sibsp','parch']]
y = df[['survived']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state = 123)
X_train.head()


Unnamed: 0,pclass,age,fare,sibsp,parch
605,3,0.447097,0.030352,1,0
197,3,0.522493,0.016404,0,1
56,2,0.258608,0.020495,0,0
645,1,0.597889,0.149765,1,0
356,1,0.271174,0.107353,0,1


In [94]:
y_train.columns

Index(['survived'], dtype='object')

In [95]:
# from sklearn.linear_model import LogisticRegression, solver options ('newton-cg','lbfgs','liblinear','sag','saga', default=liblinear)
# logit = LogisticRegression(C=1, class_weight={1:2}, random_state = 123, solver='saga')
logit_fit = LogisticRegression(C=1, class_weight={1:2}, random_state = 123, solver='liblinear')

In [96]:
logit_fit.fit(X_train, y_train)

LogisticRegression(C=1, class_weight={1: 2}, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=123, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

In [97]:
y_pred = logit_fit.predict(X_train)

Evaluate your in-sample results using the model score, confusion matrix, and classification report.
Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.
Look in the scikit-learn documentation to research the solver parameter. What is your best option(s) for the particular problem you are trying to solve and the data to be used?
Run through steps 2-4 using another solver (from question 5)
Which performs better on your in-sample data?
Save the best model in logit_fit

Liblinear (solver) gives a slightly higher Accuracy score,  and the documentation says its a good choice for small datasets

In [98]:
print('Accuracy of Logistic Regression classifier on training set: {:.5f}'
     .format(logit_fit.score(X_train, y_train)))

Accuracy of Logistic Regression classifier on training set: 0.69880


In [99]:
confusion_matrix(y_train, y_pred)

array([[200,  99],
       [ 51, 148]])

In [86]:
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
# Thus in binary classification, the count of true negatives is C 0,0,
# false negatives is 1, 0, true positives is 1, 1 and false positives is 0, 1.

cm = pd.DataFrame(confusion_matrix(y_train, y_pred),
             columns=['Pred -Survived', 'Pred +Survived'], index=['Actual -Survived', 'Actual +Survived'])

cm

Unnamed: 0,Pred -Survived,Pred +Survived
Actual -Survived,298,1
Actual +Survived,5,194


In [10]:
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.80      0.67      0.73       299
           1       0.60      0.74      0.66       199

   micro avg       0.70      0.70      0.70       498
   macro avg       0.70      0.71      0.70       498
weighted avg       0.72      0.70      0.70       498



Decision Tree
Fit the decision tree classifier to your training sample and transform (i.e. make predictions on the training sample)
Evaluate your in-sample results using the model score, confusion matrix, and classification report.
Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.
Run through steps 2-4 using entropy as your measure of impurity.
Which performs better on your in-sample data?
Save the best model in tree_fit

In [51]:
df = prep_iris_data(get_iris_data())

Split
Create the Decision Tree Object

In [52]:
X = df.drop(['species','measurement_id'],axis=1)
y = df[['species']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state = 123)

X_train.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
114,5.8,2.8,5.1,2.4
136,6.3,3.4,5.6,2.4
53,5.5,2.3,4.0,1.3
19,5.1,3.8,1.5,0.3
38,4.4,3.0,1.3,0.2


In [53]:
clf = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=123)


Fit the model to the training data

In [54]:
clf.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=123,
            splitter='best')

In [55]:
y_pred = clf.predict(X_train)
y_pred[0:5]

array(['virginica', 'virginica', 'versicolor', 'setosa', 'setosa'],
      dtype=object)

Estimate the probability of a species

In [56]:
y_pred_proba = clf.predict_proba(X_train)

In [57]:
print('Accuracy of Decision Tree classifier on training set: {:.6f}'
     .format(clf.score(X_train, y_train)))

Accuracy of Decision Tree classifier on training set: 0.980952


to build a confusion_matrix,  the first parameter will end up being the "X" or columns of the matrix (which should be the dataset),   the second argument is the unique values "Y" or the rows of the matrix , aka predictor

The predictor,  will be the unique values of Y.  think "value_count" of species,  
In scikitlearn, the unique values of the predictor are sorted alphabetically in the matrx (row 1, row 2, row 3, etc).    So the y-order of the resulting matrix is "setosa", "versicolor", and then "virginica"


In [58]:
confusion_matrix(y_train, y_pred)


array([[32,  0,  0],
       [ 0, 40,  0],
       [ 0,  2, 31]])

Now - put the row and column labels on the matric so you can tell whats going on

In [59]:
cm = pd.DataFrame(confusion_matrix(y_train, y_pred),columns=['Pred Versicolor', 'Pred Virginica', 'Pred Setosa'], index=['Actual Versicolor', 'Actual Virginica', 'Act Setosa'])
cm

Unnamed: 0,Pred Versicolor,Pred Virginica,Pred Setosa
Actual Versicolor,32,0,0
Actual Virginica,0,40,0
Act Setosa,0,2,31


Ramdom Forest  (titanic data)

In [76]:
df = prep_titanic_data(get_titanic_data())

Reduce the number of columns down to the ones to test

     Need to write a function to return a list of numeric columns

In [77]:
X = df[['pclass','age','fare','sibsp','parch','alone','embarked_encode','sex_encode']]
y = df.survived

In [78]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state = 123)


In [79]:
rf = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', min_samples_leaf=1, n_estimators=100,
                            max_depth=20, 
                            random_state=123)


In [80]:
rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=20, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=123, verbose=0, warm_start=False)

In [81]:
y_pred = rf.predict(X_train)

In [82]:
y_pred_proba = rf.predict_proba(X_train)

In [83]:
rf.score(X_train, y_train)

0.9879518072289156

 y_train is columns,    y_pred = rows 

In [84]:
confusion_matrix(y_train, y_pred)

array([[298,   1],
       [  5, 194]])

In [74]:
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.81      0.94      0.87       299
           1       0.88      0.66      0.76       199

   micro avg       0.83      0.83      0.83       498
   macro avg       0.84      0.80      0.81       498
weighted avg       0.84      0.83      0.82       498

