# Data Modeling
Do your work for these exercises in either a notebook or a python script named model.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import MinMaxScaler

import matplotlib.pyplot as plt
%matplotlib inline

from acquire import get_titanic_data
from prepare import prep_titanic_data

df = get_titanic_data()
df = prep_titanic_data(df)
df.sample(5)

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,embark_town,alone,embarked_encode
821,821,1,3,male,27.0,0,0,8.6625,S,Third,Southampton,1,2
422,422,0,3,male,29.0,0,0,7.875,S,Third,Southampton,1,2
848,848,0,2,male,28.0,0,1,33.0,S,Second,Southampton,0,2
484,484,1,1,male,25.0,1,0,91.0792,C,First,Cherbourg,0,0
609,609,1,1,female,40.0,0,0,153.4625,S,First,Southampton,1,2


## Logistic Regression
1. Fit the logistic regression classifier to your training sample and transform, i.e. make predictions on the training sample

In [2]:
# Handle missing values in the `age` column.
df.dropna(inplace=True)

In [3]:
X = df[['pclass','age','fare','sibsp','parch']]
y = df[['survived']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state = 123)

X_train.head()

Unnamed: 0,pclass,age,fare,sibsp,parch
60,3,22.0,7.2292,0,0
348,3,3.0,15.9,1,1
606,3,30.0,7.8958,0,0
195,1,58.0,146.5208,0,0
56,2,21.0,10.5,0,0


In [4]:
# 1. make the thing
scaler = MinMaxScaler()

# 2. fit the thing
scaler.fit(X_train[['age', 'fare']])

# 3. use the thing
X_train[['age', 'fare']] = scaler.transform(X_train[['age', 'fare']])
X_test[['age', 'fare']] = scaler.transform(X_test[['age', 'fare']])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/ind

### Train Model
#### Create the logistic regression object

In [5]:
# from sklearn.linear_model import LogisticRegression

logit = LogisticRegression(C=1, class_weight={1:2}, random_state = 123, solver='saga')

#### Fit the model to the training data

In [6]:
logit.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


LogisticRegression(C=1, class_weight={1: 2}, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=123, solver='saga',
          tol=0.0001, verbose=0, warm_start=False)

#### Print the coefficients and intercept of the model

In [7]:
print('Coefficient: \n', logit.coef_)
print()
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-1.06836414 -1.9727291   0.79910148 -0.27300495  0.40904858]]

Intercept: 
 [3.3184823]


2. Evaluate your in-sample results using the model score, confusion matrix, and classification report.

#### Estimate whether or not a passenger would survive, using the training data

In [8]:
y_pred = logit.predict(X_train)

#### Estimate the probability of a passenger surviving, using the training data

In [9]:
y_pred_proba = logit.predict_proba(X_train)

In [10]:
X_train['prediction'] = logit.predict(X_train[['pclass','age','fare','sibsp','parch']])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [11]:
(y_train.survived == X_train.prediction).sum() / y_train.shape[0]

0.6933867735470942

In [12]:
logit.score(X_train[['pclass','age','fare','sibsp','parch']], y_train.survived)

0.6933867735470942

### Evaluate Model
#### Compute the accuracy

In [13]:
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X_train.drop(columns='prediction'), y_train)))

Accuracy of Logistic Regression classifier on training set: 0.69


#### Create a confusion matrix

In [14]:
print(confusion_matrix(y_train, y_pred))

[[190 103]
 [ 50 156]]


In [15]:
df = pd.DataFrame(confusion_matrix(y_train.survived, X_train.prediction),
             columns=['Pred -', 'Pred +'], index=['Actual -', 'Actual +'])

df

Unnamed: 0,Pred -,Pred +
Actual -,190,103
Actual +,50,156


#### Compute Precision, Recall, F1-score, and Support

In [16]:
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.79      0.65      0.71       293
           1       0.60      0.76      0.67       206

   micro avg       0.69      0.69      0.69       499
   macro avg       0.70      0.70      0.69       499
weighted avg       0.71      0.69      0.70       499



3. Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

### Test Model
#### Compute the accuracy of the model when run on the test data

In [17]:
print('Accuracy of Logistic Regression classifier on test set: {:.2f}'
     .format(logit.score(X_test, y_test)))

Accuracy of Logistic Regression classifier on test set: 0.67


In [18]:
df = pd.DataFrame(confusion_matrix(y_train.survived, X_train.prediction),
             columns=['Pred -', 'Pred +'], index=['Actual -', 'Actual +'])

df

Unnamed: 0,Pred -,Pred +
Actual -,190,103
Actual +,50,156


In [19]:
TN = df['Pred -'][0] # 190
FP = df['Pred +'][0] #103
FN = df['Pred -'][1] # 50
TP = df['Pred +'][1] # 156
total = TN + FP + FN + TP

print('True Negative = ', TN)
print('False Positive = ', FP)
print('False Negative = ', FN)
print('True Positive = ', TP)
print('Total = ', total)

True Negative =  190
False Positive =  103
False Negative =  50
True Positive =  156
Total =  499


In [20]:
# Accuracy = # correct / total 
#          = (true positive + true negative) / total
accuracy = (TP + TN) / total
print('Accuracy = ', accuracy)

Accuracy =  0.6933867735470942


In [21]:
# Recall = Sensitivity
#      = true positive rate 
#      = true positive / (true positive + false negative) 
recall = TP / (TP + FN)
print('Recall = ', recall)

Recall =  0.7572815533980582


In [22]:
# Specificity = false positive rate
#      = false positive / (false positive + true negative)
specificity = FP / (FP + TN)
print('Specificity = ', specificity)

Specificity =  0.3515358361774744


In [23]:
# true negative rate = true negative / (true negative + false positive)
trueneg = TN / (TN + FP)
print('True Negative Rate = ', trueneg)

True Negative Rate =  0.6484641638225256


In [24]:
# false negative rate = false negative / (false negaitve + true positive)
falseneg = FN / (FN + TP)
print('False Negative Rate = ', falseneg)

False Negative Rate =  0.24271844660194175


In [25]:
# precision = true positive / (true positive + false positive)
precision = TP / (TP + FP)
print('Precision = ', precision)

Precision =  0.6023166023166023


In [26]:
f1 = (precision + recall) / 2
print('f1-score is ', f1)

f1-score is  0.6797990778573303


In [27]:
died = TN + FP
lived = TP + FN
print(died, 'people died and', lived, 'people lived.')

293 people died and 206 people lived.


4. Look in the scikit-learn documentation to research the solver parameter. What is your best option(s) for the particular problem you are trying to solve and the data to be used?

class sklearn.linear_model.LogisticRegression(
    penalty=’l2’, 
    dual=False, 
    tol=0.0001, 
    C=1.0, 
    fit_intercept=True, 
    intercept_scaling=1, 
    class_weight=None, 
    random_state=None, 
    solver=’warn’, 
    max_iter=100, 
    multi_class=’warn’, 
    verbose=0, 
    warm_start=False, 
    n_jobs=None)
    
solver : str, {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default: ‘liblinear’.
Algorithm to use in the optimization problem.

- For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster for large ones.
- For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss; ‘liblinear’ is limited to one-versus-rest schemes.

‘newton-cg’, ‘lbfgs’ and ‘sag’ only handle L2 penalty, whereas ‘liblinear’ and ‘saga’ handle L1 penalty.
Note that ‘sag’ and ‘saga’ fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from sklearn.preprocessing.

We just want the default!

5. Run through steps 2-4 using another solver (from question 5)

In [28]:
# for saga solver:
X_train = []
df = get_titanic_data()
df = prep_titanic_data(df)
# Handle missing values in the `age` column.
df.dropna(inplace=True)
X = df[['pclass','age','fare','sibsp','parch']]
y = df[['survived']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state = 123)

# 1. make the thing
scaler = MinMaxScaler()

# 2. fit the thing
scaler.fit(X_train[['age', 'fare']])

# 3. use the thing
X_train[['age', 'fare']] = scaler.transform(X_train[['age', 'fare']])
X_test[['age', 'fare']] = scaler.transform(X_test[['age', 'fare']])


# from sklearn.linear_model import LogisticRegression
logit = LogisticRegression(C=1, class_weight={1:2}, random_state = 123, solver='saga')
logit.fit(X_train, y_train)
y_pred = logit.predict(X_train)
y_pred_proba = logit.predict_proba(X_train)
X_train['prediction'] = logit.predict(X_train[['pclass','age','fare','sibsp','parch']])
# (y_train.survived == X_train.prediction).sum() / y_train.shape[0]
# logit.score(X_train[['pclass','age','fare','sibsp','parch']], y_train.survived)
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X_train.drop(columns='prediction'), y_train)))
print(confusion_matrix(y_train, y_pred))
df = pd.DataFrame(confusion_matrix(y_train.survived, X_train.prediction),
             columns=['Pred -', 'Pred +'], index=['Actual -', 'Actual +'])
print(classification_report(y_train, y_pred))
TN = df['Pred -'][0] # 190
FP = df['Pred +'][0] #103
FN = df['Pred -'][1] # 50
TP = df['Pred +'][1] # 156
total = TN + FP + FN + TP

print('True Negative = ', TN)
print('False Positive = ', FP)
print('False Negative = ', FN)
print('True Positive = ', TP)
print('Total = ', total)

# Accuracy = # correct / total 
#          = (true positive + true negative) / total
accuracy = (TP + TN) / total
print('Accuracy = ', accuracy)

# Recall = Sensitivity
#      = true positive rate 
#      = true positive / (true positive + false negative) 
recall = TP / (TP + FN)
print('Recall = ', recall)

# Specificity = false positive rate
#      = false positive / (false positive + true negative)
specificity = FP / (FP + TN)
print('Specificity = ', specificity)

# true negative rate = true negative / (true negative + false positive)
trueneg = TN / (TN + FP)
print('True Negative Rate = ', trueneg)

# false negative rate = false negative / (false negaitve + true positive)
falseneg = FN / (FN + TP)
print('False Negative Rate = ', falseneg)

# precision = true positive / (true positive + false positive)
precision = TP / (TP + FP)
print('Precision = ', precision)

f1 = (precision + recall) / 2
print('f1-score is ', f1)

died = TN + FP
lived = TP + FN
print(died, 'people died and', lived, 'people lived.')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-

Accuracy of Logistic Regression classifier on training set: 0.69
[[190 103]
 [ 50 156]]
              precision    recall  f1-score   support

           0       0.79      0.65      0.71       293
           1       0.60      0.76      0.67       206

   micro avg       0.69      0.69      0.69       499
   macro avg       0.70      0.70      0.69       499
weighted avg       0.71      0.69      0.70       499

True Negative =  190
False Positive =  103
False Negative =  50
True Positive =  156
Total =  499
Accuracy =  0.6933867735470942
Recall =  0.7572815533980582
Specificity =  0.3515358361774744
True Negative Rate =  0.6484641638225256
False Negative Rate =  0.24271844660194175
Precision =  0.6023166023166023
f1-score is  0.6797990778573303
293 people died and 206 people lived.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [29]:
# liblinear solver
X_train = []
df = get_titanic_data()
df = prep_titanic_data(df)
# Handle missing values in the `age` column.
df.dropna(inplace=True)
X = df[['pclass','age','fare','sibsp','parch']]
y = df[['survived']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state = 123)

# 1. make the thing
scaler = MinMaxScaler()

# 2. fit the thing
scaler.fit(X_train[['age', 'fare']])

# 3. use the thing
X_train[['age', 'fare']] = scaler.transform(X_train[['age', 'fare']])
X_test[['age', 'fare']] = scaler.transform(X_test[['age', 'fare']])



# from sklearn.linear_model import LogisticRegression
logit = LogisticRegression(C=1, class_weight={1:2}, random_state = 123, solver='liblinear')
logit.fit(X_train, y_train)
y_pred = logit.predict(X_train)
y_pred_proba = logit.predict_proba(X_train)
X_train['prediction'] = logit.predict(X_train[['pclass','age','fare','sibsp','parch']])
# (y_train.survived == X_train.prediction).sum() / y_train.shape[0]
# logit.score(X_train[['pclass','age','fare','sibsp','parch']], y_train.survived)
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X_train.drop(columns='prediction'), y_train)))
print(confusion_matrix(y_train, y_pred))
df = pd.DataFrame(confusion_matrix(y_train.survived, X_train.prediction),
             columns=['Pred -', 'Pred +'], index=['Actual -', 'Actual +'])
print(classification_report(y_train, y_pred))
TN = df['Pred -'][0] # 190
FP = df['Pred +'][0] #103
FN = df['Pred -'][1] # 50
TP = df['Pred +'][1] # 156
total = TN + FP + FN + TP

print('True Negative = ', TN)
print('False Positive = ', FP)
print('False Negative = ', FN)
print('True Positive = ', TP)
print('Total = ', total)

# Accuracy = # correct / total 
#          = (true positive + true negative) / total
accuracy = (TP + TN) / total
print('Accuracy = ', accuracy)

# Recall = Sensitivity
#      = true positive rate 
#      = true positive / (true positive + false negative) 
recall = TP / (TP + FN)
print('Recall = ', recall)

# Specificity = false positive rate
#      = false positive / (false positive + true negative)
specificity = FP / (FP + TN)
print('Specificity = ', specificity)

# true negative rate = true negative / (true negative + false positive)
trueneg = TN / (TN + FP)
print('True Negative Rate = ', trueneg)

# false negative rate = false negative / (false negaitve + true positive)
falseneg = FN / (FN + TP)
print('False Negative Rate = ', falseneg)

# precision = true positive / (true positive + false positive)
precision = TP / (TP + FP)
print('Precision = ', precision)

f1 = (precision + recall) / 2
print('f1-score is ', f1)

died = TN + FP
lived = TP + FN
print(died, 'people died and', lived, 'people lived.')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-

Accuracy of Logistic Regression classifier on training set: 0.69
[[187 106]
 [ 51 155]]
              precision    recall  f1-score   support

           0       0.79      0.64      0.70       293
           1       0.59      0.75      0.66       206

   micro avg       0.69      0.69      0.69       499
   macro avg       0.69      0.70      0.68       499
weighted avg       0.71      0.69      0.69       499

True Negative =  187
False Positive =  106
False Negative =  51
True Positive =  155
Total =  499
Accuracy =  0.685370741482966
Recall =  0.7524271844660194
Specificity =  0.36177474402730375
True Negative Rate =  0.6382252559726962
False Negative Rate =  0.24757281553398058
Precision =  0.5938697318007663
f1-score is  0.6731484581333929
293 people died and 206 people lived.


In [30]:
# newton-cg solver
X_train = []
df = get_titanic_data()
df = prep_titanic_data(df)
# Handle missing values in the `age` column.
df.dropna(inplace=True)
X = df[['pclass','age','fare','sibsp','parch']]
y = df[['survived']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state = 123)

# 1. make the thing
scaler = MinMaxScaler()

# 2. fit the thing
scaler.fit(X_train[['age', 'fare']])

# 3. use the thing
X_train[['age', 'fare']] = scaler.transform(X_train[['age', 'fare']])
X_test[['age', 'fare']] = scaler.transform(X_test[['age', 'fare']])



# from sklearn.linear_model import LogisticRegression
logit = LogisticRegression(C=1, class_weight={1:2}, random_state = 123, solver='newton-cg')
logit.fit(X_train, y_train)
y_pred = logit.predict(X_train)
y_pred_proba = logit.predict_proba(X_train)
X_train['prediction'] = logit.predict(X_train[['pclass','age','fare','sibsp','parch']])
# (y_train.survived == X_train.prediction).sum() / y_train.shape[0]
# logit.score(X_train[['pclass','age','fare','sibsp','parch']], y_train.survived)
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X_train.drop(columns='prediction'), y_train)))
print(confusion_matrix(y_train, y_pred))
df = pd.DataFrame(confusion_matrix(y_train.survived, X_train.prediction),
             columns=['Pred -', 'Pred +'], index=['Actual -', 'Actual +'])
print(classification_report(y_train, y_pred))
TN = df['Pred -'][0] # 190
FP = df['Pred +'][0] #103
FN = df['Pred -'][1] # 50
TP = df['Pred +'][1] # 156
total = TN + FP + FN + TP

print('True Negative = ', TN)
print('False Positive = ', FP)
print('False Negative = ', FN)
print('True Positive = ', TP)
print('Total = ', total)

# Accuracy = # correct / total 
#          = (true positive + true negative) / total
accuracy = (TP + TN) / total
print('Accuracy = ', accuracy)

# Recall = Sensitivity
#      = true positive rate 
#      = true positive / (true positive + false negative) 
recall = TP / (TP + FN)
print('Recall = ', recall)

# Specificity = false positive rate
#      = false positive / (false positive + true negative)
specificity = FP / (FP + TN)
print('Specificity = ', specificity)

# true negative rate = true negative / (true negative + false positive)
trueneg = TN / (TN + FP)
print('True Negative Rate = ', trueneg)

# false negative rate = false negative / (false negaitve + true positive)
falseneg = FN / (FN + TP)
print('False Negative Rate = ', falseneg)

# precision = true positive / (true positive + false positive)
precision = TP / (TP + FP)
print('Precision = ', precision)

f1 = (precision + recall) / 2
print('f1-score is ', f1)

died = TN + FP
lived = TP + FN
print(died, 'people died and', lived, 'people lived.')

Accuracy of Logistic Regression classifier on training set: 0.69
[[190 103]
 [ 50 156]]
              precision    recall  f1-score   support

           0       0.79      0.65      0.71       293
           1       0.60      0.76      0.67       206

   micro avg       0.69      0.69      0.69       499
   macro avg       0.70      0.70      0.69       499
weighted avg       0.71      0.69      0.70       499

True Negative =  190
False Positive =  103
False Negative =  50
True Positive =  156
Total =  499
Accuracy =  0.6933867735470942
Recall =  0.7572815533980582
Specificity =  0.3515358361774744
True Negative Rate =  0.6484641638225256
False Negative Rate =  0.24271844660194175
Precision =  0.6023166023166023
f1-score is  0.6797990778573303
293 people died and 206 people lived.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-

In [31]:
# sag solver
X_train = []
df = get_titanic_data()
df = prep_titanic_data(df)
# Handle missing values in the `age` column.
df.dropna(inplace=True)
X = df[['pclass','age','fare','sibsp','parch']]
y = df[['survived']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state = 123)

# 1. make the thing
scaler = MinMaxScaler()

# 2. fit the thing
scaler.fit(X_train[['age', 'fare']])

# 3. use the thing
X_train[['age', 'fare']] = scaler.transform(X_train[['age', 'fare']])
X_test[['age', 'fare']] = scaler.transform(X_test[['age', 'fare']])



# from sklearn.linear_model import LogisticRegression
logit = LogisticRegression(C=1, class_weight={1:2}, random_state = 123, solver='sag')
logit.fit(X_train, y_train)
y_pred = logit.predict(X_train)
y_pred_proba = logit.predict_proba(X_train)
X_train['prediction'] = logit.predict(X_train[['pclass','age','fare','sibsp','parch']])
# (y_train.survived == X_train.prediction).sum() / y_train.shape[0]
# logit.score(X_train[['pclass','age','fare','sibsp','parch']], y_train.survived)
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X_train.drop(columns='prediction'), y_train)))
print(confusion_matrix(y_train, y_pred))
df = pd.DataFrame(confusion_matrix(y_train.survived, X_train.prediction),
             columns=['Pred -', 'Pred +'], index=['Actual -', 'Actual +'])
print(classification_report(y_train, y_pred))
TN = df['Pred -'][0] # 190
FP = df['Pred +'][0] #103
FN = df['Pred -'][1] # 50
TP = df['Pred +'][1] # 156
total = TN + FP + FN + TP

print('True Negative = ', TN)
print('False Positive = ', FP)
print('False Negative = ', FN)
print('True Positive = ', TP)
print('Total = ', total)

# Accuracy = # correct / total 
#          = (true positive + true negative) / total
accuracy = (TP + TN) / total
print('Accuracy = ', accuracy)

# Recall = Sensitivity
#      = true positive rate 
#      = true positive / (true positive + false negative) 
recall = TP / (TP + FN)
print('Recall = ', recall)

# Specificity = false positive rate
#      = false positive / (false positive + true negative)
specificity = FP / (FP + TN)
print('Specificity = ', specificity)

# true negative rate = true negative / (true negative + false positive)
trueneg = TN / (TN + FP)
print('True Negative Rate = ', trueneg)

# false negative rate = false negative / (false negaitve + true positive)
falseneg = FN / (FN + TP)
print('False Negative Rate = ', falseneg)

# precision = true positive / (true positive + false positive)
precision = TP / (TP + FP)
print('Precision = ', precision)

f1 = (precision + recall) / 2
print('f1-score is ', f1)

died = TN + FP
lived = TP + FN
print(died, 'people died and', lived, 'people lived.')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-

Accuracy of Logistic Regression classifier on training set: 0.69
[[190 103]
 [ 50 156]]
              precision    recall  f1-score   support

           0       0.79      0.65      0.71       293
           1       0.60      0.76      0.67       206

   micro avg       0.69      0.69      0.69       499
   macro avg       0.70      0.70      0.69       499
weighted avg       0.71      0.69      0.70       499

True Negative =  190
False Positive =  103
False Negative =  50
True Positive =  156
Total =  499
Accuracy =  0.6933867735470942
Recall =  0.7572815533980582
Specificity =  0.3515358361774744
True Negative Rate =  0.6484641638225256
False Negative Rate =  0.24271844660194175
Precision =  0.6023166023166023
f1-score is  0.6797990778573303
293 people died and 206 people lived.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [32]:
# lbfgs solver
X_train = []
df = get_titanic_data()
df = prep_titanic_data(df)
# Handle missing values in the `age` column.
df.dropna(inplace=True)
X = df[['pclass','age','fare','sibsp','parch']]
y = df[['survived']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state = 123)

# 1. make the thing
scaler = MinMaxScaler()

# 2. fit the thing
scaler.fit(X_train[['age', 'fare']])

# 3. use the thing
X_train[['age', 'fare']] = scaler.transform(X_train[['age', 'fare']])
X_test[['age', 'fare']] = scaler.transform(X_test[['age', 'fare']])



# from sklearn.linear_model import LogisticRegression
logit = LogisticRegression(C=1, class_weight={1:2}, random_state = 123, solver='lbfgs')
logit.fit(X_train, y_train)
y_pred = logit.predict(X_train)
y_pred_proba = logit.predict_proba(X_train)
X_train['prediction'] = logit.predict(X_train[['pclass','age','fare','sibsp','parch']])
# (y_train.survived == X_train.prediction).sum() / y_train.shape[0]
# logit.score(X_train[['pclass','age','fare','sibsp','parch']], y_train.survived)
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X_train.drop(columns='prediction'), y_train)))
print(confusion_matrix(y_train, y_pred))
df = pd.DataFrame(confusion_matrix(y_train.survived, X_train.prediction),
             columns=['Pred -', 'Pred +'], index=['Actual -', 'Actual +'])
print(classification_report(y_train, y_pred))
TN = df['Pred -'][0] # 190
FP = df['Pred +'][0] #103
FN = df['Pred -'][1] # 50
TP = df['Pred +'][1] # 156
total = TN + FP + FN + TP

print('True Negative = ', TN)
print('False Positive = ', FP)
print('False Negative = ', FN)
print('True Positive = ', TP)
print('Total = ', total)

# Accuracy = # correct / total 
#          = (true positive + true negative) / total
accuracy = (TP + TN) / total
print('Accuracy = ', accuracy)

# Recall = Sensitivity
#      = true positive rate 
#      = true positive / (true positive + false negative) 
recall = TP / (TP + FN)
print('Recall = ', recall)

# Specificity = false positive rate
#      = false positive / (false positive + true negative)
specificity = FP / (FP + TN)
print('Specificity = ', specificity)

# true negative rate = true negative / (true negative + false positive)
trueneg = TN / (TN + FP)
print('True Negative Rate = ', trueneg)

# false negative rate = false negative / (false negaitve + true positive)
falseneg = FN / (FN + TP)
print('False Negative Rate = ', falseneg)

# precision = true positive / (true positive + false positive)
precision = TP / (TP + FP)
print('Precision = ', precision)

f1 = (precision + recall) / 2
print('f1-score is ', f1)

died = TN + FP
lived = TP + FN
print(died, 'people died and', lived, 'people lived.')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-

Accuracy of Logistic Regression classifier on training set: 0.69
[[190 103]
 [ 50 156]]
              precision    recall  f1-score   support

           0       0.79      0.65      0.71       293
           1       0.60      0.76      0.67       206

   micro avg       0.69      0.69      0.69       499
   macro avg       0.70      0.70      0.69       499
weighted avg       0.71      0.69      0.70       499

True Negative =  190
False Positive =  103
False Negative =  50
True Positive =  156
Total =  499
Accuracy =  0.6933867735470942
Recall =  0.7572815533980582
Specificity =  0.3515358361774744
True Negative Rate =  0.6484641638225256
False Negative Rate =  0.24271844660194175
Precision =  0.6023166023166023
f1-score is  0.6797990778573303
293 people died and 206 people lived.


6. Which performs better on your in-sample data?

I got the same results for all of the solvers.

7. Save the best model in logit_fit

In [33]:
logit_fit = logit
logit_fit

LogisticRegression(C=1, class_weight={1: 2}, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=123, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

## Decision Tree
1. Fit the decision tree classifier to your training sample and transform (i.e. make predictions on the training sample)

In [34]:
# ignore warnings
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd

from pydataset import data

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn import tree

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

df = data('iris')

df.columns = [col.lower().replace('.', '_') for col in df]

X = df.drop(['species'],axis=1)
y = df[['species']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state = 123)

# for classificaiton you can change the algorithm as gini or entropy 
# (information gain).  Default is gini.
clf = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=123)

clf.fit(X_train, y_train)

print("for features", X_train.columns)
print(clf.feature_importances_)
print()

y_pred = clf.predict(X_train)
#print(y_pred[0:5]) # ['virginica' 'virginica' 'versicolor' 'setosa' 'setosa']

y_pred_proba = clf.predict_proba(X_train)
print(y_pred_proba)

for features Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], dtype='object')
[0.         0.         0.57027606 0.42972394]

[[0.    0.    1.   ]
 [0.    0.    1.   ]
 [0.    0.975 0.025]
 [1.    0.    0.   ]
 [1.    0.    0.   ]
 [0.    0.    1.   ]
 [1.    0.    0.   ]
 [1.    0.    0.   ]
 [0.    0.975 0.025]
 [0.    0.975 0.025]
 [0.    0.975 0.025]
 [0.    0.975 0.025]
 [0.    0.    1.   ]
 [0.    0.975 0.025]
 [0.    0.    1.   ]
 [1.    0.    0.   ]
 [0.    0.    1.   ]
 [0.    0.975 0.025]
 [1.    0.    0.   ]
 [1.    0.    0.   ]
 [0.    0.    1.   ]
 [0.    0.975 0.025]
 [0.    0.    1.   ]
 [0.    0.    1.   ]
 [1.    0.    0.   ]
 [0.    0.975 0.025]
 [0.    0.975 0.025]
 [0.    0.    1.   ]
 [1.    0.    0.   ]
 [0.    0.    1.   ]
 [0.    0.975 0.025]
 [0.    0.975 0.025]
 [1.    0.    0.   ]
 [0.    0.    1.   ]
 [0.    0.    1.   ]
 [1.    0.    0.   ]
 [1.    0.    0.   ]
 [0.    0.975 0.025]
 [0.    0.975 0.025]
 [0.    0.    1.   ]
 [1.    0.    

2. Evaluate your in-sample results using the model score, confusion matrix, and classification report.

In [35]:
print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
cm = confusion_matrix(y_train, y_pred)
cm

Accuracy of Decision Tree classifier on training set: 0.98


array([[32,  0,  0],
       [ 0, 40,  0],
       [ 0,  2, 31]])

In [36]:
sorted(y_train.species.unique())
y_train.species.value_counts()

labels = sorted(y_train.species.unique())

pd.DataFrame(confusion_matrix(y_train, y_pred), index=labels, columns=labels)

Unnamed: 0,setosa,versicolor,virginica
setosa,32,0,0
versicolor,0,40,0
virginica,0,2,31


3. Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [37]:
print(classification_report(y_train, y_pred))
print('Accuracy of Decision Tree classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        32
  versicolor       0.95      1.00      0.98        40
   virginica       1.00      0.94      0.97        33

   micro avg       0.98      0.98      0.98       105
   macro avg       0.98      0.98      0.98       105
weighted avg       0.98      0.98      0.98       105

Accuracy of Decision Tree classifier on test set: 0.93


In [38]:
## need to install graphviz to anaconda
## example: 

from sklearn.datasets import load_iris

# iris = load_iris()
# clf = tree.DecisionTreeClassifier()
# clf = clf.fit(iris.data, iris.target)

import graphviz

from graphviz import Graph

dot_data = tree.export_graphviz(clf, out_file=None) 
graph = graphviz.Source(dot_data) 

graph.render('iris_decision_tree2', view=True)

'iris_decision_tree2.pdf'

4. Run through steps 2-4 using entropy as your measure of impurity.

In [39]:
# ignore warnings
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd

from pydataset import data

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn import tree

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

df = data('iris')

df.columns = [col.lower().replace('.', '_') for col in df]

X = df.drop(['species'],axis=1)
y = df[['species']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state = 123)

# for classificaiton you can change the algorithm as gini or entropy 
# (information gain).  Default is gini.
clf = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=123)

clf.fit(X_train, y_train)

y_pred = clf.predict(X_train)
#print(y_pred[0:5]) # ['virginica' 'virginica' 'versicolor' 'setosa' 'setosa']

y_pred_proba = clf.predict_proba(X_train)

print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
cm = confusion_matrix(y_train, y_pred)

sorted(y_train.species.unique())
y_train.species.value_counts()

labels = sorted(y_train.species.unique())

pd.DataFrame(confusion_matrix(y_train, y_pred), index=labels, columns=labels)

print(classification_report(y_train, y_pred))
print('Accuracy of Decision Tree classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

cm = pd.DataFrame(confusion_matrix(y_train, y_pred), index=labels, columns=labels)
print(classification_report(y_train, y_pred))
cm

Accuracy of Decision Tree classifier on training set: 0.98
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        32
  versicolor       0.95      1.00      0.98        40
   virginica       1.00      0.94      0.97        33

   micro avg       0.98      0.98      0.98       105
   macro avg       0.98      0.98      0.98       105
weighted avg       0.98      0.98      0.98       105

Accuracy of Decision Tree classifier on test set: 0.93
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        32
  versicolor       0.95      1.00      0.98        40
   virginica       1.00      0.94      0.97        33

   micro avg       0.98      0.98      0.98       105
   macro avg       0.98      0.98      0.98       105
weighted avg       0.98      0.98      0.98       105



Unnamed: 0,setosa,versicolor,virginica
setosa,32,0,0
versicolor,0,40,0
virginica,0,2,31


In [40]:
## need to install graphviz to anaconda
## example: 

from sklearn.datasets import load_iris

iris = load_iris()
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)

import graphviz

from graphviz import Graph

dot_data = tree.export_graphviz(clf, out_file=None) 
graph = graphviz.Source(dot_data) 

graph.render('iris_decision_tree2', view=True)

'iris_decision_tree2.pdf'

5. Which performs better on your in-sample data?

They are the same.

6. Save the best model in tree_fit

In [41]:
tree_fit = clf
tree_fit

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

## KNN
1. Fit the K-Nearest Neighbors classifier to your training sample and transform (i.e. make predictions on the training sample)

In [42]:
# ignore warnings
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns


from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

from acquire import get_titanic_data
from prepare import prep_titanic_data

df = prep_titanic_data(get_titanic_data())

df.dropna(inplace=True) # handle missing age values

X = df[['pclass','age','fare','sibsp','parch']]
y = df[['survived']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state = 123)

# weights = ['uniform', 'density']
knn = KNeighborsClassifier(n_neighbors=5, weights='uniform')
# choosing to be closest to five nearest neighbors
# could weight features

knn.fit(X_train, y_train)

y_pred = knn.predict(X_train)

y_pred_proba = knn.predict_proba(X_train)

2. Evaluate your results using the model score, confusion matrix, and classification report.

In [43]:
print('Accuracy of KNN classifier on training set: {:.2f}'
     .format(knn.score(X_train, y_train)))
print(confusion_matrix(y_train, y_pred))
print(classification_report(y_train, y_pred))

Accuracy of KNN classifier on training set: 0.76
[[239  54]
 [ 65 141]]
              precision    recall  f1-score   support

           0       0.79      0.82      0.80       293
           1       0.72      0.68      0.70       206

   micro avg       0.76      0.76      0.76       499
   macro avg       0.75      0.75      0.75       499
weighted avg       0.76      0.76      0.76       499



3. Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [44]:
print('Accuracy of KNN classifier on training set: {:.2f}'
     .format(knn.score(X_train, y_train)))
print(confusion_matrix(y_train, y_pred))
print(classification_report(y_train, y_pred))
print('Accuracy of KNN classifier on test set: {:.2f}'
     .format(knn.score(X_test, y_test)))
# print(confusion_matrix(y_test, y_pred))
# print(classification_report(y_test, y_pred))

Accuracy of KNN classifier on training set: 0.76
[[239  54]
 [ 65 141]]
              precision    recall  f1-score   support

           0       0.79      0.82      0.80       293
           1       0.72      0.68      0.70       206

   micro avg       0.76      0.76      0.76       499
   macro avg       0.75      0.75      0.75       499
weighted avg       0.76      0.76      0.76       499

Accuracy of KNN classifier on test set: 0.67


4. Run through steps 1-3 setting k to 10

In [45]:
# weights = ['uniform', 'density']
knn = KNeighborsClassifier(n_neighbors=10, weights='uniform')
# choosing to be closest to five nearest neighbors
# could weight features

knn.fit(X_train, y_train)

y_pred = knn.predict(X_train)

y_pred_proba = knn.predict_proba(X_train)

print('Accuracy of KNN classifier on training set: {:.2f}'
     .format(knn.score(X_train, y_train)))
print(confusion_matrix(y_train, y_pred))
print(classification_report(y_train, y_pred))

print('Accuracy of KNN classifier on test set: {:.2f}'
     .format(knn.score(X_test, y_test)))

Accuracy of KNN classifier on training set: 0.71
[[252  41]
 [103 103]]
              precision    recall  f1-score   support

           0       0.71      0.86      0.78       293
           1       0.72      0.50      0.59       206

   micro avg       0.71      0.71      0.71       499
   macro avg       0.71      0.68      0.68       499
weighted avg       0.71      0.71      0.70       499

Accuracy of KNN classifier on test set: 0.70


5. Run through setps 1-3 setting k to 20

In [46]:
# weights = ['uniform', 'density']
knn = KNeighborsClassifier(n_neighbors=20, weights='uniform')
# choosing to be closest to five nearest neighbors
# could weight features

knn.fit(X_train, y_train)

y_pred = knn.predict(X_train)

y_pred_proba = knn.predict_proba(X_train)

print('Accuracy of KNN classifier on training set: {:.2f}'
     .format(knn.score(X_train, y_train)))
print(confusion_matrix(y_train, y_pred))
print(classification_report(y_train, y_pred))

print('Accuracy of KNN classifier on test set: {:.2f}'
     .format(knn.score(X_test, y_test)))

Accuracy of KNN classifier on training set: 0.71
[[249  44]
 [100 106]]
              precision    recall  f1-score   support

           0       0.71      0.85      0.78       293
           1       0.71      0.51      0.60       206

   micro avg       0.71      0.71      0.71       499
   macro avg       0.71      0.68      0.69       499
weighted avg       0.71      0.71      0.70       499

Accuracy of KNN classifier on test set: 0.72


6. What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?

The K-Nearest Neighbor mode with k = 5 was the best fit on my in-sample data with 76% accuracy, but the test data only yielded a 67% accuracy. The rest of the metrics look comparable. I guess I'll save k = 5.

7. Save the best model in knn_fit

In [47]:
knn_fit = KNeighborsClassifier(n_neighbors=5, weights='uniform')

## Random Forest
1. Fit the Random Forest classifier to your training sample and transform (i.e. make predictions on the training sample) setting the random_state accordingly and setting min_samples_leaf = 1 and max_depth = 20.

In [48]:
# ignore warnings
import warnings
warnings.filterwarnings("ignore")

import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix


import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from acquire import get_titanic_data
from prepare import prep_titanic_data

df = prep_titanic_data(get_titanic_data())

# Handle missing age values
df.dropna(inplace=True)
print('number of nulls = ')
print(df.isnull().sum())
print()

X = df[['pclass','age','fare','sibsp','parch']]
y = df.survived

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state = 123)

# setting the random_state accordingly and 
# setting min_samples_leaf = 1 and max_depth = 20.
rf = RandomForestClassifier(bootstrap=True, 
                            class_weight=None, 
                            criterion='gini',
                            min_samples_leaf=1,
                            n_estimators=100,
                            max_depth=20, 
                            random_state=123)
# min_samples_leaf is set to only 3 because dataset is small

rf.fit(X_train, y_train)

print('this shows gini-index, shows you the importance of each feature in order')
print('shows that fare is biggest indicator of survival')
print("for features ['pclass','age','fare','sibsp','parch']:")
print(rf.feature_importances_)


y_pred = rf.predict(X_train)

y_pred_proba = rf.predict_proba(X_train)

number of nulls = 
passenger_id       0
survived           0
pclass             0
sex                0
age                0
sibsp              0
parch              0
fare               0
embarked           0
class              0
embark_town        0
alone              0
embarked_encode    0
dtype: int64

this shows gini-index, shows you the importance of each feature in order
shows that fare is biggest indicator of survival
for features ['pclass','age','fare','sibsp','parch']:
[0.10439494 0.39013442 0.38148822 0.06701136 0.05697105]


2. Evaluate your results using the model score, confusion matrix, and classification report.

In [49]:
print('Accuracy of random forest classifier on training set: {:.2f}'
     .format(rf.score(X_train, y_train)))
print()
print(confusion_matrix(y_train, y_pred))
# y_train is rows
# y_pred is columns

# these numbers are from lesson... need to be changed...
# 248 - pred died, died     |45 -  pred to survive, died
# 79 - pred died, survived  |127 - pred to survive, survived

# accuracy = (248 + 127) / (248 + 79 + 45 + 127)
# recall of surviving = sensitivity = 127 / (79 + 127)
# recall of not surviving = specificity = 248 / (248 + 5)
# precision of surviving = 127 / (45 + 127)
# precision of not surviving = 248 / (248 + 79)
# false negative = 79 / (248 + 79)

print()
print(classification_report(y_train, y_pred))

Accuracy of random forest classifier on training set: 0.98

[[291   2]
 [  6 200]]

              precision    recall  f1-score   support

           0       0.98      0.99      0.99       293
           1       0.99      0.97      0.98       206

   micro avg       0.98      0.98      0.98       499
   macro avg       0.98      0.98      0.98       499
weighted avg       0.98      0.98      0.98       499



3. Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [50]:
print('Accuracy of random forest classifier on TRAIN datat: {:.2f}'
     .format(rf.score(X_train, y_train)))
print()
print(confusion_matrix(y_train, y_pred))
print()
print(classification_report(y_train, y_pred))

print('Accuracy of output when model is run on TEST data:')
print(rf.score(X_test, y_test))
# print()
# print(confusion_matrix(y_train, y_pred))
# print()
# print(classification_report(y_train, y_pred))

Accuracy of random forest classifier on TRAIN datat: 0.98

[[291   2]
 [  6 200]]

              precision    recall  f1-score   support

           0       0.98      0.99      0.99       293
           1       0.99      0.97      0.98       206

   micro avg       0.98      0.98      0.98       499
   macro avg       0.98      0.98      0.98       499
weighted avg       0.98      0.98      0.98       499

Accuracy of output when model is run on TEST data:
0.7116279069767442


4. Run through steps increasing your min_samples_leaf to 5 and decreasing your max_depth to 3.

In [51]:
# setting the random_state accordingly and 
# setting min_samples_leaf = 5 and max_depth = 3.
rf = RandomForestClassifier(bootstrap=True, 
                            class_weight=None, 
                            criterion='gini',
                            min_samples_leaf=5,
                            n_estimators=100,
                            max_depth=3, 
                            random_state=123)
# min_samples_leaf is set to only 3 because dataset is small

rf.fit(X_train, y_train)

print('this shows gini-index, shows you the importance of each feature in order')
print('shows that fare is biggest indicator of survival')
print("for features ['pclass','age','fare','sibsp','parch']:")
print(rf.feature_importances_)
print()

y_pred = rf.predict(X_train)

y_pred_proba = rf.predict_proba(X_train)

# y_train is rows
# y_pred is columns

# these numbers are from lesson... need to be changed...
# 248 - pred died, died     |45 -  pred to survive, died
# 79 - pred died, survived  |127 - pred to survive, survived

# accuracy = (248 + 127) / (248 + 79 + 45 + 127)
# recall of surviving = sensitivity = 127 / (79 + 127)
# recall of not surviving = specificity = 248 / (248 + 5)
# precision of surviving = 127 / (45 + 127)
# precision of not surviving = 248 / (248 + 79)
# false negative = 79 / (248 + 79)

print('Accuracy of random forest classifier on TRAIN datat: {:.2f}'
     .format(rf.score(X_train, y_train)))
print()
print(confusion_matrix(y_train, y_pred))
print()
print(classification_report(y_train, y_pred))

print('Accuracy of output when model is run on TEST data:')
print(rf.score(X_test, y_test))
# print()
# print(confusion_matrix(y_train, y_pred))
# print()
# print(classification_report(y_train, y_pred))

this shows gini-index, shows you the importance of each feature in order
shows that fare is biggest indicator of survival
for features ['pclass','age','fare','sibsp','parch']:
[0.31756957 0.13479889 0.39019831 0.07086815 0.08656508]

Accuracy of random forest classifier on TRAIN datat: 0.75

[[247  46]
 [ 79 127]]

              precision    recall  f1-score   support

           0       0.76      0.84      0.80       293
           1       0.73      0.62      0.67       206

   micro avg       0.75      0.75      0.75       499
   macro avg       0.75      0.73      0.73       499
weighted avg       0.75      0.75      0.75       499

Accuracy of output when model is run on TEST data:
0.7441860465116279


5. What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?

The first random forest classifier setting min_samples_leaf = 1 and max_depth = 20 gave MUCH better in-sample results with an accuracy of 98%, but its accuracy on the test data was only 71% suggesting the model was overfit for the data. So I'll go with the second classifier with the min_samples_leaf to 5 and decreasing your max_depth to 3.

6. Save the best model in forest_fit

In [52]:
forest_fit = RandomForestClassifier(bootstrap=True, 
                            class_weight=None, 
                            criterion='gini',
                            min_samples_leaf=5,
                            n_estimators=100,
                            max_depth=3, 
                            random_state=123)

K-Nearest Neighbor:

Accuracy of KNN classifier on training set: 0.76
[[239  54]
 [ 65 141]]
              precision    recall  f1-score   support

           0       0.79      0.82      0.80       293
           1       0.72      0.68      0.70       206

   micro avg       0.76      0.76      0.76       499
   macro avg       0.75      0.75      0.75       499
weighted avg       0.76      0.76      0.76       499

Accuracy of KNN classifier on test set: 0.67

Random Forest Classifier:

for features ['pclass','age','fare','sibsp','parch']:
[0.31756957 0.13479889 0.39019831 0.07086815 0.08656508]

Accuracy of random forest classifier on TRAIN datat: 0.75

[[247  46]
 [ 79 127]]

              precision    recall  f1-score   support

           0       0.76      0.84      0.80       293
           1       0.73      0.62      0.67       206

   micro avg       0.75      0.75      0.75       499
   macro avg       0.75      0.73      0.73       499
weighted avg       0.75      0.75      0.75       499

Accuracy of output when model is run on TEST data:
0.7441860465116279

Going with K-Nearest Neighbor model setting k = 5.

## Test
Once you have determined which algorithm (with metaparameters) performs the best, try reducing the number of features to the top 4 features in terms of information gained for each feature individually. That is, how close do we get to predicting accurately the survival with each feature?

1. Compute the information gained.

In [54]:
df = data('iris')

df.columns = [col.lower().replace('.', '_') for col in df]
df.sample(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
108,7.3,2.9,6.3,1.8,virginica
135,6.1,2.6,5.6,1.4,virginica
146,6.7,3.0,5.2,2.3,virginica


In [56]:
X = df.drop(['species', 'sepal_length', 'sepal_width'],axis=1)
y = df[['species']]
X

Unnamed: 0,petal_length,petal_width
1,1.4,0.2
2,1.4,0.2
3,1.3,0.2
4,1.5,0.2
5,1.4,0.2
6,1.7,0.4
7,1.4,0.3
8,1.5,0.2
9,1.4,0.2
10,1.5,0.1


In [55]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state = 123)

# for classificaiton you can change the algorithm as gini or entropy 
# (information gain).  Default is gini.
clf = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=123)

clf.fit(X_train, y_train)

print("for features", X_train.columns)
print(clf.feature_importances_)
print()

y_pred = clf.predict(X_train)
#print(y_pred[0:5]) # ['virginica' 'virginica' 'versicolor' 'setosa' 'setosa']

y_pred_proba = clf.predict_proba(X_train)
print(y_pred_proba)

print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
cm = confusion_matrix(y_train, y_pred)
print(cm)

sorted(y_train.species.unique())
y_train.species.value_counts()

labels = sorted(y_train.species.unique())

pd.DataFrame(confusion_matrix(y_train, y_pred), index=labels, columns=labels)

print(classification_report(y_train, y_pred))
print('Accuracy of Decision Tree classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

import graphviz

from graphviz import Graph

dot_data = tree.export_graphviz(clf, out_file=None) 
graph = graphviz.Source(dot_data) 

graph.render('iris_decision_tree3', view=True)

for features Index(['petal_length', 'petal_width'], dtype='object')
[0.57027606 0.42972394]

[[0.    0.    1.   ]
 [0.    0.    1.   ]
 [0.    0.975 0.025]
 [1.    0.    0.   ]
 [1.    0.    0.   ]
 [0.    0.    1.   ]
 [1.    0.    0.   ]
 [1.    0.    0.   ]
 [0.    0.975 0.025]
 [0.    0.975 0.025]
 [0.    0.975 0.025]
 [0.    0.975 0.025]
 [0.    0.    1.   ]
 [0.    0.975 0.025]
 [0.    0.    1.   ]
 [1.    0.    0.   ]
 [0.    0.    1.   ]
 [0.    0.975 0.025]
 [1.    0.    0.   ]
 [1.    0.    0.   ]
 [0.    0.    1.   ]
 [0.    0.975 0.025]
 [0.    0.    1.   ]
 [0.    0.    1.   ]
 [1.    0.    0.   ]
 [0.    0.975 0.025]
 [0.    0.975 0.025]
 [0.    0.    1.   ]
 [1.    0.    0.   ]
 [0.    0.    1.   ]
 [0.    0.975 0.025]
 [0.    0.975 0.025]
 [1.    0.    0.   ]
 [0.    0.    1.   ]
 [0.    0.    1.   ]
 [1.    0.    0.   ]
 [1.    0.    0.   ]
 [0.    0.975 0.025]
 [0.    0.975 0.025]
 [0.    0.    1.   ]
 [1.    0.    0.   ]
 [1.    0.    0.   ]
 [0.    0.975 0.025]
 [1.

'iris_decision_tree3.pdf'

2. Create a new dataframe with top 4 features (train_df_reduced).

In [57]:
train_df_reduced = df.drop(['species'],axis=1)

3. Use the top performing algorithm with the metaparameters used in that model. Create the object, fit, transform on in-sample data, and evaluate the results. Compare your evaluation metrics with those from the original model (with all the features). Select the best model.

4. Run your final model on your out-of-sample dataframe (test_df). Evaluate the results.

# Feature Engineering
- Titanic Data
    - Create a feature named who, this should be either man, woman, or child. How does including this feature effect your model's performance?
    - Create a feature named adult_male that is either a 1 or a 0. How does this effect your model's predictions?

- Iris Data
    - Create features named petal_area and sepal_area.