# Decision Tree
## Titanic Data - Decision Tree
- Determine baseline and baseline accuracy.
- Split data, fit decision tree classifier to data.
- Make predictions.
- Get model score for training dataset.
- Print confusion matrix and classification report for predictions.
- Fit new model and run analysis for a different max_tree depth.
- Determine which model performs better on in-sample data.
- Determine which model performs better on out-of-sample data.

In [1]:
### Imports ###
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

from env import get_db_url
import explore

In [2]:
### Pull titanic_db data ###
url = get_db_url(db_name='titanic_db')
df = pd.read_sql('SELECT * FROM passengers', url)
df.head(3)

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1


In [3]:
### Save a copy of dataframe before manipulations ###
df_original = df.copy()

In [4]:
### Split data ###
train_validate, test = train_test_split(df, test_size=.2, 
                                            random_state=123, 
                                            stratify=df.survived)
train, validate = train_test_split(train_validate, test_size=.3,
                                                   random_state=123,
                                                   stratify=train_validate.survived)
train.shape, validate.shape, test.shape

((498, 13), (214, 13), (179, 13))

In [5]:
### Baseline and Baseline Prediction ###
# train.survived.value_counts()
(train.survived == 0).mean()

0.6164658634538153

Data is tidy, continuing...

In [6]:
### Bivariate Exploration setup ###
cat_vars = ['pclass', 'sex', 'deck', 'embark_town', 'alone']
quant_vars = ['age', 'sibsp', 'parch', 'fare']
target = 'survived'

In [7]:
### Bivariate Exploration ###
# explore.explore_bivariate(train, target, cat_vars, quant_vars)

Candidates for model:
- Sex
- Fare
- Passenger Class
- Alone

Candidates contain no nulls, continuing...

In [8]:
### Set original subsets to enable rerun of following cells without notebook restart ###
train_original = train.copy()
validate_original = validate.copy()
test_original = test.copy()

In [9]:
### Set features, prepare for model ###
train, validate, test = train_original.copy(), validate_original.copy(), test_original.copy()

train = train[['survived','sex','pclass','alone','fare']]
validate = validate[['survived','sex','pclass','alone','fare']]
test = test[['survived','sex','pclass','alone','fare']]

map1 = {'male':0, 'female':1}
train['sex'] = train.sex.map(map1)
validate['sex'] = validate.sex.map(map1)
test['sex'] = test.sex.map(map1)

train.head(3)

Unnamed: 0,survived,sex,pclass,alone,fare
583,0,0,1,1,40.125
165,1,0,3,0,20.525
50,0,0,3,0,39.6875


In [10]:
### Set target ###
X_train, y_train = train.drop(columns='survived'), train.survived
X_validate, y_validate = validate.drop(columns='survived'), validate.survived
X_test, y_test = test.drop(columns='survived'), test.survived

In [11]:
### Create Decision Tree Model ###
clf = DecisionTreeClassifier(max_depth=3, random_state=123)
clf = clf.fit(X_train, y_train)

In [12]:
### Show score, make predictions ###
print("Score:", clf.score(X_train, y_train))
y_pred = clf.predict(X_train)

Score: 0.8232931726907631


In [13]:
### Print confusion matrix and classification report ###
print("Confusion Matrix")
print(confusion_matrix(y_train, y_pred), "\n")

print("Classification Report")
report = pd.DataFrame(classification_report(y_train, y_pred, output_dict=True))
print(report)

Confusion Matrix
[[276  31]
 [ 57 134]] 

Classification Report
                    0           1  accuracy   macro avg  weighted avg
precision    0.828829    0.812121  0.823293    0.820475      0.822421
recall       0.899023    0.701571  0.823293    0.800297      0.823293
f1-score     0.862500    0.752809  0.823293    0.807654      0.820430
support    307.000000  191.000000  0.823293  498.000000    498.000000


In [14]:
### Try a model with different max_depth ###
clf_alt = DecisionTreeClassifier(max_depth=1, random_state=123)
clf_alt = clf_alt.fit(X_train, y_train)

print("Score:", clf_alt.score(X_train, y_train), "\n")
y_pred_alt = clf_alt.predict(X_train)

print("Confusion Matrix")
print(confusion_matrix(y_train, y_pred_alt), "\n")

print("Classification Report")
report = pd.DataFrame(classification_report(y_train, y_pred_alt, output_dict=True))
print(report)

Score: 0.7991967871485943 

Confusion Matrix
[[265  42]
 [ 58 133]] 

Classification Report
                    0           1  accuracy   macro avg  weighted avg
precision    0.820433    0.760000  0.799197    0.790217      0.797255
recall       0.863192    0.696335  0.799197    0.779764      0.799197
f1-score     0.841270    0.726776  0.799197    0.784023      0.797358
support    307.000000  191.000000  0.799197  498.000000    498.000000


Model 1 score (max_depth=3): 82.3% ----- Model 2 score (max_depth=1): 79.9%

In [15]:
### Run models against out-of-sample data ###
print("Model 1 score:", clf.score(X_validate, y_validate))
print("Model 2 score:", clf_alt.score(X_validate, y_validate))

Model 1 score: 0.7850467289719626
Model 2 score: 0.7616822429906542


## Telco Data - Decision Tree
Repeat the same steps as above with this dataset.

In [16]:
### Pull in Telco dataset ###
telco = pd.read_csv('Cust_Churn_Telco.csv')
print(telco.columns)
print(telco.dtypes)
telco.head(3)

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')
customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges        float64
Churn                object
dtype: object


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes


In [17]:
### Save a copy before manipulations ###
telco_original = telco.copy()

In [18]:
### Convert Yes and No to 1 and 0, respectively ###
telco = telco_original.copy()
map1 = {"No":0, "Yes":1}
telco['Churn'] = telco.Churn.map(map1)

In [19]:
### Null Handling ###
# telco.isna().sum()
telco['TotalCharges'] = telco.TotalCharges.fillna(value=(telco.tenure * telco.MonthlyCharges))
telco.isna().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

In [20]:
### Split data with target = churn ###
train_validate, test = train_test_split(telco, test_size=.2, 
                                            random_state=123, 
                                            stratify=telco.Churn)
train, validate = train_test_split(train_validate, test_size=.3,
                                                   random_state=123,
                                                   stratify=train_validate.Churn)
train.shape, validate.shape, test.shape

((3943, 21), (1691, 21), (1409, 21))

In [21]:
### Baseline and Baseline Prediction ###
# train.Churn.value_counts()
(train.Churn == 0).mean()

0.7347197565305605

In [22]:
### Bivariate Exploration setup ###
cat_vars = ['Contract','gender','SeniorCitizen','Partner','Dependents','PhoneService','MultipleLines','InternetService',
           'OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies','Contract',
           'PaperlessBilling','PaymentMethod']
quant_vars = ['tenure','MonthlyCharges','TotalCharges']
target = 'Churn'

In [23]:
### Explore! ###
# explore.explore_bivariate(train, target, cat_vars, quant_vars)

Candidates (Requires further analysis):
- Partner: (No=35%churn, Yes=18%churn with even populations)
- Dependents: (No=32%churn, Yes=15%churn at 2to1)
- InternetService (DSL=20%churn, Fiber=42%churn, None=8%churn where Fiber is nearly half of total population)
- OnlineBackup (No=40%, 20%, 7% where No is nearly half of total population)
- DeviceProtection (nearly same population and proportions as OnlineBackup... don't include?)
- TechSupport (18, 7, No=41% where No is nearly half of population)
- Contract (10, 3, m2m=42% where m2m is more than half of population)
- PaperlessBilling (18, Yes=32% where Yes is 2/5ths of population)
- PaymentMethod (18, 18, eCheck=42%, 18 where eCheck is 1/3rd of population)

Not Candidates (and why):
- Gender: No significant difference in churn rate
- PhoneService: No significant difference in churn rate
- MultipleLines: Not a very significant difference in churn rate
- Senior Citizen: Small population compared to total

In [24]:
### In-Common Populations ###

# Goal: For the features whose high-churning value is nearly 50% of total population,
#       determine if any two have too many values in common to be added to the model.

# Features to investigate: Partner(Single), InternetService(Fiber), OnlineBackup(Nobackup), 
#                          DeviceProtection(Unprotected), Contract(Month-to-month)

fs = (train[train.InternetService == 'Fiber optic'].Partner == 'No').mean()
fn = (train[train.InternetService == 'Fiber optic'].OnlineBackup == 'No').mean()
fu = (train[train.InternetService == 'Fiber optic'].DeviceProtection == 'No').mean()
fm = (train[train.InternetService == 'Fiber optic'].Contract == 'Month-to-month').mean()

sn = (train[train.Partner == 'No'].OnlineBackup == 'No').mean()
su = (train[train.Partner == 'No'].DeviceProtection == 'No').mean()
sm = (train[train.Partner == 'No'].Contract == 'Month-to-month').mean()

nu = (train[train.OnlineBackup == 'No'].DeviceProtection == 'No').mean()
nm = (train[train.OnlineBackup == 'No'].Contract == 'Month-to-month').mean()

mu = (train[train.Contract == 'Month-to-month'].DeviceProtection == 'No').mean()

# Print percent-in-common for each two-feature combination
print(fs, fn, fu, fm, sn, su, sm, nu, nm, mu)

0.5282685512367491 0.5753828032979976 0.571849234393404 0.6955241460541813 0.5053501945525292 0.5214007782101168 0.6960116731517509 0.6493135011441648 0.744279176201373 0.6281818181818182


Populations:
- 50s% Share: Single-Nobackup (50.5%), Single-Unprotected (52.1%), Fiber-Single (52.8%), Fiber-Unprotected (57.1%), Fiber-Nobackup (57.5%)
- 60s% Share: Monthly-Unprotected (62.8%), Nobackup-Unprotected (64.9%), Fiber-Monthly (69.55%), Single_Monthly (69.6%)
- 70s% Share: Nobackup-Monthly (74.4%)

Takeaways: 
- **Will not include Contract (Month-to-month) in model, it adds redundancy compared to other features.**
- **Will not include DeviceProtection (Unprotected) in model, it shares a large proportion of population with OnlineBackup (Nobackup).**

Model Target and Features:
- Target: Churn
- Features: Partner, InternetService, OnlineBackup

In [25]:
### Set features and target ###
# train = train[['Churn','Contract','InternetService','Partner','OnlineBackup']]
# validate = validate[['Churn','Contract','InternetService','Partner','OnlineBackup']]
# test = test[['Churn','Contract','InternetService','Partner','OnlineBackup']]

train = train[['Churn','InternetService','Partner','OnlineBackup']]
validate = validate[['Churn','InternetService','Partner','OnlineBackup']]
test = test[['Churn','InternetService','Partner','OnlineBackup']]

train.shape, validate.shape, test.shape

((3943, 4), (1691, 4), (1409, 4))

In [26]:
### Encode data subsets for model ###
# map0 = {'Month-to-month':0, 'One year':1, 'Two year':2}
map1 = {'DSL':0, 'Fiber optic':1, 'No':2}
map2 = {'No':0, 'Yes': 1}
map3 = {'No':0, 'Yes':1, 'No internet service':2}

for dataset in [train, validate, test]:
#    dataset['Contract'] = dataset.Contract.map(map0)
    dataset['InternetService'] = dataset.InternetService.map(map1)
    dataset['Partner'] = dataset.Partner.map(map2)
    dataset['OnlineBackup'] = dataset.OnlineBackup.map(map3)

train.head(3)

Unnamed: 0,Churn,InternetService,Partner,OnlineBackup
5262,0,0,0,0
3734,0,1,0,1
4340,0,0,0,0


In [27]:
### Split out target ###
X_train, y_train = train.drop(columns='Churn'), train.Churn
X_validate, y_validate = validate.drop(columns='Churn'), validate.Churn
X_test, y_test = test.drop(columns='Churn'), test.Churn

In [28]:
### Create Decision Tree Model ###
clf = DecisionTreeClassifier(max_depth=3, random_state=123)
clf = clf.fit(X_train, y_train)

In [29]:
### Show score, make predictions ###
print("Score:", clf.score(X_train, y_train))
y_pred = clf.predict(X_train)

Score: 0.7565305604869389


In [30]:
### Print confusion matrix and classification report ###
print("Confusion Matrix")
print(confusion_matrix(y_train, y_pred), "\n")

print("Classification Report")
report = pd.DataFrame(classification_report(y_train, y_pred, output_dict=True))
print(report)

Confusion Matrix
[[2649  248]
 [ 712  334]] 

Classification Report
                     0            1  accuracy    macro avg  weighted avg
precision     0.788158     0.573883  0.756531     0.681021      0.731315
recall        0.914394     0.319312  0.756531     0.616853      0.756531
f1-score      0.846596     0.410319  0.756531     0.628458      0.730861
support    2897.000000  1046.000000  0.756531  3943.000000   3943.000000


In [31]:
### Try a model with different max_depth ###
clf_alt = DecisionTreeClassifier(max_depth=1, random_state=123)
clf_alt = clf_alt.fit(X_train, y_train)

print("Score:", clf_alt.score(X_train, y_train), "\n")
y_pred_alt = clf_alt.predict(X_train)

print("Confusion Matrix")
print(confusion_matrix(y_train, y_pred_alt), "\n")

print("Classification Report")
report = pd.DataFrame(classification_report(y_train, y_pred_alt, output_dict=True))
print(report)

Score: 0.7347197565305605 

Confusion Matrix
[[2897    0]
 [1046    0]] 

Classification Report
                     0       1  accuracy    macro avg  weighted avg
precision     0.734720     0.0   0.73472     0.367360      0.539813
recall        1.000000     0.0   0.73472     0.500000      0.734720
f1-score      0.847076     0.0   0.73472     0.423538      0.622363
support    2897.000000  1046.0   0.73472  3943.000000   3943.000000


In [32]:
### Run models against out-of-sample data ###
print("Model 1 in-sample score:", round(clf.score(X_train, y_train),4))
print("Model 1 out-of-sample score:", round(clf.score(X_validate, y_validate),4))
print("Model 1 test score:", round(clf.score(X_test, y_test),4), "\n")

print("Model 2 in-sample score:", round(clf_alt.score(X_train, y_train),4))
print("Model 2 out-of-sample score:", round(clf_alt.score(X_validate, y_validate),4))
print("Model 2 test score:", round(clf_alt.score(X_test, y_test),4))

Model 1 in-sample score: 0.7565
Model 1 out-of-sample score: 0.7504
Model 1 test score: 0.7473 

Model 2 in-sample score: 0.7347
Model 2 out-of-sample score: 0.7345
Model 2 test score: 0.7346


### Model 1 Performance

**Includes Contract**
- Model 1 in-sample score: 0.7695
- Model 1 out-of-sample score: 0.7564
- Model 1 test score: 0.7502 

**Doesn't Include Contract**
- Model 1 in-sample score: 0.7565
- Model 1 out-of-sample score: 0.7504
- Model 1 test score: 0.7473 



### Model 2 Performance

**Includes Contract**
- Model 2 in-sample score: 0.7347
- Model 2 out-of-sample score: 0.7345
- Model 2 test score: 0.7346

**Doesn't Include Contract**
- Model 2 in-sample score: 0.7347
- Model 2 out-of-sample score: 0.7345
- Model 2 test score: 0.7346

## Titanic Data - Random Forest
- Fit the Random Forest classifier to your training sample using min_samples_leaf = 1, max_depth=10, and random_state.
- Make predictions.
- Evaluate results using model score, confusion matrix, and classification report.
- Create new model(s) with increased min_samples_leaf and decreased max_depth.
- What are the differences in the evaluation metrics? 
- Which performs better on your in-sample data? Why?
- After making a few models, which one has the best performance (or closest metrics) on both train and validate?

In [33]:
### Checking dataframe ###
df.head(3)

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1


In [34]:
### Split data ###
train_validate, test = train_test_split(df, test_size=.2, 
                                            random_state=123, 
                                            stratify=df.survived)
train, validate = train_test_split(train_validate, test_size=.3,
                                                   random_state=123,
                                                   stratify=train_validate.survived)
train.shape, validate.shape, test.shape

((498, 13), (214, 13), (179, 13))

In [35]:
### Set features, prepare for model ###
train, validate, test = train_original.copy(), validate_original.copy(), test_original.copy()

train = train[['survived','sex','pclass','alone','fare']]
validate = validate[['survived','sex','pclass','alone','fare']]
test = test[['survived','sex','pclass','alone','fare']]

map1 = {'male':0, 'female':1}
train['sex'] = train.sex.map(map1)
validate['sex'] = validate.sex.map(map1)
test['sex'] = test.sex.map(map1)

train.head(3)

Unnamed: 0,survived,sex,pclass,alone,fare
583,0,0,1,1,40.125
165,1,0,3,0,20.525
50,0,0,3,0,39.6875


In [36]:
### Set target ###
X_train, y_train = train.drop(columns='survived'), train.survived
X_validate, y_validate = validate.drop(columns='survived'), validate.survived
X_test, y_test = test.drop(columns='survived'), test.survived

In [37]:
### Create Random Forest Model ###
rf = RandomForestClassifier(min_samples_leaf=1, max_depth=10, random_state=123)
rf = rf.fit(X_train, y_train)

In [38]:
### Show score, make predictions ###
print("Score:", rf.score(X_train, y_train))
y_pred = rf.predict(X_train)

Score: 0.9337349397590361


In [39]:
### Print confusion matrix and classification report ###
print("Confusion Matrix")
print(confusion_matrix(y_train, y_pred), "\n")

print("Classification Report")
report = pd.DataFrame(classification_report(y_train, y_pred, output_dict=True))
print(report)

Confusion Matrix
[[301   6]
 [ 27 164]] 

Classification Report
                    0           1  accuracy   macro avg  weighted avg
precision    0.917683    0.964706  0.933735    0.941194      0.935718
recall       0.980456    0.858639  0.933735    0.919547      0.933735
f1-score     0.948031    0.908587  0.933735    0.928309      0.932903
support    307.000000  191.000000  0.933735  498.000000    498.000000


In [40]:
### Increase min_samples_leaf, decrease max_depth in new model ###
rf_alt = RandomForestClassifier(min_samples_leaf=5, max_depth=5)
rf_alt = rf_alt.fit(X_train, y_train)

In [41]:
### Show score, make predictions ###
print("Score:", rf_alt.score(X_train, y_train))
y_pred_alt = rf_alt.predict(X_train)

Score: 0.8514056224899599


In [42]:
### Print confusion matrix and classification report ###
print("Confusion Matrix")
print(confusion_matrix(y_train, y_pred_alt), "\n")

print("Classification Report")
report = pd.DataFrame(classification_report(y_train, y_pred_alt, output_dict=True))
print(report)

Confusion Matrix
[[291  16]
 [ 58 133]] 

Classification Report
                    0           1  accuracy   macro avg  weighted avg
precision    0.833811    0.892617  0.851406    0.863214      0.856365
recall       0.947883    0.696335  0.851406    0.822109      0.851406
f1-score     0.887195    0.782353  0.851406    0.834774      0.846985
support    307.000000  191.000000  0.851406  498.000000    498.000000


Higher min_samples_leaf and lower max_depth makes for worse in-sample accuracy because a lower max_depth means every tree in the random forest is less extensive and a higher min_samples_leaf means every sample is more restricted (performs worse).

In [43]:
### Run models against out-of-sample data ###
print("Model 1 in-sample score:", rf.score(X_train, y_train))
print("Model 1 out-of-sample score:", rf.score(X_validate, y_validate), "\n")

print("Model 2 in-sample score:", rf_alt.score(X_train, y_train))
print("Model 2 out-of-sample score:", rf_alt.score(X_validate, y_validate))

Model 1 in-sample score: 0.9337349397590361
Model 1 out-of-sample score: 0.794392523364486 

Model 2 in-sample score: 0.8514056224899599
Model 2 out-of-sample score: 0.780373831775701


Even though the Random Forest with min_samples_leaf=1 and max_depth=10 performed better on the in-sample data, it was clearly much more overfit than the less-extensive, more-restricted Random Forest having min_samples_leaf=5 and max_depth=5.

## Titanic Data - K-Nearest Neighbors

- Fit a K-Nearest Neighbors classifier to your training sample.
- Make predictions.
- Evaluate your results using the model score, confusion matrix, and classification report.
- Make new model with k=10.
- Make another new model with k=20.
- What are the differences in the evaluation metrics?
- Which performs better on your in-sample data? Why?
- Which model performs best on our out-of-sample data from validate?

In [44]:
### Fit KNN to training data ###
knn = KNeighborsClassifier()
knn = knn.fit(X_train, y_train)

In [45]:
### Make predictions ###
y_pred = knn.predict(X_train)

In [46]:
### Score, confusion matric, classification report ###
print("Score:", knn.score(X_train, y_train))
print("Confusion Matrix\n", confusion_matrix(y_train, y_pred))
print(pd.DataFrame(classification_report(y_train, y_pred, output_dict=True)))

Score: 0.8112449799196787
Confusion Matrix
 [[262  45]
 [ 49 142]]
                    0           1  accuracy   macro avg  weighted avg
precision    0.842444    0.759358  0.811245    0.800901      0.810578
recall       0.853420    0.743455  0.811245    0.798438      0.811245
f1-score     0.847896    0.751323  0.811245    0.799610      0.810857
support    307.000000  191.000000  0.811245  498.000000    498.000000


In [47]:
### Make new model with k=10, make predictions, print metrics ###
knn_10 = KNeighborsClassifier(n_neighbors=10)
knn_10 = knn_10.fit(X_train, y_train)
y_pred10 = knn_10.predict(X_train)

print("Score:", knn_10.score(X_train, y_train))
print("Confusion Matrix\n", confusion_matrix(y_train, y_pred10))
print(pd.DataFrame(classification_report(y_train, y_pred10, output_dict=True)))

Score: 0.785140562248996
Confusion Matrix
 [[262  45]
 [ 62 129]]
                    0           1  accuracy   macro avg  weighted avg
precision    0.808642    0.741379  0.785141    0.775011      0.782844
recall       0.853420    0.675393  0.785141    0.764406      0.785141
f1-score     0.830428    0.706849  0.785141    0.768639      0.783031
support    307.000000  191.000000  0.785141  498.000000    498.000000


In [48]:
### Make new model with k=20, make predictions, print metrics ###
knn_20 = KNeighborsClassifier(n_neighbors=20)
knn_20 = knn_20.fit(X_train, y_train)
y_pred20 = knn_20.predict(X_train)

print("Score:", knn_20.score(X_train, y_train))
print("Confusion Matrix\n", confusion_matrix(y_train, y_pred20))
print(pd.DataFrame(classification_report(y_train, y_pred20, output_dict=True)))

Score: 0.7469879518072289
Confusion Matrix
 [[257  50]
 [ 76 115]]
                    0           1  accuracy   macro avg  weighted avg
precision    0.771772    0.696970  0.746988    0.734371      0.743083
recall       0.837134    0.602094  0.746988    0.719614      0.746988
f1-score     0.803125    0.646067  0.746988    0.724596      0.742888
support    307.000000  191.000000  0.746988  498.000000    498.000000


In [49]:
### Run models against out-of-sample data ###
print("K=5 in-sample score:", knn.score(X_train, y_train))
print("K=5 out-of-sample score:", knn.score(X_validate, y_validate), "\n")

print("K=10 in-sample score:", knn_10.score(X_train, y_train))
print("K=10 out-of-sample score:", knn_10.score(X_validate, y_validate), "\n")

print("K=20 in-sample score:", knn_20.score(X_train, y_train))
print("K=20 out-of-sample score:", knn_20.score(X_validate, y_validate))

K=5 in-sample score: 0.8112449799196787
K=5 out-of-sample score: 0.7663551401869159 

K=10 in-sample score: 0.785140562248996
K=10 out-of-sample score: 0.7383177570093458 

K=20 in-sample score: 0.7469879518072289
K=20 out-of-sample score: 0.6915887850467289


As n_nearest increased, each model's in-sample and out-of-sample score decreased. Additionally, each model performed roughly 5% more accurately on in-sample compared to out-of-sample, indicating a potential need to revise what data will be used to train the model (may need to change/add features for fit).

## Titanic Data - Logistic Regression
1. Calculate baseline for Titanic data, set threshold for future model accuracy.
2. Create a model for age, fare, and pclass, then compare to the baseline.
3. Create a new model that also includes sex.
4. Create multiple models with varying features, keep the best 3 for train and validate data.
5. Use the best model of the three for the test data, compare performance to train/validate.
6. BONUS1: How do different strategies for handling the missing values in the age column affect model performance?
7. BONUS2: How do different strategies for encoding sex affect model performance?
8. BONUS3:
- scikit-learn's LogisticRegression classifier is actually applying a regularization penalty to the coefficients by default. This penalty causes the magnitude of the coefficients in the resulting model to be smaller than they otherwise would be. This value can be modified with the C hyper parameter. Small values of C correspond to a larger penalty, and large values of C correspond to a smaller penalty.
- Try out the following values for C and note how the coefficients and the model's performance on both the dataset it was trained on and on the validate split are affected.
<code>C=.01,.1,1,10,100,1000</code>
9. BONUS4: How does scaling the data interact with your choice of C?

In [50]:
### Calculate baseline ###
df = df_original.copy()
baseline = df.groupby('survived').survived.count().idxmax()
print('Baseline Performance:', round((df.survived == 0).mean(),4))

Baseline Performance: 0.6162


In [51]:
### Create model for age, fare, and pclass, then compare to the baseline ###
# Handle nulls
df['age'] = df.age.fillna(value=int(df.age.mean()))
most_freq_embarked = df.groupby('embark_town').embark_town.count().idxmax()
df['embark_town'] = df.embark_town.fillna(value=most_freq_embarked)
df = df.drop(columns=['deck','embarked','class','passenger_id'])

# Encode values
map0 = {'male':0, 'female':1}
map1 = {'Cherbourg':0, 'Southampton':1, 'Queenstown':2}
df['sex'] = df.sex.map(map0)
df['embark_town'] = df.embark_town.map(map1)

# Split
train_validate, test = train_test_split(df, test_size=.2, 
                                            random_state=123, 
                                            stratify=df.survived)
train, validate = train_test_split(train_validate, test_size=.3,
                                                   random_state=123,
                                                   stratify=train_validate.survived)

# Limit columns
train_1 = train[['survived','age','fare','pclass']]
validate_1 = validate[['survived','age','fare','pclass']]
test_1 = test[['survived','age','fare','pclass']]

# Isolate target
X_train, y_train = train_1.drop(columns='survived'), train_1.survived
X_validate, y_validate = validate_1.drop(columns='survived'), validate_1.survived
X_test, y_test = test_1.drop(columns='survived'), test_1.survived

# Create, fit model
logit = LogisticRegression(random_state=123)
logit = logit.fit(X_train, y_train)

# Score model, compare baseline
print('Baseline Performance:', round((df.survived == 0).mean(),4))
print('Model Performance:', round(logit.score(X_train, y_train),4))

Baseline Performance: 0.6162
Model Performance: 0.7028


In [52]:
### Create new model including sex ###
# Set columns of already-split data
train_2 = train[['survived','age','fare','pclass','sex']]
validate_2 = validate[['survived','age','fare','pclass','sex']]
test_2 = test[['survived','age','fare','pclass','sex']]

# Isolate target
X_train, y_train = train_2.drop(columns='survived'), train_2.survived
X_validate, y_validate = validate_2.drop(columns='survived'), validate_2.survived
X_test, y_test = test_2.drop(columns='survived'), test_2.survived

# Create, fit model
logit = LogisticRegression(random_state=123)
logit = logit.fit(X_train, y_train)

# Score model, compare baseline
print('Baseline Performance:', round((df.survived == 0).mean(),4))
print('Model Performance:', round(logit.score(X_train, y_train),4))

Baseline Performance: 0.6162
Model Performance: 0.8153


In [53]:
### Create multiple models with various features, keep best 3 ###
# Use full df, isolate target
X_train, y_train = train.drop(columns='survived'), train.survived
X_validate, y_validate = validate.drop(columns='survived'), validate.survived
X_test, y_test = test.drop(columns='survived'), test.survived

# Create Random Forest model, fit, and print feature importances
rf = RandomForestClassifier(max_depth=10, random_state=123)
rf = rf.fit(X_train, y_train)
features = pd.DataFrame(index=X_train.columns)
features['importances'] = rf.feature_importances_
# print(features.T)

# Use top 5 features based on importances
X_train = X_train[['sex','fare','age','pclass','sibsp']]
X_validate = X_validate[['sex','fare','age','pclass','sibsp']]
X_test = X_test[['sex','fare','age','pclass','sibsp']]

# Create and score multiple Logistic Regression models
metrics = []
models = []
for i in np.arange(0.1, 1, 0.1):
    C_value = i
    logit = LogisticRegression(C=C_value, random_state=123)
    logit = logit.fit(X_train, y_train)
    in_sample_score = logit.score(X_train, y_train)
    out_of_sample_score = logit.score(X_validate, y_validate)
    output = {
        "C_value": i,
        "train_score": in_sample_score,
        "validate_score": out_of_sample_score
    } 
    metrics.append(output)
    models.append(logit)

# Print scores
metrics = pd.DataFrame(metrics)
# print(metrics)

# Using C=0.2, C=0.3, C=0.4 as top-3
logit1, logit2, logit3 = models[1], models[2], models[3]

In [54]:
### Use best model for train/validate on test data ###
# Top performing model is C=0.3
print('Train score:', round(logit2.score(X_train, y_train),4))
print('Validate score:', round(logit2.score(X_validate, y_validate),4))
print('Test score:', round(logit2.score(X_test, y_test),4))

Train score: 0.8012
Validate score: 0.7617
Test score: 0.7877
