![example](images/pexels-pixabay-40568.png)

# Phase 3 Project

**Author:** Freddy Abrahamson<br>
**Date created:** 3-27-2022<br>
**Discipline:** Data Science

## Overview
For this project, I will use multiple linear regression modeling to analyze house sales in King County, in Washington state.

## Business Problem

The goal of this project is to to provide advice to homeowners about how home renovations can increase the value of their homes, and by what amount. The information for this project is derived from information comprised of the different characteristics of over 20,000 homes in King County,which is located in Washington State. I will use this information gain a better understanding about how different remodels, or renovations to the homes listed, impact their price. 

## Data Understanding

Describe the data being used for this project.
***
The data comes from the King County House Sales dataset, in the form of a 'csv' file. The file will be converted into a pandas dataframe. It contains information about the different characteristics of the homes in the King County area,including the number of bedrooms, building grades, square footage, and price. King County is located in Washington State, and has a size of approximately 2300 square miles, per the U.S Census Bureau:

kc_house_data.csv


I will be giving this dataframe a brief overview of its different characteristics, with a view toward using its columns as variables in a regression model. These include:

* dataframe shape: the number of rows and columns in the dataframe
* any missing/null values
* continuous variables
* categorical variables
* binary variables
* zero inflated variables
* outliers

Since the goal is to try to gain insights, as to how much much a particular upgrade or remodel can the impact the
price of the house, as opposed to predicting home prices, I will be placing an emphasis on choosing features with the least explanatory overlap. To that end, for instance, I would favor a feature such as a bedroom, or a bathroom over square footage.

In [1]:
import pandas as pd
import numpy as np
import warnings
#warnings.filterwarnings('ignore')
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.tree import DecisionTreeClassifier 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
import shap
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [2]:
def best_score_metric(df1, df2):
    from sklearn.metrics import auc
    
    rows = len(df1)
    test_scores = np.zeros((rows, 3))
    score_diffs = np.zeros((rows, 3))
    auc_scores = []
    
    for row in range(rows):
        test_scores[row][1] = df1['mean_test_score'][row]
        test_scores[row][2] = 1
        score_diffs[row][1] = df1['score_dif'][row]
        score_diffs[row][2] = 1
    
    for row in range(rows):
        auc_score = auc(score_diffs[row], test_scores[row])
        auc_scores.append(auc_score)
        
    best_auc_score = max(auc_scores)
    best_score_index = auc_scores.index(best_auc_score)
    return (df1['mean_test_score'][best_score_index], df1['score_dif'][best_score_index], best_auc_score,
            df2['params'][best_score_index],best_score_index)


In [3]:
#importing dataset
df = pd.read_csv('H1N1_Flu_Vaccines.csv')

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 38 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   respondent_id                26707 non-null  int64  
 1   h1n1_concern                 26615 non-null  float64
 2   h1n1_knowledge               26591 non-null  float64
 3   behavioral_antiviral_meds    26636 non-null  float64
 4   behavioral_avoidance         26499 non-null  float64
 5   behavioral_face_mask         26688 non-null  float64
 6   behavioral_wash_hands        26665 non-null  float64
 7   behavioral_large_gatherings  26620 non-null  float64
 8   behavioral_outside_home      26625 non-null  float64
 9   behavioral_touch_face        26579 non-null  float64
 10  doctor_recc_h1n1             24547 non-null  float64
 11  doctor_recc_seasonal         24547 non-null  float64
 12  chronic_med_condition        25736 non-null  float64
 13  child_under_6_mo

In [5]:
df.head()

Unnamed: 0,respondent_id,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,...,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation,h1n1_vaccine,seasonal_vaccine
0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,Own,Not in Labor Force,oxchjgsf,Non-MSA,0.0,0.0,,,0,0
1,1,3.0,2.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,Rent,Employed,bhuqouqj,"MSA, Not Principle City",0.0,0.0,pxcmvdjn,xgwztkwe,0,1
2,2,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,Own,Employed,qufhixun,"MSA, Not Principle City",2.0,0.0,rucpziij,xtkaffoo,0,0
3,3,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,Rent,Not in Labor Force,lrircsnp,"MSA, Principle City",0.0,0.0,,,0,1
4,4,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,...,Own,Employed,qufhixun,"MSA, Not Principle City",1.0,0.0,wxleyezf,emcorrxb,0,0


In [6]:
print("Raw Counts")
print(df["h1n1_vaccine"].value_counts())
print()
print("Percentages")
print(df["h1n1_vaccine"].value_counts(normalize=True))

Raw Counts
0    21033
1     5674
Name: h1n1_vaccine, dtype: int64

Percentages
0    0.787546
1    0.212454
Name: h1n1_vaccine, dtype: float64


<b>A baseline model that always chose the majority class would have an accuracy of over 78%.</b>

# Preprocessing the Data:

### Dropping Features, Train-test-split, and Dealing with Missing Values: 

In [7]:
#I will drop:
# 'respondent_id' - since it is a unique identifier
# 'employment_industry','employment_occupation','health_insurance' - about 50% or more records missing 
# 'seasonal_vaccine' - we will not account for this classification
df_II = df.drop(['respondent_id','employment_industry','employment_occupation','health_insurance'], axis=1)


In [8]:
# Split df into X and y
X = df_II.drop("h1n1_vaccine", axis=1)
y = df_II["h1n1_vaccine"]

# Perform train-test split with random_state=42 and stratify=y
# stratify y to maintain uniform ratios of dependent variable y
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [9]:
#impute values based on most common value in each column:
X_train = X_train.apply(lambda x:x.fillna(x.value_counts().index[0]))
X_test = X_test.apply(lambda x:x.fillna(x.value_counts().index[0]))

<b>There is now no missing data in the training dataset.</b>

### Pre-processing training data:

In [10]:
# splitting dataframe between ordinals , categoricals, and nominals
X_train_ord =  X_train.iloc[:,np.r_[0:2,14:22,24]]
X_train_nom = X_train.iloc[:,30:32]
ord_cols = X_train_ord.columns
nom_cols = X_train_nom.columns
cols_to_drop = ord_cols.append(nom_cols)
X_train_cat = X_train.drop(cols_to_drop, axis=1)
X_train_ord_index = X_train_ord.index
X_train_cat_index = X_train_cat.index

# I will convert all the columns in the dataset to string type, so I can then encode them:
X_train_ord = X_train_ord.astype(str)
X_train_cat = X_train_cat.astype(str)

# creating a Encoder objects:
enc = OrdinalEncoder()
ohe = OneHotEncoder(categories="auto", handle_unknown="ignore", sparse=False)

# fitting dataset to OneHotEncoder object:
X_train_enc = enc.fit_transform(X_train_ord)
X_train_ohe = ohe.fit_transform(X_train_cat)

# creating an array with enc and ohe column names:
enc_col_names = X_train_ord.columns
ohe_col_names = ohe.get_feature_names(X_train_cat.columns)

# Setting arrays back to dataframes
X_train_enc_df = pd.DataFrame(X_train_enc, columns=enc_col_names,index=X_train_ord_index)
X_train_ohe_df = pd.DataFrame(X_train_ohe, columns=ohe_col_names,index=X_train_cat_index)

#putting the datframe back together:
X_train_II_encoded = pd.concat([X_train_enc_df,X_train_ohe_df,X_train_nom],axis=1)
X_train_II_encoded.head()

Unnamed: 0,h1n1_concern,h1n1_knowledge,opinion_h1n1_vacc_effective,opinion_h1n1_risk,opinion_h1n1_sick_from_vacc,opinion_seas_vacc_effective,opinion_seas_risk,opinion_seas_sick_from_vacc,age_group,education,...,hhs_geo_region_mlyzmhmf,hhs_geo_region_oxchjgsf,hhs_geo_region_qufhixun,"census_msa_MSA, Not Principle City","census_msa_MSA, Principle City",census_msa_Non-MSA,seasonal_vaccine_0,seasonal_vaccine_1,household_adults,household_children
11075,2.0,2.0,3.0,1.0,1.0,3.0,1.0,3.0,1.0,3.0,...,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,2.0
7807,3.0,2.0,3.0,1.0,4.0,3.0,1.0,3.0,1.0,2.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
3014,2.0,1.0,4.0,0.0,0.0,4.0,0.0,0.0,4.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0
1671,2.0,1.0,2.0,4.0,3.0,3.0,4.0,3.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
16691,3.0,2.0,4.0,1.0,3.0,4.0,1.0,3.0,4.0,2.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0


### Pre-processing test data:

In [11]:
# splitting dataframe between ordinals and categoricals
X_test_ord =  X_test.iloc[:,np.r_[0:2,14:22,24]]
X_test_nom = X_test.iloc[:,30:32]
ord_cols = X_test_ord.columns
nom_cols = X_test_nom.columns
cols_to_drop = ord_cols.append(nom_cols)
X_test_cat = X_test.drop(cols_to_drop, axis=1)

# create index arrays to use when I recreate the dataframe
X_test_ord_index = X_test_ord.index
X_test_cat_index = X_test_cat.index

# I will convert all the columns in the dataset to string type, so I can then encode them:
X_test_ord = X_test_ord.astype(str)
X_test_cat = X_test_cat.astype(str)

# Setting arrays back to dataframes
X_test_enc_df = pd.DataFrame(enc.transform(X_test_ord), columns=enc_col_names,index=X_test_ord_index)
X_test_ohe_df = pd.DataFrame(ohe.transform(X_test_cat), columns=ohe_col_names,index=X_test_cat_index)

#putting the datframe back together:
X_test_II_encoded = pd.concat([X_test_enc_df,X_test_ohe_df,X_test_nom],axis=1)
X_test_II_encoded.head()

Unnamed: 0,h1n1_concern,h1n1_knowledge,opinion_h1n1_vacc_effective,opinion_h1n1_risk,opinion_h1n1_sick_from_vacc,opinion_seas_vacc_effective,opinion_seas_risk,opinion_seas_sick_from_vacc,age_group,education,...,hhs_geo_region_mlyzmhmf,hhs_geo_region_oxchjgsf,hhs_geo_region_qufhixun,"census_msa_MSA, Not Principle City","census_msa_MSA, Principle City",census_msa_Non-MSA,seasonal_vaccine_0,seasonal_vaccine_1,household_adults,household_children
12369,3.0,1.0,3.0,0.0,3.0,3.0,0.0,1.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
17593,3.0,2.0,4.0,1.0,0.0,4.0,1.0,0.0,1.0,2.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,2.0
2698,3.0,1.0,3.0,0.0,1.0,4.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,2.0
13754,2.0,2.0,3.0,1.0,1.0,3.0,1.0,1.0,3.0,2.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
7106,2.0,1.0,2.0,3.0,1.0,0.0,3.0,1.0,0.0,3.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,3.0,3.0


# Classification models:

## K Nearest Neighbors (KNN)

### K Nearest Neighbors Baseline Model:

In [None]:
# Creating K nearest neighbor classifier object 
knn = KNeighborsClassifier(n_jobs = -1)

# using 2-split cross-validation to score the classification:
knn_cv_score = cross_val_score(knn, X_train_II_encoded, y_train, cv=2)

# return the mean of the 5 accuracy scores:
mean_knn_cv_score = np.mean(knn_cv_score)
print(f"Mean Cross Validation Score: {mean_knn_cv_score :.2%}")

### Using GridSearchCV to create additional KNN models:

In [None]:
# Define the parameter grid:

knn_param_grid = {
    'n_neighbors': [5,12,20],
    'metric'     : ['minkowski'],
    'p'          : [1,2,3,4]
}

In [None]:
# Instantiate GridSearchCV object:
knn_grid_search = GridSearchCV(knn, knn_param_grid, cv=2, return_train_score=True, n_jobs = -1)

# Fit to the data
knn_grid_search.fit(X_train_II_encoded, y_train)

In [None]:
# Mean training score
knn_gs_training_score = np.mean(knn_grid_search.cv_results_['mean_train_score'])

# Mean test score
knn_grid_search.score(X_test_II_encoded, y_test)
knn_gs_testing_score = np.mean(knn_grid_search.cv_results_['mean_test_score'])

print(f"Mean Training Score: {knn_gs_training_score :.2%}")
print(f"Mean Test Score: {knn_gs_testing_score :.2%}")
print("Best Parameter Combination (to return highest score for the holdout data):")
knn_grid_search.best_params_

In [None]:
# Creates a dataframe from knn_grid_search.cv_results_ dictionary:
knn_cv_grid_df = pd.DataFrame(knn_grid_search.cv_results_)

# adding new column 
knn_cv_grid_df['score_dif'] = abs(knn_cv_grid_df['mean_train_score'] - knn_cv_grid_df['mean_test_score'])

# creates new dataframe with only 'train' and 'test' scores
knn_scores = knn_cv_grid_df.loc[:,['mean_train_score','mean_test_score','score_dif']]
knn_scores.describe()

In [None]:
print('best score: ', best_score_metric(knn_scores,knn_cv_grid_df)[0])
print('best score difference: ', best_score_metric(knn_scores,knn_cv_grid_df)[1])
print('best train-test combination score(auc): ', best_score_metric(knn_scores,knn_cv_grid_df)[2])
print('best dataframe row: ',best_score_metric(knn_scores,knn_cv_grid_df)[4])
print('best parameters: ', best_score_metric(knn_scores,knn_cv_grid_df)[3])

In [None]:

X_II_sample = shap.sample(X_train_II_encoded, nsamples=100, random_state=42)
X_II_sample

In [None]:
# explain the model's predictions using SHAP
# (same syntax works for LightGBM, CatBoost, scikit-learn, transformers, Spark, etc.)
knn_II = KNeighborsClassifier(n_jobs = -1)
knn_II.fit(X_train_II_encoded, y_train)


# visualize the first prediction's explanation
#shap.plots.waterfall(shap_values[0])

# Get the model explainer object
explainer = shap.KernelExplainer(knn_II.predict_proba, X_II_sample,n_jobs = -1)

shap.initjs()


In [None]:
# Get shap values for the test data observation whose index is 0, i.e. first observation in the test set
shap_values = explainer.shap_values(X_II_sample.iloc[0,:])

# Generate a force plot for this first observation using the derived shap values
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test_II_encoded.iloc[0,:])

# Generate a force plot for this first observation using the derived shap values
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test_II_encoded.iloc[0,:])

In [None]:
f = lambda x: knn.predict_proba(x)[:,1]
med = X_train_II_encoded.median().values.reshape((1,X_train_II_encoded.shape[1]))
explainer = shap.KernelExplainer(f, med)
shap_values_single = explainer.shap_values(X_train_II_encoded.iloc[0,:], nsamples=1000)
shap.force_plot(explainer.expected_value, shap_values_single)

## Decision Trees

### Decision Tree Baseline Model:

In [None]:
# Creating decision tree classifier object
dec_tree = DecisionTreeClassifier(random_state=42)

# using 2-split cross-validation to score the classification:
dec_tree_cv_score = cross_val_score(dec_tree, X_train_II_encoded, y_train, cv=2)

# return the mean of the 2 accuracy scores:
mean_dec_tree_cv_score = np.mean(dec_tree_cv_score)
print(f"Mean Cross Validation Score: {mean_dec_tree_cv_score :.2%}")

### Using GridSearchCV to create additional Decision Tree models:

In [None]:
# Define the parameter grid:

dec_tree_param_grid = {
    'criterion'        : ['gini', 'entropy'],
    'max_depth'        : [None,5,6, 7, 8],
    'min_samples_split': [2,3,5],
    'min_samples_leaf' : [1, 2, 3, 4, 5, 6],
    'class_weight'     : [None, 'balanced']
}

In [None]:
# Instantiate GridSearchCV object:
dec_tree_grid_search = GridSearchCV(dec_tree, dec_tree_param_grid, cv=2, return_train_score=True, 
                                    n_jobs = -1)

# Fit to the data
dec_tree_grid_search.fit(X_train_II_encoded, y_train)

In [None]:
# Mean training score
dec_tree_gs_training_score = np.mean(dec_tree_grid_search.cv_results_['mean_train_score'])

# Mean test score
dec_tree_grid_search.score(X_test_II_encoded, y_test)
dec_tree_gs_testing_score = np.mean(dec_tree_grid_search.cv_results_['mean_test_score'])

# Print Results
print(f"Mean Training Score: {dec_tree_gs_training_score :.2%}")
print(f"Mean Test Score: {dec_tree_gs_testing_score :.2%}")
print("Best Parameter Combination (to return highest score for the holdout data):")
dec_tree_grid_search.best_params_

In [None]:
# Creates a dataframe from dec_tree_grid_search.cv_results_ dictionary:
dec_tree_gs_df = pd.DataFrame(dec_tree_grid_search.cv_results_)

# adding new column:
dec_tree_gs_df['score_dif'] = abs(dec_tree_gs_df['mean_train_score'] -  dec_tree_gs_df['mean_test_score'])

# creates new dataframe with only 'train','test' scores, and their difference:
dec_tree_scores = dec_tree_gs_df.loc[:,['mean_train_score','mean_test_score','score_dif']]
dec_tree_scores.describe()

In [None]:
print('best score: ', best_score_metric(dec_tree_scores,dec_tree_gs_df)[0])
print('best score difference: ', best_score_metric(dec_tree_scores,dec_tree_gs_df)[1])
print('best train-test combination score(auc): ', best_score_metric(dec_tree_scores,dec_tree_gs_df)[2])
print('best dataframe row: ',best_score_metric(dec_tree_scores,dec_tree_gs_df)[4])
print('best parameters: ', best_score_metric(dec_tree_scores,dec_tree_gs_df)[3])

## Random Forests

### Random Forest Baseline Model:

In [None]:
# Creating random forest classifier object
forest = RandomForestClassifier(n_jobs = -1,random_state=42)

# using 5-split cross-validation to score the classification:
forest_cv_score = cross_val_score(forest, X_train_II_encoded, y_train, cv=2)

# return the mean of the 5 accuracy scores:
mean_forest_cv_score = np.mean(forest_cv_score)
print(f"Mean Cross Validation Score: {mean_forest_cv_score :.2%}")

In [12]:
forest_II = RandomForestClassifier(n_jobs = -1,random_state=42)
forest_II.fit( X_train_II_encoded, y_train)

RandomForestClassifier(n_jobs=-1, random_state=42)

In [None]:
import shap
shap_values = shap.TreeExplainer(forest_II).shap_values(X_train_II_encoded,2048)
shap.summary_plot(shap_values, X_train_II_encoded, plot_type="bar",n_jobs = -1)

### Using GridSearchCV to create additional Random Forests:

In [None]:
# Define the parameter grid:

forest_param_grid = {
              'criterion'        : ['gini', 'entropy'],
              'max_depth'        : [None, 4,5,6,8],
              'min_samples_split': [2,3,4,6],
              'max_features'     : [15, 20, 32,'auto'],
             'class_weight'      : [None, 'balanced'],
              'n_estimators'     : [100, 150]
         
}

In [None]:
# Instantiate GridSearchCV object:
forest_grid_search = GridSearchCV(forest, forest_param_grid, cv=2, return_train_score=True,
                                  n_jobs = -1)

# Fit to the data
forest_grid_search.fit(X_train_II_encoded, y_train)

In [None]:
# Mean training score
forest_gs_training_score = np.mean(forest_grid_search.cv_results_['mean_train_score'])

# Mean test score
forest_grid_search.score(X_test_II_encoded, y_test)
forest_gs_testing_score = np.mean(forest_grid_search.cv_results_['mean_test_score'])

print(f"Mean Training Score: {forest_gs_training_score :.2%}")
print(f"Mean Test Score: {forest_gs_testing_score :.2%}")
print("Best Parameter Combination (to return highest score for the holdout data):")
forest_grid_search.best_params_

In [None]:
# Creates a dataframe from forest_grid_search.cv_results_ dictionary:
forest_cv_grid_df = pd.DataFrame(forest_grid_search.cv_results_)

# adding new column:
forest_cv_grid_df['score_dif'] = abs(forest_cv_grid_df['mean_train_score'] - forest_cv_grid_df['mean_test_score'])

# creates new dataframe with only 'train','test' scores, and their difference:
forest_scores = forest_cv_grid_df.loc[:,['mean_train_score','mean_test_score','score_dif']]
forest_scores.describe()

In [None]:
print('best score: ', best_score_metric(forest_scores,forest_cv_grid_df)[0])
print('best score difference: ', best_score_metric(forest_scores,forest_cv_grid_df)[1])
print('best train-test combination score(auc): ', best_score_metric(forest_scores,forest_cv_grid_df)[2])
print('best dataframe row: ', best_score_metric(forest_scores,forest_cv_grid_df)[4])
print('best parameters: ', best_score_metric(forest_scores,forest_cv_grid_df)[3])

## XGBoost

In [None]:
# Creating new dataframes with 2 column names modified so they work with XGBoost:

X_train_III_encoded = X_train_II_encoded.rename(columns={'education_< 12 Years': 'education less than 12 Years', 
                                                         'income_poverty_<= $75,000, Above Poverty':
                                                         'income_poverty less than or = to $75000_Above Poverty'})
X_test_III_encoded = X_test_II_encoded.rename(columns={'education_< 12 Years': 'education less than 12 Years', 
                                                         'income_poverty_<= $75,000, Above Poverty':
                                                         'income_poverty less than or = to $75000_Above Poverty'})

### XGBoost Baseline Model:

In [None]:
# Creating random forest classifier object
xgboost_clf = XGBClassifier(random_state=42, n_jobs = -1)

# using 2-split cross-validation to score the classification:
xgboost_clf_cv_score = cross_val_score(xgboost_clf, X_train_III_encoded, y_train, cv=2)

# return the mean of the 2 accuracy scores:
mean_xgboost_clf_cv_score = np.mean(xgboost_clf_cv_score)
print(f"Mean Cross Validation Score: {mean_xgboost_clf_cv_score :.2%}")

### Using GridSearchCV to create additional XGBoost Classifiers:

In [None]:
# Define the parameter grid:

xgboost_param_grid = {
    'learning_rate': [None, .08, .1],
    'max_depth': [None, 4, 5, 6 ],
    'min_child_weight': [1, 2, 3],
    'subsample': [0.65, 1],
    'min_split_loss' : [0, .5],
    'n_estimators' : [100, 160],
    'reg_alpha':[None, .5,]
}

In [None]:
# Instantiate GridSearchCV object:
xgboost_clf_grid_search = GridSearchCV(xgboost_clf, xgboost_param_grid, cv=2, return_train_score=True,
                                  n_jobs = -1)

# Fit to the data
xgboost_clf_grid_search.fit(X_train_III_encoded, y_train)

In [None]:
# Mean training score
xgboost_clf_gs_training_score = np.mean(xgboost_clf_grid_search.cv_results_['mean_train_score'])

# Mean test score
xgboost_clf_grid_search.score(X_test_III_encoded, y_test)
xgboost_clf_gs_testing_score = np.mean(xgboost_clf_grid_search.cv_results_['mean_test_score'])

print(f"Mean Training Score: {xgboost_clf_gs_training_score :.2%}")
print(f"Mean Test Score: {xgboost_clf_gs_testing_score :.2%}")
print("Best Parameter Combination (to return highest score for the holdout data):")
xgboost_clf_grid_search.best_params_

In [None]:
# Creates a dataframe from xgboost_clf_grid_search.cv_results_ dictionary:
xgboost_clf_grid_df = pd.DataFrame(xgboost_clf_grid_search.cv_results_)

# adding new column:
xgboost_clf_grid_df['score_dif'] = abs(xgboost_clf_grid_df['mean_train_score'] - 
                                       xgboost_clf_grid_df['mean_test_score'])

# creates new dataframe with only 'train','test' scores, and their difference:
xgboost_scores = xgboost_clf_grid_df.loc[:,['mean_train_score','mean_test_score','score_dif']]

In [None]:
print('best score: ', best_score_metric(xgboost_scores,xgboost_clf_grid_df)[0])
print('best score difference: ', best_score_metric(xgboost_scores,xgboost_clf_grid_df)[1])
print('best train-test combination score(auc): ', best_score_metric(xgboost_scores,xgboost_clf_grid_df)[2])
print('best dataframe row: ', best_score_metric(xgboost_scores,xgboost_clf_grid_df)[4])
print('best parameters: ', best_score_metric(xgboost_scores,xgboost_clf_grid_df)[3])
