![example](images/pexels-pixabay-40568.png)

# Phase 3 Project

**Author:** Freddy Abrahamson<br>
**Date created:** 3-27-2022<br>
**Discipline:** Data Science

## Overview
For this project, I will use multiple linear regression modeling to analyze house sales in King County, in Washington state.

## Business Problem

The goal of this project is to to provide advice to homeowners about how home renovations can increase the value of their homes, and by what amount. The information for this project is derived from information comprised of the different characteristics of over 20,000 homes in King County,which is located in Washington State. I will use this information gain a better understanding about how different remodels, or renovations to the homes listed, impact their price. 

## Data Understanding

Describe the data being used for this project.
***
The data comes from the King County House Sales dataset, in the form of a 'csv' file. The file will be converted into a pandas dataframe. It contains information about the different characteristics of the homes in the King County area,including the number of bedrooms, building grades, square footage, and price. King County is located in Washington State, and has a size of approximately 2300 square miles, per the U.S Census Bureau:

kc_house_data.csv


I will be giving this dataframe a brief overview of its different characteristics, with a view toward using its columns as variables in a regression model. These include:

* dataframe shape: the number of rows and columns in the dataframe
* any missing/null values
* continuous variables
* categorical variables
* binary variables
* zero inflated variables
* outliers

Since the goal is to try to gain insights, as to how much much a particular upgrade or remodel can the impact the
price of the house, as opposed to predicting home prices, I will be placing an emphasis on choosing features with the least explanatory overlap. To that end, for instance, I would favor a feature such as a bedroom, or a bathroom over square footage.

In [1]:
import pandas as pd
import numpy as np
import warnings
#warnings.filterwarnings('ignore')
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.tree import DecisionTreeClassifier 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [2]:
#importing dataset
df = pd.read_csv('H1N1_Flu_Vaccines.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 38 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   respondent_id                26707 non-null  int64  
 1   h1n1_concern                 26615 non-null  float64
 2   h1n1_knowledge               26591 non-null  float64
 3   behavioral_antiviral_meds    26636 non-null  float64
 4   behavioral_avoidance         26499 non-null  float64
 5   behavioral_face_mask         26688 non-null  float64
 6   behavioral_wash_hands        26665 non-null  float64
 7   behavioral_large_gatherings  26620 non-null  float64
 8   behavioral_outside_home      26625 non-null  float64
 9   behavioral_touch_face        26579 non-null  float64
 10  doctor_recc_h1n1             24547 non-null  float64
 11  doctor_recc_seasonal         24547 non-null  float64
 12  chronic_med_condition        25736 non-null  float64
 13  child_under_6_mo

In [4]:
df.head()

Unnamed: 0,respondent_id,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,...,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation,h1n1_vaccine,seasonal_vaccine
0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,Own,Not in Labor Force,oxchjgsf,Non-MSA,0.0,0.0,,,0,0
1,1,3.0,2.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,Rent,Employed,bhuqouqj,"MSA, Not Principle City",0.0,0.0,pxcmvdjn,xgwztkwe,0,1
2,2,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,Own,Employed,qufhixun,"MSA, Not Principle City",2.0,0.0,rucpziij,xtkaffoo,0,0
3,3,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,Rent,Not in Labor Force,lrircsnp,"MSA, Principle City",0.0,0.0,,,0,1
4,4,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,...,Own,Employed,qufhixun,"MSA, Not Principle City",1.0,0.0,wxleyezf,emcorrxb,0,0


In [5]:
print("Raw Counts")
print(df["h1n1_vaccine"].value_counts())
print()
print("Percentages")
print(df["h1n1_vaccine"].value_counts(normalize=True))

Raw Counts
0    21033
1     5674
Name: h1n1_vaccine, dtype: int64

Percentages
0    0.787546
1    0.212454
Name: h1n1_vaccine, dtype: float64


<b>A baseline model that always chose the majority class would have an accuracy of over 78%.</b>

# Preprocessing the Data:

### Dropping Features, Train-test-split, and Dealing with Missing Values: 

In [6]:
#I will drop:
# 'respondent_id' - since it is a unique identifier
# 'employment_industry','employment_occupation','health_insurance' - about 50% or more records missing 
# 'seasonal_vaccine' - we will not account for this classification
df_II = df.drop(['respondent_id','seasonal_vaccine','employment_industry','employment_occupation','health_insurance' ], axis=1)


In [7]:
# Split df into X and y
X = df_II.drop("h1n1_vaccine", axis=1)
y = df_II["h1n1_vaccine"]

# Perform train-test split with random_state=42 and stratify=y
# stratify y to maintain uniform ratios of dependent variable y
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [8]:
#impute values based on most common value in each column:
X_train = X_train.apply(lambda x:x.fillna(x.value_counts().index[0]))
X_test = X_test.apply(lambda x:x.fillna(x.value_counts().index[0]))

<b>There is now no missing data in the training dataset.</b>

### Pre-processing training data:

In [9]:
# I will convert all the columns in the dataset to string type, so I can then one-hot encode them:
X_train_II = X_train.astype(str)

# creating a OneHotEncoder object:
ohe = OneHotEncoder(categories="auto", handle_unknown="ignore", sparse=False)

# fitting dataset to OneHotEncoder object:
ohe.fit(X_train_II)

# creating an array with ohe column names:
col_names = ohe.get_feature_names(X_train_II.columns)

# Create transformed dataframe
X_train_II_encoded = pd.DataFrame(ohe.fit_transform(X_train_II), columns=col_names)
X_train_II_encoded.head()

Unnamed: 0,h1n1_concern_0.0,h1n1_concern_1.0,h1n1_concern_2.0,h1n1_concern_3.0,h1n1_knowledge_0.0,h1n1_knowledge_1.0,h1n1_knowledge_2.0,behavioral_antiviral_meds_0.0,behavioral_antiviral_meds_1.0,behavioral_avoidance_0.0,...,"census_msa_MSA, Principle City",census_msa_Non-MSA,household_adults_0.0,household_adults_1.0,household_adults_2.0,household_adults_3.0,household_children_0.0,household_children_1.0,household_children_2.0,household_children_3.0
0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,...,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0


### Pre-processing test data:

In [10]:
# I will convert all the columns in the dataset to string type, so I can then one-hot encode them:
X_test_II = X_test.astype(str)

# Create transformed dataframe. No need to fit. Use same column names array:
X_test_II_encoded = pd.DataFrame(ohe.transform(X_test_II), columns=col_names)

# Classification models:

## K Nearest Neighbors (KNN)

### K Nearest Neighbors Baseline Model:

In [11]:
# Creating K nearest neighbor classifier object 
knn = KNeighborsClassifier(n_jobs = -1)

# using 2-split cross-validation to score the classification:
knn_cv_score = cross_val_score(knn, X_train_II_encoded, y_train, cv=2)

# return the mean of the 5 accuracy scores:
mean_knn_cv_score = np.mean(knn_cv_score)
print(f"Mean Cross Validation Score: {mean_knn_cv_score :.2%}")

Mean Cross Validation Score: 80.55%


### Using GridSearchCV to create additional KNN models:

In [12]:
# Define the parameter grid:

knn_param_grid = {
    'n_neighbors': [5,12,20],
    'metric'     : ['minkowski'],
    'p'          : [1,2,3,4]
}

In [13]:
# Instantiate GridSearchCV object:
knn_grid_search = GridSearchCV(knn, knn_param_grid, cv=2, return_train_score=True, n_jobs = -1)

# Fit to the data
knn_grid_search.fit(X_train_II_encoded, y_train)

GridSearchCV(cv=2, estimator=KNeighborsClassifier(n_jobs=-1), n_jobs=-1,
             param_grid={'metric': ['minkowski'], 'n_neighbors': [5, 12, 20],
                         'p': [1, 2, 3, 4]},
             return_train_score=True)

In [14]:
# Mean training score
knn_gs_training_score = np.mean(knn_grid_search.cv_results_['mean_train_score'])

# Mean test score
knn_grid_search.score(X_test_II_encoded, y_test)
knn_gs_testing_score = np.mean(knn_grid_search.cv_results_['mean_test_score'])

print(f"Mean Training Score: {knn_gs_training_score :.2%}")
print(f"Mean Test Score: {knn_gs_testing_score :.2%}")
print("Best Parameter Combination (to return highest score for the holdout data):")
knn_grid_search.best_params_

Mean Training Score: 83.93%
Mean Test Score: 81.37%
Best Parameter Combination (to return highest score for the holdout data):


{'metric': 'minkowski', 'n_neighbors': 20, 'p': 1}

In [15]:
# Creates a dataframe from knn_grid_search.cv_results_ dictionary:
knn_cv_grid_df = pd.DataFrame(knn_grid_search.cv_results_)

# adding new column 
knn_cv_grid_df['score_dif'] = abs(knn_cv_grid_df['mean_train_score'] - knn_cv_grid_df['mean_test_score'])

# creates new dataframe with only 'train' and 'test' scores
knn_scores = knn_cv_grid_df.loc[:,['mean_train_score','mean_test_score','score_dif']]
knn_scores.describe()

Unnamed: 0,mean_train_score,mean_test_score,score_dif
count,12.0,12.0,12.0
mean,0.839308,0.813663,0.025645
std,0.011144,0.006086,0.017222
min,0.829406,0.805492,0.010734
25%,0.829406,0.805492,0.010734
50%,0.834398,0.816825,0.017574
75%,0.854119,0.818672,0.048627
max,0.854119,0.818672,0.048627


## Decision Trees

### Decision Tree Baseline Model:

In [16]:
# Creating decision tree classifier object
dec_tree = DecisionTreeClassifier(random_state=42)

# using 2-split cross-validation to score the classification:
dec_tree_cv_score = cross_val_score(dec_tree, X_train_II_encoded, y_train, cv=2)

# return the mean of the 2 accuracy scores:
mean_dec_tree_cv_score = np.mean(dec_tree_cv_score)
print(f"Mean Cross Validation Score: {mean_dec_tree_cv_score :.2%}")

Mean Cross Validation Score: 75.02%


### Using GridSearchCV to create additional Decision Tree models:

In [17]:
path = dec_tree.cost_complexity_pruning_path(X_train_II_encoded, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
ccp_alphas.mean()

0.00014796956808722217

In [18]:
# Define the parameter grid:

dec_tree_param_grid = {
    'criterion'        : ['gini', 'entropy'],
    'max_depth'        : [None, 7, 8],
    'min_samples_split': [2,3,5],
    'min_samples_leaf' : [1, 2, 3, 4, 5, 6],
    'class_weight'     : [None, 'balanced']
    #'ccp_alpha'        : [None, 0.00014796956808722217]
}

In [19]:
# Instantiate GridSearchCV object:
dec_tree_grid_search = GridSearchCV(dec_tree, dec_tree_param_grid, cv=2, return_train_score=True, 
                                    n_jobs = -1)

# Fit to the data
dec_tree_grid_search.fit(X_train_II_encoded, y_train)

GridSearchCV(cv=2, estimator=DecisionTreeClassifier(random_state=42), n_jobs=-1,
             param_grid={'class_weight': [None, 'balanced'],
                         'criterion': ['gini', 'entropy'],
                         'max_depth': [None, 7, 8],
                         'min_samples_leaf': [1, 2, 3, 4, 5, 6],
                         'min_samples_split': [2, 3, 5]},
             return_train_score=True)

In [20]:
# Mean training score
dec_tree_gs_training_score = np.mean(dec_tree_grid_search.cv_results_['mean_train_score'])

# Mean test score
dec_tree_grid_search.score(X_test_II_encoded, y_test)
dec_tree_gs_testing_score = np.mean(dec_tree_grid_search.cv_results_['mean_test_score'])

# Print Results
print(f"Mean Training Score: {dec_tree_gs_training_score :.2%}")
print(f"Mean Test Score: {dec_tree_gs_testing_score :.2%}")
print("Best Parameter Combination (to return highest score for the holdout data):")
dec_tree_grid_search.best_params_

Mean Training Score: 85.61%
Mean Test Score: 78.14%
Best Parameter Combination (to return highest score for the holdout data):


{'class_weight': None,
 'criterion': 'entropy',
 'max_depth': 7,
 'min_samples_leaf': 5,
 'min_samples_split': 2}

In [21]:
# Creates a dataframe from dec_tree_grid_search.cv_results_ dictionary:
dec_tree_gs_df = pd.DataFrame(dec_tree_grid_search.cv_results_)

# adding new column:
dec_tree_gs_df['score_dif'] = abs(dec_tree_gs_df['mean_train_score'] -  dec_tree_gs_df['mean_test_score'])

# creates new dataframe with only 'train','test' scores, and their difference:
dec_tree_scores = dec_tree_gs_df.loc[:,['mean_train_score','mean_test_score','score_dif']]
dec_tree_scores.describe()

Unnamed: 0,mean_train_score,mean_test_score,score_dif
count,216.0,216.0,216.0
mean,0.856119,0.781356,0.074763
std,0.056444,0.032498,0.069446
min,0.789166,0.719321,0.019571
25%,0.802197,0.763854,0.02671
50%,0.850924,0.770944,0.033849
75%,0.885721,0.818884,0.127172
max,1.0,0.825761,0.249825


## Random Forests

### Random Forest Baseline Model:

In [22]:
# Creating random forest classifier object
forest = RandomForestClassifier(n_jobs = -1,random_state=42)

# using 5-split cross-validation to score the classification:
forest_cv_score = cross_val_score(forest, X_train_II_encoded, y_train, cv=2)

# return the mean of the 5 accuracy scores:
mean_forest_cv_score = np.mean(forest_cv_score)
print(f"Mean Cross Validation Score: {mean_forest_cv_score :.2%}")

Mean Cross Validation Score: 83.31%


### Using GridSearchCV to create additional Random Forests:

In [23]:
# Define the parameter grid:

forest_param_grid = {
              'criterion'        : ['gini', 'entropy'],
              'max_depth'        : [None, 4, 6,8],
              'min_samples_split': [2,3,4,6],
              'max_features'     : [15, 54,'auto'],
             'class_weight'      : [None, 'balanced'],
              'n_estimators'     : [100, 140]
         
}

In [24]:
# Instantiate GridSearchCV object:
forest_grid_search = GridSearchCV(forest, forest_param_grid, cv=2, return_train_score=True,
                                  n_jobs = -1)

# Fit to the data
forest_grid_search.fit(X_train_II_encoded, y_train)

GridSearchCV(cv=2, estimator=RandomForestClassifier(n_jobs=-1, random_state=42),
             n_jobs=-1,
             param_grid={'class_weight': [None, 'balanced'],
                         'criterion': ['gini', 'entropy'],
                         'max_depth': [None, 4, 6, 8],
                         'max_features': [15, 54, 'auto'],
                         'min_samples_split': [2, 3, 4, 6],
                         'n_estimators': [100, 140]},
             return_train_score=True)

In [25]:
# Mean training score
forest_gs_training_score = np.mean(forest_grid_search.cv_results_['mean_train_score'])

# Mean test score
forest_grid_search.score(X_test_II_encoded, y_test)
forest_gs_testing_score = np.mean(forest_grid_search.cv_results_['mean_test_score'])

print(f"Mean Training Score: {forest_gs_training_score :.2%}")
print(f"Mean Test Score: {forest_gs_testing_score :.2%}")
print("Best Parameter Combination (to return highest score for the holdout data):")
forest_grid_search.best_params_

Mean Training Score: 86.48%
Mean Test Score: 81.32%
Best Parameter Combination (to return highest score for the holdout data):


{'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'min_samples_split': 4,
 'n_estimators': 100}

In [26]:
# Creates a dataframe from forest_grid_search.cv_results_ dictionary:
forest_cv_grid_df = pd.DataFrame(forest_grid_search.cv_results_)

# adding new column:
forest_cv_grid_df['score_dif'] = abs(forest_cv_grid_df['mean_train_score'] - forest_cv_grid_df['mean_test_score'])

# creates new dataframe with only 'train','test' scores, and their difference:
forest_scores = forest_cv_grid_df.loc[:,['mean_train_score','mean_test_score','score_dif']]
forest_scores.describe()

Unnamed: 0,mean_train_score,mean_test_score,score_dif
count,384.0,384.0,384.0
mean,0.864814,0.813245,0.051569
std,0.07737,0.021527,0.063759
min,0.767149,0.763804,0.002896
25%,0.813692,0.794284,0.00805
50%,0.834199,0.825487,0.017773
75%,0.894121,0.831503,0.061533
max,1.0,0.835247,0.171842


## XGBoost

In [27]:
X_train_III_encoded = X_train_II_encoded.rename(columns={'education_< 12 Years': 'education less than 12 Years', 
                                                         'income_poverty_<= $75,000, Above Poverty':
                                                         'income_poverty less than or = to $75000_Above Poverty'})
X_test_III_encoded = X_test_II_encoded.rename(columns={'education_< 12 Years': 'education less than 12 Years', 
                                                         'income_poverty_<= $75,000, Above Poverty':
                                                         'income_poverty less than or = to $75000_Above Poverty'})

### XGBoost Baseline Model:

In [28]:
# Creating random forest classifier object
xgboost_clf = XGBClassifier(random_state=42, n_jobs = -1)

# using 2-split cross-validation to score the classification:
xgboost_clf_cv_score = cross_val_score(xgboost_clf, X_train_III_encoded, y_train, cv=2)

# return the mean of the 2 accuracy scores:
mean_xgboost_clf_cv_score = np.mean(xgboost_clf_cv_score)
print(f"Mean Cross Validation Score: {mean_xgboost_clf_cv_score :.2%}")

Mean Cross Validation Score: 82.27%


### Using GridSearchCV to create additional XGBoost Classifiers:

In [29]:
# Define the parameter grid:

xgboost_param_grid = {
    'learning_rate': [None, .08, 0.09],
    'max_depth': [None, 4, 5],
    'min_child_weight': [1, 2, 3],
    'subsample': [0.65, 1],
    'min_split_loss' : [0, .5],
    'n_estimators' : [100, 160],
    'reg_alpha':[None, .5,]
}

In [30]:
# Instantiate GridSearchCV object:
xgboost_clf_grid_search = GridSearchCV(xgboost_clf, xgboost_param_grid, cv=2, return_train_score=True,
                                  n_jobs = -1)

# Fit to the data
xgboost_clf_grid_search.fit(X_train_III_encoded, y_train)

GridSearchCV(cv=2,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None, gamma=None,
                                     gpu_id=None, importance_type='gain',
                                     interaction_constraints=None,
                                     learning_rate=None, max_delta_step=None,
                                     max_depth=None, min_child_weight=None,
                                     missing=nan, monotone_constraints=None,
                                     n_estimators=100, n_jobs...
                                     reg_alpha=None, reg_lambda=None,
                                     scale_pos_weight=None, subsample=None,
                                     tree_method=None, validate_parameters=None,
                                     verbosity=None),
  

In [31]:
# Mean training score
xgboost_clf_gs_training_score = np.mean(xgboost_clf_grid_search.cv_results_['mean_train_score'])

# Mean test score
xgboost_clf_grid_search.score(X_test_III_encoded, y_test)
xgboost_clf_gs_testing_score = np.mean(xgboost_clf_grid_search.cv_results_['mean_test_score'])

print(f"Mean Training Score: {xgboost_clf_gs_training_score :.2%}")
print(f"Mean Test Score: {xgboost_clf_gs_testing_score :.2%}")
print("Best Parameter Combination (to return highest score for the holdout data):")
xgboost_clf_grid_search.best_params_

Mean Training Score: 89.39%
Mean Test Score: 83.25%
Best Parameter Combination (to return highest score for the holdout data):


{'learning_rate': 0.08,
 'max_depth': 4,
 'min_child_weight': 2,
 'min_split_loss': 0,
 'n_estimators': 160,
 'reg_alpha': 0.5,
 'subsample': 1}

In [32]:
# Creates a dataframe from xgboost_clf_grid_search.cv_results_ dictionary:
xgboost_clf_grid_df = pd.DataFrame(xgboost_clf_grid_search.cv_results_)

# adding new column:
xgboost_clf_grid_df['score_dif'] = abs(xgboost_clf_grid_df['mean_train_score'] - 
                                       xgboost_clf_grid_df['mean_test_score'])

# creates new dataframe with only 'train','test' scores, and their difference:
xgboost_scores = xgboost_clf_grid_df.loc[:,['mean_train_score','mean_test_score','score_dif']]
xgboost_scores.describe()

Unnamed: 0,mean_train_score,mean_test_score,score_dif
count,432.0,432.0,432.0
mean,0.893937,0.832498,0.061439
std,0.03313,0.006507,0.039118
min,0.853669,0.812831,0.015726
25%,0.867212,0.827758,0.029331
50%,0.884548,0.835597,0.050275
75%,0.90815,0.837444,0.077634
max,0.983475,0.839491,0.168797
