# Implement Your Machine Learning Project Plan

In this lab you will implement the machine learning project plan you created in the assignment. You will:

1. Load your data set and save it to a Pandas DataFrame.
2. Create features and a label, and prepare your data for your model.
3. Fit your model to the training data and evaluate your model. 
4. Show how you've improved upon your baseline model.

### Import Packages

Before you get started, import a few packages.

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

<b>Task:</b> In the code cell below, import the additional packages that you will need for this task (only import packages that you have used in this course).

In [2]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import plot_roc_curve, accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import cross_val_score

## Part 1: Load the Data Set


You have chosen to work with one of three data sets. The data sets are located in a folder named "data." The file names of the three data sets are as follows:

* The "adult" data set that contains Census information from 1994 is located in file `adultDataFull.csv`
* The airbnb NYC "listings" data set is located in file  `airbnbListingsData.csv`
* The World Happiness Report (WHR) data set is location in file `WHR2018Chapter2OnlineData.csv`

<b>Task:</b> In the code cell below, use the same method you have been using to load your data using `pd.read_csv()` and save it to DataFrame `df`.

In [3]:
# Filenames of the three data sets
filename = os.path.join(os.getcwd(), "data", "adultDataFull.csv")
df = pd.read_csv(filename)

df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex_selfID,capital-gain,capital-loss,hours-per-week,native-country,income_binary
0,39.0,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Non-Female,2174,0,40.0,United-States,<=50K
1,50.0,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Non-Female,0,0,13.0,United-States,<=50K
2,38.0,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Non-Female,0,0,40.0,United-States,<=50K
3,53.0,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Non-Female,0,0,40.0,United-States,<=50K
4,28.0,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40.0,Cuba,<=50K


## Part 2: Implement Your Project Plan

<b>Task:</b> Use the rest of this notebook to carry out your project plan. Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit.

In [4]:
y = df["marital-status"]
myList = ["age","education","workclass","marital-status"]
df = df[myList]


In [5]:
df.isnull().sum()

age                162
education            0
workclass         1836
marital-status       0
dtype: int64

In [6]:
rows = df.loc[df["workclass"].isnull()].index

df.drop(rows, axis = 0, inplace=True)

rows = df.loc[df["age"].isnull()].index

df.drop(rows, axis = 0, inplace=True)
df.isnull().sum()

age               0
education         0
workclass         0
marital-status    0
dtype: int64

In [7]:
df["workclass"].unique()

array(['State-gov', 'Self-emp-not-inc', 'Private', 'Federal-gov',
       'Local-gov', 'Self-emp-inc', 'Without-pay', 'Never-worked'],
      dtype=object)

In [8]:
encoder = OneHotEncoder(handle_unknown ="error", sparse = False)
to_encode = ["workclass", "education"]

df_enc = pd.DataFrame(encoder.fit_transform(df[to_encode]))

df_enc.columns = encoder.get_feature_names(to_encode)

df = df_enc.join(df) #order matters

df.drop(to_encode, axis = 1, inplace=True)
df.isnull().sum()
rows = df.loc[df["age"].isnull()].index

df.drop(rows, axis = 0, inplace=True)

rows = df.loc[df["marital-status"].isnull()].index

df.drop(rows, axis = 0, inplace=True)
df.isnull().sum()

workclass_Federal-gov         0
workclass_Local-gov           0
workclass_Never-worked        0
workclass_Private             0
workclass_Self-emp-inc        0
workclass_Self-emp-not-inc    0
workclass_State-gov           0
workclass_Without-pay         0
education_10th                0
education_11th                0
education_12th                0
education_1st-4th             0
education_5th-6th             0
education_7th-8th             0
education_9th                 0
education_Assoc-acdm          0
education_Assoc-voc           0
education_Bachelors           0
education_Doctorate           0
education_HS-grad             0
education_Masters             0
education_Preschool           0
education_Prof-school         0
education_Some-college        0
age                           0
marital-status                0
dtype: int64

In [9]:
y = df["marital-status"]
X = df.drop("marital-status",axis = 1)
X.head()

Unnamed: 0,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_State-gov,workclass_Without-pay,education_10th,education_11th,...,education_Assoc-acdm,education_Assoc-voc,education_Bachelors,education_Doctorate,education_HS-grad,education_Masters,education_Preschool,education_Prof-school,education_Some-college,age
0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,39.0
1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,50.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,38.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,53.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,28.0


In [10]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = .2, random_state = 1234)

In [12]:
hyperparams = [2**n for n in range(2,7)]
hyperparams
print('Running k-fold Cross-Validation...')

accuracy_scores = []

for md in hyperparams:
    
    model = DecisionTreeClassifier(max_depth = md, min_samples_leaf= 1)
    
    acc_score = cross_val_score(model,X_train,y_train)
    
    acc_mean = acc_score.mean()
    
    accuracy_scores.append(acc_mean)
    
print('Done\n')

for s in range(len(accuracy_scores)):
    print('Accuracy score for max_depth {0}: {1}'.format(hyperparams[s], accuracy_scores[s]))

Running k-fold Cross-Validation...
Done

Accuracy score for max_depth 4: 0.6177914181892568
Accuracy score for max_depth 8: 0.6116549471509029
Accuracy score for max_depth 16: 0.5905903510740759
Accuracy score for max_depth 32: 0.5778385519508378
Accuracy score for max_depth 64: 0.5776209240944721


In [15]:
model = DecisionTreeClassifier(max_depth = 4, min_samples_leaf = 1)
model.fit(X_train,y_train)
class_predictions = model.predict(X_test)
acc_score = accuracy_score(y_test,class_predictions)
print(acc_score)


0.6259355961705831
