# Lab 8: Define and Solve an ML Problem of Your Choosing

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [2]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")

#CREATE THE DATA FRAME
df = pd.read_csv(adultDataSet_filename)
df.head(50)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex_selfID,capital-gain,capital-loss,hours-per-week,native-country,income_binary
0,39.0,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Non-Female,2174,0,40.0,United-States,<=50K
1,50.0,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Non-Female,0,0,13.0,United-States,<=50K
2,38.0,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Non-Female,0,0,40.0,United-States,<=50K
3,53.0,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Non-Female,0,0,40.0,United-States,<=50K
4,28.0,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40.0,Cuba,<=50K
5,37.0,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40.0,United-States,<=50K
6,49.0,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16.0,Jamaica,<=50K
7,52.0,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Non-Female,0,0,45.0,United-States,>50K
8,31.0,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50.0,United-States,>50K
9,42.0,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Non-Female,5178,0,40.0,United-States,>50K


## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

For this ML problem I have chosen the data provided by the US Census of 1994. My label will be the income of the subjects in the Census. Given that this is a binary variable (less or equal than 50k a year or greater than 50k a year), this would be considered a binary classification problem. For now I have chose my features to be those columns that I assume are correlated with each person's salary: workclass, education number, native_country.

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [3]:
#print(df.groupby('native-country')['education-num'].sum().sort_values(ascending = True).head)
print(df.describe())
#Handling missing values
#We will fill the empty native)country rows with Unknown origin.
countries = df['native-country'].fillna('Unknown')


                age        fnlwgt  education-num  capital-gain  capital-loss  \
count  32399.000000  3.256100e+04   32561.000000  32561.000000  32561.000000   
mean      38.589216  1.897784e+05      10.080679    615.907773     87.303830   
std       13.647862  1.055500e+05       2.572720   2420.191974    402.960219   
min       17.000000  1.228500e+04       1.000000      0.000000      0.000000   
25%       28.000000  1.178270e+05       9.000000      0.000000      0.000000   
50%       37.000000  1.783560e+05      10.000000      0.000000      0.000000   
75%       48.000000  2.370510e+05      12.000000      0.000000      0.000000   
max       90.000000  1.484705e+06      16.000000  14084.000000   4356.000000   

       hours-per-week  
count    32236.000000  
mean        40.450428  
std         12.353748  
min          1.000000  
25%         40.000000  
50%         40.000000  
75%         45.000000  
max         99.000000  


In [4]:
nan_count = np.sum(df.isnull(), axis = 0)
#We will modify the data for missing values in the columns age, hours-per-week and native-country
print(nan_count)
#Because all countries are stored in the variable countries, we will drop the native-country variable in the future


age                162
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex_selfID           0
capital-gain         0
capital-loss         0
hours-per-week     325
native-country     583
income_binary        0
dtype: int64


In [5]:
print(np.unique(countries))
#We are going to start by doing One Hot Encoding on the native country column
#We will make a new column that only considers if a person has been born in the US
#We will consider every other origin as an immigrant person and store our results in a binary column.
df['US_national'] = (df['native-country'] == 'United-States').astype(int) 

#We are also going to use One Hot Encoding for the label, as right now it's made by strings
df['Greater_than-50k_a_year'] = (df['income_binary'] == '>50K').astype(int)

#Drop original columns
df.drop(columns = 'native-country', inplace=True)
df.drop(columns = 'income_binary', inplace=True)

df.head()

['Cambodia' 'Canada' 'China' 'Columbia' 'Cuba' 'Dominican-Republic'
 'Ecuador' 'El-Salvador' 'England' 'France' 'Germany' 'Greece' 'Guatemala'
 'Haiti' 'Holand-Netherlands' 'Honduras' 'Hong' 'Hungary' 'India' 'Iran'
 'Ireland' 'Italy' 'Jamaica' 'Japan' 'Laos' 'Mexico' 'Nicaragua'
 'Outlying-US(Guam-USVI-etc)' 'Peru' 'Philippines' 'Poland' 'Portugal'
 'Puerto-Rico' 'Scotland' 'South' 'Taiwan' 'Thailand' 'Trinadad&Tobago'
 'United-States' 'Unknown' 'Vietnam' 'Yugoslavia']


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex_selfID,capital-gain,capital-loss,hours-per-week,US_national,Greater_than-50k_a_year
0,39.0,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Non-Female,2174,0,40.0,1,0
1,50.0,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Non-Female,0,0,13.0,1,0
2,38.0,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Non-Female,0,0,40.0,1,0
3,53.0,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Non-Female,0,0,40.0,1,0
4,28.0,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40.0,0,0


In [6]:
#We will now fill the empty rows of the columns age and hours-per week
for x in df['age']:
    df['age'].fillna(np.mean(df['age']), inplace=True)

for x in df['hours-per-week']:
    df['hours-per-week'].fillna(np.mean(df['hours-per-week']), inplace=True)

print("number of empty rows in age column: " + str(np.sum(df['age'].isnull(), axis = 0)))
print("number of empty rows in hours_per_week column: " + str(np.sum(df['hours-per-week'].isnull(), axis = 0)))


number of empty rows in age column: 0
number of empty rows in hours_per_week column: 0


In [7]:
nan_count = np.sum(df.isnull(), axis = 0)
print(nan_count) 
#Now that we have diminished the amount of empty rows we are only going to keep our features in the dataframe

cols_to_erase = ['workclass', 'occupation']
#although workclass and occupation are important to identify the annual income of a person,
#we lack so much data on these individuals that we might missrepresent our population
df.drop(columns=cols_to_erase, inplace = True)
print('columns after transformation')
df.columns

age                           0
workclass                  1836
fnlwgt                        0
education                     0
education-num                 0
marital-status                0
occupation                 1843
relationship                  0
race                          0
sex_selfID                    0
capital-gain                  0
capital-loss                  0
hours-per-week                0
US_national                   0
Greater_than-50k_a_year       0
dtype: int64
columns after transformation


Index(['age', 'fnlwgt', 'education', 'education-num', 'marital-status',
       'relationship', 'race', 'sex_selfID', 'capital-gain', 'capital-loss',
       'hours-per-week', 'US_national', 'Greater_than-50k_a_year'],
      dtype='object')

In [8]:
#We will do some more One-Hot_Encoding for the race, relationship and self sex ID columns
#One Hot Encoding of race
races = df["race"].value_counts()
for i in races.index: #had to fix to .index because it was returning numerical values
    race_column = str(i)
    df[race_column] = np.where(df["race"]==i,1,0)

#One Hot Encoding of sex_selfID
sexID = df["sex_selfID"].value_counts()
for i in sexID.index: #had to fix to .index because it was returning numerical values
    sex_self_ID = str(i)
    df[sex_self_ID] = np.where(df["sex_selfID"]==i,1,0)

#One Hot Encoding of marital-status
mar_status = df["marital-status"].value_counts()
for i in mar_status.index: #had to fix to .index because it was returning numerical values
    mar_stat_col = str(i)
    df[mar_stat_col] = np.where(df["marital-status"]==i,1,0)



#I will drop the relationship column because the information we obtain from it is reiterative compared to 
#the marital status column
#Similarly, I will drop the education column, because the education-num column already gives us that information
df.drop(columns='relationship', inplace = True)
df.drop(columns='race', inplace = True)
df.drop(columns='marital-status', inplace = True)
df.drop(columns='sex_selfID', inplace = True)
df.drop(columns='education', inplace = True)

df.columns


Index(['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss',
       'hours-per-week', 'US_national', 'Greater_than-50k_a_year', 'White',
       'Black', 'Asian-Pac-Islander', 'Amer-Indian-Inuit', 'Other',
       'Non-Female', 'Female', 'Married-civ-spouse', 'Never-married',
       'Divorced', 'Separated', 'Widowed', 'Married-spouse-absent',
       'Married-AF-spouse'],
      dtype='object')

## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

Transformations have been done to the model like handling missing data and using one-hot-encoding. Since there are a lot of categories in this data we will be using Desicion Trees.

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [9]:
from sklearn.model_selection import train_test_split 
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score


<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In [10]:
y = df['Greater_than-50k_a_year']
X = df.drop('Greater_than-50k_a_year', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=123)

print(X_train.shape)
print(X_test.shape) #the shapes are coherent numbers

(22792, 21)
(9769, 21)


In [21]:
model = DecisionTreeClassifier(criterion = 'entropy', max_depth = 10, min_samples_leaf = 16)
model.fit(X_train, y_train)
class_label_predictions = model.predict(X_test)
acc_score = accuracy_score(y_test, class_label_predictions)
acc_score

0.8556658818712253

In [None]:
#After modifying the hyperparameters a couple of times, we have a model with an acceptable accuracy score to industry standards (greater than 70%)
