# Assigment 8: Choose Your ML Problem and Data

In this unit's lab, you will implement a model to solve a machine learning problem of your choosing. First, you will have to make some decisions, such as which model to choose and which data preparation techniques may be necessary, and formulate a project plan accordingly. 

In this assignment, you will select a data set and choose a predictive problem that the data set supports. You will then inspect the data with your problem in mind and begin to formulate your  project plan. You will create this project plan in the written assignment that follows.


### Import Packages

Before you get started, import a few packages. You can import additional packages that you have used in this course that you may need for this task.

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns


from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, log_loss, roc_auc_score
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import validation_curve


## Step 1: Choose Your Data Set and Load the Data

You will have the option to choose one of four data sets that you have worked with in this program:

* The "adult" data set that contains Census information from 1994: `adultData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load the Data Set

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [2]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "adultData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(adultDataSet_filename, header=0)

df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex_selfID,capital-gain,capital-loss,hours-per-week,native-country,income_binary
0,39.0,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Non-Female,2174,0,40.0,United-States,<=50K
1,50.0,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Non-Female,0,0,13.0,United-States,<=50K
2,38.0,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Non-Female,0,0,40.0,United-States,<=50K
3,53.0,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Non-Female,0,0,40.0,United-States,<=50K
4,28.0,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40.0,Cuba,<=50K


## Step 2: Choose Your Predictive Problem and Label 

Now that you have chosen your data set, you can: 

1. Choose what you would like to predict (i.e. the label) 
2. Identify your problem type: is it a classification or regression problem?

<b>Task:</b> In the markdown cell below, state what you are predicting (the label) and whether this is a classification or regression problem.

I want to predict if the income exceed 50k or not. So this is a binary classification problem. 

## Step 3: Inspect Your Data

In the code cell below, use some of the techniques you have learned in this course to take a look at your data. As you are investigating your data, consider the following to help you formulate your project plan:

1. What are my features?
5. Which model (or models) should I select that is appropriate for my machine learning problem and data?
6. Which data preparation techniques may be needed for my model (e.g. perform one-hot encoding)?
7. Which techniques should I use to evaluate my model's performance and improve my model?

Note: You will use this notebook to take a glimpse at your data to help you start making some considerations. In the written assignment you will outline your project plan, and in the lab assignment you will perform a deeper exploratory analysis of the data before implementing data preparation and feature engineering techniques.

<b>Task</b>: Use the techniques you have learned in this course to inspect your data.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.


In [3]:
# obtain columns from the data set to create labeled examples
y = df['income_binary']
X = df.drop(columns = 'income_binary', axis=1)
X.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex_selfID,capital-gain,capital-loss,hours-per-week,native-country
0,39.0,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Non-Female,2174,0,40.0,United-States
1,50.0,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Non-Female,0,0,13.0,United-States
2,38.0,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Non-Female,0,0,40.0,United-States
3,53.0,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Non-Female,0,0,40.0,United-States
4,28.0,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40.0,Cuba


In [4]:
#check if class imbalance existed
class_distribution = y.value_counts()
print('Class Distribution: ' + str(class_distribution))

#calculate class ratios
class_ratios = class_distribution / len(df)
print('\nClass Ratios: ' + str(class_ratios))

Class Distribution: <=50K    24720
>50K      7841
Name: income_binary, dtype: int64

Class Ratios: <=50K    0.75919
>50K     0.24081
Name: income_binary, dtype: float64


In [5]:
# use the train_test_split() function to create training and test sets
# out of the labeled examples.

#handling missing values
X.fillna(X.mean(), inplace=True) 

#one-hot encoding to convert categorical variables into binary vectors
X_encoded = pd.get_dummies(X) 

#scale the numerical features
scaler = StandardScaler() 
X_scaled = scaler.fit_transform(X_encoded)

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled,y,test_size=0.20, 
                                                    random_state=1234)

In [11]:
# Fit a KNN model to train the data and calculate the accuracy
for k in range (1,5):
    KNN_model = KNeighborsClassifier(n_neighbors=k) #create the model
    KNN_model.fit(X_train, y_train) #fit the model
    
    KNN_prediction = KNN_model.predict(X_test) # make predictions on the test data
    KNN_acc_score = accuracy_score(y_test, KNN_prediction)
    #print('k='+str(k)+', Accuracy score of the KNN model: '+str(KNN_acc_score))
    
    cross_valid_scores = cross_val_score(KNN_model, X_train, y_train, cv=5 )
    KNN_mean_val_score = np.mean(cross_valid_scores)
    print('k='+str(k)+', Accuracy score of the KNN model: '+str(KNN_acc_score)
           + '\n\tCross validation score of the KNN model: ' + str(KNN_mean_val_score))
    
    print()
    
print('Done')

k=1, Accuracy score of the KNN model: 0.7956394902502687
	Cross validation score of the KNN model: 0.800061682699624

k=2, Accuracy score of the KNN model: 0.8113004759711346
	Cross validation score of the KNN model: 0.819948045037951

k=3, Accuracy score of the KNN model: 0.820205742361431
	Cross validation score of the KNN model: 0.822404858857529

k=4, Accuracy score of the KNN model: 0.8261937663135268
	Cross validation score of the KNN model: 0.8273956893594395

Done


In [12]:
# Fit a Decision Tree model to train the data and calculate the accuracy
max_depth_range = [2**i for i in range(5)]
for mx in max_depth_range:
    DT_model = DecisionTreeClassifier(criterion='entropy', max_depth=mx, min_samples_leaf=1)
    DT_model.fit(X_train, y_train)
    DT_class_label_predictions = DT_model.predict(X_test)
    DT_acc_score = accuracy_score(y_test, DT_class_label_predictions)
    
    train_scores, val_scores = validation_curve(DT_model, X_train, y_train, 
                                           param_name='max_depth',
                                           param_range=max_depth_range,
                                           cv=5)
    # Get the validation scores for the specific max_depth value(mx)
    validation_scores = val_scores[:, max_depth_range.index(mx)]
    # Get the mean of validation scores for the specific max_depth value(mx)
    mean_val_score = np.mean(validation_scores)
    
    print('Max depth='+str(mx)+', Accuracy score of the DT model: '+str(DT_acc_score)
          + '\n\t\tValidation score of the DT model: ' + str(mean_val_score))
    print()
    
print('Done')
    

Max depth=1, Accuracy score of the DT model: 0.7653922923384001
		Validation score of the DT model: 0.8165451055662188

Max depth=2, Accuracy score of the DT model: 0.8275756179947796
		Validation score of the DT model: 0.8314395393474088

Max depth=4, Accuracy score of the DT model: 0.8424689083371718
		Validation score of the DT model: 0.8212284069097888

Max depth=8, Accuracy score of the DT model: 0.8522954091816367
		Validation score of the DT model: 0.8279900172777884

Max depth=16, Accuracy score of the DT model: 0.8386304314448026
		Validation score of the DT model: 0.8238817431368783

Done


In [8]:
# Fit a Logistic regression model to train the data and calculate the accuracy
LR_model = LogisticRegression()
LR_model.fit(X_train, y_train)

# Probability predictions for log loss
LR_probability_predictions = LR_model.predict_proba(X_test)
LR_l_loss = log_loss(y_test, LR_probability_predictions)
print('Log loss of the Logictic Regression model: ' + str(LR_l_loss))

# Class label predictions for accuracy score
LR_class_label_predictions = LR_model.predict(X_test)
LR_acc_score = accuracy_score(y_test, LR_class_label_predictions)
print('Accuracy score of the Logistic Regression model: ' + str(LR_acc_score))

# Calculate ROC-AUC
LR_roc_auc = roc_auc_score(y_test, LR_probability_predictions[:,1])
print('ROC-AUC score of the Logictic Regression model: ' + str(LR_roc_auc))

Log loss of the Logictic Regression model: 0.3276934859110692
Accuracy score of the Logistic Regression model: 0.8490710885920467
ROC-AUC score of the Logictic Regression model: 0.8972373928066923
