# Lab 8: Implement Your Machine Learning Project Plan

In this lab assignment, you will implement the machine learning project plan you created in the written assignment. You will:

1. Load your data set and save it to a Pandas DataFrame.
2. Perform exploratory data analysis on your data to determine which feature engineering and data preparation techniques you will use.
3. Prepare your data for your model and create features and a label.
4. Fit your model to the training data and evaluate your model.
5. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.

### Import Packages

Before you get started, import a few packages.

In [187]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split 
from sklearn.metrics import log_loss
from sklearn.metrics import accuracy_score


<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need for this task.

In [188]:
# YOUR CODE HERE

## Part 1: Load the Data Set


You have chosen to work with one of four data sets. The data sets are located in a folder named "data." The file names of the three data sets are as follows:

* The "adult" data set that contains Census information from 1994 is located in file `adultData.csv`
* The airbnb NYC "listings" data set is located in file  `airbnbListingsData.csv`
* The World Happiness Report (WHR) data set is located in file `WHR2018Chapter2OnlineData.csv`
* The book review data set is located in file `bookReviewsData.csv`



<b>Task:</b> In the code cell below, use the same method you have been using to load your data using `pd.read_csv()` and save it to DataFrame `df`.

In [189]:
# YOUR CODE HERE
adultDataSet_filename = os.path.join(os.getcwd(), "data", "adultData.csv")
df = pd.read_csv(adultDataSet_filename)


## Part 2: Exploratory Data Analysis

The next step is to inspect and analyze your data set with your machine learning problem and project plan in mind. 

This step will help you determine data preparation and feature engineering techniques you will need to apply to your data to build a balanced modeling data set for your problem and model. These data preparation techniques may include:
* addressing missingness, such as replacing missing values with means
* renaming features and labels
* finding and replacing outliers
* performing winsorization if needed
* performing one-hot encoding on categorical features
* performing vectorization for an NLP problem
* addressing class imbalance in your data sample to promote fair AI


Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.


<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. 

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [190]:
# YOUR CODE HERE
# First drop the columns that would create a bias in the model (race, sex_selfID, and native-country), and drop the income binary because it is the label
features_list = df.drop(columns = ['race', 'sex_selfID','native-country', 'income_binary', 'fnlwgt', 'education-num'])

In [191]:
#inspect the number of unique values for each column
features_list.nunique()

age                73
workclass           8
education          16
marital-status      7
occupation         14
relationship        6
capital-gain      106
capital-loss       92
hours-per-week     94
dtype: int64

In [192]:
#find any null values
features_list.isna().any()

age                True
workclass          True
education         False
marital-status    False
occupation         True
relationship      False
capital-gain      False
capital-loss      False
hours-per-week     True
dtype: bool

In [193]:
# replace all the missing values in the age column, with the mean
mean_ages = features_list['age'].mean()
features_list['age'].fillna(value = mean_ages, inplace = True)

In [194]:

# replace all the missing values in hours per week, with the mean
mean_hours_per_week = features_list['hours-per-week'].mean()
features_list['hours-per-week'].fillna(value = mean_hours_per_week, inplace = True)

In [195]:
#one-hot encode categorical values with the top 10 items from each column
top_10_workclass = features_list['workclass'].value_counts().head(10).index
for value in top_10_workclass: 
    features_list['workclass'+ value ] = np.where(features_list['workclass']== value, 1,0)
features_list.drop(columns = 'workclass', inplace = True)

In [196]:
top_10_education = features_list['education'].value_counts().head(10).index
for value in top_10_education:
    features_list['education'+ value] = np.where(features_list['education']== value, 1,0)
features_list.drop(columns = 'education', inplace = True)

In [197]:
top_10_maritalstatus = features_list['marital-status'].value_counts().head(10).index
for value in top_10_maritalstatus:
    features_list['marital-status' + value] = np.where(features_list['marital-status'] == value, 1, 0)
features_list.drop(columns = 'marital-status', inplace = True)

In [198]:
top_10_occupation = features_list['occupation'].value_counts().head(10).index
for value in top_10_occupation:
    features_list['occupation' + value] = np.where(features_list ['occupation']== value, 1, 0)
features_list.drop(columns = 'occupation', inplace = True)

In [199]:
top_10_relationship = features_list['relationship'].value_counts().head(10).index
for value in top_10_relationship:
    features_list['relationship'+ value] = np.where(features_list['relationship']== value, 1, 0)
features_list.drop(columns = 'relationship', inplace = True)



## Part 3: Implement Your Project Plan

<b>Task:</b> Use the rest of this notebook to carry out your project plan. You will:

1. Prepare your data for your model and create features and a label.
2. Fit your model to the training data and evaluate your model.
3. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.


Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit.

In [200]:
# YOUR CODE HERE
#Now take the encoded data and apply a logistic regression model to predict the income binary
#assign the label to y, which is income binary 
#assign the features to X, which is the features_list
y = df['income_binary']
X = features_list

In [201]:
# create the training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .20, random_state = 1234)

In [202]:
# fit a logistic regression model to the data
def train_test_LR(X_train, y_train, X_test, y_test, c = 1):
    model = LogisticRegression(C = c, max_iter = 1000)
    model.fit(X_train, y_train)
    probability_predictions = model.predict_proba(X_test)
    l_loss = log_loss(y_test, probability_predictions)
    class_label_predictions = model.predict(X_test)
    acc_score = accuracy_score(y_test, class_label_predictions)
    
    return l_loss, acc_score

In [203]:
#analyze the results of the logistic regression model
loss, acc = train_test_LR(X_train, y_train, X_test, y_test)
print('Log loss: ' + str(loss))
print('Accuracy: ' + str(acc))

Log loss: 0.3376896296375336
Accuracy: 0.8433901427913404


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [None]:
# adjust the C hyperparameter to yield a better accuracy
cs = [10** i for i in range(-10,10)]


In [None]:
#find the log loss and accuracy score for every model
ll_cs = []
acc_cs = []
for c in cs:
    l_loss, acc_score = train_test_LR(X_train, y_train, X_test, y_test, c)
    ll_cs.append(l_loss)
    acc_cs.append(acc_score)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
