## CSC4815 Machine Learning
Homework 4 due on 4/22 at midnight

__Noah Foilb__

<div class="alert alert-block alert-danger">

Rules:
- All the problems must be solved programmatically. No manual solution will be accepted. Do not hard-code constants unless they are given as parts of a problem.
- Solve each problem only in the given cell and show the final result. Do not leave debugging information.  __Do not add cells__.
- You may not import any external modules other than numpy, pandas, sklearn, and matplotlib. Take advantage of the code examples accompanying each sklearn API on their website.
- Always try to make the output informative and intuitive. That's what your client will care about in the end. Add description/label to the output values.
- <font color='red'>A solution ending with a syntax or runtime error will get zero points no matter how much you worked on it. It will be much better to submit an error-free, partial solution than a solution which never runs.</font>

Execute the following cell to ensure you are using Python 3 or above.

In [1]:
!python --version

Python 3.8.3


In [2]:
import time
start_time = time.time()

# Problem 1

[9 points] Your task is to perform classification on the attched dataset using Decision Tree, Random Forest, and AdaBoost algorithms. This task involves the end-to-end machine learning workflow. Follow the following information and guidelines:

- The attribute names of the dataset: age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, captial-loss , hours-per-week, native-country, and income in the give order. income is the class label.
- In throry, Decision Trees should be able to handle categorical attributes. However, the Sklearn implementation accepts numerical attributes only.
- When the dataset is preprocessed, use the 70% hold-out method with random_state=42 before the first training begins.
- Train each algorithm using the same train set (X_train, y_train). Use the default configuration for the algorithms except setting the random_state to 42.
- Evaluate the performance of each trained classifier in terms of precision, recall and f-1 on the test set. Additionally, for each trained classifier, compare its accuracy performance on the training set (X_train, y_train), test set (X_test, y_test) and entire set (X, y), 
- The minimum requirement for partial points is an accuracy better than 80% from each algorithm on each of the training, test and entire set. Additional requirements are given below.
- Each output from your code must be annotated informatively and meaningfully. (In other words, do not display numbers without a proper context.)

In [3]:
# All imports go in this cell only.
# You may import numpy, pandas and sklearn modules only.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_precision_recall_curve 
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier

In [4]:
# Load data
df = pd.read_csv("data",delimiter=',')         # Read_csv to get data into python and delimiter to create dataframe
                                         
df.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',    # Rename Columns
              'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'captial-loss',
              'hours-per-week', 'native-country','income']

print("The Schema of the data is:","\n")
df.head(5)


The Schema of the data is: 



Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,captial-loss,hours-per-week,native-country,income
0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K


In [5]:
# Prepare data
#df.isna().sum()               #There is no na values!
#df['income'].unique()         # Income consists of <=50k and >50k

# Chance categorical variables to binary, onehotcoding

# Workclass
Df = pd.get_dummies(df['workclass'], prefix='workclass')        # Create the one hot encoding
df = pd.concat([df,Df],axis = 1)                                # Was thinking about getting rid of ? but I think that is not the same as NA

# Native-Country
Df = pd.get_dummies(df['native-country'], prefix='native-country')        # Create the one hot encoding
df = pd.concat([df,Df],axis = 1)  

# Martial Status
Df = pd.get_dummies(df['marital-status'], prefix='marital-status')        # Create the one hot encoding
df = pd.concat([df,Df],axis = 1)  

# Relationship
Df = pd.get_dummies(df['relationship'], prefix='relationship')        # Create the one hot encoding
df = pd.concat([df,Df],axis = 1)  

# Martial Status
Df = pd.get_dummies(df['race'], prefix='race')        # Create the one hot encoding
df = pd.concat([df,Df],axis = 1)  

# Income
df['income'] = df['income'].str.replace('<=50K','0')    # Replace <=50k with 0, and >50K with 1  
df['income'] = df['income'].str.replace('>50K','1')     # had to have help to find out to use .str
df['income'] = df['income'].astype('int')

# Sex
df['sex'] = df['sex'].str.replace('Male','0')
df['sex'] = df['sex'].str.replace('Female','1')
df['sex'] = df['sex'].astype('int')

# Drop columns
df = df.drop(columns = ['fnlwgt','education','occupation','workclass','native-country','marital-status','relationship','race']) # get rid of useless columns

# Prepare train and test sets:
X = df.drop(columns = 'income')                  # Original X, y
y = df['income']        

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.3,random_state = 42)    # Test/train datasets


# Uncomment to show the shapes of the entire, training and test sets.
# Do not modify variable names.

print(f'X Shape {X.shape}, y Shape {y.shape}')
print(f'X_train Shape {X_train.shape}, y_train Shape {y_train.shape}')
print(f'X_test Shape {X_test.shape}, y_test Shape {y_test.shape}')


X Shape (32560, 75), y Shape (32560,)
X_train Shape (22792, 75), y_train Shape (22792,)
X_test Shape (9768, 75), y_test Shape (9768,)


In [6]:
# Train and evaluate Decision Tree, Random Forest, and AdaBoost here.
# Algorithms must be trained on the train set only.
# No data processing or preparation may take place in this step.
# Generate a confusion matrix, precision, recall and f-1 score
# from each trained classifier on the test set.
# In addition, compare the accuracy of each trained algorithm
# on each of the train, test and entire sets.
# Label each output clearly and informatively.

# CLASSIFIER
# 1 Train model
# 2 Confusion matrix, precision, recall and f-1 score on test set
# 3 Now compute accuracy on the train, test and entire dataset. All have to be >80%



# DECISION TREE
# 1 Train model
Dec_Tree = DecisionTreeClassifier(random_state=42)
Dec_Tree.fit(X_train,y_train)

# 2 Confusion matrix, precision, recall and f-1 score on test set
y_pred = Dec_Tree.predict(X_test)
Dec_c = confusion_matrix(y_test,y_pred)
Dec_p = precision_score(y_test,y_pred)
Dec_r = recall_score(y_test,y_pred)
Dec_f = f1_score(y_test,y_pred)

# 3 Now compute accuracy on the train, test and full dataset. All have to be >80%
# Test set
y_pred = Dec_Tree.predict(X_test)
Dec_te = accuracy_score(y_test,y_pred)

# Train set
y_pred = Dec_Tree.predict(X_train)
Dec_tr = accuracy_score(y_train,y_pred)

# Entire set
y_pred = Dec_Tree.predict(X)
Dec_y = accuracy_score(y,y_pred)



# RANDOM FOREST
# 1 Train model
Ran_For = RandomForestClassifier(random_state=42)
Ran_For.fit(X_train,y_train)

# 2 Confusion matrix, precision, recall and f-1 score on test set
y_pred = Ran_For.predict(X_test)
Ran_c = confusion_matrix(y_test,y_pred)
Ran_p = precision_score(y_test,y_pred)
Ran_r = recall_score(y_test,y_pred)
Ran_f = f1_score(y_test,y_pred)

# 3 Now compute accuracy on the train, test and full dataset. All have to be >80%
# Test set
y_pred = Ran_For.predict(X_test)
Ran_te = accuracy_score(y_test,y_pred)

# Train set
y_pred = Ran_For.predict(X_train)
Ran_tr = accuracy_score(y_train,y_pred)

# Entire set
y_pred = Ran_For.predict(X)
Ran_y = accuracy_score(y,y_pred)



# ADABOOST
# 1 Train model
Ada_Boo = AdaBoostClassifier(random_state=42)
Ada_Boo.fit(X_train,y_train)

# 2 Confusion matrix, precision, recall and f-1 score on test set
y_pred = Ada_Boo.predict(X_test)
Ada_c = confusion_matrix(y_test,y_pred)
Ada_p = precision_score(y_test,y_pred)
Ada_r = recall_score(y_test,y_pred)
Ada_f = f1_score(y_test,y_pred)

# 3 Now compute accuracy on the train, test and full dataset. All have to be >80%
# Test set
y_pred = Ada_Boo.predict(X_test)
Ada_te = accuracy_score(y_test,y_pred)

# Train set
y_pred = Ada_Boo.predict(X_train)
Ada_tr = accuracy_score(y_train,y_pred)

# Entire set
y_pred = Ada_Boo.predict(X)
Ada_y = accuracy_score(y,y_pred)



# Display results
# Show confusion matrix, precision, recall, and f-1 score. Then compare the models accuracy against the test, train and entrie dataset.
# Show up to 5 decimal digits
print("For each Model I will show the Confusion Matrix, Accuracy, Precision, Recall, and f-1 score. Then show the", "\n"+
      "accuracy of the model against the test, train and entire dataset","\n"*2,
     "\t","Decision Tree","\n",
     "Confusion Matrix:","\n",Dec_c,"\n",
     "Precision:",Dec_p.round(5),"\n",
     "Recall:", Dec_r.round(5), "\n",
     "F-1 Score:", Dec_f.round(5),"\n",
     "Accuracy","\n","\t"+
     "Test dataset:", Dec_te.round(5),"\n","\t"+
     "Train dataset:",Dec_tr.round(5),"\n","\t"+
     "Entire dataset:", Dec_y.round(5),"\n"*3,
     "\t","Random Forest","\n",
     "Confusion Matrix:","\n",Ran_c,"\n",
     "Precision:",Ran_p.round(5),"\n",
     "Recall:", Ran_r.round(5), "\n",
     "F-1 Score:", Ran_f.round(5),"\n",
     "Accuracy","\n","\t"+
     "Test dataset:", Ran_te.round(5),"\n","\t"+
     "Train dataset:",Ran_tr.round(5),"\n","\t"+
     "Entire dataset:", Ran_y.round(5),"\n"*3,
     "\t","AdaBoost","\n",
     "Confusion Matrix:","\n",Ada_c,"\n",
     "Precision:",Ada_p.round(5),"\n",
     "Recall:", Ada_r.round(5), "\n",
     "F-1 Score:", Ada_f.round(5),"\n",
     "Accuracy","\n","\t"+
     "Test dataset:", Ada_te.round(5),"\n","\t"+
     "Train dataset:",Ada_tr.round(5),"\n","\t"+
     "Entire dataset:", Ada_y.round(5),"\n"*3,)



For each Model I will show the Confusion Matrix, Accuracy, Precision, Recall, and f-1 score. Then show the 
accuracy of the model against the test, train and entire dataset 

 	 Decision Tree 
 Confusion Matrix: 
 [[6590  805]
 [ 954 1419]] 
 Precision: 0.63804 
 Recall: 0.59798 
 F-1 Score: 0.61736 
 Accuracy 
 	Test dataset: 0.81992 
 	Train dataset: 0.95691 
 	Entire dataset: 0.91582 


 	 Random Forest 
 Confusion Matrix: 
 [[6796  599]
 [ 917 1456]] 
 Precision: 0.70852 
 Recall: 0.61357 
 F-1 Score: 0.65763 
 Accuracy 
 	Test dataset: 0.8448 
 	Train dataset: 0.95691 
 	Entire dataset: 0.92328 


 	 AdaBoost 
 Confusion Matrix: 
 [[6971  424]
 [ 977 1396]] 
 Precision: 0.76703 
 Recall: 0.58828 
 F-1 Score: 0.66587 
 Accuracy 
 	Test dataset: 0.85657 
 	Train dataset: 0.85899 
 	Entire dataset: 0.85826 





In this cell, explain the meaning of each number found in the confusion matrix from AdaBoost with respect to class labels. Do not simply label them as 'TP', 'FP', etc.

In [7]:
print("The meaning of confusion matrix is:","\n"
      "Top left is the amount of  + class predicted that were the same as the actual + class ,", "\n"+
      "Top right is the amount of - class predicted that were supposed to be actual + class", "\n"+
     "Bottom left is the amount + class predicted that were supposed ot be - class predicted","\n"+
     "Botton right is the amount of - class predicted that were the same as the actual - class")

The meaning of confusion matrix is: 
Top left is the amount of  + class predicted that were the same as the actual + class , 
Top right is the amount of - class predicted that were supposed to be actual + class 
Bottom left is the amount + class predicted that were supposed ot be - class predicted 
Botton right is the amount of - class predicted that were the same as the actual - class


# Problem 2

(1 point) Improve the accuracy of the AdaBoost better than 90% when it is trained on the same train set and evaluated on the entire set. You may configure AdaBoost differently from the default configuration. However, you may not make modification to the train and entire sets prepared in Problem 1.

In [8]:
# Train and evaluate AdaBoost only
# ADABOOST
# 1 Train model
Ada_Boo =  AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=5), # default estimator
    n_estimators=200, # The maximum number of estimators
    algorithm="SAMME.R",
    learning_rate=0.5, # shrinks the contribution of each classifier by learning_rate
    random_state=42)
Ada_Boo.fit(X_train,y_train)

# 2 Prediction
y_pred = Ada_Boo.predict(X_test)

# 3 Now compute accuracy on the entire dataset. Has to be >90%
y_pred = Ada_Boo.predict(X)
Acc = accuracy_score(y,y_pred)


print("The accuracy of the improved AdaBoost is:", Acc.round(5)*100, "percent")

The accuracy of the improved AdaBoost is: 90.289 percent


In [9]:
print(f"--- {time.time() - start_time} seconds ---")

--- 21.76927638053894 seconds ---
