# Assignment 1
Consider the dataset Assignment01_Lasagna_Triers.xlsx
File location: https://drive.google.com/drive/folders/1Jl8iDu7nGmrqCECbrLqmVafgwE5PYfiU

The file contains details of people in an area who have either tried Lasagna or not in an Italian restaurant chain. 
Train a decision tree classifier using the given data to predict whether someone has tried Lasagna or not.
Use a 80/20 split for train/test. 
    
    1) What is the train and test accuracy score?
    2) Which features come out to be important?
    3) Does grouping 'age' and 'income' into 5 categories each, improve the  prediction score?

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_excel("Assignment01_Lasagna_Triers.xlsx", sheet_name="Data")
print(df.shape)
df.head()

(856, 13)


Unnamed: 0,Person,Age,Weight,Income,Pay Type,Car Value,CC Debt,Gender,Live Alone,Dwell Type,Mall Trips,Nbhd,Have Tried
0,1,48,175,65500,Hourly,2190,3510,Male,No,Home,7,East,No
1,2,33,202,29100,Hourly,2110,740,Female,No,Condo,4,East,Yes
2,3,51,188,32200,Salaried,5140,910,Male,No,Condo,1,East,No
3,4,56,244,19000,Hourly,700,1620,Female,No,Home,3,West,No
4,5,28,218,81400,Salaried,26620,600,Male,No,Apt,3,West,Yes


In [3]:
#df.info() # very clean data set
#df.columns

# Question 1 and 2

In [4]:
cat_var_list = ['Pay Type', 'Gender', 'Live Alone', 'Dwell Type', 'Nbhd']
num_var_list = ['Age', 'Weight', 'Income', 'Car Value', 'CC Debt', 'Mall Trips']
df = pd.get_dummies(df, columns=cat_var_list)
df['target'] = np.where(df['Have Tried'] == "Yes", 1, 0)
df.head(2)

Unnamed: 0,Person,Age,Weight,Income,Car Value,CC Debt,Mall Trips,Have Tried,Pay Type_Hourly,Pay Type_Salaried,...,Gender_Male,Live Alone_No,Live Alone_Yes,Dwell Type_Apt,Dwell Type_Condo,Dwell Type_Home,Nbhd_East,Nbhd_South,Nbhd_West,target
0,1,48,175,65500,2190,3510,7,No,1,0,...,1,1,0,0,0,1,1,0,0,0
1,2,33,202,29100,2110,740,4,Yes,1,0,...,0,1,0,0,1,0,1,0,0,1


In [5]:
x_var = ['Pay Type_Hourly', 'Pay Type_Salaried', 'Gender_Female', 'Gender_Male', 'Live Alone_No', 
         'Live Alone_Yes', 'Dwell Type_Apt', 'Dwell Type_Condo', 'Dwell Type_Home', 'Nbhd_East', 'Nbhd_South', 'Nbhd_West'
        ] + num_var_list
y_var = 'target'

In [6]:
# varify encoding
pd.crosstab(df['Have Tried'], df['target'])

target,0,1
Have Tried,Unnamed: 1_level_1,Unnamed: 2_level_1
No,361,0
Yes,0,495


In [7]:
# perform 80:20 split for training and validation
x_train, x_test, y_train, y_test = train_test_split(df[x_var], df[y_var], test_size=0.2, random_state=0, stratify=df[y_var])

In [8]:
print(x_train.shape)
print(y_train.shape)

(684, 18)
(684,)


In [9]:
# Get all tunable parameters for decision tree
#DecisionTreeClassifier().get_params()

In [10]:
# perform a 5 fold cross-validation to get best parameters for model training based on accuracy
tune_parm_space = {'min_samples_split':range(1, 20),
                   'max_depth':range(1, 20),
                   'criterion':['gini', 'entropy'],
                   'min_samples_leaf':range(1, 20)
                  }

clf = GridSearchCV(DecisionTreeClassifier(), tune_parm_space, cv=5)
clf.fit(x_train, y_train)

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': range(1, 20),
                         'min_samples_leaf': range(1, 20),
                         'min_samples_split': range(1, 20)})

In [11]:
print(clf.best_estimator_)
print(clf.best_params_)
print(clf.best_score_)

DecisionTreeClassifier(criterion='entropy', max_depth=4, min_samples_leaf=17)
{'criterion': 'entropy', 'max_depth': 4, 'min_samples_leaf': 17, 'min_samples_split': 2}
0.8084585659081152


In [12]:
train_accuracy = clf.best_estimator_.score(x_train, y_train)
test_accuracy = clf.best_estimator_.score(x_test, y_test)
without_model_accuracy = y_train.value_counts()[1] / (np.sum(y_train.value_counts())) # min accuracy with only 1 as prediction

print(f"The train accuracy is     {np.round(train_accuracy * 100, 2)}%")
print(f"The test accuracy is      {np.round(test_accuracy * 100, 2)}%")
print(f"Without model accuracy is {np.round(without_model_accuracy * 100, 2)}%")

The train accuracy is     82.02%
The test accuracy is      77.33%
Without model accuracy is 57.89%


In [13]:
feature_importance = pd.Series(clf.best_estimator_.feature_importances_)
feature_importance.index = x_var
feature_importance.sort_values(ascending=False)

Mall Trips           0.620779
Nbhd_West            0.157280
Pay Type_Salaried    0.092745
Age                  0.077253
Nbhd_East            0.039684
Income               0.012258
Live Alone_Yes       0.000000
Dwell Type_Apt       0.000000
Dwell Type_Condo     0.000000
Dwell Type_Home      0.000000
Live Alone_No        0.000000
Nbhd_South           0.000000
Gender_Male          0.000000
Gender_Female        0.000000
Weight               0.000000
Car Value            0.000000
CC Debt              0.000000
Pay Type_Hourly      0.000000
dtype: float64

# Question 3
Does grouping `age` and `income` into 5 categories each, improve the  prediction score?

In [14]:
df = pd.read_excel("Assignment01_Lasagna_Triers.xlsx", sheet_name="Data")

# divide age and income into 5 equal bucket as per population, just like decile divides into 10 equal population bucket
df['Age_bucket'] = pd.qcut(df['Age'], q=5)
df['Income_bucket'] = pd.qcut(df['Income'], q=5)

# Add dummies for categorical variables
cat_var_list = ['Pay Type', 'Gender', 'Live Alone', 'Dwell Type', 'Nbhd', 'Age_bucket', 'Income_bucket']
num_var_list = ['Weight', 'Car Value', 'CC Debt', 'Mall Trips']
df = pd.get_dummies(df, columns=cat_var_list)
df['target'] = np.where(df['Have Tried'] == "Yes", 1, 0)

df.head(2)

Unnamed: 0,Person,Age,Weight,Income,Car Value,CC Debt,Mall Trips,Have Tried,Pay Type_Hourly,Pay Type_Salaried,...,"Age_bucket_(30.0, 34.0]","Age_bucket_(34.0, 40.0]","Age_bucket_(40.0, 48.0]","Age_bucket_(48.0, 64.0]","Income_bucket_(2599.999, 21700.0]","Income_bucket_(21700.0, 33700.0]","Income_bucket_(33700.0, 46900.0]","Income_bucket_(46900.0, 64300.0]","Income_bucket_(64300.0, 190500.0]",target
0,1,48,175,65500,2190,3510,7,No,1,0,...,0,0,1,0,0,0,0,0,1,0
1,2,33,202,29100,2110,740,4,Yes,1,0,...,1,0,0,0,0,1,0,0,0,1


In [15]:
x_var = ['Pay Type_Hourly', 'Pay Type_Salaried', 'Gender_Female', 'Gender_Male', 'Live Alone_No', 'Live Alone_Yes',
         'Dwell Type_Apt', 'Dwell Type_Condo', 'Dwell Type_Home', 'Nbhd_East', 'Nbhd_South', 'Nbhd_West', 
         'Age_bucket_(21.999, 30.0]', 'Age_bucket_(30.0, 34.0]', 'Age_bucket_(34.0, 40.0]', 'Age_bucket_(40.0, 48.0]', 
         'Age_bucket_(48.0, 64.0]', 'Income_bucket_(2599.999, 21700.0]', 'Income_bucket_(21700.0, 33700.0]', 
         'Income_bucket_(33700.0, 46900.0]', 'Income_bucket_(46900.0, 64300.0]', 'Income_bucket_(64300.0, 190500.0]'
        ] + num_var_list
y_var = 'target'
#df.columns

x_train, x_test, y_train, y_test = train_test_split(df[x_var], df[y_var], test_size=0.2, random_state=0, stratify=df[y_var])

In [16]:
# perform a 5 fold cross-validation to get best parameters for model training based on accuracy
tune_parm_space = {'min_samples_split':range(1, 20),
                   'max_depth':range(1, 20),
                   'criterion':['gini', 'entropy'],
                   'min_samples_leaf':range(1, 20)
                  }

clf = GridSearchCV(DecisionTreeClassifier(), tune_parm_space, cv=5)
clf.fit(x_train, y_train)

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': range(1, 20),
                         'min_samples_leaf': range(1, 20),
                         'min_samples_split': range(1, 20)})

In [17]:
print(clf.best_estimator_)
print(clf.best_params_)
print(clf.best_score_)

DecisionTreeClassifier(max_depth=3, min_samples_leaf=19)
{'criterion': 'gini', 'max_depth': 3, 'min_samples_leaf': 19, 'min_samples_split': 2}
0.8084907685702019


In [18]:
train_accuracy = clf.best_estimator_.score(x_train, y_train)
test_accuracy = clf.best_estimator_.score(x_test, y_test)
without_model_accuracy = y_train.value_counts()[1] / (np.sum(y_train.value_counts())) # min accuracy with only 1 as prediction

print(f"The train accuracy is     {np.round(train_accuracy * 100, 2)}%")
print(f"The test accuracy is      {np.round(test_accuracy * 100, 2)}%")
print(f"Without model accuracy is {np.round(without_model_accuracy * 100, 2)}%")

The train accuracy is     80.85%
The test accuracy is      83.14%
Without model accuracy is 57.89%


In [19]:
feature_importance = pd.Series(clf.best_estimator_.feature_importances_)
feature_importance.index = x_var
feature_importance.sort_values(ascending=False)

Mall Trips                           0.687594
Nbhd_West                            0.198767
Pay Type_Hourly                      0.097365
Nbhd_East                            0.016274
Gender_Male                          0.000000
Age_bucket_(40.0, 48.0]              0.000000
CC Debt                              0.000000
Car Value                            0.000000
Weight                               0.000000
Income_bucket_(64300.0, 190500.0]    0.000000
Income_bucket_(46900.0, 64300.0]     0.000000
Income_bucket_(33700.0, 46900.0]     0.000000
Income_bucket_(21700.0, 33700.0]     0.000000
Income_bucket_(2599.999, 21700.0]    0.000000
Age_bucket_(48.0, 64.0]              0.000000
Age_bucket_(34.0, 40.0]              0.000000
Live Alone_No                        0.000000
Pay Type_Salaried                    0.000000
Age_bucket_(21.999, 30.0]            0.000000
Gender_Female                        0.000000
Nbhd_South                           0.000000
Dwell Type_Home                   

# Answer

    1. What is the train and test accuracy score?
        The train accuracy is 82%, and test accuracy is 77.32%. 
        Without model accuracy is 57.55%, and so we can say that model is doing a better job, and this model can be used.
        The cross-validation accuracy for best model, using grid search cv is 80.8%. This is different than the train 
        accuracy, since its the mean of cross-validation accuracy. 
    2. The feature importance is listed in below table:
    
|feature_name        | importance  |
|------------------- | ----------- |
|Mall Trips          | 0.620779    |
|Nbhd_West           | 0.157280    |
|Age                 | 0.077253    |
|Pay Type_Hourly     | 0.056812    |
|Nbhd_East           | 0.039684    |
|Pay Type_Salaried   | 0.035933    |
|Income              | 0.012258    |
         
         All other features, have zero importance.

    3. Does grouping 'age' and 'income' into 5 categories each, improve the  prediction score?
    
|feature_name        | importance  |
|------------------- | ----------- |
|Mall Trips          | 0.687594    |
|Nbhd_West           | 0.198767    |
|Pay Type_Salaried   | 0.097365    |
|Nbhd_East           | 0.016274    |

        - The train accuracy is     80.85%
        - The test accuracy is      83.14%
        - Without model accuracy is 57.89%
        
        - In feature imprtance, we don't see zero importance for all featutes related to age and salary; we do see there is 
          some increase in test accuracy, but train set accuracy is similar as before. This could be because of change 
          in features of model, as before grouping 'age` and `income` were important variables with importance as 7.7% 
          and 1.2% respectively. But after grouping, we dont see age and income as important variables.
        
    