# First Cut on Modeling

In this activity we'll explore creating and evaluating a decision tree classifier.

Review these resouces for more details on Decision Trees :
- https://towardsdatascience.com/understanding-decision-trees-for-classification-python-9663d683c952
- https://scikit-learn.org/stable/modules/tree.html#tree

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Read dataset
crime_df = pd.read_csv('./pb_compas.csv')

crime_df.head()

Unnamed: 0,id,name,first,last,sex,dob,age,age_cat,race,juv_fel_count,...,vr_charge_desc,type_of_assessment,decile_score.1,score_text,screening_date,v_type_of_assessment,v_decile_score,v_score_text,priors_count.1,event
0,1.0,miguel hernandez,miguel,hernandez,Male,18/04/1947,69,Greater than 45,Other,0,...,,Risk of Recidivism,1,Low,14/08/2013,Risk of Violence,1,Low,0,0
1,2.0,miguel hernandez,miguel,hernandez,Male,18/04/1947,69,Greater than 45,Other,0,...,,Risk of Recidivism,1,Low,14/08/2013,Risk of Violence,1,Low,0,0
2,3.0,michael ryan,michael,ryan,Male,6/2/85,31,25 - 45,Caucasian,0,...,,Risk of Recidivism,5,Medium,31/12/2014,Risk of Violence,2,Low,0,0
3,4.0,kevon dixon,kevon,dixon,Male,22/01/1982,34,25 - 45,African-American,0,...,Felony Battery (Dom Strang),Risk of Recidivism,3,Low,27/01/2013,Risk of Violence,1,Low,0,1
4,5.0,ed philo,ed,philo,Male,14/05/1991,24,Less than 25,African-American,0,...,,Risk of Recidivism,4,Low,14/04/2013,Risk of Violence,3,Low,4,0


In [3]:
#####
crime_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18316 entries, 0 to 18315
Data columns (total 40 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       11001 non-null  float64
 1   name                     18316 non-null  object 
 2   first                    18316 non-null  object 
 3   last                     18316 non-null  object 
 4   sex                      18316 non-null  object 
 5   dob                      18316 non-null  object 
 6   age                      18316 non-null  int64  
 7   age_cat                  18316 non-null  object 
 8   race                     18316 non-null  object 
 9   juv_fel_count            18316 non-null  int64  
 10  decile_score             18316 non-null  int64  
 11  juv_misd_count           18316 non-null  int64  
 12  juv_other_count          18316 non-null  int64  
 13  priors_count             18316 non-null  int64  
 14  days_b_screening_arres

In [4]:
###
crime_df.head()

Unnamed: 0,id,name,first,last,sex,dob,age,age_cat,race,juv_fel_count,...,vr_charge_desc,type_of_assessment,decile_score.1,score_text,screening_date,v_type_of_assessment,v_decile_score,v_score_text,priors_count.1,event
0,1.0,miguel hernandez,miguel,hernandez,Male,18/04/1947,69,Greater than 45,Other,0,...,,Risk of Recidivism,1,Low,14/08/2013,Risk of Violence,1,Low,0,0
1,2.0,miguel hernandez,miguel,hernandez,Male,18/04/1947,69,Greater than 45,Other,0,...,,Risk of Recidivism,1,Low,14/08/2013,Risk of Violence,1,Low,0,0
2,3.0,michael ryan,michael,ryan,Male,6/2/85,31,25 - 45,Caucasian,0,...,,Risk of Recidivism,5,Medium,31/12/2014,Risk of Violence,2,Low,0,0
3,4.0,kevon dixon,kevon,dixon,Male,22/01/1982,34,25 - 45,African-American,0,...,Felony Battery (Dom Strang),Risk of Recidivism,3,Low,27/01/2013,Risk of Violence,1,Low,0,1
4,5.0,ed philo,ed,philo,Male,14/05/1991,24,Less than 25,African-American,0,...,,Risk of Recidivism,4,Low,14/04/2013,Risk of Violence,3,Low,4,0


In [5]:
# clean up
# drop duplicate rows
crime_df.drop_duplicates(subset=crime_df.columns.difference(['id']), inplace=True)

#select columns of interest 
crime_df = crime_df[['age', 'c_charge_degree', 'race', 'sex', 'priors_count','score_text']]
print(crime_df.head())
crime_df.dropna(inplace=True)
crime_df.reset_index(drop=True, inplace=True)
print(crime_df.shape)

   age c_charge_degree              race   sex  priors_count score_text
0   69            (F3)             Other  Male             0        Low
2   31             NaN         Caucasian  Male             0     Medium
3   34            (F3)  African-American  Male             0        Low
4   24            (F3)  African-American  Male             4        Low
9   23            (F3)  African-American  Male             1       High
(10595, 6)


In [6]:
# create training and testing sets - 75/25
from sklearn.model_selection import train_test_split
crime_train, crime_test = train_test_split(crime_df, test_size =0.25, random_state = 1)
print(crime_test.shape)
print(crime_train.shape)

(2649, 6)
(7946, 6)


In [7]:
# for learning our tree we want to use age, c_charge_degree, race, sex, priors_count
x = crime_train[['age', 'c_charge_degree', 'race', 'sex', 'priors_count']]

In [8]:
# we want to predict the score text let's make that our y
y = crime_train['score_text']

In [9]:
# converting categorical features using one-hot encoding (i.e., dummy features)
# textbook approach using statsmodels categorical function is depricated
# using pandas get_dummies function instead
x = pd.get_dummies(x, columns=['race','sex','c_charge_degree' ], prefix = ['dummy','dummy','dummy'])
# quick look at our updated data
x.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7946 entries, 6555 to 235
Data columns (total 23 columns):
 #   Column                  Non-Null Count  Dtype
---  ------                  --------------  -----
 0   age                     7946 non-null   int64
 1   priors_count            7946 non-null   int64
 2   dummy_African-American  7946 non-null   bool 
 3   dummy_Asian             7946 non-null   bool 
 4   dummy_Caucasian         7946 non-null   bool 
 5   dummy_Hispanic          7946 non-null   bool 
 6   dummy_Native American   7946 non-null   bool 
 7   dummy_Other             7946 non-null   bool 
 8   dummy_Female            7946 non-null   bool 
 9   dummy_Male              7946 non-null   bool 
 10  dummy_(CO3)             7946 non-null   bool 
 11  dummy_(F1)              7946 non-null   bool 
 12  dummy_(F2)              7946 non-null   bool 
 13  dummy_(F3)              7946 non-null   bool 
 14  dummy_(F5)              7946 non-null   bool 
 15  dummy_(F6)              

### Model Training

The first model we will try is a decision tree classifier. 

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

In [10]:
from sklearn.tree import DecisionTreeClassifier

If you look at the documentation for this classifier your will see numerous parameters that can be tuned. To start we will experiment with a small set and keep the default setting for most paramters.

The criterion parameter is used to specify how the impurity of a split will be measured. Possible options are “gini” and “entropy”. Both “gini-index” and “cross-entropy” are values to show the node purity. When the node is purer, value of gini-index or cross-entropy is smaller and close to zero. Decision tree algorithm splits nodes as long as this value decreases till it reaches zero or there is no other parameter to stop it. We can also limit the number of leaf nodes using max_leaf_nodes parameter which grows the tree in best-first fashion until max_leaf_nodes reached. The best split is decided based on impurity decrease. Tuning the parameter value is typically done through iteration and evaluation.

We will create three decision trees. The first is based on the CART algorithm.  

In [11]:
cart01 = DecisionTreeClassifier(max_leaf_nodes=5).fit(x,y) # using default criterion 'gini' 

The second decision tree is based on the C5.0 algorithm

In [12]:
c50_01 = DecisionTreeClassifier(criterion="entropy",max_leaf_nodes=5).fit(x,y)

The last decision tree is based on the [Random Forest algorithm](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

Random Forest consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction. The n_estimators parameter sets the number of decision trees to use par of the ensamble.


In [13]:
from sklearn.ensemble import RandomForestClassifier

In [14]:
rf01 = RandomForestClassifier(n_estimators = 10,criterion="gini").fit(x,y)

### Applying the Model

Now that we have trained our models we can use these decision trees to predict our target feature

In [15]:
prediction_cart01 = cart01.predict(x)

In [16]:
prediction_c50_01 = c50_01.predict(x)

In [17]:
prediction_rf01 = rf01.predict(x)

We now have an array for each model that contains the prediction values for our target feature.

### Evaluating the Model

An important question to answer after we create a model is how well does it perform the task we want it to do?

In the last activity we practiced with establishing a baseline that can be used as a benchmark for model evaluation. Let's start by doing that.

### To Do: Using the training set establish a baseline model accuracy for our target feature.

In [18]:
from sklearn.metrics import accuracy_score

In [19]:
print("Decision Tree (Gini) accuracy:", accuracy_score(y, prediction_cart01))
print("Decision Tree (Entropy) accuracy:", accuracy_score(y, prediction_c50_01))
print("Random Forest accuracy:", accuracy_score(y, prediction_rf01))

Decision Tree (Gini) accuracy: 0.6215706015605336
Decision Tree (Entropy) accuracy: 0.6215706015605336
Random Forest accuracy: 0.8094638811980871


#### Now back to our new models. 

In our training set we had the actual values for the target feature we want to predict. A simple way to tell how our model performs is to go back and compare the predicted results with the actual values. 

Let's first see how well our models are performing on the set we used for training.

In [20]:
# create a function that we can use to check how our predictions compare to the actual values
def eval_prediction(pred, actual):
    index = 0
    correct = 0
    for outcome in actual:
        if pred[index] == outcome:
            correct += 1

        index+=1
    return correct

In [21]:
#Using the evaluation function to see how many we got correct for each model 
print("CART:", eval_prediction(prediction_cart01, y))
print("C5.0:", eval_prediction(prediction_c50_01, y))
print("Random Forest:", eval_prediction(prediction_rf01, y))

CART: 4939
C5.0: 4939
Random Forest: 6432


Accuracy rates:

In [22]:
print("CART:", '{0:.2f}'.format((eval_prediction(prediction_cart01, y)/len(x))*100),"%")
print("C5.0:", '{0:.2f}'.format((eval_prediction(prediction_c50_01, y)/len(x))*100),"%")
print("Random Forest:", '{0:.2f}'.format((eval_prediction(prediction_rf01, y)/len(x))*100),"%")

CART: 62.16 %
C5.0: 62.16 %
Random Forest: 80.95 %


Note that our classifiers have a method (score(...)) that does what our evaluation function does and returns the mean accuracy on a given data and labels.

In [23]:
print(cart01.score(x,y))
print(c50_01.score(x,y))
print(rf01.score(x,y))

0.6215706015605336
0.6215706015605336
0.8094638811980871


### To Do: What do you think of these results? Are they inline with what you expected them to be?
    

In [21]:
## This is to be expected. Decision trees often underperform compared to Random Forests
### A ~ 20% improvement is a lot, which might indicate that the individual decision trees - 
### - are oversimplified due to their constraints and not capturing all patterns in the data

Now that we saw how well our models perform on the data they had already seen, it would be more intersting to see how well they perform on data they haven't seen before

In [24]:
# Now we would want to see how well our models perfom on the testing set
x_test = crime_test[['age', 'c_charge_degree', 'race', 'sex', 'priors_count']]
y_test = crime_test['score_text']

### To Do: Using the three models we trained predict the score_text values for the testing set.

In [26]:
x_test = pd.get_dummies(x_test, columns=['race', 'sex', 'c_charge_degree'], prefix=['dummy', 'dummy', 'dummy'])
x_test = x_test.reindex(columns=x.columns, fill_value=0)

In [27]:
# Make predictions on the test set using the trained models
prediction_cart_test = cart01.predict(x_test)
prediction_c50_test = c50_01.predict(x_test)
prediction_rf_test = rf01.predict(x_test)

In [28]:
print("CART Test Accuracy:", '{0:.2f}'.format((eval_prediction(prediction_cart_test, y_test) / len(x_test)) * 100), "%")
print("C5.0 Test Accuracy:", '{0:.2f}'.format((eval_prediction(prediction_c50_test, y_test) / len(x_test)) * 100), "%")
print("Random Forest Test Accuracy:", '{0:.2f}'.format((eval_prediction(prediction_rf_test, y_test) / len(x_test)) * 100), "%")

print("CART Test Accuracy (using score method):", cart01.score(x_test, y_test))
print("C5.0 Test Accuracy (using score method):", c50_01.score(x_test, y_test))
print("Random Forest Test Accuracy (using score method):", rf01.score(x_test, y_test))

CART Test Accuracy: 60.82 %
C5.0 Test Accuracy: 60.82 %
Random Forest Test Accuracy: 59.08 %
CART Test Accuracy (using score method): 0.608154020385051
C5.0 Test Accuracy (using score method): 0.608154020385051
Random Forest Test Accuracy (using score method): 0.5907889769724425


### To Do: How well did the models perform on the test set?

In [22]:
## The models performed.. Just okay, all around 60%

### To Do: Now repeat the model fitting part trying different parameter values (e.g., max_leaf_nodes,n_estimators)

Refer to the [Decision tree classifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) for details on the different parameters that can be used with this classifier.



In [29]:
# Tuning the CART model with max_leaf_nodes=10
cart_tuned = DecisionTreeClassifier(max_leaf_nodes=10, random_state=1).fit(x, y)
print("Tuned CART Test Accuracy:", cart_tuned.score(x_test, y_test))

# Tuning the C5.0 model with max_leaf_nodes=10
c50_tuned = DecisionTreeClassifier(criterion="entropy", max_leaf_nodes=10, random_state=1).fit(x, y)
print("Tuned C5.0 Test Accuracy:", c50_tuned.score(x_test, y_test))

Tuned CART Test Accuracy: 0.6198565496413742
Tuned C5.0 Test Accuracy: 0.6198565496413742


In [30]:
# Tuning the Random Forest model with n_estimators=50
rf_tuned = RandomForestClassifier(n_estimators=50, random_state=1).fit(x, y)
print("Tuned Random Forest Test Accuracy:", rf_tuned.score(x_test, y_test))

# Tuning the Random Forest model with n_estimators=100
rf_tuned_100 = RandomForestClassifier(n_estimators=100, random_state=1).fit(x, y)
print("Tuned Random Forest (100 trees) Test Accuracy:", rf_tuned_100.score(x_test, y_test))

Tuned Random Forest Test Accuracy: 0.6024915062287656
Tuned Random Forest (100 trees) Test Accuracy: 0.6036240090600227


### To Do: How well did the updated models perform? How do they compare to our first set of models?

In [22]:
### They performed slightly better than the original models, but still around 60%

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=2ad44d46-0237-493e-99f9-3a3df784b950' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>