# COMP 7150 Assignment 6 SOLUTION

---

**How each problem is graded**

This is a general guideline.  There might be some variant for a specific problem.

* Answering the question in English: 25%
    + Thorough answering the question, explaning the findings, discussing the comparison: 10-25%
    + Not answering the question in English: 0%.
    + Note: An answer is written in English to address a specific question. The answer must come first.  
* Code: 50%
    + Code is correct: 50%
    + Code is correct, but missing some minor elements: 25-40%
    + Incorrect solution, but make an effort: 10%-25%
    + Empty: 0% 
    + Note: code is not an answer. Code shows how you get an answer.  Code comes after an answer.
* Explanation of code: 10%
    + Concise English explanation of solution/code: 10%
    + No explanation: 0%
* Testing: 15%
    + Providing adequate testing of code: 15%
    + Code operational but does not show how the answer(s) were generated: 5%
    + Code not runnable or no testing: 0%


**How to turn in your assignment**

+ Export your notebook to an HTML file.
+ Upload it to the appropriate folder in Assignments on Canvas.

#### Dataset

The dataset you'll used for this assignment is diabetes.csv

The target variable is Outcome. The other 8 variables are features.

Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

* Pregnancies: Number of times pregnant
* Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
* BloodPressure: Diastolic blood pressure (mm Hg)
* SkinThickness: Triceps skin fold thickness (mm)
* Insulin: 2-Hour serum insulin (mu U/ml)
* BMI: Body mass index (weight in kg/(height in m)^2)
* DiabetesPedigreeFunction: Diabetes pedigree function
* Age: Age (years)
* Outcome: Class variable (0 or 1)


---

Each problem is worth 20 points.

---

---

**Problem 1**

Some of the features have missing values.  Unfortunately, in this dataset, missing values are not indicated as "nan".  Therefore, if you use pandas' "dropna", it won't work.

However, if you understand the meanings of the features, you can guess which ones have missing values.  For example, blood pressure should not be less than 20.

Find the features that you think have missing values in the dataset.  For each feature, explain why you think they have missing values.



ANSWER:

In [3]:
import pandas

diabetes = pandas.read_csv('../Datasets/diabetes.csv')
diabetes.sample(1)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
120,0,162,76,56,100,53.2,0.759,25,1


Certain features (Glucose, BloodPressure, SkinThickness, Insulin, BMI, Age) should not have 0 values.

In [9]:
valid_bloodpressure = diabetes['BloodPressure'] > 0
valid_glucose = diabetes['Glucose'] > 0
valid_bmi = diabetes['BMI'] > 0
valid_insulin = diabetes['Insulin'] > 0
valid_skin = diabetes['SkinThickness'] > 0
valid_age = diabetes['Age'] > 0


print('Data set size:', len(diabetes))
print('Valid data for BloodPressure', len(diabetes[valid_bloodpressure]))
print('Valid data for Glucose', len(diabetes[valid_glucose]))
print('Valid data for BMI', len(diabetes[valid_bmi]))
print('Valid data for Insulin', len(diabetes[valid_insulin]))
print('Valid data for SkinThickness', len(diabetes[valid_skin]))
print()
Q = valid_bloodpressure & valid_glucose & valid_bmi & valid_insulin & valid_skin & valid_age
print('Valid data:', len(diabetes[Q]))

Data set size: 768
Valid data for BloodPressure 733
Valid data for Glucose 763
Valid data for BMI 757
Valid data for Insulin 394
Valid data for SkinThickness 541

Valid data: 392


---

**Problem 2**

Remove the missing values from the diabetes dataset.

Find 3 decision models with the best F1 scores, based on these characteristics:
* The first model: find the best max_depth.
* The second model: find the best min_samples_leaf
* The third model: find the best combination of max_depth and min_samples_leaf

Evaluate your models using f1_score and ShuffleSplit with 100 splits.

You can read about these two parameters here: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier


ANSWER:

In [16]:
valid_bloodpressure = diabetes['BloodPressure'] > 0
valid_glucose = diabetes['Glucose'] > 0
valid_bmi = diabetes['BMI'] > 0
valid_insulin = diabetes['Insulin'] > 0
valid_skin = diabetes['SkinThickness'] > 0
valid_age = diabetes['Age'] > 0

Q = valid_bloodpressure & valid_glucose & valid_bmi & valid_insulin & valid_skin & valid_age

data = diabetes[Q]

print('Valid data:', len(data), len(data)/len(diabetes))

Valid data: 392 0.5104166666666666


We lose about 50% of the data after removing missing values.

There are ways of guesstimating missing values, but we won't be discuss them here.

**Finding the best max_depth for a decision tree model**

In [17]:
from sklearn.model_selection import cross_validate, ShuffleSplit
from sklearn.tree import DecisionTreeClassifier

X, y = data.drop(columns=['Outcome']), data['Outcome']
cv = ShuffleSplit(n_splits=100)

for d in [2, 3, 4, 5, 6, 7, 8, 10, 14, 16, 18, 20, 22]:
    dt = DecisionTreeClassifier(max_depth=d)
    r = cross_validate(dt, X, y, cv=cv, scoring='f1')
    print(d, r['test_score'].mean().round(2))
    

2 0.54
3 0.67
4 0.61
5 0.61
6 0.6
7 0.58
8 0.57
10 0.59
14 0.56
16 0.55
18 0.58
20 0.57
22 0.56


In [19]:
# the best max_depth seems to be 3.
model1 = DecisionTreeClassifier(max_depth=3)

**Finding the best min_samples_leaf for a decision tree model**

In [20]:
X, y = data.drop(columns=['Outcome']), data['Outcome']
cv = ShuffleSplit(n_splits=100)

for d in range(2,25):
    dt = DecisionTreeClassifier(min_samples_leaf=d)
    r = cross_validate(dt, X, y, cv=cv, scoring='f1')
    print(d, r['test_score'].mean().round(2))
    


2 0.58
3 0.59
4 0.6
5 0.63
6 0.6
7 0.6
8 0.61
9 0.61
10 0.61
11 0.61
12 0.6
13 0.62
14 0.61
15 0.58
16 0.59
17 0.6
18 0.59
19 0.6
20 0.58
21 0.61
22 0.61
23 0.59
24 0.6


In [34]:
# there are multiple values
model2 = DecisionTreeClassifier(min_samples_leaf=9)

#### Identifying the best combination

In [21]:
cv = ShuffleSplit(n_splits=100)
        
for l in range(2,7):
    for d in range(5,15):
        dt = DecisionTreeClassifier(max_depth=d, min_samples_leaf=l)
        r = cross_validate(dt, X, y, cv=cv, scoring='f1')
        print(l, d, r['test_score'].mean().round(2))


2 5 0.63
2 6 0.61
2 7 0.58
2 8 0.58
2 9 0.56
2 10 0.59
2 11 0.59
2 12 0.58
2 13 0.56
2 14 0.58
3 5 0.62
3 6 0.6
3 7 0.59
3 8 0.59
3 9 0.59
3 10 0.6
3 11 0.57
3 12 0.6
3 13 0.61
3 14 0.61
4 5 0.64
4 6 0.61
4 7 0.59
4 8 0.6
4 9 0.6
4 10 0.6
4 11 0.58
4 12 0.61
4 13 0.6
4 14 0.61
5 5 0.65
5 6 0.62
5 7 0.63
5 8 0.62
5 9 0.63
5 10 0.62
5 11 0.61
5 12 0.6
5 13 0.62
5 14 0.63
6 5 0.62
6 6 0.62
6 7 0.6
6 8 0.61
6 9 0.61
6 10 0.62
6 11 0.62
6 12 0.58
6 13 0.6
6 14 0.6


In [22]:
model3 = DecisionTreeClassifier(min_samples_leaf=4, max_depth=5)

---
**Problem 3**

Compare the performance of the best decision tree classifier and a random forest (with similar max_depth and min_samples_leaf).


ANSWER:

In [23]:
from sklearn.model_selection import cross_validate, ShuffleSplit, KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

X, y = data.drop(columns=['Outcome']), data['Outcome']
cv = ShuffleSplit(n_splits=100)
dt = DecisionTreeClassifier(min_samples_leaf=4, max_depth=5)
rf = RandomForestClassifier(min_samples_leaf=4, max_depth=5)

In [24]:
cv = ShuffleSplit(n_splits=100)
result_dt = cross_validate(dt, X, y, cv=cv, scoring='f1')
result_rf = cross_validate(rf, X, y, cv=cv, scoring='f1')

print(result_dt['test_score'].mean().round(2))
print(result_rf['test_score'].mean().round(2))

0.61
0.63


In [25]:
cv = KFold(n_splits=10, shuffle=True)
result_dt = cross_validate(dt, X, y, cv=cv, scoring='f1')
result_rf = cross_validate(rf, X, y, cv=cv, scoring='f1')

print(result_dt['test_score'].mean().round(2))
print(result_rf['test_score'].mean().round(2))

0.64
0.67


The result depends on a specific cross validator. Overall, random forest seems to perform better.

---
**Problem 4**

By default, a random forest classifier uses 100 random trees (n_estimators). The larger the number of random trees, the longer it takes to train and predict.

What is the smallest number random trees in a random forest do you need for a random forest classifer to outperform your best decision tree classifier?


ANSWER:

In [28]:
from sklearn.model_selection import cross_validate, KFold
from sklearn.ensemble import RandomForestClassifier
cv = KFold(n_splits=10, shuffle=True)
# cv = ShuffleSplit(n_splits=100)
for n in range(10,200,10):
    model = RandomForestClassifier(n_estimators=n, min_samples_leaf=9, max_depth=4)
    result_rf = cross_validate(rf, X, y, cv=cv, scoring='f1')
    print(n, result_rf['test_score'].mean().round(2))    

10 0.65
20 0.62
30 0.65
40 0.63
50 0.64
60 0.62
70 0.64
80 0.65
90 0.67
100 0.64
110 0.62
120 0.62
130 0.64
140 0.63
150 0.63
160 0.64
170 0.64
180 0.64
190 0.62


With 10-Fold cv, n ~ 90 gives good results.

---
**Problem 5**

Compare the performance of the best decision tree classifier and logistic regression classifier.

If your logistic regression classifer doesn't convert, you can increase max_iter.


ANSWER:

In [33]:
from sklearn.model_selection import cross_validate, ShuffleSplit, KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

X, y = data.drop(columns=['Outcome']), data['Outcome']
dt = DecisionTreeClassifier(min_samples_leaf=9, max_depth=4)
logit = LogisticRegression(max_iter=10000)

In [34]:
cv = ShuffleSplit(n_splits=100)
result_dt = cross_validate(dt, X, y, cv=cv, scoring='f1')
result_logit = cross_validate(logit, X, y, cv=cv, scoring='f1')

print(result_dt['test_score'].mean().round(2))
print(result_logit['test_score'].mean().round(2))

0.62
0.6


In [32]:
cv = KFold(n_splits=10)
result_dt = cross_validate(dt, X, y, cv=cv, scoring='f1')
result_logit = cross_validate(logit, X, y, cv=cv, scoring='f1')

print(result_dt['test_score'].mean().round(2))
print(result_logit['test_score'].mean().round(2))

0.62
0.64


The performance is similar.

Logistic regression is to be slightly better with 10-Fold, and decision tree is better with Shuffle Split.



#### feature importance

In [36]:
dt.fit(X,y)
logit.fit(X,y)

In [39]:
dt.feature_importances_.round(2)

array([0.  , 0.62, 0.01, 0.02, 0.06, 0.  , 0.07, 0.22])

In [38]:
X.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age'],
      dtype='object')

In [41]:
logit.coef_.round(2)

array([[ 0.08,  0.04, -0.  ,  0.01, -0.  ,  0.07,  0.97,  0.03]])

In [42]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(min_samples_leaf=9, max_depth=4)
rf.fit(X,y)

In [44]:
rf.feature_importances_.round(2)

array([0.06, 0.35, 0.03, 0.05, 0.15, 0.09, 0.07, 0.19])