In [1]:
import pandas as pd
from sklearn import tree
from sklearn.metrics import confusion_matrix, classification_report, plot_confusion_matrix
import pydotplus
from IPython.display import Image

diabetes_df = pd.read_csv("../week-14-repository/diabetes.csv")
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


 Write simple (straightforward) definitions for the following parameters for RandomForestClassifier
(https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClass
ifier.html) and indicate how they correlate with the precision and recall for the basic
diabetes model we built in class. You will need to rerun the model multiple times to do
so.

Basic definitions:

n_estimators = number of trees in the foreset

max_features = max number of features considered for splitting a node

max_depth = max number of levels in each decision tree

min_samples_split = min number of data points placed in a node before the node is split

min_samples_leaf = min number of data points allowed in a leaf node

bootstrap = method for sampling data points (with or without replacement)


In [3]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = diabetes_df.drop('Outcome', axis=1)
y = diabetes_df['Outcome']

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)

#Standardize
sc= StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)

## Estimators

In [4]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=1, random_state =42)
#what is an estimator?  models

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.6623376623376623

In [5]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.73      0.77      0.75       100
           1       0.52      0.46      0.49        54

    accuracy                           0.66       154
   macro avg       0.62      0.62      0.62       154
weighted avg       0.65      0.66      0.66       154



In [6]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=500, random_state =42)
#what is an estimator?  models

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.7532467532467533

In [7]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.79      0.84      0.82       100
           1       0.67      0.59      0.63        54

    accuracy                           0.75       154
   macro avg       0.73      0.72      0.72       154
weighted avg       0.75      0.75      0.75       154



If we have more estimators the precision,recall and accuracy are more . But choosing a large number of estimators in a random forest model is not the best idea. Although it will not degrade the model, it can save you the computational complexity and prevent the use of a fire extinguisher on your CPU!

Recall and precision seems to have a positive correaltion with n_estimators .As num of estimators are increasing recall and precision seems to increase.

## Max depth

In [8]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state =42,max_depth=200)
#what is an estimator?  models

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.7532467532467533

In [9]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.79      0.85      0.82       100
           1       0.67      0.57      0.62        54

    accuracy                           0.75       154
   macro avg       0.73      0.71      0.72       154
weighted avg       0.75      0.75      0.75       154



In [10]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state =42,max_depth=5)
#what is an estimator?  models

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.7272727272727273

In [11]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.76      0.85      0.80       100
           1       0.64      0.50      0.56        54

    accuracy                           0.73       154
   macro avg       0.70      0.68      0.68       154
weighted avg       0.72      0.73      0.72       154



Max depth of the decision tree increases, the performance of the model over the training set increases continuously. On the other hand as the max_depth value increases, the performance over the test set increases initially but after a certain point, it starts to decrease rapidly.

I see a positive correlation of max_depth with recall and precision

## min_samples_split

In [166]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(min_samples_split=500, random_state =42)
#what is an estimator?  models

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.6493506493506493

In [160]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.79      0.86      0.82       100
           1       0.69      0.57      0.63        54

    accuracy                           0.76       154
   macro avg       0.74      0.72      0.72       154
weighted avg       0.75      0.76      0.75       154



In [167]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(min_samples_split=4, random_state =42)
#what is an estimator?  models

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.7922077922077922

In [168]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.83      0.86      0.84       100
           1       0.72      0.67      0.69        54

    accuracy                           0.79       154
   macro avg       0.77      0.76      0.77       154
weighted avg       0.79      0.79      0.79       154



If I'm  using a small value for min samples split my accuracy,precision and recall are higher 
whereas if i'm using a large value for min samples split my accuracy,precision and recall is going down considerably

## min_samples_leaf

min_samples_leaf:  It specifies the minimum amount of samples that a node must hold after getting split. It also helps to reduce overfitting when we have ample amount of parameters. Less number of parameters can lead to overfitting also, we should keep in mind that increasing the value to a large number can lead to less number of parameters and in this case model can underfit also. The default value is set to 1.

In [178]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(min_samples_leaf=4, random_state =42)
#what is an estimator?  models

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.7532467532467533

In [179]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.78      0.86      0.82       100
           1       0.68      0.56      0.61        54

    accuracy                           0.75       154
   macro avg       0.73      0.71      0.72       154
weighted avg       0.75      0.75      0.75       154



In [180]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(min_samples_leaf=200, random_state =42)
#what is an estimator?  models

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)


0.6493506493506493

In [181]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.65      1.00      0.79       100
           1       0.00      0.00      0.00        54

    accuracy                           0.65       154
   macro avg       0.32      0.50      0.39       154
weighted avg       0.42      0.65      0.51       154



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


If I'm using a small value for min_samples_leaf my accuracy,precision and recall are higher whereas if i'm using a large value for min_samples_leaf my accuracy,precision and recall is going down considerably

## min_weight_fraction_leaf

In [184]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(min_weight_fraction_leaf=0, random_state =42)
#what is an estimator?  models

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.7532467532467533

In [183]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.65      1.00      0.79       100
           1       0.00      0.00      0.00        54

    accuracy                           0.65       154
   macro avg       0.32      0.50      0.39       154
weighted avg       0.42      0.65      0.51       154



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [186]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(min_weight_fraction_leaf=0.5, random_state =42)
#what is an estimator?  models

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.6493506493506493

In [187]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.65      1.00      0.79       100
           1       0.00      0.00      0.00        54

    accuracy                           0.65       154
   macro avg       0.32      0.50      0.39       154
weighted avg       0.42      0.65      0.51       154



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


It has high precision, accuracy and recall when  min_weight_fraction_leaf=0

## max_leaf_nodes

In [107]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(max_leaf_nodes=2, random_state =42)
#what is an estimator?  models

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.7012987012987013

In [105]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.76      0.86      0.81       100
           1       0.66      0.50      0.57        54

    accuracy                           0.73       154
   macro avg       0.71      0.68      0.69       154
weighted avg       0.73      0.73      0.72       154



In [188]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(max_leaf_nodes=1000, random_state =42)
#what is an estimator?  models

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.7402597402597403

In [190]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.78      0.84      0.81       100
           1       0.65      0.56      0.60        54

    accuracy                           0.74       154
   macro avg       0.71      0.70      0.70       154
weighted avg       0.73      0.74      0.73       154



In [None]:
high precision, accuracy and recall with higher value for max_leaf_nodes

## min_impurity_decrease

In [111]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(min_impurity_decrease=0, random_state =42)
#what is an estimator?  models

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.7532467532467533

In [113]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.79      0.85      0.82       100
           1       0.67      0.57      0.62        54

    accuracy                           0.75       154
   macro avg       0.73      0.71      0.72       154
weighted avg       0.75      0.75      0.75       154



In [130]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(min_impurity_decrease=0.01, random_state =42)
#what is an estimator?  models

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.7337662337662337

high precision and recall with min_impurity_decrease =0

In [132]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.75      0.88      0.81       100
           1       0.68      0.46      0.55        54

    accuracy                           0.73       154
   macro avg       0.71      0.67      0.68       154
weighted avg       0.73      0.73      0.72       154



## min_impurity_split

In [139]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(min_impurity_split=0, random_state =42)
#what is an estimator?  models

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)





0.7532467532467533

In [140]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.79      0.85      0.82       100
           1       0.67      0.57      0.62        54

    accuracy                           0.75       154
   macro avg       0.73      0.71      0.72       154
weighted avg       0.75      0.75      0.75       154



|Parameter||Correlation with Precision||Correlation with Recall|
|--------| |----------| |----------|
|estimators||Precision increases with increase in number of estimators|| Recall increases with increase in number of estimators| 
|max_depth||Precision increases with increase in max_depth|| Recall increases with increase in max_depth|
|min_samples_split|| high precision with small value for min_samples_split| |high recall with small value for min_samples_split|
|min_samples_leaf|| high precision with low value of min_samples_leaf | |high recall with low value of min_samples_leaf|
|min_weight_fraction_leaf||with min_weight_fraction_leaf=0 high precision | | with min_weight_fraction_leaf=0 high recall |
|max_leaf_nodes||high precision with higher value for max_leaf_nodes||high recall with higher value for max_leaf_nodes|
|min_impurity_decrease||high precision with min_impurity_decrease=0 | | high recall with min_impurity_decrease=0|
|min_impurity_split|| high precision with min_impurity_split=0 | |high recall with min_impurity_split=0 |
 

2) How does setting bootstrap=False influence the model performance? Note: the default is bootstrap=True. Explain why your results might be so.

In [141]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(bootstrap=False ,random_state =42)
#what is an estimator?  models

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.7467532467532467

In [142]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.79      0.83      0.81       100
           1       0.65      0.59      0.62        54

    accuracy                           0.75       154
   macro avg       0.72      0.71      0.72       154
weighted avg       0.74      0.75      0.74       154



After Setting bootstrap value=False, the model performance and recall went down by  1%.But precision values remained the same.

In [143]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(bootstrap=True ,random_state =42)
#what is an estimator?  models

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.7532467532467533

In [144]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.79      0.85      0.82       100
           1       0.67      0.57      0.62        54

    accuracy                           0.75       154
   macro avg       0.73      0.71      0.72       154
weighted avg       0.75      0.75      0.75       154

