### 1. Write simple (straightforward) definitions for the following parameters for RandomForestClassifier (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) and indicate how they correlate with the precision and recall for the basic diabetes model we built in class. You will need to rerun the model multiple times to do so.
#### Parameter                       Correlation with Precision               Correlation with Recall
#### estimators
#### max_depth 
#### min_samples_split 
#### min_samples_leaf 
#### min_weight_fraction_leaf
#### max_leaf_nodes
#### min_impurity_decrease
#### min_impurity_split
## Answers in markdowns below

Importing dataset

In [3]:
import pandas as pd
from sklearn import tree
from sklearn.metrics import confusion_matrix, classification_report, plot_confusion_matrix
import pydotplus
from IPython.display import Image

diabetes_df = pd.read_csv(r'C:\Users\watso\Documents\Data_Science_Bootcamp\homework_13/diabetes.csv')
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Created initial model

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = diabetes_df.drop('Outcome', axis=1)
y = diabetes_df['Outcome']

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)

#Standardize
sc= StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)

Class example

In [5]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200, random_state =42)
#what is an estimator?  models

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.7662337662337663

In [7]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.79      0.83      0.81       100
           1       0.65      0.59      0.62        54

    accuracy                           0.75       154
   macro avg       0.72      0.71      0.72       154
weighted avg       0.74      0.75      0.74       154



estimators - the number of decision trees run.  The default is 100. The in class model we ran is 200. Moving from 200 to 100 precision improved slightly, recall stayed the same.  Overall score decreased.

In [48]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state =42)
#what is an estimator?  models

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.7532467532467533

In [49]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.79      0.85      0.82       100
           1       0.67      0.57      0.62        54

    accuracy                           0.75       154
   macro avg       0.73      0.71      0.72       154
weighted avg       0.75      0.75      0.75       154



max_depth - the maximum depth of the tree. Default is none, so the tree can continue until it reaches a natural conclusion. We went from none to the best score which was 11. Both precision and recall improved by three points each.

In [59]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200, max_depth=11, random_state =42)
#what is an estimator?  models
rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.7792207792207793

In [60]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.81      0.87      0.84       100
           1       0.72      0.61      0.66        54

    accuracy                           0.78       154
   macro avg       0.76      0.74      0.75       154
weighted avg       0.77      0.78      0.77       154



min_samples_split - the minimum number of samples required to split a node.  Default is 2. The best answer I got was 7.  At that level both precision and recall improved.  At every other number it was the same or decreased.

In [68]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200, min_samples_split=7, random_state =42)
#what is an estimator?  models
rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.7727272727272727

In [69]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.80      0.86      0.83       100
           1       0.70      0.61      0.65        54

    accuracy                           0.77       154
   macro avg       0.75      0.74      0.74       154
weighted avg       0.77      0.77      0.77       154



min_samples_leaf - the number of samples required to be a node.  The default is 1. Peak performance was at 3.  This improved both precision and recall.

In [75]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200, min_samples_leaf=3, random_state =42)
#what is an estimator?  models
rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.7727272727272727

In [76]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.80      0.87      0.83       100
           1       0.71      0.59      0.65        54

    accuracy                           0.77       154
   macro avg       0.75      0.73      0.74       154
weighted avg       0.77      0.77      0.77       154



min_weight_fraction_leaf - the minimum weighted fraction of all samples to be in a leaf node.  Must be a float between 0.0 and 0.5.  Default is equal weight. Changing this number to .3 increased precision but decreased recall.

In [83]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200, min_weight_fraction_leaf=.3, random_state =42)
#what is an estimator?  models
rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.7337662337662337

In [84]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.72      0.97      0.83       100
           1       0.84      0.30      0.44        54

    accuracy                           0.73       154
   macro avg       0.78      0.63      0.63       154
weighted avg       0.76      0.73      0.69       154



max_leaf_nodes - this restricts the number of leaf nodes but grows the best first, as defined by impurity.  The default is no restriction. Any artificial restriction here seems to decrease overall accuracy and at best neither increase or decrease precion and recall.

In [97]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200, max_leaf_nodes=40, random_state =42)
#what is an estimator?  models
rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.7467532467532467

In [98]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.78      0.85      0.81       100
           1       0.67      0.56      0.61        54

    accuracy                           0.75       154
   macro avg       0.72      0.70      0.71       154
weighted avg       0.74      0.75      0.74       154



min_impurity_decrease - a node splits if the split induces a decrease in impurity greater than or equal to this vale.  Default is 0.0. Placing any value here seriously decreases the overall accuracy as well as precision and recall.   

In [109]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200, min_impurity_decrease=.5, random_state =42)
#what is an estimator?  models
rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.6493506493506493

In [110]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.65      1.00      0.79       100
           1       0.00      0.00      0.00        54

    accuracy                           0.65       154
   macro avg       0.32      0.50      0.39       154
weighted avg       0.42      0.65      0.51       154



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


min_impurity_split - a node splits if it's impurity is above the value, otherwise it becomes a leaf.  Default is none.  Implementing a value here decreases overall accuracy as well as precision and recall

In [40]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200, min_impurity_split=.2, random_state =42)
#what is an estimator?  models
rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)









0.7532467532467533

In [41]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.77      0.88      0.82       100
           1       0.70      0.52      0.60        54

    accuracy                           0.75       154
   macro avg       0.74      0.70      0.71       154
weighted avg       0.75      0.75      0.74       154



### 2. How does setting bootstrap=False influence the model performance? Note: the default is bootstrap=True. Explain why your results might be so

Bootstrapping=False decreases model performance.  I believe this is because it removes one method for fine tuning the model.

In [42]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200, bootstrap=False, random_state =42)
#what is an estimator?  models
rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.7532467532467533

In [43]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.79      0.84      0.82       100
           1       0.67      0.59      0.63        54

    accuracy                           0.75       154
   macro avg       0.73      0.72      0.72       154
weighted avg       0.75      0.75      0.75       154



Just for fun running the model with the parameters identified above as improvements to see their collective effects.  Adding all of the changes together did not improve precision and recall beyond improvements that were acheived when implementing only one of these changes.

In [111]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, max_depth=11, min_samples_split=7, min_samples_leaf=3, random_state =42)
#what is an estimator?  models
rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.7727272727272727

In [112]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.80      0.87      0.83       100
           1       0.71      0.59      0.65        54

    accuracy                           0.77       154
   macro avg       0.75      0.73      0.74       154
weighted avg       0.77      0.77      0.77       154

