**1. Write simple (straightforward) definitions for the following parameters for RandomForestClassifier (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClass ifier.html) and indicate how they correlate with the precision and recall for the basic diabetes model we built in class. You will need to rerun the model multiple times to do so.**

  
Parameter, Correlation with Precision, Correlation with Recall

**estimators:** Number of trees that vote to final result.  Larger is better, but affects computation time.  With fewer tree lower recall and precision, higher number of trees -> higher recall and precision

**max_depth:** The depth tree or level of nodes.   More depth is better at precision and recall, but at some point, you may be running an unrestrained tree and either it won't matter or it will overfit.

**min_samples_split:** the number of data points left (samples) required to split a node. From 2 - 100.  Increasing slightly from 2, 4, 6, 10, etc increased precision and recall, then for a while precision and recall stayed fairly similar.  At a large enough point (say 50-65) approaching 100, precision and recall decreased as did the model score.  

**min_samples_leaf:** samples required to be a leaf node (end node?). The model improved when the min sample went from 1 to 2, but started dropping off by 3 to 4 and a larger drop off by 20.

**min_weight_fraction_leaf:** the weighted fraction of the total samples required to be a leaf node.  A fraction of all samples instead of a number.  1% preformed better precision and recall than 5%, 10% or more.   By 40% (ridiculous), drastic negative model performance.

**max_leaf_nodes:** best fit for the max number specified.  3 is the sweet spot for this model.  2 wasn't horrible, but more than 4 wasn't helpful.

**min_impurity_decrease:** A node will be split if this split induces a decrease of the impurity greater than or equal to this value. (not sure how to rephrase this so I understand it).    Values approaching and equal to zero are better for recall and precision.  .1 broke the model. :)

**min_impurity_split:** From the documentation: Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.  *min_impurity_split has been deprecated in favor of min_impurity_decrease - Use min_impurity_decrease instead.* 


In [1]:
import pandas as pd
from sklearn import tree
from sklearn.metrics import confusion_matrix, classification_report, plot_confusion_matrix
import pydotplus
from IPython.display import Image

diabetes_df = pd.read_csv("diabetes.csv")
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = diabetes_df.drop('Outcome', axis=1)
y = diabetes_df['Outcome']

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)

#Standardize
sc= StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)

In [3]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=500, random_state =42, bootstrap=True)
#what is an estimator?  models

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.7532467532467533

In [4]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.79      0.84      0.82       100
           1       0.67      0.59      0.63        54

    accuracy                           0.75       154
   macro avg       0.73      0.72      0.72       154
weighted avg       0.75      0.75      0.75       154



---

**2. How does setting bootstrap=False influence the model performance? Note: the default is bootstrap=True. Explain why your results might be so.**

If False, the whole dataset is used to build each tree.   I think this makes the model perform worse as outliers in the data are used in each tree built.  But I don't see a big difference between the two.

Model score fell from 75.3% to 74.7%.

There are random differences in each tree, but the same (whole) dataset is used in each tree when bootstrapping is false.  


In [5]:
rf = RandomForestClassifier(n_estimators=500, random_state =42, bootstrap=False)
#what is an estimator?  models

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.7467532467532467

In [6]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.79      0.83      0.81       100
           1       0.65      0.59      0.62        54

    accuracy                           0.75       154
   macro avg       0.72      0.71      0.72       154
weighted avg       0.74      0.75      0.74       154

