# 1. Write simple (straightforward) definitions for the following parameters for RandomForestClassifier (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClass ifier.html) and indicate how they correlate with the precision and recall for the basic diabetes model we built in class. You will need to rerun the model multiple times to do so

In [114]:
import pandas as pd
from sklearn import tree
from sklearn.metrics import confusion_matrix, classification_report, plot_confusion_matrix

diabetes_df = pd.read_csv("../week_13/diabetes.csv")
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [115]:
#https://stackoverflow.com/questions/46480457/difference-between-min-samples-split-and-min-samples-leaf-in-sklearn-decisiontre

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = diabetes_df.drop('Outcome', axis=1)
y = diabetes_df['Outcome']

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42, stratify=y)

#Standardize
sc= StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)

In [213]:
from sklearn.ensemble import RandomForestClassifier
#estimator = model
rf = RandomForestClassifier(min_impurity_decrease = 0.03, random_state=42)

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.7402597402597403

In [214]:
predictions = rf.predict(X_test)
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.72      0.97      0.83       150
           1       0.84      0.32      0.46        81

    accuracy                           0.74       231
   macro avg       0.78      0.64      0.65       231
weighted avg       0.76      0.74      0.70       231



**Parameter Definitions**

n_estimators - the number of trees you want to build before taking the max. averages or predictions; higher number of trees give a better performance but slows down your code

max_depth - the depth  every tree in random forest grows. It's the longest path between the root node and leaf node

min_samples_split - tells the decision tree the minimum number of samples or observations any node needs in order to split further into subnodes

min_samples_leaf - leaf is the end node of a decision tree. A smaller leaf makes the model more prone to capturing noise in train data. The minimum number of samples that should be present in the leaf node after splitting a node. Helps prevent overfitting as the parameter value increases

min_weight_fraction_leaf - the fraction of the input samples required to be at a leaf node where weights are determined by sample_weight. This is a way to deal with class imbalance

max_leaf_nodes - sets a condition on the splitting of the nodes in the tree and hence restricts the growth of the tree. 

min_impurity_decrease - controls how deep the tree grows based on the impurity. If the final impurity decrease is less than the minimum impurity decrease parameter, then the split will not be performed



| Parameter | Correlation with Precision | Correlation with Recall |
| --- | --- | --- |
| n_estimator | positive: when set to 200, precision increased by 1% and 2% | negative: when set to 200, recall decreased by about 30% in correctly identifying those who have diabetes |
| max_depth | positive: set to 9 precision increased by 1% and 2% when changed from its default | positive: increased by 1% for those who are non-diabetic and 3% for those who are diabetic
| min_samples_split | positive when set to 6 it increases by 1% for those who are diabetic and 2% for non-diabetic | positive, increases by 1% and 3% in correctly identifying those who are diabetic
| min_samples_leaf | positive: set to 4 precision increases by 1% and 2% | positive: increases by 1% and 3%
| min_weight_fraction_leaf | positive or negative | positive or negative
| max_leaf_nodes | positive when set to 20 | positive when set to 20
| min_impurity_decrease | positive when set to 0.3, but anything lower creases the correlation | negative correlation indentifying diabetics when set to 0.3


#  2. How does setting bootstrap=False influence the model performance? Note: the default is bootstrap=True. Explain why your results might be so.

**Setting bootstrap = False has a slight positive influence on the model performance. The accuracy improves by going from 75% to 76%. There is a slight decrease in recall when identifying those who are non-diabetic and precision increases when identifying those who are non-diabetic. Bootstrap set to false means that samples are drawn without replacement, so each data point or person in the dataset has only one chance to be selected in the sample.**

In [172]:
rf1 = RandomForestClassifier(bootstrap = True, random_state=42)

rf1 = rf1.fit(X_train, y_train)
rf1.score(X_test, y_test)

0.7489177489177489

In [171]:
predictions = rf1.predict(X_test)
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.79      0.85      0.82       150
           1       0.68      0.58      0.63        81

    accuracy                           0.76       231
   macro avg       0.74      0.72      0.72       231
weighted avg       0.75      0.76      0.75       231



## Helpful links to review regarding parameters
https://www.analyticsvidhya.com/blog/2020/03/beginners-guide-random-forest-hyperparameter-tuning/
https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74
https://towardsdatascience.com/how-to-tune-a-decision-tree-f03721801680
https://stackoverflow.com/questions/54812230/sklearn-min-impurity-decrease-explanation#:~:text=A%20node%20will%20be%20split,or%20equal%20to%20this%20value.&text=Does%20this%20mean%20to%20prune,will%20become%20less%20than%200.1%20%3F
https://stackoverflow.com/questions/40131893/random-forest-with-bootstrap-false-in-scikit-learn-python