# <b> Random Forest Classifier (Attempt 1)</b>
Behold! My first attempt at the Random Forest Classifier for the Diabetes Health Indicator Dataset. 

<b>*Note that you only uncomment the line with pickle if you store the files only locally. They are too large for github and will therefor give error when pushing to main branch.

In [44]:
# All necessary imports
import numpy as np
import pandas as pd
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score

### <b>Loading the data</b> 
First, the data is loaded and is split between the data (features) and the labels. The data (features) is stored in a dataframe and the labels are stored as a list. 

In [26]:
# Load training data and split labels and features (data)
data = pd.read_csv("./diabetes/training_data(no_pre-diabetes).csv")
labels = data["Diabetes_012"]
del data["Diabetes_012"] # deletes the labels from the data dataframe

### <b> Split data into training and validation set </b>
Below the data is split up again into training and validation data. with an 80 to 20% ratio

In [27]:
# Split data in training and validation set, with for each a set of data and a set of corresponding labels
training_data, validation_data, training_labels, validation_labels = train_test_split(data, labels, test_size=0.2, random_state=33)

### <b>Fit model without feature selection or hyperparameter tuning</b>
Here a Random Forest Classifier is trained based on just the training data and the default parameters of the function RandomForestClassifier()

In [29]:
base_model = RandomForestClassifier().fit(training_data, training_labels)

Now the model is evaluated based on the validation set with the help of sklearn's accuracy_score function:

In [34]:
pred_base_model = base_model.predict(validation_data)
print("Accuracy base model: ", accuracy_score(validation_labels, pred_base_model))

Accuracy base model:  0.8616845953315555


Now the model is pickled to be able to use it later:

In [41]:
# pickle.dump(base_model, open("RF_base_model.p","wb"))

## <b> GridSearchCV </b>
To boost the performance of the model, GridSearchCV is performed. Which refers to sampling the data multiple times into training and validation set to determine the best hyperparameters for the model in combination with the given data. 

### <b>Set up the paramater grid</b>
The next piece of code sets the parameter grid for the GridSearchCV later. <br>
Initial search for the right parameters on the internet resulted into the use of default parameters for the following: <br>
<ul>
    <li> bootstrap (default = True) If it would be False then the complete data was used to create the tree.</li> 
    <li> max_features (default = auto) Refers to max_features=sqrt(n_features)</li>
    <li> criterion (default = gini) meaning that the gini impurity is calculated to determine whether to split a node. Gini is faster than entropy calculation and the difference between           them should not be major.</li> 
    <li> max_depth (default = None) Tree kan have as many nodes and edges as seems fit.</li> 
    <li> min_samples_leaf (default = 1, meaning at least one sample needs to be present in the leaf)</li>
</ul>
<b> * Please note that using as many default parameters as possible significantly reduces the training time, which is already quite long for the RF algorithm </b>

In [35]:
# Number of trees in the random forest
n_estimators = [50, 100, 150, 200]
    
# Minimum no. of samples that need to be positive to create a new node split.
min_samples_split = [2, 4, 8, 16, 32, 64]

# Create random_grid
param_grid = {'n_estimators': n_estimators,
               'min_samples_split': min_samples_split}

### <b> Initialize model and perform GridSearchCV </b>
Now the Random Forest Classifier is initiated and a gridsearch is started with 4 crossvalidations. 

In [37]:
cv_model = GridSearchCV(estimator=RandomForestClassifier(), param_grid=param_grid, cv=4)

### <b> Fit the model</b>
Now the model is fitted with the training data and training labels.<br>
<b> Note that running this piece of code can take quite some time (max 10 min.)</b>

In [38]:
cv_model.fit(training_data, training_labels)

GridSearchCV(cv=4, estimator=RandomForestClassifier(),
             param_grid={'min_samples_split': [2, 4, 8, 16, 32, 64],
                         'n_estimators': [50, 100, 150, 200]})

### <b> Check best parameters and accuracy</b>
Check which parameters where considered the best this round:

In [48]:
# Print out the best parameters according to GridSearchCV
print("Best parameters: ", cv_model.best_params_)

# Determine accuracy of model based on validation set
pred_cv_model = cv_model.predict(validation_data)
print("Accuracy GridSearchCV model: ", accuracy_score(validation_labels, pred_cv_model))

# Pickle model
# pickle.dump(cv_model, open("RF_cv_model.p","wb"))

Best parameters:  {'min_samples_split': 32, 'n_estimators': 200}
Accuracy base model:  0.8680969646121116


## <b>Feature selection</b>
To boost the performance of the model, feature selection is performed. 

In [50]:
selection = SelectFromModel(RandomForestClassifier())
f_select_model = selection.fit(training_data, training_labels)

In [63]:
# Determine which features were selected as best
selected_features = training_data.columns[(f_select_model.get_support())]

# Remove non important features from training and validation set
f_select_training_data = training_data[selected_features]
print("Snapshot of training data with important features:", "\n",f_select_training_data.head())

f_select_validation_data = validation_data[selected_features]

# Train model again with selected features
f_model = RandomForestClassifier().fit(f_select_training_data, training_labels)


Snapshot of training data with important features: 
          BMI  GenHlth  MentHlth  PhysHlth   Age  Education  Income
120489  31.0      5.0       0.0      15.0  11.0        5.0     3.0
129057  21.0      3.0       2.0       0.0  13.0        6.0     8.0
48071   32.0      3.0       0.0       0.0   6.0        6.0     8.0
145333  44.0      4.0       5.0      20.0   5.0        4.0     3.0
30240   25.0      1.0       4.0       0.0   3.0        6.0     7.0


In [65]:
# Determine accuracy of model based on validation set
pred_f_model = f_model.predict(f_select_validation_data)
print("Accuracy feature selection model: ", accuracy_score(validation_labels, pred_f_model))

# Pickle model
# pickle.dump(f_model, open("RF_feat_select_model.p","wb"))

Accuracy base model:  0.8443652054055674


## <b>Combining the feature selection and GridSearchCV</b>
Let's take a look whether combining these two methods will boost the accuracy.
The param_grid is reused as well as the earlier created data with feature selection.

In [66]:
CVF_model = GridSearchCV(estimator=RandomForestClassifier(), param_grid=param_grid, cv=4)
CVF_model.fit(f_select_training_data, training_labels)

GridSearchCV(cv=4, estimator=RandomForestClassifier(),
             param_grid={'min_samples_split': [2, 4, 8, 16, 32, 64],
                         'n_estimators': [50, 100, 150, 200]})

Below the accuracy is determined for this combined classifier.

In [67]:
# Determine accuracy of model based on validation set
pred_CVF_model = CVF_model.predict(f_select_validation_data)
print("Accuracy base model: ", accuracy_score(validation_labels, pred_CVF_model))

# Pickle model
# pickle.dump(CVF_model, open("RF_combi_CVF_model.p","wb"))

Accuracy base model:  0.8635723489048033
