# Random Forest Classifier 

Random Forest Classifier Steps: <br>
- Load Main Dataset <br>
- Split in Dataset for training and testing <br>
- Build Bootstrapped Dataset <br>
- Train Decision Trees on those datasets using diferent groups of features<br>
- Get Test Data and for each entry on the table make it go through all the trees and take note of the results<br>
- Agregate all the results from the different trees and choose the best value thorugh majority voting<br>
            


## Library imports

In [25]:
import pandas as pd
import numpy as np
import os
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, ConfusionMatrixDisplay
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from scipy.stats import randint
from matplotlib import pyplot 


#mainpath="C:\\Users\\alexa\\Desktop\\github\\Project-LabIACD\\metadata.csv"
mainpath="C:\\Users\\alexa\\Desktop\\Tudo\\Aulas\\LABS\\Project-LabIACD\\metadata.csv"

## Load Main Dataset

In [26]:
main_dataset=pd.read_csv(mainpath)
main_dataset.tail()


Unnamed: 0,Subject ID,Study UID,Study Description,Study Date,Series ID,Series Description,Number of images,File Size (Bytes),Collection Name,Modality,Manufacturer
1303,LIDC-IDRI-0030,1.3.6.1.4.1.14519.5.2.1.6279.6001.573428694448...,,2000-01-01 00:00:00.0,1.3.6.1.4.1.14519.5.2.1.6279.6001.986011151772...,,119,62645518,LIDC-IDRI,CT,GE MEDICAL SYSTEMS
1304,LIDC-IDRI-0121,1.3.6.1.4.1.14519.5.2.1.6279.6001.985098410443...,,2000-01-01 00:00:00.0,1.3.6.1.4.1.14519.5.2.1.6279.6001.987704869630...,,1,13228740,LIDC-IDRI,CR,Philips Medical Systems
1305,LIDC-IDRI-0974,1.3.6.1.4.1.14519.5.2.1.6279.6001.214565862082...,,2000-01-01 00:00:00.0,1.3.6.1.4.1.14519.5.2.1.6279.6001.991510467496...,,101,53158868,LIDC-IDRI,CT,SIEMENS
1306,LIDC-IDRI-0473,1.3.6.1.4.1.14519.5.2.1.6279.6001.210105060472...,CT CHEST O CONTR,2000-01-01 00:00:00.0,1.3.6.1.4.1.14519.5.2.1.6279.6001.994459772950...,,291,153155158,LIDC-IDRI,CT,GE MEDICAL SYSTEMS
1307,LIDC-IDRI-0493,1.3.6.1.4.1.14519.5.2.1.6279.6001.333362756208...,CHEST,2000-01-01 00:00:00.0,1.3.6.1.4.1.14519.5.2.1.6279.6001.997611074084...,,285,150013636,LIDC-IDRI,CT,GE MEDICAL SYSTEMS


## Spliting Training and Testing Datasets

We are firstly going to make the data division based on Pareto´s Law (80% training/20% testing) <br>
And Later maybe try the division acording to the paper "A scaling law
for the validation-set training-set size ratio" by Isabelle Guyon

In [27]:
collumn_name='Manufacturer'
features_collumns=main_dataset.drop(collumn_name,axis=1)
has_cancer_collumns=main_dataset[collumn_name]

features_treino, features_teste, has_cancer_treino, has_cancer_teste=train_test_split(features_collumns,has_cancer_collumns, test_size=0.2)

divisao


## Random Forest Classifier

Although we say in the index that we have to build the Bootstrapped Dataset the sklearn random forest classifier has the ability to do that for us alowing us to skip a step<br>
<br>
In this chapter we´ll have some sub-chapters so that we can play around with the random Forest Classifier to try and get the best possible model


### Default Random Forest

#### DataFitting

In [None]:
RandomForest = RandomForestClassifier()
RandomForest.fit(features_treino,has_cancer_treino)

#### Model Evaluation

##### Prediction Part

In [None]:
has_cancer_prediction=RandomForest.predict(features_teste)

##### Accuracy Part

In [None]:
accuracy = accuracy_score(has_cancer_teste, has_cancer_prediction)
print(f"Accuracy = {accuracy} in a percentage of 100 = {accuracy*100}")

### Model Tuning

Now let's see if by changing the parameters we can get a better solution<br>
<br>
With sklearn we have some comands that help us with this such as RandomizedSearchCV, but let´s first talk about the parameters we'll be messing with:<br>
-n_estimators: Represents the amount of Decision Trees in our Forest, we'll try to change the value in increments of 50 to try and find an "optimal" value between [50,400]<br>
-criterion: criterion in the trees we'll be testing gini vs entropy<br>
-max_features: the number of features that each tree will have, according to some studies the best values should be around $\sqrt{TotalFeatures}$ or log<sub>2</sub>(TotalFeatures)<br>
<br>
Note that we'll not be changing tree max_depth although it might seem important, based on the thought process that although it is true that some trees may overfit, due to the amount of trees, we'll have overfitting trees for both results (cancer and non-cancer) wich will end up balancing things out.<br>
It is also important to remind that even if some trees overfit it will probably be a minority of them.

#### Gini with Square root and tree number

##### Find the best tree number for gini with square root

In [None]:
a=50
b=400
values=[]
treenumber=[]
for i in range (50,400):
    RandomForest = RandomForestClassifier(n_estimators=i,criterion="gini",max_features="sqrt")
    RandomForest.fit(features_treino,has_cancer_treino)
    has_cancer_prediction=RandomForest.predict(features_teste)
    accuracy = accuracy_score(has_cancer_teste, has_cancer_prediction)
    values.append(accuracy)
    treenumber.append(i)
pyplot.plot(treenumber,values )
pyplot.xlabel('trees')
pyplot.ylabel('accuracy') 
pyplot.title('Correlation between trees and Accuracy')
pyplot.show() 
values=np.array(values)
index=np.where(values==values.max())
besttree=treenumber[index[0]]
print(f"Best tree value={besttree}")

##### Model With Best Tree Number

In [None]:
RandomForest = RandomForestClassifier(n_estimators=besttree,criterion="gini",max_features="sqrt")
RandomForest.fit(features_treino,has_cancer_treino)
has_cancer_prediction=RandomForest.predict(features_teste)

###### Accuracy


In [None]:
accuracy = accuracy_score(has_cancer_teste, has_cancer_prediction)
print(f"Accuracy = {accuracy} in a percentage of 100 = {accuracy*100}")

###### Precision

In [None]:
precision = precision_score(has_cancer_teste, has_cancer_prediction)
print(f"Precision = {precision} in a percentage of 100 = {precision*100}")

###### Confusion Matrix

In [None]:
cm=confusion_matrix(has_cancer_teste, has_cancer_prediction)
ConfusionMatrixDisplay(confusion_matrix=cm).plot()
True_positives, False_positives, True_negatives, False_negatives=cm.ravel
print(f"True Positives={True_positives} \n False Positives={False_positives} \n True Negatives={True_negatives} \n False Positives={False_positives}")

#### Gini with Log and tree number

##### Find the best tree number for gini with log

In [None]:
a=50
b=400
values=[]
treenumber=[]
for i in range (50,400):
    RandomForest = RandomForestClassifier(n_estimators=i,criterion="gini",max_features="log2")
    RandomForest.fit(features_treino,has_cancer_treino)
    has_cancer_prediction=RandomForest.predict(features_teste)
    accuracy = accuracy_score(has_cancer_teste, has_cancer_prediction)
    values.append(accuracy)
    treenumber.append(i)
pyplot.plot(treenumber,values )
pyplot.xlabel('trees')
pyplot.ylabel('accuracy') 
pyplot.title('Correlation between trees and Accuracy')
pyplot.show() 
values=np.array(values)
index=np.where(values==values.max())
besttree=treenumber[index[0]]
print(f"Best tree value={besttree}")

##### Model With Best Tree Number

In [None]:
RandomForest = RandomForestClassifier(n_estimators=besttree,criterion="gini",max_features="log2")
RandomForest.fit(features_treino,has_cancer_treino)
has_cancer_prediction=RandomForest.predict(features_teste)

###### Accuracy

In [None]:
accuracy = accuracy_score(has_cancer_teste, has_cancer_prediction)
print(f"Accuracy = {accuracy} in a percentage of 100 = {accuracy*100}")

###### Precision

In [None]:
precision = precision_score(has_cancer_teste, has_cancer_prediction)
print(f"Precision = {precision} in a percentage of 100 = {precision*100}")

###### Confusion Matrix

In [1]:
cm=confusion_matrix(has_cancer_teste, has_cancer_prediction)
ConfusionMatrixDisplay(confusion_matrix=cm).plot()
True_positives, False_positives, True_negatives, False_negatives=cm.ravel
print(f"True Positives={True_positives} \n False Positives={False_positives} \n True Negatives={True_negatives} \n False Positives={False_positives}")

NameError: name 'confusion_matrix' is not defined