## Random Forest classification with Scikit-Learn
The process of fitting a decision tree to our data can be done in Scikit-Learn with the **DecisionTreeClassifier** estimator:
  
```python
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier().fit(X, y)
```
However, as we have seen earlier, it’s better to **use Random Forest to reduce overfitting**. Applying the Random Forest classifier to the titanic data is very similar to Naïve Bayes.   
  
First we explore, transform and clean the data the same way as we did for Naïve Bayes.

In [56]:
# import the library 
import pandas as pd
url = 'https://raw.githubusercontent.com/HOGENT-Databases/DB3-Workshops/master/data/titanic.csv'
titanic = pd.read_csv(url)
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [57]:
# explore the data to estimate if we have enough (statistically relevant) data for both classes
titanic.groupby('Survived').count()

Unnamed: 0_level_0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,549,549,549,549,424,549,549,549,549,68,549
1,342,342,342,342,290,342,342,342,342,136,340


In [58]:
# We drop clearly irrelevant attributes. Pay attention for bias! Don't let your own opinion play. 
titanic = titanic.drop(['PassengerId','Name','Ticket','Fare','Cabin','Embarked'],axis=1)
titanic.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch
0,0,3,male,22.0,1,0
1,1,1,female,38.0,1,0
2,1,3,female,26.0,0,0
3,1,1,female,35.0,1,0
4,0,3,male,35.0,0,0


In [59]:
print('Before')
print(titanic.count())
print()

# drop all lines that contain empty (null or NaN) values
titanic = titanic.dropna()

print('After')
print(titanic.count())

Before
Survived    891
Pclass      891
Sex         891
Age         714
SibSp       891
Parch       891
dtype: int64

After
Survived    714
Pclass      714
Sex         714
Age         714
SibSp       714
Parch       714
dtype: int64


In [60]:
# see what remains
titanic.groupby('Survived').count()

Unnamed: 0_level_0,Pclass,Sex,Age,SibSp,Parch
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,424,424,424,424,424
1,290,290,290,290,290


In [61]:
import numpy as np
titanic['Sex'] = np.where(titanic['Sex']>='male', 1, 2)
titanic.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch
0,0,3,1,22.0,1,0
1,1,1,2,38.0,1,0
2,1,3,2,26.0,0,0
3,1,1,2,35.0,1,0
4,0,3,1,35.0,0,0


In [62]:
from sklearn.model_selection import train_test_split
X = titanic.drop('Survived',axis=1)
y = titanic['Survived']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30)

Now we are ready to train the data using the RandomForestClassifier. Parameter n_estimators is the number of trees in the forest. Default = 10. 

In [63]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=300)
model.fit(X_train, y_train)

RandomForestClassifier(n_estimators=300)

In [64]:
y_test2 = model.predict(X_test)

In [65]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_test2)

0.8093023255813954

In [66]:
# We could also try to find the optimal number of trees in a automated way. 

best_accuracy = 0
best_trees = 0

for trees in range(50,550,50):
    model = RandomForestClassifier(n_estimators=trees)
    model.fit(X_train, y_train)    
    y_test2 = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_test2)
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_trees = trees
        
print('Optimal number of trees = % s' %(best_trees))
print('Accuracy = % 3.2f' % (best_accuracy)) 

Optimal number of trees = 50
Accuracy =  0.82


The above approach involves a certain risk of _"leaking"_ the test data to the tuning of the algorithm. Indeed, we are in fact searching for the optimal hyperparameters (i.e. the number of trees) by using the same (randomly determined) test set in every iteration. This means we are tuning the classifier for a specific version of the test set instead of for the problem in general. A better approach would be to use a training, a validation and a test set and use the validation set for hyperparameter tuning and only use the test set to determine the accuracy after these tuning. Therefore, in the next cell, we further split the training data in a training and a validation set. To obtain maximum randomization we repeat this random split for each number of trees. 



In [67]:
X_remainder, X_test, y_remainder, y_test = train_test_split(X,y,test_size=0.30)

best_accuracy = 0
best_trees = 0

for trees in range(50,550,50):
    X_train, X_validation, y_train, y_validation = train_test_split(X_remainder,y_remainder,test_size=0.30)
    model = RandomForestClassifier(n_estimators=trees)
    model.fit(X_train, y_train)    
    y_validation2 = model.predict(X_validation)
    accuracy = accuracy_score(y_validation, y_validation2)
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_trees = trees
        best_validation = model.predict(X_test)
        
print('Optimal number of trees = % s' %(best_trees))
print('Accuracy on validation set = % 3.2f' % (best_accuracy)) 
accuracyOnTestSet = accuracy_score(y_test, best_validation)
print('Accuracy on test set = % 3.2f' % (accuracyOnTestSet))

Optimal number of trees = 200
Accuracy on validation set =  0.85
Accuracy on test set =  0.78


It  turns out that the accuracy of the Random Forest classifier is very close to Naïve Bayes. However, Decision Tree and Random Forest classifiers have one major advantage over Naïve Bayes: you can **determine the relative importance of each feature**.  
  
From an ethical point of view this is also a very important feature if you have to declare why a model makes a certain decision, for instance in case of deciding to grant a loan to a bank customer.

In [68]:
print(X_train.columns)
print(model.feature_importances_)

Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch'], dtype='object')
[0.16389414 0.26973187 0.43242171 0.06725788 0.06669439]


In [69]:
# we now combine those two collections into a dataframe
pd.DataFrame(model.feature_importances_,columns=['Importance'],index=X_train.columns).sort_values(by='Importance',ascending=False)

Unnamed: 0,Importance
Age,0.432422
Sex,0.269732
Pclass,0.163894
SibSp,0.067258
Parch,0.066694


We learn from this that by far the three most important criteria to survive the Titanic disaster were: (1) Age, (2) Sex and (3) Ticket class. 

In [70]:
# Determine the false negative rate: what's the proportion of the passengers 
# who survived that we declared death. 
results = pd.DataFrame({'true':y_test,'estimated':y_test2})

results['TP'] = np.where((results['true'] == 1) & (results['estimated'] == 1),1,0)
results['TN'] = np.where((results['true'] == 0) & (results['estimated'] == 0),1,0)
results['FP'] = np.where((results['true'] == 0) & (results['estimated'] == 1),1,0)
results['FN'] = np.where((results['true'] == 1) & (results['estimated'] == 0),1,0)

FNrate = results['FN'].sum()/(results['FN'].sum() + results['TP'].sum())
print(FNrate)

0.6375


### One hot encoding for categorical features
One common type of non-numerical data is categorical data. For example, in the previous example, Sex = male or female. 

We encode male as 1 and female as 2 because the algorithms we used only work with numerical features. 
However, Scikit-Learn models make the fundamental assumption that numerical features reflect algebraic quantities, 
so in our example they would assume that a female is twice a male, which does not make much sense. 
In this case, one proven technique is to use one-hot encoding, which effectively creates extra columns indicating 
the presence or absence of a category with a value of 1 or 0 respectively. 
This has the benefit of not weighting a value improperly but does have the downside of adding more columns to the data set. 
Pandas supports this feature using the function _get_dummies_. This function is named this way because it creates 
dummy/indicator variables (aka 1 or 0).  
  
In our example we can then replace the line
```python
titanic['Sex'] = np.where(titanic['Sex']>='male', 1, 2)
```
by
```python
titanic = pd.get_dummies(titanic, columns=["Sex"], prefix=["Sex"])
```
Of course, you can also use one hot encoding for 'numerical' features that are categorical by nature. For example, if in the original titanic data the gender would have been encoded as 1 or 2.  
  
Let's now redo the modeling using one-hot-encoding for Sex. For brevity we don't search for the optimal number of trees, be use what we have found above.

In [71]:
url = 'https://raw.githubusercontent.com/HOGENT-Databases/DB3-Workshops/master/data/titanic.csv'
titanic = pd.read_csv(url)
titanic = titanic.drop(['PassengerId','Name','Ticket','Fare','Cabin','Embarked'],axis=1)
titanic = titanic.dropna()
titanic = pd.get_dummies(titanic, columns=["Sex"], prefix=["Sex"])
print(titanic.head())
X = titanic.drop('Survived',axis=1)
y = titanic['Survived']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30)
model = RandomForestClassifier(n_estimators=best_trees) # ues the optimal number of trees we have found above
model.fit(X_train, y_train)
y_test2 = model.predict(X_test)
accuracy_score(y_test, y_test2)

   Survived  Pclass   Age  SibSp  Parch  Sex_female  Sex_male
0         0       3  22.0      1      0           0         1
1         1       1  38.0      1      0           1         0
2         1       3  26.0      0      0           1         0
3         1       1  35.0      1      0           1         0
4         0       3  35.0      0      0           0         1


0.7953488372093023

In this specific case one-hot-encoding is not offering an advantage but in general this is a better approach. 
A disadvantage of splitting the columns is that the relative importances are also split: 

In [72]:
importances = pd.DataFrame(model.feature_importances_,columns=['Importance'],index=X_train.columns).sort_values(by='Importance',ascending=False).reset_index()
importances


Unnamed: 0,index,Importance
0,Age,0.407596
1,Sex_male,0.194067
2,Sex_female,0.153117
3,Pclass,0.142342
4,SibSp,0.054395
5,Parch,0.048482


In [73]:
# We can group these relative importances together and make the sum of there values: 
importances['index'] = np.where(importances['index'].str.startswith ('Sex'),'Sex',importances['index'])
imp = importances.groupby(['index'])['Importance'].sum().reset_index().sort_values(by='Importance',ascending=False).reset_index()
imp

Unnamed: 0,level_0,index,Importance
0,0,Age,0.407596
1,3,Sex,0.347185
2,2,Pclass,0.142342
3,4,SibSp,0.054395
4,1,Parch,0.048482
