# Part30: Random Forests

http://hamelg.blogspot.com/2015/12/python-for-data-analysis-part-30-random.html

A random forest model is a collection of decision tree models that are combined together to make predictions. When you make a random forest, you have to specify the number of decision trees you want to use to make the model. The random forest algorithm then takes random samples of observations from your training data and builds a decision tree model for each sample. The random samples are typically drawn with replacement, meaning the same observation can be drawn multiple times. The end result is a bunch of decision trees that are created with different groups of data records drawn from the original training data.

In [2]:
import numpy as np
import pandas as pd
import os

%matplotlib inline

In [3]:
os.chdir('/home/sindhuvarun/github/ML-Learning/staticsAndProbability/PythonForDataAnalytics/dataset/Titanic')
titanic_train = pd.read_csv('train.csv')

char_cabin = titanic_train["Cabin"].astype(str)     # Convert cabin to str

new_Cabin = np.array([cabin[0] for cabin in char_cabin]) # Take first letter

titanic_train["Cabin"] = pd.Categorical(new_Cabin)  # Save the new cabin var

# Impute median Age for NA Age values
new_age_var = np.where(titanic_train["Age"].isnull(), # Logical check
                       28,                       # Value if check is true
                       titanic_train["Age"])     # Value if check is false

titanic_train["Age"] = new_age_var

In [4]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing

In [9]:
np.random.seed(12)

label_encoder = preprocessing.LabelEncoder()

titanic_train['Sex'] = label_encoder.fit_transform(titanic_train['Sex'])
#titanic_train['Embarked'] = label_encoder.fit_transform(titanic_train['Embarked'])

rf_model = RandomForestClassifier(n_estimators=1000, max_features=2, oob_score=True)

features=['Sex', 'Pclass', 'SibSp', 'Age', 'Fare']

rf_model.fit(X=titanic_train[features], y=titanic_train['Survived'])

print("OOB Accuracy:")
print(rf_model.oob_score_)


OOB Accuracy:
0.8204264870931538


The random forest classifier assigns an importance value to each feature used in training. Features with higher importance were more influential in creating the model, indicating a stronger association with the response variable.

In [10]:
for feature, imp in zip(features, rf_model.feature_importances_):
    print(feature, imp)

Sex 0.27345124368239543
Pclass 0.08964149328540527
SibSp 0.04911270949151079
Age 0.27740791776462315
Fare 0.3103866357760655
