## Random Forests

Decision trees are one of such models which have low bias but high variance. We have studied that decision trees tend to overfit the data. So bagging technique becomes a very good solution for decreasing the variance in a decision tree.
Instead of using a bagging model with underlying model as a decision tree, we can also use Random forest which is more convenient and well optimized for decision trees. The main issue with bagging is that there is not much independence among the sampled datasets i.e. there is correlation. The advantage of random forests over bagging models is that the random forests makes a tweak in the working algorithm of bagging model to decrease the correlation in trees.  The idea is to introduce more randomness while creating trees which will help in reducing correlation.

Let’s understand how algorithm works for a random forest model:

1)	Just like in bagging, different samples are collected from the training dataset using bootstraping.

2)	On each sample we train our tree model and we allow the trees to grow with high depths. 

    Now, the difference with in random forest is how the trees are formed. In bootstraping we allow all the sample data to be used for splitting the nodes but not   with random forests.  When building a decision tree, each time a split is to happen, a random sample of ‘m’ predictors are chosen from the total ‘p’ predictors. Only those ‘m’ predictors are allowed to be used for the split.

    Why is that?

    Suppose in those ‘p’ predictors, 1 predictor is very strong. Now each sample this predictor will remain the strongest. So, whenever trees will be built for these sampled data, this predictor will be chosen by all the trees for splitting and thus will result in similar kind of tree formation for each bootstrap model. This introduces correaltion in the dataset and averaging correalted dataset results do not lead low variance. That’s why in random forest the choice for selecting node for split is limited and it introduces randomness in the formation of the trees as well.
    Most of the predictors are not allowed to be considered for split.
    Generally, value of ‘m’ is taken as m ≈√p , where ‘p’ is the number of predictors in the sample.

    When m=p , the random forest model becomes bagging model.   
              
    *This method is also referred as “Feature Sampling”

<img src="rf/7.PNG">


    The above graph represents the decrease in test classifcation error as we select different     
    values  of ‘m’.

3)	Once the trees are formed, prediction is made by the random forest by aggregating the predictions of all the model.  For regression model, the mean of all the predictions is the final prediction and for classification mode, the mode of all the predictions is considered the final predictions. 


#### Working of a Random Forest Model

<img src="rf/random_forest.PNG">

From the given dataset different samples are created by bootstrapping and these samples are used to train different decision trees. Once the training is complete, prediction is made using all the different models.


#### Predicting Outcome

<img src="rf/random_forest2.PNG">

Random forest makes the prediction by taking the mode of all the predictions made by all the models, since this is the case of classification. This process is also known as “Majority voting”.
We can also use prediction probability to make the final prediction. We can use the predict_proba method, which will predict a probability from 0 to 1 that a given class is the right one for a row. For a problem with output being only 0 and 1, we'll get a matrix with as many rows as there is in the data and 2 columns. predict_proba will return something like this:

<img src="rf/8.png">

Each row corresponds to a prediction. The first column is the probability that the prediction is a 0, the second column is the probability that the prediction is a 1. Each row adds up to 1.

If we just take the second column, we get the average value that the classifier would predict for that row. If there's a .9 probability that the correct classification is 1, we can use the .9 as the value the classifier is predicting. This will give us a continuous output in a single vector instead of just 0 or 1.
We can then add all of the vectors we get through this method together and divide by the number of vectors to get the mean prediction by all the members of the ensemble. We can then round off to get 0 or 1 predictions.
Similarly, in case of regression Random forest makes the prediction by taking the mean of all the predictions made by different models. 

#### Advantages and Disadvantages of Random Forest:

1)	It can be used for both regression and classification problems.

2)	Since base model is a tree, handling of missing values is easy.

3)	It gives very accurate result with very low variance.

4)	Results of a random forest are very hard to interpret in comparison with decision trees.

5)	High computational time than other respective models.


Random Forest should be used where accuracy is up utmost priority and interpretability is not very important. Also, computational time is less expensive than the desired outcome.


In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('ML_df.csv')
df.head()

Unnamed: 0,carat,depth,table,x,y,z,Good,Ideal,Premium,Very Good,...,I,J,IF,SI1,SI2,VS1,VS2,VVS1,VVS2,price
0,-1.198168,-0.174092,-1.099672,-1.587837,-1.536196,-1.571129,0,1,0,0,...,0,0,0,0,1,0,0,0,0,326
1,-1.240361,-1.360738,1.585529,-1.641325,-1.658774,-1.741175,0,0,1,0,...,0,0,0,1,0,0,0,0,0,326
2,-1.198168,-3.385019,3.375663,-1.498691,-1.457395,-1.741175,1,0,0,0,...,0,0,0,0,0,1,0,0,0,327
3,-1.071587,0.454133,0.242928,-1.364971,-1.317305,-1.28772,0,0,1,0,...,1,0,0,0,0,0,1,0,0,334
4,-1.029394,1.082358,0.242928,-1.240167,-1.212238,-1.117674,1,0,0,0,...,0,1,0,0,1,0,0,0,0,335


In [3]:
from sklearn.model_selection import train_test_split

In [4]:
X = df.drop('price',axis=1)
y = df.price

In [8]:
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=33)

In [9]:
from sklearn.ensemble import RandomForestRegressor

In [10]:
rand_clf = RandomForestRegressor()
rand_clf.fit(x_train,y_train)

RandomForestRegressor()

In [11]:
rand_clf.score(x_train,y_train)

0.9965370254679945

In [12]:
rand_clf.score(x_test,y_test)

0.9719221695838091

In [14]:
import numpy as np
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = range(2,11,1)
# Minimum number of samples required at each leaf node
min_samples_leaf = range(2,11,1)
# Method of selecting samples for training each tree
bootstrap = [True, False]

In [15]:
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [16]:
from sklearn.model_selection import RandomizedSearchCV 

In [17]:
rand_search = RandomizedSearchCV(estimator=rand_clf,param_distributions=random_grid,cv=5,n_jobs=-1,verbose=3,random_state=33)
rand_search.fit(x_train,y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits




RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(), n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [10, 20, 30, 40, 50, 60,
                                                      70, 80, 90, 100, 110,
                                                      None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': range(2, 11),
                                        'min_samples_split': range(2, 11),
                                        'n_estimators': [200, 400, 600, 800,
                                                         1000, 1200, 1400, 1600,
                                                         1800, 2000]},
                   random_state=33, verbose=3)

In [18]:
rand_search.best_params_

{'n_estimators': 1400,
 'min_samples_split': 3,
 'min_samples_leaf': 5,
 'max_features': 'auto',
 'max_depth': 110,
 'bootstrap': True}

In [19]:
best_random_grid = rand_search.best_estimator_

In [20]:
y_pred = best_random_grid.predict(x_test)

In [21]:
rand_search.score(x_train,y_train)

0.9856957432917445

In [22]:
rand_search.score(x_test,y_test)

0.9697848969693821

In [23]:
import pickle

In [24]:
filename = 'random_forest.pickle'
pickle.dump(rand_search,open(filename,'wb'))