<a href="https://colab.research.google.com/github/MukundVarmaT/PythonMLworkshop/blob/master/ML/.ipynb_checkpoints/Random_forest03-checkpoint.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Random Forest 


Random forest is a type of supervised machine learning algorithm based on ensemble learning. 
Ensemble learning is a type of learning where you join different types of algorithms or same algorithm multiple times to form a more powerful prediction model. 


The random forest algorithm combines multiple algorithm of the same type i.e. multiple decision trees, resulting in a forest of trees, hence the name "Random Forest". The random forest algorithm can be used for both regression and classification tasks.

### How the Random Forest Algorithm Works


The following are the basic steps involved in performing the random forest algorithm:

    1) Pick N random records from the dataset.

    2) Build a decision tree based on these N records.

    3) Choose the number of trees you want in your algorithm and repeat steps 1 and 2.

In case of a regression problem, for a new record, each tree in the forest predicts a value for Y (output). 
The final value can be calculated by taking the average of all the values predicted by all the trees in forest. 

Or, in case of a classification problem, each tree in the forest predicts the category to which the new record belongs. Finally, the new record is assigned to the category that wins the majority vote.

## Advantages of using Random Forest
As with any algorithm, there are advantages and disadvantages to using it. In the next two sections we'll take a look at the pros and cons of using random forest for classification and regression.



1) The random forest algorithm is not biased, since, there are multiple trees and each tree is trained on a subset of data. 
    
    Basically, the random forest algorithm relies on the power of "the crowd"; therefore the overall biasedness of the             algorithm is reduced.

2) This algorithm is very stable. Even if a new data point is introduced in the dataset the overall algorithm is not affected much since new data may impact one tree, but it is very hard for it to impact all the trees.

3) The random forest algorithm works well when you have both categorical and numerical features.

4) The random forest algorithm also works well when data has missing values or it has not been scaled well

## Disadvantages of using Random Forest


1) A major disadvantage of random forests lies in their complexity. They required much more computational resources, owing to the large number of decision trees joined together.

2) Due to their complexity, they require much more time to train than other comparable algorithms

#### Problem Definition
The problem here is to predict the gas consumption (in millions of gallons) in 48 of the US states based on petrol tax (in cents), per capita income (dollars), paved highways (in miles) and the proportion of population with the driving license.

#### Solution
To solve this regression problem we will use the random forest algorithm via the Scikit-Learn Python library. We will follow the traditional machine learning pipeline to solve this problem.

In [0]:
import pandas as pd
import numpy as np

In [0]:
# import the dataset:
dataset = pd.read_csv('https://raw.githubusercontent.com/MukundVarmaT/PythonMLworkshop/master/ML/petrol_consumption.csv')

In [13]:
# To get a high-level view of what the dataset looks like, execute the following command:
dataset.head()

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
0,9.0,3571,1976,0.525,541
1,9.0,4092,1250,0.572,524
2,9.0,3865,1586,0.58,561
3,7.5,4870,2351,0.529,414
4,8.0,4399,431,0.544,410


We can see that the values in our dataset are not very well scaled. We will scale them down before training the algorithm.

### Preparing Data For Training
Two tasks will be performed in this section. The first task is to divide data into 'attributes' and 'label' sets. The resultant data is then divided into training and test sets.

The following script divides data into attributes and labels:

In [0]:
X = dataset.iloc[:, 0:4].values
y = dataset.iloc[:, 4].values

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

### Feature Scaling
We know our dataset is not yet a scaled value, for instance the Average_Income field has values in the range of thousands while Petrol_tax has values in range of tens. 

Therefore, it would be beneficial to scale our data (although, as mentioned earlier, this step isn't as important for the random forests algorithm).

To do so, we will use Scikit-Learn's StandardScaler class. Execute the following code to do so:

In [0]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Now that we have scaled our dataset, it is time to train our random forest algorithm to solve this regression problem. Execute the following code:


In [17]:

from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(n_estimators=20, random_state=0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
# print(y_pred,y_test)
data = np.concatenate((y_pred,y_test))
array2D = data.reshape((10,2))
df = pd.DataFrame(array2D, columns = ['True Value', 'Model Predicted'])
print(df)


   True Value  Model Predicted
0      574.10           514.60
1      604.80           589.75
2      625.55           592.50
3      594.90           573.30
4      468.55           536.80
5      534.00           410.00
6      577.00           571.00
7      577.00           704.00
8      487.00           587.00
9      467.00           580.00


The RandomForestRegressor class of the sklearn.ensemble library is used to solve regression problems via random forest.
The most important parameter of the RandomForestRegressor class is the n_estimators parameter. 

This parameter defines the number of trees in the random forest. We will start with n_estimator=20 to see how our algorithm performs.

### Evaluating the Algorithm
The last and final step of solving a machine learning problem is to evaluate the performance of the algorithm. For regression problems the metrics used to evaluate an algorithm are mean absolute error, mean squared error, and root mean squared error. Execute the following code to find these values:

In [19]:
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Mean Absolute Error: 51.76500000000001
Mean Squared Error: 4216.166749999999
Root Mean Squared Error: 64.93201637097064
