# Random Forest for Regression

## Introduction

In this Jupyter Notebook, we're going to work on a data set that I retrieved from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/student+performance). There are many features related to students' performance. We'll build a machine learning model which makes predictions about students' final grades based on their preceding grades.

### How does it work ?



Random Forest (for classification) is already explained in this repository. You can find further information [here](../Random%20Forest%20-%20Classification/). A Random Forest consists of **Decision Trees** as base estimators. If you would like to remember Decision Trees, you can look at [this notebook](..//Decision%20Tree%20-%20Regression/). Base estimators are trained on bootstrap samples of training set. Each estimator makes predictions based on a new instance and the final decision is chosen by majority voting.

<img src="img.png" width="500" height="500" align="left"/>

## 1. First steps

First of all, we need to import necessary libraries/packages.

In [4]:
# Import fundamental libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# We'll use RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor

# To evaluate our model
from sklearn.metrics import mean_squared_error

# To split our data into training and test sets
from sklearn.model_selection import train_test_split

# Import the GridSearchCV for hyperparameter tuning
from sklearn.model_selection import GridSearchCV

# We are going to visualize our tree
from sklearn.tree import plot_tree

# I will keep the resulting plots
%matplotlib inline

# Enable Jupyter Notebook's intellisense
%config IPCompleter.greedy=True

Load the dataset into a DataFrame which we named as data

In [5]:
data = pd.read_csv('../datasets/student-mat.csv',delimiter=';')

## 2. Exploration and Preprocessing

Before build the model, we have to do some exploratory data analysis on the data we're working on

In [6]:
# Print first five rows
display(data.head())

# Print the summary statistics
print("\nSummary Statistics\n")
print(data.describe())

# Print the information
print("\nInfo\n")
print(data.info())

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10



Summary Statistics

              age        Medu        Fedu  traveltime   studytime    failures  \
count  395.000000  395.000000  395.000000  395.000000  395.000000  395.000000   
mean    16.696203    2.749367    2.521519    1.448101    2.035443    0.334177   
std      1.276043    1.094735    1.088201    0.697505    0.839240    0.743651   
min     15.000000    0.000000    0.000000    1.000000    1.000000    0.000000   
25%     16.000000    2.000000    2.000000    1.000000    1.000000    0.000000   
50%     17.000000    3.000000    2.000000    1.000000    2.000000    0.000000   
75%     18.000000    4.000000    3.000000    2.000000    2.000000    0.000000   
max     22.000000    4.000000    4.000000    4.000000    4.000000    3.000000   

           famrel    freetime       goout        Dalc        Walc      health  \
count  395.000000  395.000000  395.000000  395.000000  395.000000  395.000000   
mean     3.944304    3.235443    3.108861    1.481013    2.291139    3.554430   
std   

As we see, there are lot's of features. We only need G1, G2 and G3 columns which indicate exam grades. According to info, there is no missing values to deal with.

In [7]:
# Select the features
X = data[['G1','G2']]

# Select the target
y = data['G3']

## 3. Build the model

Now, we can start to build our regression model. First, split the data into training and test sets

In [8]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=34)

Let's create the Random Forest

In [9]:
# Create the parameters dictionary
parameters = {'n_estimators':[100,200,300,500,700],'criterion':['mse','mae']}

# Initialize the regressor
rf = RandomForestRegressor()

# Initialize the GridSearchCV
search = GridSearchCV(estimator=rf, param_grid=parameters, cv=10)

# Fit the grid search
search.fit(X_train, y_train)

# Print the best parameters and the best accuracy
print("Best parameters:",search.best_params_)
print("Best R^2:",search.best_score_)

# Fit the model with training set
rf_best = search.best_estimator_

Best parameters: {'criterion': 'mae', 'n_estimators': 200}
Best R^2: 0.8068209650801483


## 4. Evaluate the model

In [10]:
# Make predictions
y_pred_test = rf_best.predict(X_test)
y_pred_train = rf_best.predict(X_train)

# TRAINING ERRORS
print("\nTraining Errors:\n ")

# Mean Squared Error
print("MSE:",mean_squared_error(y_train, y_pred_train))

# Root Mean Squared Error
print("RMSE:",np.sqrt(mean_squared_error(y_train, y_pred_train)))
      
# R^2
print("R^2:",rf_best.score(X_train, y_train))


# TEST ERRORS
print("\nTest Errors:\n ")

# Mean Squared Error
print("MSE:",mean_squared_error(y_test, y_pred_test))

# Root Mean Squared Error
print("RMSE:",np.sqrt(mean_squared_error(y_test, y_pred_test)))
      
# R^2
print("R^2:",rf_best.score(X_test, y_test))



Training Errors:
 
MSE: 3.02726914556962
RMSE: 1.7399049242902958
R^2: 0.8498291151427237

Test Errors:
 
MSE: 5.148100949367089
RMSE: 2.2689426941567055
R^2: 0.782598735858121


This is the end of the notebook. I hope this notebook will be helpful to understand basics of RandomForestRegressor.