# Random forest model 

This is a slightly more sophisticated model than linear regression. But I will try to implement it on the same data as the LR model. 

Code based on [this](https://towardsdatascience.com/random-forest-in-python-24d0893d51c0) article.

In [2]:
# Firstly import the modules
import numpy as np 
import pandas as pd 
from sklearn.model_selection import train_test_split
import sklearn.metrics as skm

In [3]:
# Import the same data as before
data = "https://raw.githubusercontent.com/Blackman9t/Machine_Learning/master/Original_2000_2014_Fuel_Consumption_Ratings.csv"
missing_data = ["n/a","na","--","?","non","Non","None"]

df = pd.read_csv(data, na_values=missing_data)

df.rename(columns={'FUEL_CONSUMPTION_CITY(L/100km)':'FUEL_CONS_CITY', 
                        'ENGINE_SIZE(L)':'ENGINE_SIZE',
                       'HWY_(L/100km)':'HWY_L100km',
                       'COMB_(L/100km)':'COMB_L100km',
                       'COMB_(mpg)':'COMB_MPG',
                       'CO2_EMISSIONS(g/km)':'CO2_EMISSIONS'},
                       inplace=True)
target = df["CO2_EMISSIONS"]
df.head()

Unnamed: 0,MODEL_YEAR,MAKE,MODEL,VEHICLE_CLASS,ENGINE_SIZE,CYLINDERS,TRANSMISSION,FUEL_TYPE,FUEL_CONS_CITY,HWY_L100km,COMB_L100km,COMB_MPG,CO2_EMISSIONS
0,2000,ACURA,1.6EL,COMPACT,1.6,4,A4,X,9.2,6.7,8.1,35,186
1,2000,ACURA,1.6EL,COMPACT,1.6,4,M5,X,8.5,6.5,7.6,37,175
2,2000,ACURA,3.2TL,MID-SIZE,3.2,6,AS5,Z,12.2,7.4,10.0,28,230
3,2000,ACURA,3.5RL,MID-SIZE,3.5,6,A4,Z,13.4,9.2,11.5,25,264
4,2000,ACURA,INTEGRA,SUBCOMPACT,1.8,4,A4,X,10.0,7.0,8.6,33,198


In [4]:
# Drop the columns that aren't strongly correlated with the target. We know which are highly correlated and which are colinear from previous work so we can drop those. 

df = df[['ENGINE_SIZE','CYLINDERS','FUEL_CONS_CITY','COMB_MPG', 'HWY_L100km', 'COMB_L100km']]
df = df[['ENGINE_SIZE','CYLINDERS','COMB_MPG']]

# check the target hasn't changed
target.head(3)

0    186
1    175
2    230
Name: CO2_EMISSIONS, dtype: int64

In [5]:
features = df 
features.describe()
df.head()

Unnamed: 0,ENGINE_SIZE,CYLINDERS,COMB_MPG
0,1.6,4,35
1,1.6,4,37
2,3.2,6,28
3,3.5,6,25
4,1.8,4,33


#### One-hot encoding
This takes the variables that are non-numerical and converts them into numbers to feed into a model.

For this model the one hot encoding is not necessary because there are only numerical variables

In [6]:
labels = np.array(target)
labels

array([186, 175, 230, ..., 237, 225, 258], dtype=int64)

In [7]:
# Saving feature names for later use
feature_list = list(features.columns)
print(feature_list)

['ENGINE_SIZE', 'CYLINDERS', 'COMB_MPG']


In [8]:
# Convert to numpy array
features = np.array(features)

In [9]:
# Split the data into training and testing sets
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.25, random_state = 42)
print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)

Training Features Shape: (10757, 3)
Training Labels Shape: (10757,)
Testing Features Shape: (3586, 3)
Testing Labels Shape: (3586,)


### Create a baseline model that we can evaluate against.

Might have to use the model from before. Mean absolute error of OLS model from before is ***14.95***, this is the target to beat.


## Now we can apply the model. 



In [10]:
# import model
from sklearn.ensemble import RandomForestRegressor


In [11]:
# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
# Train the model on training data
rf.fit(train_features, train_labels)

RandomForestRegressor(n_estimators=1000, random_state=42)

In [12]:
# Use the forest's predict method on the test data
predictions = rf.predict(test_features)
predictions

array([241.08152022, 190.76135925, 223.72521971, ..., 148.07984455,
       237.67081232, 293.00190189])

In [16]:
# Calculate the mean absolute errors. 
mae = skm.mean_absolute_error(test_labels, predictions)
mae

4.114111678128973

In [17]:
mse = skm.mean_squared_error(test_labels, predictions)
mse

102.57075828871535

## A much lower MAE/MSE! 

This means the random forest model is much better. 

## Visualise the trees


In [14]:
# Import tools needed for visualization
from sklearn.tree import export_graphviz
import pydot

In [15]:
# Pull out one tree from the forest
tree = rf.estimators_[5]
# Export the image to a dot file
export_graphviz(tree, out_file = 'tree.dot', feature_names = feature_list, rounded = True, precision = 1)
# Use dot file to create a graph
(graph, ) = pydot.graph_from_dot_file('tree.dot')
