Random Forest Regression is essentially  multiple decision trees working together (hence random FOREST). It's one of the most effective regression models and usually outperforms all others aside from SVR. This dataset is a simple data from Kaggle that predicts the profit of 50 startups based on 4 predictor variables.

### Importing the libraries
These are the three go to libraries for most ML.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### Importing the dataset
Simple dataset import using Pandas dataframe and iloc to assign our independent variable(s) (everything besides the last column) and our dependent variable (the last column). The name of the dataset has to be updated and it must be in the same folder as your .py file or uploaded on Jupyter Notebooks or Google Collab.

In [2]:
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

### Encoding categorical data
Index 3 had categorical data that had to be converted using OneHotEncoding.

In [3]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = ct.fit_transform(X)

### Splitting the dataset into the Training set and Test set
Because of the smaller dataset and less variables for testing, I used a 80/20 split. The random state is tuned to 0 for consistency sakes.

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

### Training the Random Forest Regression model on the whole dataset
Here, 'n_estimators' is the value assigned to how many trees we have. 10 is usually the go to but with model tweaking and grid search and k fold, you can find the optimal amount for your specific problem.

In [5]:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=10, n_jobs=None, oob_score=False,
                      random_state=0, verbose=0, warm_start=False)

### Predicting the Test set results
By using the concatenate function I display the predicted values and  actual values in a side by side 2D array through '(len(y_wtv), 1))' for easy viewing.

In [6]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=0, suppress=True)
print(np.concatenate((y_pred.astype(int).reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[103669. 103282.]
 [133310. 144259.]
 [136197. 146122.]
 [ 81563.  77799.]
 [183960. 191050.]
 [114786. 105008.]
 [ 76475.  81229.]
 [100724.  97484.]
 [113105. 110352.]
 [162041. 166188.]]


### Evaluating Model Performance
We use two metrics to evaluate our model performance, r^2 being the more superior. These are both simple to understand and are covered in one of my Medium articles! This model acheives a r2 of .96 (.01 better than decision tree regression) with only 50 instances of testing data! Pretty impressive.

In [7]:
from sklearn.metrics import r2_score, mean_squared_error as mse
print("r^2: " + str(r2_score(y_test, y_pred)))
print("MSE: " + str(mse(y_test, y_pred)))

r^2: 0.9658739721928109
MSE: 43643490.06520345
