# Random Forests

The problem to be tackled is the prediction of tomorrow's maximum temperature in our city of choice, deriving from the previous year's weather data. The retrieved weather data were for Seattle, WA from the year 2019 and they were retrieved using the NOAA Climate Data Online tool. The choice is to act as if there is no access to any weather forecast. The only access given was that of the historical temperature highs, the temperatures of the previous two days, and an estimate from a friend who claims to know everything about the weather.

In [1]:
# Import of necessary libraries
import pandas as pd
import numpy as np

In [2]:
# Import of the data file and printing of the 5 rows of our dataset
features = pd.read_csv('temps.csv')
features.head(5)

Unnamed: 0,year,month,day,week,temp_2,temp_1,average,actual,friend
0,2019,1,1,Fri,45,45,45.6,45,29
1,2019,1,2,Sat,44,45,45.7,44,61
2,2019,1,3,Sun,45,44,45.8,41,56
3,2019,1,4,Mon,44,41,45.9,40,53
4,2019,1,5,Tues,41,40,46.0,44,41


Since the dataset includes the categorical variable 'week' with character string values, the process of 'One-hot Encoding' must be implemented. Through this process, each day of the week will be transformed to a numerical representation without arbitrary order. The unique column representing the days of the week will be transformed to seven columns of binary data.

In [3]:
# One-hot encoding of the categorical variable of the days of the week and re-printing of the 5 rows of our dataset
features = pd.get_dummies(features)
features.head(5)

Unnamed: 0,year,month,day,temp_2,temp_1,average,actual,friend,week_Fri,week_Mon,week_Sat,week_Sun,week_Thurs,week_Tues,week_Wed
0,2019,1,1,45,45,45.6,45,29,1,0,0,0,0,0,0
1,2019,1,2,44,45,45.7,44,61,0,0,1,0,0,0,0
2,2019,1,3,45,44,45.8,41,56,0,0,0,1,0,0,0
3,2019,1,4,44,41,45.9,40,53,0,1,0,0,0,0,0
4,2019,1,5,41,40,46.0,44,41,0,0,0,0,0,1,0


In [4]:
# Separation of data into features and targets

In [5]:
# Labels are the values the Random Forests aim to predict
labels = np.array(features['actual'])

In [6]:
# Dropping the 'actual' column from the dataset
features = features.drop('actual', axis = 1)

In [7]:
# Saving the attribute name for future use
feature_list = list(features.columns)

In [8]:
# Convertion of the Pandas dataframe to a Numpy array
features = np.array(features)

In [9]:
# From the Scikit-learn library we import the method of separating the data into training and test sets
from sklearn.model_selection import train_test_split

In [12]:
# Splitting of the data into training and test sets
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.25, 
                                                                            random_state = 42)
# The setting of the value of the random state as 42 means that the results will be the same each
# time we perform the separation for reproducible results

In [14]:
# Making sure that the process was implemented correctly by printing the shape of all data
# The expected result is that the number of columns of the training features will match the number of columns
# of test features and that the number of rows will match the corresponding training and test features and labels:
print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)

Training Features Shape: (261, 14)
Training Labels Shape: (261,)
Testing Features Shape: (87, 14)
Testing Labels Shape: (87,)


In [15]:
# Setting the historical maximum average temperatures as the baseline prediction
baseline_preds = test_features[:, feature_list.index('average')]

In [16]:
# Baseline errors, and display of average baseline error
baseline_errors = abs(baseline_preds - test_labels)
print('Average baseline error: ', round(np.mean(baseline_errors), 2), 'degrees')

Average baseline error:  5.06 degrees


One of the main goals has been achieved. As if an average error of 5 points couldn't be achieved, the original approach should be reconsidered.

In [17]:
# Import of the random forest regression model from the Scikit-learn library
from sklearn.ensemble import RandomForestRegressor

In [19]:
# Instatiate the model
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)

In [20]:
# Train the model on training data
rf.fit(train_features, train_labels)

In [21]:
# Using the random forest model prediction method on test data
predictions = rf.predict(test_features)

In [22]:
# Calculation of the absolute errors
errors = abs(predictions - test_labels)

In [23]:
# Printing out the mean absolute error (mae)
print('Mean Absolute Error: ', round(np.mean(errors), 2), 'degrees')

Mean Absolute Error:  3.83 degrees


The average estimate is off by 3.83 points. This is more than a one-point average improvement over the baseline. While this may not seem significant, it is almost 25% better than the baseline, which, depending on the field and problem it relates to, could represent millions of dollars for a company.

In order to put the model's prediction into perpective a calculation of its accuracy using the mean average percentage error will be made.

In [24]:
# Calculation of the mean average percentage error (MAPE)
mape = 100 * (errors / test_labels)

In [25]:
# Calculation and display of the model's accuracy
accuracy = 100 - np.mean(mape)
print('Accuracy: ', round(accuracy, 2), '%')

Accuracy:  93.98 %


The implemented model has learned how to predict the next day's maximum temperature in Seattle with a 93.8% accuracy.