# Predicting traffic using Extremely Random Forest regressor

Let's apply the concepts we learned in the previous sections to a real world problem. We will be using the dataset available at: https://archive.ics.uci.edu/ml/datasets/Dodgers+Loop+Sensor . This dataset consists of data that counts the number of vehicles passing by on the road during baseball games played at Los Angeles Dodgers stadium. In order to make the data readily available for analysis, we need to pre-process it. The pre-processed data is in the file traffic_data.txt. In this file, each line contains comma-separated strings. Let's take the first line as an example:



__Tuesday,00:00,San Francisco,no,3__

Day of the week, time of the day, opponent team, binary value indicating whether or not a baseball game is currently going on (yes/no), number of vehicles passing by.

__Our goal is to predict the number of vehicles going by using the given information.__ Since the output variable is continuous valued, we need to build a regressor that can predict the output. We will be using Extremely Random Forests to build this regressor. Let's go ahead and see how to do that.

In [1]:
import numpy as np 
import matplotlib.pyplot as plt 
from sklearn.metrics import classification_report, mean_absolute_error 
from sklearn import preprocessing 
from sklearn.ensemble import ExtraTreesRegressor 
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split # Cross Validation


Load the data in the file traffic_data.txt:

In [2]:
# Load input data 
input_file = 'traffic_data.txt' 
data = [] 
with open(input_file, 'r') as f: 
      for line in f.readlines(): 
            items = line[:-1].split(',') 
            data.append(items) 

data = np.array(data) 
print("Data:\n" , data[0:5])

Data:
 [['Tuesday' '00:00' 'San Francisco' 'no' '3']
 ['Tuesday' '00:05' 'San Francisco' 'no' '8']
 ['Tuesday' '00:10' 'San Francisco' 'no' '10']
 ['Tuesday' '00:15' 'San Francisco' 'no' '6']
 ['Tuesday' '00:20' 'San Francisco' 'no' '1']]


We need to encode the non-numerical features in the data. We also need to ensure that we don't encode numerical features. Each feature that needs to be encoded needs to have a separate label encoder. We need to keep track of these encoders because we will need them when we want to compute the output for an unknown data point. Let's create those label encoders:

In [3]:
# Convert string data to numerical data 
label_encoder = []  
X_encoded = np.empty(data.shape) 
for i, item in enumerate(data[0]): 
    if item.isdigit(): 
        X_encoded[:, i] = data[:, i] 
    else: 
        label_encoder.append(preprocessing.LabelEncoder()) 
        X_encoded[:, i] = label_encoder[-1].fit_transform(data[:, i]) 

X = X_encoded[:, :-1].astype(int) 
y = X_encoded[:, -1].astype(int) 
print("X:\n" , X[0:5])
print("\ny:\n" , y[0:5])

X:
 [[ 5  0 13  0]
 [ 5  1 13  0]
 [ 5  2 13  0]
 [ 5  3 13  0]
 [ 5  4 13  0]]

y:
 [ 3  8 10  6  1]


Split the data into training and testing datasets:

In [4]:
# Split data into training and testing datasets  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 5 )

Train an extremely Random Forests regressor:

In [5]:
# Extremely Random Forests regressor 
params = {'n_estimators': 100, 'max_depth': 4, 'random_state': 0} 
regressor = ExtraTreesRegressor(**params) 
regressor.fit(X_train, y_train) 

ExtraTreesRegressor(bootstrap=False, criterion='mse', max_depth=4,
          max_features='auto', max_leaf_nodes=None,
          min_impurity_decrease=0.0, min_impurity_split=None,
          min_samples_leaf=1, min_samples_split=2,
          min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
          oob_score=False, random_state=0, verbose=0, warm_start=False)

Compute the performance of the regressor on testing data:

In [6]:
# Compute the regressor performance on test data 
y_pred = regressor.predict(X_test) 
print("Mean absolute error:", round(mean_absolute_error(y_test, y_pred), 2)) 

Mean absolute error: 7.42


Let's see how to compute the output for an unknown data point. We will be using those label encoders to convert non-numerical features into numerical values:

In [15]:
# Testing encoding on single data instance
test_datapoint = ['Saturday', '10:20', 'Atlanta', 'no']
test_datapoint_encoded = [-1] * len(test_datapoint)
count = 0

for i, item in enumerate(test_datapoint):
    if item.isdigit():
        test_datapoint_encoded[i] = int(test_datapoint[i])
    else:
        test_datapoint_encoded[i] = int(label_encoder[i].transform([test_datapoint[i]]))
        count = count + 1 

test_datapoint_encoded = np.array(test_datapoint_encoded)


Predict the output:

In [39]:
# Predict the output for the test datapoint 
print("Predicted traffic:", int(regressor.predict([test_datapoint_encoded])[0])) 

Predicted traffic: 26


you will get 26 as the output, which is pretty close to the actual value. You can confirm this from the data file.

In [44]:
input_file = 'traffic_data.txt' 
data = [] 
with open(input_file, 'r') as f: 
      for line in f.readlines(): 
            items = line[:-1].split(',') 
            data.append(items) 

data = np.array(data) 
print("Data Shape:\n" , np.shape(data))

num_of_Sampels = len(data)
rows=0

for x in range(num_of_Sampels):
    if(data[x,-1] == '26' and data[x,0] == 'Saturday' and data[x,2] == 'Atlanta'):
        print(data[x])


Data Shape:
 (17568, 5)
['Saturday' '09:40' 'Atlanta' 'no' '26']
['Saturday' '10:05' 'Atlanta' 'no' '26']
['Saturday' '10:10' 'Atlanta' 'no' '26']
['Saturday' '10:30' 'Atlanta' 'no' '26']
['Saturday' '11:10' 'Atlanta' 'no' '26']
['Saturday' '11:25' 'Atlanta' 'no' '26']
['Saturday' '12:30' 'Atlanta' 'no' '26']
['Saturday' '15:00' 'Atlanta' 'no' '26']
['Saturday' '18:40' 'Atlanta' 'no' '26']
['Saturday' '19:40' 'Atlanta' 'yes' '26']
['Saturday' '20:30' 'Atlanta' 'yes' '26']
['Saturday' '23:25' 'Atlanta' 'no' '26']
