# Car sales forecast

In [13]:
import pandas as pd

def import_data():
# Load the CSV file (should be in the same directory) 
 data = pd.read_csv("./data/norway_new_car_sales_by_make.csv") 
# Create a column “Period” with both the Year and the Month 
 data['Period'] = data['Year'].astype(str) + '-' + data['Month'].astype(str) 
# We use the datetime formatting to make sure format is consistent 
 data['Period'] = pd.to_datetime(data['Period']).dt.strftime('%Y-%m') 
# Create a pivot of the data to show the periods on columns and the car makers on rows 
 df = pd.pivot_table(data=data, values='Quantity', index='Make', columns='Period', aggfunc='sum', fill_value=0) 
 return df

df.head()

Period,2007-01,2007-02,2007-03,2007-04,2007-05,2007-06,2007-07,2007-08,2007-09,2007-10,...,2016-04,2016-05,2016-06,2016-07,2016-08,2016-09,2016-10,2016-11,2016-12,2017-01
Make,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alfa Romeo,16,9,21,20,17,21,14,12,15,10,...,3,1,2,1,6,15,3,4,3,6
Aston Martin,0,0,1,0,4,3,3,0,0,0,...,0,0,1,0,0,0,0,0,0,0
Audi,599,498,682,556,630,498,562,590,393,554,...,685,540,551,687,794,688,603,645,827,565
BMW,352,335,365,360,431,477,403,348,271,562,...,1052,832,808,636,1031,1193,1096,1663,866,1540
Bentley,0,0,0,0,0,1,0,0,0,0,...,0,0,1,1,1,0,0,0,0,0


Now that we have our dataset with the proper formatting, we can create our training and test sets. For this purpose, we will create a function datasets that takes as inputs:

df our initial historical demand;
- x_len the number of months we will use to make a prediction;
- y_len the number of months we want to predict;
- y_test_len the number of months we leave as a final test;

and returns X_train, Y_train, X_test & Y_test.

In [17]:
df.values.shape[1]

121

In [25]:
def datasets(df, x_len=12, y_len=1, y_test_len=12):
 D = df.values
 periods = D.shape[1] #number of columns
 
 # Training set creation: run through all the possible time windows
 loops = periods + 1 - x_len - y_len - y_test_len 
 train = []
 for col in range(loops):
  train.append(D[:,col:col+x_len+y_len])
 train = np.vstack(train)
 X_train, Y_train = np.split(train,[x_len],axis=1)
 
 # Test set creation: unseen “future” data with the demand just before
 max_col_test = periods - x_len - y_len + 1
 test = []
 for col in range(loops,max_col_test):
  test.append(D[:,col:col+x_len+y_len])
 test = np.vstack(test)
 X_test, Y_test = np.split(test,[x_len],axis=1)
 
 # this data formatting is needed if we only predict a single period
 if y_len == 1:
  Y_train = Y_train.ravel()
 Y_test = Y_test.ravel()
 
 return X_train, Y_train, X_test, Y_test

In our function, we have to use .ravel() on both Y_train and Y_test if we only want to predict one period at a time.
array.ravel() reduces the dimension of a NumPy array to 1D.
Y_train and Y_test are always created by our function as 2D arrays (i.e. arrays with rows and columns). If we only want to predict one period at a time, these arrays will then only have one column (and multiple rows). Unfortunately, the functions we will use later will want 1D arrays if we want to forecast only one period.

In [26]:
import numpy as np

X_train, Y_train, X_test, Y_test = datasets(df)

We now obtain the datasets we need to feed our machine learning algorithm (X_train & Y_train) and the datasets we need to test it (X_test & Y_test).

Note that we took y_test_len as 12 periods. That means we will test our algorithm over 12 different predictions (as we only predict one period at a time).

Forecasting multiple periods at once You can change y_len if you want to forecast multiple periods at once. You need to pay attention to keep y_test_len ≥ y_len, otherwise you won’t be able to test all the predictions of your algorithm.

# Regression Tree
As a first machine learning algorithm, we will use a decision tree. Decision trees are a class of machine learning algorithms that will create a map (a tree actually) of questions to make a prediction. We call these trees regression trees if we want them to predict a number, or classification trees if we want them to predict a category or a label.

In order to make a prediction, the tree will start at its foundation with a first yes/no question, and based on the answer it will continue asking new yes/no questions until it gets to a final prediction. 

We will use the scikit-learn Python library (www.scikit-learn.org) to grow our first tree. This is a well-known open-source library that is used all over the world by data scientists. It is built on top of NumPy, so that it interacts easily with the rest of our code.

The first step is to call scikit-learn and create an instance of a regression tree. Once this is done, we have to train it based on our X_train and Y_train arrays.

In [27]:
from sklearn.tree import DecisionTreeRegressor 
 
# — Instantiate a Decision Tree Regressor 
tree = DecisionTreeRegressor(max_depth=5,min_samples_leaf=5) 
 
# — Fit the tree to the training data 
tree.fit(X_train,Y_train) 

DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=5,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

In [30]:
# Create a prediction based on our model 
Y_train_pred = tree.predict(X_train) 
 
# Compute the Mean Absolute Error of the model 
import numpy as np
MAE_tree = np.mean(abs(Y_train - Y_train_pred))/np.mean(Y_train) 
 
# Print the results 
print('Tree on train set MAE%:',round(MAE_tree*100,1))

Tree on train set MAE%: 18.1


In [32]:
Y_test_pred = tree.predict(X_test) 
MAE_test = np.mean(abs(Y_test - Y_test_pred))/np.mean(Y_test) 
print('Tree on test set MAE%:',round(MAE_test*100,1))

Tree on test set MAE%: 21.1
