<a href="https://colab.research.google.com/github/SushiFou/ML-Business-Case-Project/blob/main/notebooks/Random_Forest_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <div align="center"><b> Machine Learning Business Case Project </b></div>
---
<div align="center">Authors : Maxime Lepeytre | Soumaya Sabry | Alexandre Zajac | Olivier Boivin | Yann Kervella


<center>
<img src="https://github.com/SushiFou/ML-Business-Case-Project/blob/main/cover_image_tech.jpg?raw=1" width="800px"/>
</center>
</div>

<div align="center"><font color='red' size='12'> DON'T FORGET TO COMMIT CHANGES ON GITHUB FOLKS ! Good luck ! <font></div>

## Context

Vous êtes consultant data scientist dans une grande entreprise de conseil
française. Votre client est une entreprise qui génère beaucoup de données
mais pour le moment aucun modèle de Machine Learning n'a été mis en
place pour les exploiter. Elle a donc naturellement fait appel à vous.

L’entreprise en question exploite plus de 3 000 magasins dans 7 pays
européens. Actuellement, les directeurs de magasin sont chargés d’estimer
leurs ventes quotidiennes jusqu'à six semaines à l'avance. Les ventes des
magasins sont influencées par de nombreux facteurs, notamment les
promotions, la concurrence, les vacances scolaires, la saisonnalité et la
localité. Avec des milliers de gestionnaires individuels prédisant les ventes
en fonction de leur situation particulière, la précision des résultats peut
être très variable.

Avec votre équipe de consultant data scientists, vous récupérez les
données disponibles et allez mener un projet complet pour les exploiter et
répondre à la problématique. Vous devez préparer et soutenir, pour le 21
janvier 2021, un rendu sous forme de slides qui tirera les conclusions de
votre travail, incluant une démonstration présentant les résultats du
modèle de Machine Learning de manière visuelle à destination des
utilisateur finaux. Vous aurez 12 minutes pour présenter votre travail à
votre donneur d’ordre client.

## Requirements


In [226]:
!pip install ipython-autotime
%load_ext autotime

The autotime extension is already loaded. To reload it, use:
  %reload_ext autotime
time: 2.6 s (started: 2021-01-18 19:27:03 +00:00)


In [227]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

time: 3.11 ms (started: 2021-01-18 19:27:05 +00:00)


In [228]:
# Graphical settings
CB91_Blue = '#2CBDFE'
CB91_Green = '#47DBCD'
CB91_Pink = '#F3A0F2'
CB91_Purple = '#9D2EC5'
CB91_Violet = '#661D98'
CB91_Amber = '#F5B14C'
color_list = [CB91_Blue, CB91_Pink, CB91_Green, CB91_Amber,
              CB91_Purple, CB91_Violet]
params = {"ytick.color" : "w",
          "xtick.color" : "w",
          'axes.labelsize' : 15,
          "axes.labelcolor" : "w",
          "axes.edgecolor" : "w",
          "axes.titlecolor": "w", 
          'figure.figsize': [20, 8], 
          'axes.prop_cycle': plt.cycler(color=color_list), 
          'figure.dpi' : 75, 
          'legend.fontsize': 10,
          'font.size': 15 
          }
plt.rcParams.update(params)

time: 7.38 ms (started: 2021-01-18 19:27:05 +00:00)


## Data Importation

### gdown if using colab

In [229]:
!gdown "https://drive.google.com/uc?id=1IHr_vKHZ0P0lUIAksJ9joRLUoUtZdDSY"

Downloading...
From: https://drive.google.com/uc?id=1IHr_vKHZ0P0lUIAksJ9joRLUoUtZdDSY
To: /content/store.csv
  0% 0.00/45.0k [00:00<?, ?B/s]100% 45.0k/45.0k [00:00<00:00, 42.5MB/s]
time: 784 ms (started: 2021-01-18 19:27:05 +00:00)


In [230]:
!gdown "https://drive.google.com/uc?id=17ur-ILBNAZDgjpqgPU1XBLYSIXc5cn5d"

Downloading...
From: https://drive.google.com/uc?id=17ur-ILBNAZDgjpqgPU1XBLYSIXc5cn5d
To: /content/test.csv
  0% 0.00/1.43M [00:00<?, ?B/s]100% 1.43M/1.43M [00:00<00:00, 94.0MB/s]
time: 890 ms (started: 2021-01-18 19:27:06 +00:00)


In [231]:
!gdown "https://drive.google.com/uc?id=1kx5sSTcRj4aVS8KZgSCcdo9-5i1axh5n"

Downloading...
From: https://drive.google.com/uc?id=1kx5sSTcRj4aVS8KZgSCcdo9-5i1axh5n
To: /content/train.csv
38.1MB [00:00, 103MB/s] 
time: 3 s (started: 2021-01-18 19:27:07 +00:00)


In [232]:
!gdown "https://drive.google.com/uc?id=10p7JyO2DNkWbMRZoMNVPmipy1msZpBEV"

Downloading...
From: https://drive.google.com/uc?id=10p7JyO2DNkWbMRZoMNVPmipy1msZpBEV
To: /content/variables.txt
  0% 0.00/1.58k [00:00<?, ?B/s]100% 1.58k/1.58k [00:00<00:00, 2.60MB/s]
time: 993 ms (started: 2021-01-18 19:27:10 +00:00)


## Data Exploration

In [233]:
f = open("variables.txt", "r")
print(f.read())

Most of the fields are self-explanatory. The following are descriptions for those that aren't.

Id - an Id that represents a (Store, Date) duple within the test set
Store - a unique Id for each store
Sales - the turnover for any given day
Customers - the number of customers on a given day
Open - an indicator for whether the store was open: 0 = closed, 1 = open
StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools
StoreType - differentiates between 4 different store models: a, b, c, d
Assortment - describes an assortment level: a = basic, b = extra, c = extended
CompetitionDistance - distance in meters to the nearest competitor store
CompetitionOpenSince[Month/Year] - gives the approximate year and month of t

In [234]:
store_data = pd.read_csv('store.csv')
print(f'Dataframe shape : rows = {store_data.shape[0]}, columns = {store_data.shape[1]}')
store_data.head()

Dataframe shape : rows = 1115, columns = 10


Unnamed: 0,Store,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
0,1,c,a,1270.0,9.0,2008.0,0,,,
1,2,a,a,570.0,11.0,2007.0,1,13.0,2010.0,"Jan,Apr,Jul,Oct"
2,3,a,a,14130.0,12.0,2006.0,1,14.0,2011.0,"Jan,Apr,Jul,Oct"
3,4,c,c,620.0,9.0,2009.0,0,,,
4,5,a,a,29910.0,4.0,2015.0,0,,,


time: 41.6 ms (started: 2021-01-18 19:27:11 +00:00)


In [235]:
train_data = pd.read_csv('train.csv', low_memory = False)
print(f'Dataframe shape : rows = {train_data.shape[0]}, columns = {train_data.shape[1]}')
train_data.head()

Dataframe shape : rows = 1017209, columns = 9


Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday
0,1,5,2015-07-31,5263,555,1,1,0,1
1,2,5,2015-07-31,6064,625,1,1,0,1
2,3,5,2015-07-31,8314,821,1,1,0,1
3,4,5,2015-07-31,13995,1498,1,1,0,1
4,5,5,2015-07-31,4822,559,1,1,0,1


time: 812 ms (started: 2021-01-18 19:27:11 +00:00)


In [236]:
test_data = pd.read_csv('test.csv')
print(f'Dataframe shape : rows = {test_data.shape[0]}, columns = {test_data.shape[1]}')
test_data.head()

Dataframe shape : rows = 41088, columns = 8


Unnamed: 0,Id,Store,DayOfWeek,Date,Open,Promo,StateHoliday,SchoolHoliday
0,1,1,4,2015-09-17,1.0,1,0,0
1,2,3,4,2015-09-17,1.0,1,0,0
2,3,7,4,2015-09-17,1.0,1,0,0
3,4,8,4,2015-09-17,1.0,1,0,0
4,5,9,4,2015-09-17,1.0,1,0,0


time: 50.2 ms (started: 2021-01-18 19:27:12 +00:00)


### Check Nan Values

In [237]:
store_data.isna().sum()

Store                          0
StoreType                      0
Assortment                     0
CompetitionDistance            3
CompetitionOpenSinceMonth    354
CompetitionOpenSinceYear     354
Promo2                         0
Promo2SinceWeek              544
Promo2SinceYear              544
PromoInterval                544
dtype: int64

time: 7.28 ms (started: 2021-01-18 19:27:12 +00:00)


Only 3 Nan values in competition distance 

A lot of nan due to no competition registered for some tuples

In [238]:
train_data.isna().sum()

Store            0
DayOfWeek        0
Date             0
Sales            0
Customers        0
Open             0
Promo            0
StateHoliday     0
SchoolHoliday    0
dtype: int64

time: 130 ms (started: 2021-01-18 19:27:12 +00:00)


In [239]:
store_data['CompetitionDistance'].fillna(store_data['CompetitionDistance'].mean(), inplace = True)

time: 2.31 ms (started: 2021-01-18 19:27:12 +00:00)


In [240]:
store_data.fillna(-1, inplace=True)

time: 3.32 ms (started: 2021-01-18 19:27:12 +00:00)


In [241]:
store_data.isna().sum()

Store                        0
StoreType                    0
Assortment                   0
CompetitionDistance          0
CompetitionOpenSinceMonth    0
CompetitionOpenSinceYear     0
Promo2                       0
Promo2SinceWeek              0
Promo2SinceYear              0
PromoInterval                0
dtype: int64

time: 6.91 ms (started: 2021-01-18 19:27:12 +00:00)


### Check Outliers

In [242]:
store_data.describe()

Unnamed: 0,Store,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear
count,1115.0,1115.0,1115.0,1115.0,1115.0,1115.0,1115.0
mean,558.0,5404.901079,4.613453,1370.621525,0.512108,11.595516,1029.75157
std,322.01708,7652.849306,4.65954,935.933356,0.500078,15.925223,1006.53886
min,1.0,20.0,-1.0,-1.0,0.0,-1.0,-1.0
25%,279.5,720.0,-1.0,-1.0,0.0,-1.0,-1.0
50%,558.0,2330.0,4.0,2006.0,1.0,1.0,2009.0
75%,836.5,6875.0,9.0,2011.0,1.0,22.0,2012.0
max,1115.0,75860.0,12.0,2015.0,1.0,50.0,2015.0


time: 54.8 ms (started: 2021-01-18 19:27:12 +00:00)


Let's try to process the CompetitionDistance Outliers

In [243]:
from scipy import stats
store_data['CD_zscore'] = np.abs(stats.zscore(store_data['CompetitionDistance'].to_numpy()))
store_data.head()

Unnamed: 0,Store,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval,CD_zscore
0,1,c,a,1270.0,9.0,2008.0,0,-1.0,-1.0,-1,0.540551
1,2,a,a,570.0,11.0,2007.0,1,13.0,2010.0,"Jan,Apr,Jul,Oct",0.632061
2,3,a,a,14130.0,12.0,2006.0,1,14.0,2011.0,"Jan,Apr,Jul,Oct",1.140623
3,4,c,c,620.0,9.0,2009.0,0,-1.0,-1.0,-1,0.625525
4,5,a,a,29910.0,4.0,2015.0,0,-1.0,-1.0,-1,3.203525


time: 31.1 ms (started: 2021-01-18 19:27:12 +00:00)


In [244]:
store_data_cleaned = store_data[store_data['CD_zscore'] < 3]
store_data_cleaned.describe()

Unnamed: 0,Store,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,CD_zscore
count,1093.0,1093.0,1093.0,1093.0,1093.0,1093.0,1093.0,1093.0
mean,559.150046,4725.40229,4.617566,1363.26441,0.522415,11.849039,1050.498628,0.608184
std,322.261812,5825.426321,4.6832,938.761871,0.499726,15.983167,1005.830924,0.466511
min,1.0,20.0,-1.0,-1.0,0.0,-1.0,-1.0,0.0
25%,279.0,700.0,-1.0,-1.0,0.0,-1.0,-1.0,0.349713
50%,560.0,2280.0,4.0,2006.0,1.0,1.0,2009.0,0.55365
75%,839.0,6360.0,9.0,2011.0,1.0,22.0,2012.0,0.673895
max,1115.0,27650.0,12.0,2015.0,1.0,50.0,2015.0,2.908078


time: 45.3 ms (started: 2021-01-18 19:27:12 +00:00)


In [245]:
store_data = store_data_cleaned.drop(columns='CD_zscore')
store_data.head()

Unnamed: 0,Store,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
0,1,c,a,1270.0,9.0,2008.0,0,-1.0,-1.0,-1
1,2,a,a,570.0,11.0,2007.0,1,13.0,2010.0,"Jan,Apr,Jul,Oct"
2,3,a,a,14130.0,12.0,2006.0,1,14.0,2011.0,"Jan,Apr,Jul,Oct"
3,4,c,c,620.0,9.0,2009.0,0,-1.0,-1.0,-1
5,6,a,a,310.0,12.0,2013.0,0,-1.0,-1.0,-1


time: 31.7 ms (started: 2021-01-18 19:27:12 +00:00)


In [246]:
train_data.describe()

Unnamed: 0,Store,DayOfWeek,Sales,Customers,Open,Promo,SchoolHoliday
count,1017209.0,1017209.0,1017209.0,1017209.0,1017209.0,1017209.0,1017209.0
mean,558.4297,3.998341,5773.819,633.1459,0.8301067,0.3815145,0.1786467
std,321.9087,1.997391,3849.926,464.4117,0.3755392,0.4857586,0.3830564
min,1.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,280.0,2.0,3727.0,405.0,1.0,0.0,0.0
50%,558.0,4.0,5744.0,609.0,1.0,0.0,0.0
75%,838.0,6.0,7856.0,837.0,1.0,1.0,0.0
max,1115.0,7.0,41551.0,7388.0,1.0,1.0,1.0


time: 257 ms (started: 2021-01-18 19:27:12 +00:00)


## Model

In [247]:
combined_data = store_data.merge(train_data, on=['Store'])
combined_data.set_index('Date', inplace=True)
combined_data.head()

Unnamed: 0_level_0,Store,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval,DayOfWeek,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2015-07-31,1,c,a,1270.0,9.0,2008.0,0,-1.0,-1.0,-1,5,5263,555,1,1,0,1
2015-07-30,1,c,a,1270.0,9.0,2008.0,0,-1.0,-1.0,-1,4,5020,546,1,1,0,1
2015-07-29,1,c,a,1270.0,9.0,2008.0,0,-1.0,-1.0,-1,3,4782,523,1,1,0,1
2015-07-28,1,c,a,1270.0,9.0,2008.0,0,-1.0,-1.0,-1,2,5011,560,1,1,0,1
2015-07-27,1,c,a,1270.0,9.0,2008.0,0,-1.0,-1.0,-1,1,6102,612,1,1,0,1


time: 568 ms (started: 2021-01-18 19:27:13 +00:00)


In [248]:
combined_data.reset_index(level=0, inplace=True)
model_data = combined_data.drop(columns = ['PromoInterval'])

time: 191 ms (started: 2021-01-18 19:27:13 +00:00)


### X & y Separation

In [249]:
y = model_data[['Date', 'Store', 'Sales']]
X = model_data.drop(columns='Sales')
X['Date'] = pd.to_datetime(X['Date'])
y['Date'] = pd.to_datetime(y['Date'])

time: 392 ms (started: 2021-01-18 19:27:13 +00:00)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [250]:
date_ref = pd.to_datetime(X.iloc[0, 0]) - pd.to_timedelta(42, unit='d')
date_ref

Timestamp('2015-06-19 00:00:00')

time: 4.55 ms (started: 2021-01-18 19:27:14 +00:00)


In [251]:
X_train = X[X['Date'] <= date_ref]
X_test = X[X['Date'] > date_ref]
y_train = y[y['Date'] <= date_ref]
y_test = y[y['Date'] > date_ref]
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

time: 217 ms (started: 2021-01-18 19:27:14 +00:00)


In [252]:
X_train.head()

Unnamed: 0,Date,Store,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,DayOfWeek,Customers,Open,Promo,StateHoliday,SchoolHoliday
0,2015-06-19,1,c,a,1270.0,9.0,2008.0,0,-1.0,-1.0,5,487,1,1,0,0
1,2015-06-18,1,c,a,1270.0,9.0,2008.0,0,-1.0,-1.0,4,498,1,1,0,0
2,2015-06-17,1,c,a,1270.0,9.0,2008.0,0,-1.0,-1.0,3,476,1,1,0,0
3,2015-06-16,1,c,a,1270.0,9.0,2008.0,0,-1.0,-1.0,2,503,1,1,0,0
4,2015-06-15,1,c,a,1270.0,9.0,2008.0,0,-1.0,-1.0,1,586,1,1,0,0


time: 32.5 ms (started: 2021-01-18 19:27:14 +00:00)


In [253]:
X_test.head()

Unnamed: 0,Date,Store,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,DayOfWeek,Customers,Open,Promo,StateHoliday,SchoolHoliday
0,2015-07-31,1,c,a,1270.0,9.0,2008.0,0,-1.0,-1.0,5,555,1,1,0,1
1,2015-07-30,1,c,a,1270.0,9.0,2008.0,0,-1.0,-1.0,4,546,1,1,0,1
2,2015-07-29,1,c,a,1270.0,9.0,2008.0,0,-1.0,-1.0,3,523,1,1,0,1
3,2015-07-28,1,c,a,1270.0,9.0,2008.0,0,-1.0,-1.0,2,560,1,1,0,1
4,2015-07-27,1,c,a,1270.0,9.0,2008.0,0,-1.0,-1.0,1,612,1,1,0,1


time: 34.5 ms (started: 2021-01-18 19:27:14 +00:00)


In [254]:
y_train.head()

Unnamed: 0,Date,Store,Sales
0,2015-06-19,1,4202
1,2015-06-18,1,4645
2,2015-06-17,1,4000
3,2015-06-16,1,4852
4,2015-06-15,1,5518


time: 16.3 ms (started: 2021-01-18 19:27:14 +00:00)


In [255]:
y_test.head()

Unnamed: 0,Date,Store,Sales
0,2015-07-31,1,5263
1,2015-07-30,1,5020
2,2015-07-29,1,4782
3,2015-07-28,1,5011
4,2015-07-27,1,6102


time: 15.1 ms (started: 2021-01-18 19:27:14 +00:00)


### Encoding

In [256]:
def one_hot_encoding(X):
  encoder = OneHotEncoder(sparse=False)
  features = ['StoreType', 'Assortment', 'StateHoliday']
  X_encoded = pd.DataFrame(encoder.fit_transform(X[features]))
  X_encoded.columns = encoder.get_feature_names(features)
  tmp = X.drop(features, axis=1)
  X_encoded = pd.concat([tmp, X_encoded], axis=1)
  return X_encoded

time: 6.16 ms (started: 2021-01-18 19:27:14 +00:00)


In [257]:
X_train_encoded = one_hot_encoding(X_train)
X_train_encoded.head()

Unnamed: 0,Date,Store,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,DayOfWeek,Customers,Open,Promo,SchoolHoliday,StoreType_a,StoreType_b,StoreType_c,StoreType_d,Assortment_a,Assortment_b,Assortment_c,StateHoliday_0,StateHoliday_a,StateHoliday_b,StateHoliday_c
0,2015-06-19,1,1270.0,9.0,2008.0,0,-1.0,-1.0,5,487,1,1,0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
1,2015-06-18,1,1270.0,9.0,2008.0,0,-1.0,-1.0,4,498,1,1,0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2,2015-06-17,1,1270.0,9.0,2008.0,0,-1.0,-1.0,3,476,1,1,0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
3,2015-06-16,1,1270.0,9.0,2008.0,0,-1.0,-1.0,2,503,1,1,0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
4,2015-06-15,1,1270.0,9.0,2008.0,0,-1.0,-1.0,1,586,1,1,0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0


time: 1.04 s (started: 2021-01-18 19:27:14 +00:00)


In [258]:
X_test_encoded = one_hot_encoding(X_test)
X_test_encoded.head()

Unnamed: 0,Date,Store,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,DayOfWeek,Customers,Open,Promo,SchoolHoliday,StoreType_a,StoreType_b,StoreType_c,StoreType_d,Assortment_a,Assortment_b,Assortment_c,StateHoliday_0
0,2015-07-31,1,1270.0,9.0,2008.0,0,-1.0,-1.0,5,555,1,1,1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
1,2015-07-30,1,1270.0,9.0,2008.0,0,-1.0,-1.0,4,546,1,1,1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
2,2015-07-29,1,1270.0,9.0,2008.0,0,-1.0,-1.0,3,523,1,1,1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
3,2015-07-28,1,1270.0,9.0,2008.0,0,-1.0,-1.0,2,560,1,1,1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,2015-07-27,1,1270.0,9.0,2008.0,0,-1.0,-1.0,1,612,1,1,1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0


time: 96 ms (started: 2021-01-18 19:27:15 +00:00)


In [259]:
X_train_encoded.set_index(['Date', 'Store'], inplace=True)
X_test_encoded.set_index(['Date', 'Store'], inplace=True)
y_train.set_index(['Date', 'Store'], inplace=True)
y_test.set_index(['Date', 'Store'], inplace=True)

time: 113 ms (started: 2021-01-18 19:27:15 +00:00)


In [260]:
X_train_encoded.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,DayOfWeek,Customers,Open,Promo,SchoolHoliday,StoreType_a,StoreType_b,StoreType_c,StoreType_d,Assortment_a,Assortment_b,Assortment_c,StateHoliday_0,StateHoliday_a,StateHoliday_b,StateHoliday_c
Date,Store,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
2015-06-19,1,1270.0,9.0,2008.0,0,-1.0,-1.0,5,487,1,1,0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2015-06-18,1,1270.0,9.0,2008.0,0,-1.0,-1.0,4,498,1,1,0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2015-06-17,1,1270.0,9.0,2008.0,0,-1.0,-1.0,3,476,1,1,0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2015-06-16,1,1270.0,9.0,2008.0,0,-1.0,-1.0,2,503,1,1,0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2015-06-15,1,1270.0,9.0,2008.0,0,-1.0,-1.0,1,586,1,1,0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0


time: 50.9 ms (started: 2021-01-18 19:27:15 +00:00)


In [261]:
# Create encoded columns in the test sets that are missing.
for column in np.asarray(X_train_encoded.columns):
  if column not in np.asarray(X_test_encoded.columns):
    X_test_encoded[column] = np.zeros(X_test_encoded.shape[0])

time: 7.28 ms (started: 2021-01-18 19:27:15 +00:00)


In [262]:
X_test_encoded.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,DayOfWeek,Customers,Open,Promo,SchoolHoliday,StoreType_a,StoreType_b,StoreType_c,StoreType_d,Assortment_a,Assortment_b,Assortment_c,StateHoliday_0,StateHoliday_a,StateHoliday_b,StateHoliday_c
Date,Store,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
2015-07-31,1,1270.0,9.0,2008.0,0,-1.0,-1.0,5,555,1,1,1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2015-07-30,1,1270.0,9.0,2008.0,0,-1.0,-1.0,4,546,1,1,1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2015-07-29,1,1270.0,9.0,2008.0,0,-1.0,-1.0,3,523,1,1,1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2015-07-28,1,1270.0,9.0,2008.0,0,-1.0,-1.0,2,560,1,1,1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2015-07-27,1,1270.0,9.0,2008.0,0,-1.0,-1.0,1,612,1,1,1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0


time: 47.5 ms (started: 2021-01-18 19:27:15 +00:00)


### Model Training & Prediction

In [263]:
model = RandomForestRegressor(n_estimators=10)

time: 1.41 ms (started: 2021-01-18 19:27:16 +00:00)


In [264]:
model.fit(X_train_encoded, y_train.values.ravel())

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=10, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

time: 46.5 s (started: 2021-01-18 19:27:16 +00:00)


In [265]:
X_test_encoded.describe()

Unnamed: 0,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,DayOfWeek,Customers,Open,Promo,SchoolHoliday,StoreType_a,StoreType_b,StoreType_c,StoreType_d,Assortment_a,Assortment_b,Assortment_c,StateHoliday_0,StateHoliday_a,StateHoliday_b,StateHoliday_c
count,45906.0,45906.0,45906.0,45906.0,45906.0,45906.0,45906.0,45906.0,45906.0,45906.0,45906.0,45906.0,45906.0,45906.0,45906.0,45906.0,45906.0,45906.0,45906.0,45906.0,45906.0,45906.0
mean,4725.40229,4.617566,1363.26441,0.522415,11.849039,1050.498628,4.0,630.096175,0.85943,0.357143,0.285126,0.537054,0.015554,0.133577,0.313815,0.537969,0.008234,0.453797,1.0,0.0,0.0,0.0
std,5822.824253,4.681108,938.34255,0.499503,15.976028,1005.381645,2.000022,440.971926,0.347581,0.479163,0.451479,0.498631,0.123741,0.340201,0.464047,0.498562,0.090369,0.497866,0.0,0.0,0.0,0.0
min,20.0,-1.0,-1.0,0.0,-1.0,-1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
25%,700.0,-1.0,-1.0,0.0,-1.0,-1.0,2.0,428.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
50%,2280.0,4.0,2006.0,1.0,1.0,2009.0,4.0,599.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
75%,6360.0,9.0,2011.0,1.0,22.0,2012.0,6.0,804.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0
max,27650.0,12.0,2015.0,1.0,50.0,2015.0,7.0,4783.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0


time: 132 ms (started: 2021-01-18 19:28:02 +00:00)


In [266]:
y_pred = model.predict(X_test_encoded)

time: 276 ms (started: 2021-01-18 19:28:02 +00:00)


In [267]:
mean_absolute_error(y_pred, y_test)

409.4538677619059

time: 5.08 ms (started: 2021-01-18 19:28:02 +00:00)


In [268]:
np.mean(y_test)

Sales    5998.447719
dtype: float64

time: 4.84 ms (started: 2021-01-18 19:28:02 +00:00)


In [269]:
y_test['Pred'] = y_pred
y_test.reset_index(level=0, inplace=True)
y_test['Date'] = pd.to_datetime(y_test['Date'])
y_test.set_index('Date', drop=True, inplace=True)
y_test.head()

Unnamed: 0_level_0,Sales,Pred
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2015-07-31,5263,4732.1
2015-07-30,5020,4532.7
2015-07-29,4782,4539.9
2015-07-28,5011,4620.1
2015-07-27,6102,5591.5


time: 48.7 ms (started: 2021-01-18 19:28:02 +00:00)


In [270]:
y_pred.reset_index(level=0, inplace=True)
y_pred.head()

AttributeError: ignored

time: 16.7 ms (started: 2021-01-18 19:28:03 +00:00)


In [None]:
def get_forecast_of_shop(ID_shop):
  shop_forecasts = y_pred[y_pred]

In [None]:
fig, ax = plt.subplots()
ax.plot(y_pred)
ax.plot(y_test)
fig.show()