<a href="https://colab.research.google.com/github/KonstantinBurkin/Machine_Learning_Project/blob/main/Machine_Learning_Delivery_Club.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center>Sales Forecasting</center>    

<br>

### <center>Konstantin Burkin</center>  

<center> <i>22 April 2022 </i> </center>  

---
### Table of contents:  
- <a href=#Introduction>Introduction</a> <br>
  - <a href=#Brief-outline>Brief outline</a> <br>
- <a href=#Import-of-data-and-Python-libraries>Import of data and Python libraries</a> <br>
- <a href=#Exploratory-Data-Analysis>Exploratory Data Analysis</a> <br>
- <a href=#Prediction>Prediction</a> <br>
- <a href=#Building-ML-models>Building ML models</a> <br>
- <a href=#Results>Results</a> <br>
---

## Introduction 

<p align="justify">This work was created and submitted as a final course project for MSU course "Machine learning for applied problems" in spring semester 2022. The project was written in <a href="https://https://colab.research.google.com">Google Colab</a>, using Python <i>version 3.7.13</i>.</p> 

<p align="justify">The goal is to create Machine Learning (ML) model that predicts the number of sales for each product in every store in Delivery Club App for the next week. The data set describing Delivery Club sales from 160 stores in 10 russian cities was provided by the supervisor. The data set contained information abouth the date of sales, weather conditions, product types, stores, and their locations.</p> 
  
### Brief outline  
---
* first step
* 
---

<br>   
  
<p align="justify">The result of this project is a dataset containing id of products and corresponding number of predicted sales. The error of predicted values is calculated as mean absotute error and should not exceede more than 4.10 units. The accuracy of predictions was evaluated using <a href="https://contest.yandex.ru/">Yandex.contest</a> platform.</p> 

## Import of data and Python libraries

The following data analysis includes several Python libraries for data analysis, builduing ML models, ploting data, etc:
- Matplotlib
- Numpy
- Pandas
- Sklearn
- Seaborn  
- Google colab   

Here, I download 2 datasets (train.csv and test.csv) from my Github repository and read the data csv files. Both data sets are available on my Github [page](https://github.com/KonstantinBurkin/Machine_Learning_Project/tree/main/data). The train data set is used to examine data, find patterns and factors that correlate to number of sales, add new variables and test machine learning models. The test data set is used to build predictions that will be evaluated later in Yandex.contest.

In [1]:
# Import libraries

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn
import plotly
import plotly.express as px
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from google.colab import output

# я использовал классификаторные модели, а нужны регрессионные 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
# 
from pandas import DatetimeIndex as dt
from sklearn.preprocessing import StandardScaler
from google.colab import files

In [2]:
# Download data from Github repo

!wget --no-cache --backups=1 {"https://raw.githubusercontent.com/KonstantinBurkin/Machine_Learning_Project/main/data/train.csv"}
!wget --no-cache --backups=1 {"https://raw.githubusercontent.com/KonstantinBurkin/Machine_Learning_Project/main/data/test.csv" }
output.clear()

In [3]:
# Read csv files

train = pd.read_csv("train.csv")      # download train data
test = pd.read_csv("test.csv")        # download test data

In [21]:
# for embedding plotly graphs
# plotly.offline.init_notebook_mode(connected=True)

## Exploratory Data Analysis

Two data sets (train.csv and test.csv) have common list of 11 variables. "Sales" variable is present only in train data set.  <br>

<center>Table of variables</center>

<style type="text/css">
.tg  {border:none;border-collapse:collapse;border-spacing:0;}
.tg td{border-style:solid;border-width:0px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;
  padding:10px 5px;word-break:normal;}
.tg th{border-style:solid;border-width:0px;font-family:Arial, sans-serif;font-size:14px;font-weight:normal;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top}
.tg .tg-7btt{border-color:inherit;font-weight:bold;text-align:center;vertical-align:top}
.tg .tg-j2vi{border-color:inherit;font-weight:bold;text-align:center;vertical-align:top}
.tg .tg-pcvp{border-color:inherit;text-align:left;vertical-align:top}
.tg .tg-ihkz{border-color:inherit;text-align:center;vertical-align:top}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-7btt">№</th>
    <th class="tg-7btt">Variable</th>
    <th class="tg-7btt">Description</th>
    <th class="tg-7btt">Data type</th>
    <th class="tg-7btt">NAs</th>
    <th class="tg-7btt">Unique values</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-j2vi"><span style="font-weight:bold">1</span></td>
    <td class="tg-pcvp">id</td>
    <td class="tg-pcvp">Unique identifier representing a bundle (product_id, store_id, date)  <br>There is only one id, it does not repeat in the data</td>
    <td class="tg-pcvp">int64</td>
    <td class="tg-ihkz">0</td>
    <td class="tg-pcvp">666676</td>
  </tr>
  <tr>
    <td class="tg-7btt"><span style="font-weight:bold">2</span></td>
    <td class="tg-0pky">date<br></td>
    <td class="tg-0pky">Date of sale</td>
    <td class="tg-0pky">object</td>
    <td class="tg-c3ow">0</td>
    <td class="tg-0pky">200</td>
  </tr>
  <tr>
    <td class="tg-j2vi"><span style="font-weight:bold">3</span></td>
    <td class="tg-pcvp">city_name</td>
    <td class="tg-pcvp">Name of the city where the sale took place</td>
    <td class="tg-pcvp">object</td>
    <td class="tg-ihkz">0</td>
    <td class="tg-pcvp">10</td>
  </tr>
  <tr>
    <td class="tg-7btt"><span style="font-weight:bold">4</span></td>
    <td class="tg-0pky">store_id</td>
    <td class="tg-0pky">Unique identifier for each store</td>
    <td class="tg-0pky">int64</td>
    <td class="tg-c3ow">0</td>
    <td class="tg-0pky">160<br></td>
  </tr>
  <tr>
    <td class="tg-j2vi"><span style="font-weight:bold">5</span></td>
    <td class="tg-pcvp">category_id</td>
    <td class="tg-pcvp">Product category</td>
    <td class="tg-pcvp">int64</td>
    <td class="tg-ihkz">0</td>
    <td class="tg-pcvp">9</td>
  </tr>
  <tr>
    <td class="tg-7btt"><span style="font-weight:bold">6</span></td>
    <td class="tg-0pky">product_id</td>
    <td class="tg-0pky">Unique identifier for each type of product</td>
    <td class="tg-0pky">int64</td>
    <td class="tg-c3ow">0</td>
    <td class="tg-0pky">32</td>
  </tr>
  <tr>
    <td class="tg-j2vi"><span style="font-weight:bold">7</span></td>
    <td class="tg-pcvp">price</td>
    <td class="tg-pcvp">Price of the product</td>
    <td class="tg-pcvp">float64</td>
    <td class="tg-ihkz">0</td>
    <td class="tg-pcvp">29</td>
  </tr>
  <tr>
    <td class="tg-7btt"><span style="font-weight:bold">8</span></td>
    <td class="tg-0pky">weather_desc</td>
    <td class="tg-0pky">Weather description</td>
    <td class="tg-0pky">object</td>
    <td class="tg-c3ow">0</td>
    <td class="tg-0pky">16</td>
  </tr>
  <tr>
    <td class="tg-j2vi">9</td>
    <td class="tg-pcvp">humidity</td>
    <td class="tg-pcvp">Humidity in the city on the day of sale</td>
    <td class="tg-pcvp">float64</td>
    <td class="tg-ihkz">0</td>
    <td class="tg-pcvp">916</td>
  </tr>
  <tr>
    <td class="tg-7btt">10</td>
    <td class="tg-0pky">temperature</td>
    <td class="tg-0pky">Temperature in the city on the day of sale</td>
    <td class="tg-0pky">float64</td>
    <td class="tg-c3ow">0</td>
    <td class="tg-0pky">505</td>
  </tr>
  <tr>
    <td class="tg-j2vi">11</td>
    <td class="tg-pcvp">pressure</td>
    <td class="tg-pcvp">Atmosphere pressure in the city on the day of sale</td>
    <td class="tg-pcvp">float64</td>
    <td class="tg-ihkz">0</td>
    <td class="tg-pcvp">344</td>
  </tr>
  <tr>
    <td class="tg-7btt">12</td>
    <td class="tg-0pky">sales</td>
    <td class="tg-0pky">Number of product sales (this is what I should predict)</td>
    <td class="tg-0pky">int64</td>
    <td class="tg-c3ow">0</td>
    <td class="tg-0pky">249</td>
  </tr>
</tbody>
</table>


There are `666676` observations in the train data set and `24836` observations in the test data set.  <br>
The train dataset describes data in 7 months period, from June 29, 2021 to February 13, 2022. <br>
The test dataset describes data in one week period from February 14, 2022 to February 20, 2022.  <br>
The prices of products range from 1.93 to 18.63.<br>

<center>The data set contains information from 10 cities:</center>
<style type="text/css">
.tg  {border:none;border-collapse:collapse;border-spacing:0;}
.tg td{border-style:solid;border-width:0px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;
  padding:10px 5px;word-break:normal;}
.tg th{border-style:solid;border-width:0px;font-family:Arial, sans-serif;font-size:14px;font-weight:normal;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-b7tv{border-color:#ffffff;text-align:left;vertical-align:top}
.tg .tg-zv4m{border-color:#ffffff;text-align:left;vertical-align:top}
.tg .tg-aw21{border-color:#ffffff;font-weight:bold;text-align:center;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-aw21" colspan="2">Cities</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-b7tv">Moscow</td>
    <td class="tg-b7tv">St. Petersburg</td>
  </tr>
  <tr>
    <td class="tg-zv4m">Krasnodar</td>
    <td class="tg-zv4m">Samara</td>
  </tr>
  <tr>
    <td class="tg-b7tv">Nizhny Novgorod</td>
    <td class="tg-b7tv">Rostov-on-Don</td>
  </tr>
  <tr>
    <td class="tg-zv4m">Volgograd</td>
    <td class="tg-zv4m">Voronezh</td>
  </tr>
  <tr>
    <td class="tg-b7tv">Kazan</td>
    <td class="tg-b7tv">Yekaterinburg</td>
  </tr>
</tbody>
</table>

<br>

  
Create table
https://www.tablesgenerator.com/html_tables

https://towardsdatascience.com/cheat-sheet-for-google-colab-63853778c093
  
markdaown cheatsheet

https://colab.research.google.com/github/Tanu-N-Prabhu/Python/blob/master/Cheat_sheet_for_Google_Colab.ipynb#scrollTo=8fFhNQ2EGOcR

forms

https://colab.research.google.com/notebooks/forms.ipynb#scrollTo=ZCEBZPwUDGOg

<p align="justify">The graph below shows the number of sales during 7 month period in 10 cities. As seen from the graph the number of sales changes through the week in each city. In different cities there is a different rate of sales. In larger cities like Moscow the sales are the highest. The outstanding sharp decrease in sales can be observed around New Year, which is an important national holiday in Russia, during which people generally stay at home with their families and most of the stores are closed.</p>

In [7]:
# Transform city names to english format

train = train.replace(            
    ('Москва', 'Санкт-Петербург', 'Краснодар', 'Самара','Нижний Новгород', 'Ростов-на-Дону', 'Волгоград', 'Воронеж', 'Казань', 'Екатеринбург'),
    ("Moscow", "St.Petersburg", "Krasnodar", "Samara", "Nizhny.Novgorod", "Rostov-on-Don", "Volgograd", "Voronezh", "Kazan", "Yekaterinburg")  )


In [22]:
# Number of sales in each city

group = train[['date', 'city_name', 'sales']].groupby(['date', 'city_name'], as_index=False).sum()
fig = px.line(group, x="date", y="sales", color='city_name', template='plotly_dark')
fig.update_layout(
    title={
        'text': "Sales in every city for 7 month period",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
        xaxis_title='Date, Month-Year',
        yaxis_title='Sales',
        legend=dict(y=0.5, title="City"))
fig.show() 

It can be seen that in every city the total number of sales did not change over several months except for slight decrease in autumn. 

In [9]:
# How sales trends changed over months

group = train[['date', 'city_name', 'sales']]
group.date = pd.to_datetime(group.date)  
group = group[(group.date.dt.month != 2) & (group.date.dt.month != 7)] # subset data - so that I compare whole months 
group = group.groupby([group.date.dt.month, group.city_name], as_index=False).sum()

group["month"] = np.repeat(pd.to_datetime(train.date[(pd.to_datetime(train.date).dt.month != 2) & (pd.to_datetime(train.date).dt.month != 7)]).dt.strftime("%B").unique(), 10)

# group.shape
# # plot

fig = px.line(group, x="month", y="sales", color='city_name', template='plotly_dark')
fig.update_layout(
    title={
        'text': "Monthly sales in every city",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
        xaxis_title='Month',
        yaxis_title='Sales',
        legend=dict(y=0.5, title="City"))
fig.show() 



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Mean of sales is low at the beginning of the week and that sales rise up by the weekend. Since the dependence from day of the week is quite substansive it is reasonable to add additinal variable "Day of the week" to the dataset.

In [10]:
# Mean sales over the week

group = train[['date', 'sales']]
group.date = pd.to_datetime(group.date)  
group = group.groupby([group.date.dt.weekday], as_index=False).mean()

group["weekdays"] = pd.to_datetime(train.iloc[4:].date).dt.strftime("%A").unique() # add weekdays from monday to sunday


# plot

fig = px.bar(group, x="weekdays", y="sales",  template='plotly_dark')
fig.update_layout(
    title={
        'text': "Mean sales for every day of the week",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
        xaxis_title='Day of the week',
        yaxis_title='Mean of sales',
        width=750)

fig.show() 



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



The cheaper the product the more often it gets selled. The dependence have bell-like structure, so products with the price lower that median are selled the most. Since the strong dependence from price is observed price tags should be transform into ... parameters.

In [20]:
# histogram of sales by prices

group = train[['price', 'sales']]
fig = px.histogram(group, x="price",  template='plotly_dark', nbins=20)
fig.update_layout(
    title={
        'text': "Histogram of sales by prices",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
        xaxis_title='Price',
        yaxis_title='Sales',
        width=750)
fig.show() 

For further evaluation of ML models mean absolute error (MAE) is used as quality metric. MAE shows how many times the forecast is wrong on average. This metric is very easy to interpret. If we have a metric value equal to 5, then this means that the model is on average wrong by 5. The metric cannot be negative, since all errors are taken modulo, for an ideal model this metric will be equal to 0. The metric is not so sensitive to outliers.  

$$MAE = \frac1N \sum ^{N}_{i=1} |y_i-\hat y_i|$$

## Prediction

In [None]:
# Upload data sets into the project

train = pd.read_csv("train.csv")      # download train data
test = pd.read_csv("test.csv")        # download test data
df = pd.concat([train,test], axis=0)  # concat both data sets

In [None]:
df.date = pd.to_datetime(df.date)                              # convert date column to date format    
df = df.assign(dayofweek=df.date.dt.dayofweek)                 # create day of week column
df = df.assign(weekend=lambda x: 1*(df.date.dt.dayofweek>4))   # create weekend column

In [None]:
# add day_product_mean which shows mean of sales for product_id from store_id for each day of the week

group = df.groupby(['product_id', 'store_id', 'dayofweek'])[['sales']].mean().reset_index()
group.rename(columns={'sales':'day_product_mean'}, inplace=True)
df = pd.merge(df, group, how="left", on=['product_id', 'store_id', 'dayofweek'])

In [None]:
# add lag_day_7-15 which shows lag of sales for product_id from store_id 1-3 weeks ago

group = df.groupby(['product_id', 'store_id', 'date', ])[['sales']].sum().reset_index()
group.dropna(inplace=True)                                                         # drop data with unknown sales
for i in range(7, 15):
    group[f'lag_day_{i}'] = group['sales'].shift(i)                                # add lags for 1-3 weeks
group.drop(['sales'], axis=1, inplace=True)
group.dropna(inplace=True)                                                         # drop created NAs from first 3 weeks

#  делаю мердж с исходным датафреймом
df = pd.merge(df, group, how="left", on=['product_id', 'store_id', 'date'])        # first three weeks will have NA b/c no lags exist for them

In [None]:
df.drop(index=df.index[:14], axis=0, inplace=True)    # drop first three weeks

In [None]:
# df.isna().sum()

In [None]:
# df.dayofweek = df.dayofweek.astype(str)            # convert dayofweek column to string format
# df.category_id = df.category_id.astype(str)
# df.product_id = df.product_id.astype(str)
# df.store_id = df.store_id.astype(str)
df.price = df.price.astype(str)

df.drop(labels=["weather_desc"], axis=1, inplace=True)
df = pd.get_dummies(df)                            # convert string columns to binary columns

In [None]:
sales = df.sales.dropna()                            # extract sales column = y    

In [None]:
df_test = df[df.id > 666676].drop(labels=["sales", "date"], axis=1, inplace=False)  # drop column with date format and column with y data
df_train = df[df.id < 666677].drop(labels=["sales", "date"], axis=1, inplace=False)  # drop column with date format and column with y data

In [None]:
df_train.head()

Unnamed: 0,id,store_id,category_id,product_id,humidity,temperature,pressure,dayofweek,weekend,day_product_mean,...,price_4.18,price_4.64,price_4.79,price_5.9,price_6.02,price_6.2,price_6.58,price_7.68,price_7.78,price_8.15
14,15,1,1,1,64.875,21.5,746.0,3,0,19.344828,...,0,0,1,0,0,0,0,0,0,0
15,16,1,1,1,75.4375,17.125,748.0,4,0,24.862069,...,0,0,1,0,0,0,0,0,0,0
16,17,1,1,1,55.5,19.8125,751.3125,5,1,25.758621,...,0,0,1,0,0,0,0,0,0,0
17,18,1,1,1,58.5625,22.1875,746.625,6,1,25.551724,...,0,0,1,0,0,0,0,0,0,0
18,19,1,1,1,53.0625,22.8125,747.3125,0,0,16.178571,...,0,0,1,0,0,0,0,0,0,0


In [None]:
# make 4 subsets for training and testing
X_train, X_test, y_train, y_test = train_test_split(
    df_train,
    sales,
    train_size = 0.9999, 
    test_size = 0.0001,
    shuffle = True)


In [None]:
# Linear model

model = LinearRegression()
model.fit(X_train, y_train)

forecast_lm = model.predict(X_test)
mae = mean_absolute_error(y_test, forecast_lm)

print(f"Linear model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"Linear model: MAE = {mae:.2f} > 4.10 ")

Linear model: MAE = 3.58 < 4.10 


In [None]:
# TreeClassifier model 
tree_clf = DecisionTreeClassifier(max_depth=10).fit(X_train, y_train)
forecast_tree = tree_clf.predict(X_test)
mae = mean_absolute_error(y_test, forecast_tree)
print(f"TreeClassifier model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"TreeClassifier model: MAE = {mae:.2f} > 4.10 ")


TreeClassifier model: MAE = 4.48 > 4.10 


In [None]:
# KNN model
# have to choose perfect number of neighbours
knn_clf = KNeighborsClassifier(n_neighbors=5)
knn_clf.fit(X_train, y_train)
forecast_knn = knn_clf.predict(X_test)
mae = mean_absolute_error(y_test, forecast_knn)
print(f"KNN model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"KNN model: MAE = {mae:.2f} > 4.10 ")

KNN model: MAE = 6.28 > 4.10 


In [None]:
# Sum of KNN, Linear model and TreeClassifier
sum_of_voices = (forecast_knn + forecast_tree + forecast_lm)/3
mae = mean_absolute_error(y_test, sum_of_voices)
print(f"Ensemble model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"Ensemble model: MAE = {mae:.2f} > 4.10 ")

Ensemble model: MAE = 4.30 > 4.10 


In [None]:
# RandomForestClassifier model
rf_model = RandomForestClassifier(n_estimators = 4, min_samples_split=250)
rf_model.fit(X_train, y_train)
forecast = rf_model.predict(X_test)
mae = mean_absolute_error(y_test, forecast)
print(f"KNN model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"KNN model: MAE = {mae:.2f} > 4.10 ")

KNN model: MAE = 3.21 < 4.10 


In [None]:
# TreeClassifier model for actual test data - bad mae
tree_clf = DecisionTreeClassifier(max_depth=12).fit(df_train, sales)
forecast = tree_clf.predict(df_test)

forecast1 = pd.DataFrame(forecast, columns = ['prediction'])
id = pd.DataFrame(df_test.id, columns = ['id'])

forecast1.reset_index(drop=True, inplace=True)
id.reset_index(drop=True, inplace=True)

result = pd.concat([id, forecast1], axis=1)

In [None]:
# Linear  model for actual test data

model = LinearRegression()
model.fit(df_train, sales)

forecast = model.predict(df_test)

forecast1 = pd.DataFrame(forecast, columns = ['prediction'])
id = pd.DataFrame(df_test.id, columns = ['id'])

forecast1.reset_index(drop=True, inplace=True)
id.reset_index(drop=True, inplace=True)

result = pd.concat([id, forecast1], axis=1)

In [None]:
# KNN model for actual test data

knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(df_train, sales)
forecast_knn = knn_clf.predict(df_test)

forecast1 = pd.DataFrame(forecast_knn, columns = ['prediction'])
id = pd.DataFrame(df_test.id, columns = ['id'])

forecast1.reset_index(drop=True, inplace=True)
id.reset_index(drop=True, inplace=True)

result = pd.concat([id, forecast1], axis=1)

In [None]:
# Sum of KNN, Linear model and TreeClassifier 

tree_clf = DecisionTreeClassifier(min_samples_split=50).fit(df_train, sales)
forecast_tree = tree_clf.predict(df_test)

model = LinearRegression()
model.fit(df_train, sales)
forecast_lm = model.predict(df_test)

knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(df_train, sales)
forecast_knn = knn_clf.predict(df_test)


sum_of_voices = (forecast_knn + forecast_tree + forecast_lm)/3

forecast1 = pd.DataFrame(sum_of_voices, columns = ['prediction'])
id = pd.DataFrame(df_test.id, columns = ['id'])

forecast1.reset_index(drop=True, inplace=True)
id.reset_index(drop=True, inplace=True)

result = pd.concat([id, forecast1], axis=1)

In [None]:
result.head()

Unnamed: 0,id,prediction
0,666677,23.300873
1,666678,23.107509
2,666679,24.155128
3,666680,25.67957
4,666681,32.695642


In [None]:
result.to_csv("prediction.csv", index=False)
files.download("prediction.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Building ML models

  ML models:

- k-nearest neighbors algorithm (KNN) is a supervised learning method - classification and regression
- Linear - - regression
- Tree - - classification and regression
- Ensemble
  - Average of different classifiers
  - Random forest
  - Gradient Boosting

In [None]:
# 
train.date = pd.to_datetime(train.date)
train = train.assign(dayofweek=train.date.dt.dayofweek)
train = train.assign(weekend=lambda x: 1*(train.date.dt.dayofweek>4))
train.product_id = train.product_id.astype(str)
train.store_id = train.store_id.astype(str)
train = pd.get_dummies(train)

In [None]:
# add average of sales for each day of the week
z = []
for i in range(7):
  z.append(train[train.dayofweek == i].sales.mean())

day_sales_mean = {'day_sales_mean': z, 'dayofweek': [0,1,2,3,4,5,6]}
day_sales_mean = pd.DataFrame(day_sales_mean)
train = pd.merge(train, day_sales_mean, on="dayofweek")

In [None]:
train.dayofweek = train.dayofweek.astype(str)
train.category_id = train.category_id.astype(str)
train.price = train.price.astype(str)
train = pd.get_dummies(train)

In [None]:
# add lags
for i in (1, 2, 3):
    train[f'lag_day_{i*7}'] = train['sales'].shift(i)
train = train.dropna()

In [None]:
sales = train.sales
train.drop(labels=["sales", "date"], axis=1, inplace=True) #, "id"

In [None]:
# print(sales.shape, train.shape)

In [None]:
# scaler = StandardScaler()
# train.iloc[:,[1,2,3,4]] = scaler.fit_transform(train.iloc[:,[1,2,3,4]])

In [None]:
# make 4 subsets for training and testing
X_train, X_test, y_train, y_test = train_test_split(
    train,
    sales,
    train_size = 0.9999, 
    test_size = 0.0001,
    shuffle = True)


In [None]:
# Linear model

model = LinearRegression()
model.fit(X_train, y_train)

forecast_lm = model.predict(X_test)
mae = mean_absolute_error(y_test, forecast_lm)

print(f"Linear model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"Linear model: MAE = {mae:.2f} > 4.10 ")

Linear model: MAE = 4.02 < 4.10 


In [None]:
# TreeClassifier model 
tree_clf = DecisionTreeClassifier().fit(X_train, y_train)
forecast_tree = tree_clf.predict(X_test)
mae = mean_absolute_error(y_test, forecast_tree)
print(f"TreeClassifier model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"TreeClassifier model: MAE = {mae:.2f} > 4.10 ")


TreeClassifier model: MAE = 5.67 > 4.10 


In [None]:
# Sum of Linear regression and TreeClassifier
sum_of_voices = (forecast_lm + forecast_tree)/2
mae = mean_absolute_error(y_test, sum_of_voices)
print(f"Ensemble model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"Ensemble model: MAE = {mae:.2f} > 4.10 ")

Ensemble model: MAE = 4.35 > 4.10 


## old code

In [None]:
# Linear model - refression

model = LinearRegression()
model.fit(X_train.iloc[:,[0,3,4,5,6,8,9,10]].dropna(axis=0, how='any', subset=None, inplace=False), # выкидываю character data type and NA
          y_train.dropna(axis=0, how='any', inplace=False)) # выкидываю character data type and NA

forecast_lm = model.predict(X_test.iloc[:,[0,3,4,5,6,8,9,10]].dropna(axis=0, how='any', subset=None, inplace=False))
mae = mean_absolute_error(y_test, forecast_lm)

print(f"Linear model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"Linear model: MAE = {mae:.2f} > 4.10 ")


Linear model: MAE = 7.57 > 4.10 


In [None]:
# TreeClassifier model - classification
tree_clf = DecisionTreeClassifier().fit(X_train.iloc[:,[0,3,4,5,6,8,9,10]].dropna(axis=0, how='any', subset=None, inplace=False), 
                                        y_train.dropna(axis=0, how='any', inplace=False))
forecast_tree = tree_clf.predict(X_test.iloc[:,[0,3,4,5,6,8,9,10]].dropna(axis=0, how='any', subset=None, inplace=False))
mae = mean_absolute_error(y_test, forecast_tree)
print(f"TreeClassifier model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"TreeClassifier model: MAE = {mae:.2f} > 4.10 ")

TreeClassifier model: MAE = 4.32 > 4.10 


In [None]:
# KNN model
# have to choose perfect number of neighbours
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train.iloc[:,[0,3,4,5,6,8,9,10]].dropna(axis=0, how='any', subset=None, inplace=False), y_train.dropna(axis=0, how='any', inplace=False))
forecast_knn = knn_clf.predict(X_test.iloc[:,[0,3,4,5,6,8,9,10]].dropna(axis=0, how='any', subset=None, inplace=False))
mae = mean_absolute_error(y_test, forecast_knn)
print(f"KNN model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"KNN model: MAE = {mae:.2f} > 4.10 ")

KNN model: MAE = 0.35 < 4.10 


In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV

In [None]:
# https://chrisalbon.com/code/machine_learning/nearest_neighbors/identifying_best_value_of_k/
y = y_train.dropna(axis=0, how='any', inplace=False)
# Create standardizer
standardizer = StandardScaler()

# Standardize features
X_std = standardizer.fit_transform(X_train.iloc[:,[0,3,4,5,6,8,9,10]].dropna(axis=0, how='any', subset=None, inplace=False))

knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean', n_jobs=-1).fit(X_std, y)

# Create a pipeline
pipe = Pipeline([('standardizer', standardizer), ('knn', knn)])

# Create space of candidate values
search_space = [{'knn__n_neighbors': list(range(3,30,4))}]
clf = GridSearchCV(pipe, search_space, cv=5, verbose=0).fit(X_std, y)
# Best neighborhood size (k)
clf.best_estimator_.get_params()['knn__n_neighbors']



3

In [None]:
# mean_absolute_error
from sklearn.metrics import make_scorer
custom_scorer = make_scorer(mean_absolute_error, greater_is_better=False)
y = y_train.dropna(axis=0, how='any', inplace=False)
# Create standardizer
standardizer = StandardScaler()

# Standardize features
X_std = standardizer.fit_transform(X_train.iloc[:,[0,3,4,5,6,8,9,10]].dropna(axis=0, how='any', subset=None, inplace=False))

knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean', n_jobs=-1).fit(X_std, y)

# Create a pipeline
pipe = Pipeline([('standardizer', standardizer), ('knn', knn)])

# Create space of candidate values
search_space = [{'knn__n_neighbors': list(range(3,30,4))}]
clf = GridSearchCV(pipe, search_space, cv=5, verbose=0, scoring=custom_scorer).fit(X_std, y)
# Best neighborhood size (k)
clf.best_estimator_.get_params()['knn__n_neighbors']

# считал минут 8 - результат 3



3

In [None]:
# RandomForestClassifier model
rf_model = RandomForestClassifier(n_estimators = 4)
rf_model.fit(X_train.iloc[:,[0,3,4,5,6,8,9,10]].dropna(axis=0, how='any', subset=None, inplace=False), 
                                        y_train.dropna(axis=0, how='any', inplace=False))
forecast = rf_model.predict(X_test.iloc[:,[0,3,4,5,6,8,9,10]].dropna(axis=0, how='any', subset=None, inplace=False))
mae = mean_absolute_error(y_test, forecast)
print(f"KNN model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"KNN model: MAE = {mae:.2f} > 4.10 ")

In [None]:
# Gradient Boosting


In [None]:
index = [1,2,3,4,5,6,7,8,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35]

In [None]:
# index

In [None]:
train.loc[:,train.columns != "sales"].shape

(666676, 37)

In [None]:
index1 = list(range(2,32))
index.insert(0, 0)

In [None]:
test.iloc[:,index1].head(2)

Unnamed: 0,store_id,category_id,product_id,price,humidity,temperature,pressure,dayofweek,weekend,city_name_Волгоград,...,weather_desc_облачно,"weather_desc_облачно, без существенных осадков","weather_desc_облачно, небольшие осадки","weather_desc_облачно, небольшой дождь","weather_desc_облачно, небольшой снег",weather_desc_осадки,weather_desc_переменная облачность,"weather_desc_переменная облачность, небольшие осадки",weather_desc_снег,weather_desc_ясно
0,1,1,1,4.79,87.3125,-1.9375,749.3125,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1,1,1,1,4.79,88.75,-1.25,752.6875,1,0,0,...,0,0,0,0,0,0,1,0,0,0


In [None]:
test.shape

(24836, 32)

In [None]:
train.columns

Index(['id', 'date', 'store_id', 'category_id', 'product_id', 'price',
       'humidity', 'temperature', 'pressure', 'sales', 'dayofweek', 'weekend',
       'city_name_Волгоград', 'city_name_Воронеж', 'city_name_Екатеринбург',
       'city_name_Казань', 'city_name_Краснодар', 'city_name_Москва',
       'city_name_Нижний Новгород', 'city_name_Ростов-на-Дону',
       'city_name_Самара', 'city_name_Санкт-Петербург', 'weather_desc_дождь',
       'weather_desc_дождь, гроза', 'weather_desc_метель',
       'weather_desc_облачно',
       'weather_desc_облачно, без существенных осадков',
       'weather_desc_облачно, небольшие осадки',
       'weather_desc_облачно, небольшой дождь',
       'weather_desc_облачно, небольшой снег', 'weather_desc_осадки',
       'weather_desc_переменная облачность',
       'weather_desc_переменная облачность, дождь',
       'weather_desc_переменная облачность, небольшие осадки',
       'weather_desc_переменная облачность, небольшой дождь',
       'weather_desc_пере

In [None]:
test.columns

Index(['id', 'date', 'store_id', 'category_id', 'product_id', 'price',
       'humidity', 'temperature', 'pressure', 'dayofweek', 'weekend',
       'city_name_Волгоград', 'city_name_Воронеж', 'city_name_Екатеринбург',
       'city_name_Казань', 'city_name_Краснодар', 'city_name_Москва',
       'city_name_Нижний Новгород', 'city_name_Ростов-на-Дону',
       'city_name_Самара', 'city_name_Санкт-Петербург', 'weather_desc_дождь',
       'weather_desc_облачно',
       'weather_desc_облачно, без существенных осадков',
       'weather_desc_облачно, небольшие осадки',
       'weather_desc_облачно, небольшой дождь',
       'weather_desc_облачно, небольшой снег', 'weather_desc_осадки',
       'weather_desc_переменная облачность',
       'weather_desc_переменная облачность, небольшие осадки',
       'weather_desc_снег', 'weather_desc_ясно'],
      dtype='object')

In [None]:
X_train.loc[:, X_train.columns != "date"].columns

Index(['id', 'store_id', 'category_id', 'product_id', 'price', 'humidity',
       'temperature', 'pressure', 'sales', 'dayofweek'],
      dtype='object')

In [None]:
X_train.columns

Index(['id', 'date', 'store_id', 'category_id', 'product_id', 'price',
       'humidity', 'temperature', 'pressure', 'sales', 'dayofweek', 'weekend',
       'city_name_Волгоград', 'city_name_Воронеж', 'city_name_Екатеринбург',
       'city_name_Казань', 'city_name_Краснодар', 'city_name_Москва',
       'city_name_Нижний Новгород', 'city_name_Ростов-на-Дону',
       'city_name_Самара', 'city_name_Санкт-Петербург', 'weather_desc_дождь',
       'weather_desc_дождь, гроза', 'weather_desc_метель',
       'weather_desc_облачно',
       'weather_desc_облачно, без существенных осадков',
       'weather_desc_облачно, небольшие осадки',
       'weather_desc_облачно, небольшой дождь',
       'weather_desc_облачно, небольшой снег', 'weather_desc_осадки',
       'weather_desc_переменная облачность',
       'weather_desc_переменная облачность, дождь',
       'weather_desc_переменная облачность, небольшие осадки',
       'weather_desc_переменная облачность, небольшой дождь',
       'weather_desc_пере

In [None]:
# TreeClassifier model - classification
tree_clf = DecisionTreeClassifier().fit(X_train, 
                                        y_train)
forecast_tree = tree_clf.predict(X_test)
mae = mean_absolute_error(y_test, forecast_tree)
print(f"TreeClassifier model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"TreeClassifier model: MAE = {mae:.2f} > 4.10 ")

TypeError: ignored

In [None]:
train.columns

Index(['id', 'date', 'store_id', 'category_id', 'product_id', 'price',
       'humidity', 'temperature', 'pressure', 'sales', 'dayofweek', 'weekend',
       'city_name_Волгоград', 'city_name_Воронеж', 'city_name_Екатеринбург',
       'city_name_Казань', 'city_name_Краснодар', 'city_name_Москва',
       'city_name_Нижний Новгород', 'city_name_Ростов-на-Дону',
       'city_name_Самара', 'city_name_Санкт-Петербург', 'weather_desc_дождь',
       'weather_desc_дождь, гроза', 'weather_desc_метель',
       'weather_desc_облачно',
       'weather_desc_облачно, без существенных осадков',
       'weather_desc_облачно, небольшие осадки',
       'weather_desc_облачно, небольшой дождь',
       'weather_desc_облачно, небольшой снег', 'weather_desc_осадки',
       'weather_desc_переменная облачность',
       'weather_desc_переменная облачность, дождь',
       'weather_desc_переменная облачность, небольшие осадки',
       'weather_desc_переменная облачность, небольшой дождь',
       'weather_desc_пере

In [None]:
train.iloc[:,9]

0         26
1         37
2         25
3         26
4         22
          ..
666671    11
666672    17
666673     2
666674     7
666675    18
Name: sales, Length: 666676, dtype: int64

In [None]:
train.iloc[:,[0,1,2,3, 4, 5, 6, 7, 8,10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37]].columns

Index(['id', 'date', 'store_id', 'category_id', 'product_id', 'price',
       'humidity', 'temperature', 'pressure', 'dayofweek', 'weekend',
       'city_name_Волгоград', 'city_name_Воронеж', 'city_name_Екатеринбург',
       'city_name_Казань', 'city_name_Краснодар', 'city_name_Москва',
       'city_name_Нижний Новгород', 'city_name_Ростов-на-Дону',
       'city_name_Самара', 'city_name_Санкт-Петербург', 'weather_desc_дождь',
       'weather_desc_дождь, гроза', 'weather_desc_метель',
       'weather_desc_облачно',
       'weather_desc_облачно, без существенных осадков',
       'weather_desc_облачно, небольшие осадки',
       'weather_desc_облачно, небольшой дождь',
       'weather_desc_облачно, небольшой снег', 'weather_desc_осадки',
       'weather_desc_переменная облачность',
       'weather_desc_переменная облачность, дождь',
       'weather_desc_переменная облачность, небольшие осадки',
       'weather_desc_переменная облачность, небольшой дождь',
       'weather_desc_переменная об

In [None]:
train.loc[:,train.columns not in ("sales","date")].columns

ValueError: ignored

In [None]:
(train.columns not in ("sales","date")).any()

ValueError: ignored

In [None]:
# make 4 subsets for training and testing
X_train, X_test, y_train, y_test = train_test_split(
    train.loc[:,train.columns != "sales" | train.columns != "date"],
    train.iloc[:,9],
    train_size = 0.8, 
    test_size = 0.2,
    random_state = 2022)

In [None]:
# convert date column from string to datetime and add weekday column into 
train.date = pd.to_datetime(train.date)
train = train.assign(dayofweek=train.date.dt.dayofweek)
train = train.assign(weekend=lambda x: 1*(train.date.dt.dayofweek>4))
train = pd.get_dummies(train)

In [None]:
# convert date column from string to datetime and add weekday column into 
test.date = pd.to_datetime(test.date)
test = test.assign(dayofweek=test.date.dt.dayofweek)
test = test.assign(weekend=lambda x: 1*(test.date.dt.dayofweek>4))
test = pd.get_dummies(test)

In [None]:
train.info()

In [None]:
train.head(2)

Unnamed: 0,id,date,store_id,category_id,product_id,price,humidity,temperature,pressure,sales,...,"weather_desc_облачно, небольшой дождь","weather_desc_облачно, небольшой снег",weather_desc_осадки,weather_desc_переменная облачность,"weather_desc_переменная облачность, дождь","weather_desc_переменная облачность, небольшие осадки","weather_desc_переменная облачность, небольшой дождь","weather_desc_переменная облачность, небольшой снег",weather_desc_снег,weather_desc_ясно
0,1,2021-07-29,1,1,1,4.79,61.9375,23.1875,741.0,26,...,0,0,0,0,0,0,1,0,0,0
1,2,2021-07-30,1,1,1,4.79,70.25,22.1875,740.3125,37,...,0,0,0,0,0,0,1,0,0,0


In [None]:
train.category_id.unique()

array([1, 2, 3, 4, 5, 7, 8, 9, 6])

In [None]:
train.date.max()

Timestamp('2022-02-13 00:00:00')

In [None]:
test.date = pd.to_datetime(test.date)
test.date.max()

Timestamp('2022-02-20 00:00:00')

In [None]:
train = pd.get_dummies(train)

In [None]:
train.columns

Index(['id', 'date', 'store_id', 'category_id', 'product_id', 'price',
       'humidity', 'temperature', 'pressure', 'sales', 'dayofweek', 'weekend',
       'city_name_Kazan', 'city_name_Krasnodar', 'city_name_Moscow',
       'city_name_Nizhny.Novgorod', 'city_name_Rostov-on-Don',
       'city_name_Samara', 'city_name_St.Petersburg', 'city_name_Volgograd',
       'city_name_Voronezh', 'city_name_Yekaterinburg', 'weather_desc_дождь',
       'weather_desc_дождь, гроза', 'weather_desc_метель',
       'weather_desc_облачно',
       'weather_desc_облачно, без существенных осадков',
       'weather_desc_облачно, небольшие осадки',
       'weather_desc_облачно, небольшой дождь',
       'weather_desc_облачно, небольшой снег', 'weather_desc_осадки',
       'weather_desc_переменная облачность',
       'weather_desc_переменная облачность, дождь',
       'weather_desc_переменная облачность, небольшие осадки',
       'weather_desc_переменная облачность, небольшой дождь',
       'weather_desc_переме

In [None]:
# Добавим лаговые фичи от 7 до 14 дней
# df = df.sort_values(['region_id', 'date', 'hour']).reset_index(drop=True)
# group = df.groupby(['hour', 'region_id'])
for i in range(7, 15):
    train[f'lag_day_{i}'] = train['sales'].shift(i)

In [None]:
df[(df['region_id'] == 3) & (df['hour'] == 14)].iloc[-14:]

In [None]:
train.weather_desc.unique()

array(['переменная облачность, небольшой дождь', 'переменная облачность',
       'облачно, небольшой дождь', 'дождь, гроза',
       'облачно, без существенных осадков',
       'переменная облачность, дождь', 'дождь', 'облачно', 'ясно',
       'облачно, небольшой снег',
       'переменная облачность, небольшие осадки',
       'облачно, небольшие осадки', 'снег', 'метель', 'осадки',
       'переменная облачность, небольшой снег'], dtype=object)

## Results

In [None]:
# df.to_csv("prediction.csv")

In [None]:
# %%shell
# jupyter nbconvert /content/Machine_Learning_Delivery_Club.ipynb --to html --no-input

# --no-input for hiding code

In [None]:
# import IPython
# IPython.display.HTML(filename='/content/file.html')

Inslude graphs into html report:
https://medium.com/@282abhishek/using-plotly-with-nbconvert-in-google-colab-96834c4f2850



```
import plotly
plotly.offline.init_notebook_mode(connected=True)
```

Set dark theme in html:  
https://github.com/jupyter/nbconvert/pull/1703  
https://blog.jupyter.org/the-templating-system-of-nbconvert-6-47ea781eacd2

maybe I should convert to md format first