<a href="https://colab.research.google.com/github/KonstantinBurkin/Machine_Learning_Project/blob/main/Machine_Learning_Delivery_Club.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Sales Forecasting

In this project I build a Machine Learning method that predicts the number of orders for each store in Delivery Club and each product for the next week. Here, I use data set describing the number of sales of goods in stores on the Delivery Club platform in 10 cities in Russia (Moscow, St. Petersburg, Krasnodar, Samara, Nizhny Novgorod, Rostov-on-Don, Volgograd, Voronezh, Kazan, Yekaterinburg).

Two data sets (train.csv and test.csv) are used to train Regression model and evaluate its accuracy. Both data sets have the same list of variables:
- id - уникальный идентификатор, представляющий связку (product_id,
store_id, date). Это значит, что для каждой тройки (product_id, store_id, date) существует лишь один id, он не повторяется в данных
- date - дата продажи продукта
- city_name - название города, в котором происходила продажа
- store_id - уникальный идентификатор для каждого магазина
- category_id - категория продаваемого товара
- product_id - уникальный идентификатор товара
- price - цена товара
- weather_desc - краткое описание погоды в этом городе в день продажи
- humidity - влажность в этом городе в день продажи
- temperature - температура в этом городе в день продажи
- pressure - атмосферное давление в этом городе в день продажи
- sales - количество продаж товара (это то, что нужно прогнозировать)

- MAE - средняя абсолютная ошибка, показывает на сколько заказов в среднем ошибается прогноз. Это метрика очень легко интерпретируема. Если у нас значение метрики равно, например, 5, то это значит модель в среднем ошибается на 5 заказов каждый час по каждому району. Метрика не бывает отрицательной, так как все ошибки беруться по модулю, для идеальной модели эта метрика будет равна 0. Метрика не так чувствительна к выбросам.

$$MAE = \frac1N \sum ^{N}_{i=1} |y_i-\hat y_i|$$

In [1]:
# Import libraries

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn
import plotly.express as px
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from google.colab import output
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [2]:
from google.colab import files
uploaded = files.upload()

In [None]:
# Download data from Github

# !wget --no-cache --backups=1 {"https://github.com/KonstantinBurkin/Machine_Learning_Project/blob/main/train.csv"}
# !wget --no-cache --backups=1 {"https://github.com/KonstantinBurkin/Machine_Learning_Project/blob/main/public_data.zip"}
# output.clear()

In [2]:
# Upload data sets into the project

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

## Data sets description

In [4]:
train = train.replace(
    ('Москва', 'Санкт-Петербург', 'Краснодар', 'Самара','Нижний Новгород', 'Ростов-на-Дону', 'Волгоград', 'Воронеж', 'Казань', 'Екатеринбург'),
    ("Moscow", "St.Petersburg", "Krasnodar", "Samara", "Nizhny.Novgorod", "Rostov-on-Don", "Volgograd", "Voronezh", "Kazan", "Yekaterinburg")  )

In [5]:
pd.unique(train["city_name"])
# pd.unique(train["store_id"])
# можно сделать график количества магазинов в каждом городе

array(['Moscow', 'St.Petersburg', 'Krasnodar', 'Samara',
       'Nizhny.Novgorod', 'Rostov-on-Don', 'Volgograd', 'Voronezh',
       'Kazan', 'Yekaterinburg'], dtype=object)

In [3]:
train.shape

(666676, 12)

In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28489 entries, 0 to 28488
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            28489 non-null  int64  
 1   date          28488 non-null  object 
 2   city_name     28488 non-null  object 
 3   store_id      28488 non-null  float64
 4   category_id   28488 non-null  float64
 5   product_id    28488 non-null  float64
 6   price         28488 non-null  float64
 7   weather_desc  28488 non-null  object 
 8   humidity      28488 non-null  float64
 9   temperature   28488 non-null  float64
 10  pressure      28488 non-null  float64
 11  sales         28488 non-null  float64
dtypes: float64(8), int64(1), object(3)
memory usage: 2.6+ MB


In [None]:
train.head()

Unnamed: 0,id,date,city_name,store_id,category_id,product_id,price,weather_desc,humidity,temperature,pressure,sales
0,1,2021-07-29,Moscow,1.0,1.0,1.0,4.79,"переменная облачность, небольшой дождь",61.9375,23.1875,741.0,26.0
1,2,2021-07-30,Moscow,1.0,1.0,1.0,4.79,"переменная облачность, небольшой дождь",70.25,22.1875,740.3125,37.0
2,3,2021-07-31,Moscow,1.0,1.0,1.0,4.79,переменная облачность,52.625,21.8125,741.625,25.0
3,4,2021-08-01,Moscow,1.0,1.0,1.0,4.79,"облачно, небольшой дождь",87.4375,20.0625,743.3125,26.0
4,5,2021-08-02,Moscow,1.0,1.0,1.0,4.79,переменная облачность,66.1875,23.4375,739.625,22.0


In [None]:
train.describe()

Unnamed: 0,id,store_id,category_id,product_id,price,humidity,temperature,pressure,sales
count,28489.0,28488.0,28488.0,28488.0,28488.0,28488.0,28488.0,28488.0,28488.0
mean,14244.00007,3.615768,2.398238,17.317397,5.104391,73.685486,7.737734,754.513429,12.057217
std,8224.210124,1.955576,1.902261,10.883911,3.345185,16.58221,10.770423,9.361326,15.567595
min,1.0,1.0,1.0,1.0,1.93,27.125,-18.3125,718.0625,0.0
25%,7122.0,2.0,1.0,9.0,3.0,61.5625,0.375,748.3125,3.0
50%,14244.0,3.0,1.0,16.0,4.09,73.0625,7.6875,756.0,7.0
75%,21366.0,5.0,4.0,28.0,6.2,89.3125,15.0625,761.0,15.0
max,28488.0,7.0,8.0,35.0,18.63,100.0,33.25,779.0,169.0


In [None]:
test.shape

(24836, 11)

In [None]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24836 entries, 0 to 24835
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            24836 non-null  int64  
 1   date          24836 non-null  object 
 2   city_name     24836 non-null  object 
 3   store_id      24836 non-null  int64  
 4   category_id   24836 non-null  int64  
 5   product_id    24836 non-null  int64  
 6   price         24836 non-null  float64
 7   weather_desc  24836 non-null  object 
 8   humidity      24836 non-null  float64
 9   temperature   24836 non-null  float64
 10  pressure      24836 non-null  float64
dtypes: float64(4), int64(4), object(3)
memory usage: 2.1+ MB


In [None]:
test.head()

Unnamed: 0,id,date,city_name,store_id,category_id,product_id,price,weather_desc,humidity,temperature,pressure
0,666677,2022-02-14,Москва,1,1,1,4.79,облачно,87.3125,-1.9375,749.3125
1,666678,2022-02-15,Москва,1,1,1,4.79,переменная облачность,88.75,-1.25,752.6875
2,666679,2022-02-16,Москва,1,1,1,4.79,переменная облачность,90.375,-1.5625,746.3125
3,666680,2022-02-17,Москва,1,1,1,4.79,"облачно, небольшой дождь",98.0,1.75,732.6875
4,666681,2022-02-18,Москва,1,1,1,4.79,"облачно, небольшие осадки",95.5,1.375,733.0


In [None]:
test.describe()

Unnamed: 0,id,store_id,category_id,product_id,price,humidity,temperature,pressure
count,24836.0,24836.0,24836.0,24836.0,24836.0,24836.0,24836.0,24836.0
mean,679094.5,78.053551,2.375423,17.8323,5.201144,87.285168,-0.751719,747.908286
std,7169.679979,45.689019,1.876578,10.826993,3.491933,9.839292,4.059063,9.743387
min,666677.0,1.0,1.0,1.0,1.93,55.875,-10.5,730.3125
25%,672885.75,40.0,1.0,9.0,3.0,84.8125,-3.625,740.0
50%,679094.5,76.0,1.0,17.0,4.09,89.9375,-0.3125,748.9375
75%,685303.25,117.0,4.0,28.0,6.02,94.3125,1.75,754.6875
max,691512.0,164.0,9.0,35.0,18.63,98.625,9.0625,769.0


In [None]:
group.head()

NameError: ignored

In [137]:
# график заказов по городам
group = train[['date', 'city_name', 'sales']].groupby(['date', 'city_name'], as_index=False).sum()
fig = px.line(group, x="date", y="sales", color='city_name', template='plotly_dark')
fig.update_layout(
    title={
        'text': "Orders in each city",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show() 
# можно еще добавить инфу как менялись заказы по месяцам, в течении недели

## Building regression model

In [4]:
# make 4 subsets for training and testing
X_train, X_test, y_train, y_test = train_test_split(
    train.iloc[:,0:11],
    train.iloc[:,11],
    train_size = 0.8, 
    test_size = 0.2,
    random_state = 2022,
    shuffle = True)


In [5]:
# Linear model

model = LinearRegression()
model.fit(X_train.iloc[:,[0,3,4,5,6,8,9,10]].dropna(axis=0, how='any', subset=None, inplace=False), # выкидываю character data type and NA
          y_train.dropna(axis=0, how='any', inplace=False)) # выкидываю character data type and NA

forecast_lm = model.predict(X_test.iloc[:,[0,3,4,5,6,8,9,10]].dropna(axis=0, how='any', subset=None, inplace=False))
mae = mean_absolute_error(y_test, forecast_lm)

print(f"Linear model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"Linear model: MAE = {mae:.2f} > 4.10 ")


Linear model: MAE = 8.70 > 4.10 


In [6]:
# TreeClassifier model
tree_clf = DecisionTreeClassifier().fit(X_train.iloc[:,[0,3,4,5,6,8,9,10]].dropna(axis=0, how='any', subset=None, inplace=False), 
                                        y_train.dropna(axis=0, how='any', inplace=False))
forecast_tree = tree_clf.predict(X_test.iloc[:,[0,3,4,5,6,8,9,10]].dropna(axis=0, how='any', subset=None, inplace=False))
mae = mean_absolute_error(y_test, forecast_tree)
print(f"TreeClassifier model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"TreeClassifier model: MAE = {mae:.2f} > 4.10 ")

TreeClassifier model: MAE = 5.24 > 4.10 


In [8]:
# KNN model
knn_clf = KNeighborsClassifier().fit(X_train.iloc[:,[0,3,4,5,6,8,9,10]].dropna(axis=0, how='any', subset=None, inplace=False), 
                                        y_train.dropna(axis=0, how='any', inplace=False))
forecast_knn = knn_clf.predict(X_test.iloc[:,[0,3,4,5,6,8,9,10]].dropna(axis=0, how='any', subset=None, inplace=False))
mae = mean_absolute_error(y_test, forecast_knn)
print(f"KNN model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"KNN model: MAE = {mae:.2f} > 4.10 ")

KNN model: MAE = 5.18 > 4.10 


In [9]:
# Sum of KNN and TreeClassifier
sum_of_voices = (forecast_knn + forecast_tree)/2
mae = mean_absolute_error(y_test, sum_of_voices)
print(f"KNN model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"KNN model: MAE = {mae:.2f} > 4.10 ")

KNN model: MAE = 4.49 > 4.10 


In [10]:
# RandomForestClassifier model
rf_model = RandomForestClassifier(n_estimators = 4)
rf_model.fit(X_train.iloc[:,[0,3,4,5,6,8,9,10]].dropna(axis=0, how='any', subset=None, inplace=False), 
                                        y_train.dropna(axis=0, how='any', inplace=False))
forecast = rf_model.predict(X_test.iloc[:,[0,3,4,5,6,8,9,10]].dropna(axis=0, how='any', subset=None, inplace=False))
mae = mean_absolute_error(y_test, forecast)
print(f"KNN model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"KNN model: MAE = {mae:.2f} > 4.10 ")

KNN model: MAE = 5.81 > 4.10 


## Results

In [None]:
# df.to_csv("prediction.csv")