#Sales Forecasting




In this project I build a Machine Learning method that predicts the number of orders for each store in Delivery Club and each product for the next week. Here, I use data set describing the number of sales of goods in stores on the Delivery Club platform in 10 cities in Russia (Moscow, St. Petersburg, Krasnodar, Samara, Nizhny Novgorod, Rostov-on-Don, Volgograd, Voronezh, Kazan, Yekaterinburg).  

Two data sets (train.csv and test.csv) are used to train Regression model and evaluate its accuracy. Both data sets have the same list of variables:
- id - уникальный идентификатор, представляющий связку (product_id,
store_id, date). Это значит, что для каждой тройки (product_id, store_id, date) существует лишь один id, он не повторяется в данных
- date - дата продажи продукта
- city_name - название города, в котором происходила продажа
- store_id - уникальный идентификатор для каждого магазина
- category_id - категория продаваемого товара
- product_id - уникальный идентификатор товара
- price - цена товара
- weather_desc - краткое описание погоды в этом городе в день продажи
- humidity - влажность в этом городе в день продажи
- temperature - температура в этом городе в день продажи
- pressure - атмосферное давление в этом городе в день продажи
- sales - количество продаж товара (это то, что нужно прогнозировать)

- MAE - средняя абсолютная ошибка, показывает на сколько заказов в среднем ошибается прогноз. Это метрика очень легко интерпретируема. Если у нас значение метрики равно, например, 5, то это значит модель в среднем ошибается на 5 заказов каждый час по каждому району. Метрика не бывает отрицательной, так как все ошибки беруться по модулю, для идеальной модели эта метрика будет равна 0. Метрика не так чувствительна к выбросам.

$$MAE = \frac1N \sum ^{N}_{i=1} |y_i-\hat y_i|$$

In [2]:
# Import libraries

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn
import plotly.express as px
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from google.colab import output
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from pandas import DatetimeIndex as dt
from sklearn.preprocessing import StandardScaler
from google.colab import files

In [228]:
# Download data from Github

!wget --no-cache --backups=1 {"https://raw.githubusercontent.com/KonstantinBurkin/Machine_Learning_Project/main/train.csv"}
!wget --no-cache --backups=1 {"https://raw.githubusercontent.com/KonstantinBurkin/Machine_Learning_Project/main/test.csv" }
output.clear()

## Prediction

In [282]:
# Upload data sets into the project

train = pd.read_csv("train.csv")      # download train data
test = pd.read_csv("test.csv")        # download test data
df = pd.concat([train,test], axis=0)  # concat both data sets

In [283]:
df.date = pd.to_datetime(df.date)                              # convert date column to date format    
df = df.assign(dayofweek=df.date.dt.dayofweek)                 # create day of week column
df = df.assign(weekend=lambda x: 1*(df.date.dt.dayofweek>4))   # create weekend column

In [284]:
# add day_product_mean which shows mean of sales for product_id from store_id for each day of the week

group = df.groupby(['product_id', 'store_id', 'dayofweek'])[['sales']].mean().reset_index()
group.rename(columns={'sales':'day_product_mean'}, inplace=True)
df = pd.merge(df, group, how="left", on=['product_id', 'store_id', 'dayofweek'])

In [285]:
# add lag_day_7-15 which shows lag of sales for product_id from store_id 1-3 weeks ago

group = df.groupby(['product_id', 'store_id', 'date', ])[['sales']].sum().reset_index()
group.dropna(inplace=True)                                                         # drop data with unknown sales
for i in range(7, 15):
    group[f'lag_day_{i}'] = group['sales'].shift(i)                                # add lags for 1-3 weeks
group.drop(['sales'], axis=1, inplace=True)
group.dropna(inplace=True)                                                         # drop created NAs from first 3 weeks

#  делаю мердж с исходным датафреймом
df = pd.merge(df, group, how="left", on=['product_id', 'store_id', 'date'])        # first three weeks will have NA b/c no lags exist for them

In [286]:
df.drop(index=df.index[:14], axis=0, inplace=True)    # drop first three weeks

In [287]:
# df.isna().sum()

In [288]:
# df.dayofweek = df.dayofweek.astype(str)            # convert dayofweek column to string format
# df.category_id = df.category_id.astype(str)
# df.product_id = df.product_id.astype(str)
# df.store_id = df.store_id.astype(str)
df.price = df.price.astype(str)

df.drop(labels=["weather_desc"], axis=1, inplace=True)
df = pd.get_dummies(df)                            # convert string columns to binary columns

In [289]:
sales = df.sales.dropna()                            # extract sales column = y    

In [290]:
df_test = df[df.id > 666676].drop(labels=["sales", "date"], axis=1, inplace=False)  # drop column with date format and column with y data
df_train = df[df.id < 666677].drop(labels=["sales", "date"], axis=1, inplace=False)  # drop column with date format and column with y data

In [291]:
df_train.head()

Unnamed: 0,id,store_id,category_id,product_id,humidity,temperature,pressure,dayofweek,weekend,day_product_mean,...,price_4.18,price_4.64,price_4.79,price_5.9,price_6.02,price_6.2,price_6.58,price_7.68,price_7.78,price_8.15
14,15,1,1,1,64.875,21.5,746.0,3,0,19.344828,...,0,0,1,0,0,0,0,0,0,0
15,16,1,1,1,75.4375,17.125,748.0,4,0,24.862069,...,0,0,1,0,0,0,0,0,0,0
16,17,1,1,1,55.5,19.8125,751.3125,5,1,25.758621,...,0,0,1,0,0,0,0,0,0,0
17,18,1,1,1,58.5625,22.1875,746.625,6,1,25.551724,...,0,0,1,0,0,0,0,0,0,0
18,19,1,1,1,53.0625,22.8125,747.3125,0,0,16.178571,...,0,0,1,0,0,0,0,0,0,0


In [298]:
# make 4 subsets for training and testing
X_train, X_test, y_train, y_test = train_test_split(
    df_train,
    sales,
    train_size = 0.9999, 
    test_size = 0.0001,
    shuffle = True)


In [299]:
# Linear model

model = LinearRegression()
model.fit(X_train, y_train)

forecast_lm = model.predict(X_test)
mae = mean_absolute_error(y_test, forecast_lm)

print(f"Linear model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"Linear model: MAE = {mae:.2f} > 4.10 ")

Linear model: MAE = 3.58 < 4.10 


In [295]:
# TreeClassifier model 
tree_clf = DecisionTreeClassifier(max_depth=10).fit(X_train, y_train)
forecast_tree = tree_clf.predict(X_test)
mae = mean_absolute_error(y_test, forecast_tree)
print(f"TreeClassifier model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"TreeClassifier model: MAE = {mae:.2f} > 4.10 ")


TreeClassifier model: MAE = 4.48 > 4.10 


In [275]:
# KNN model
# have to choose perfect number of neighbours
knn_clf = KNeighborsClassifier(n_neighbors=5)
knn_clf.fit(X_train, y_train)
forecast_knn = knn_clf.predict(X_test)
mae = mean_absolute_error(y_test, forecast_knn)
print(f"KNN model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"KNN model: MAE = {mae:.2f} > 4.10 ")

KNN model: MAE = 6.28 > 4.10 


In [276]:
# Sum of KNN, Linear model and TreeClassifier
sum_of_voices = (forecast_knn + forecast_tree + forecast_lm)/3
mae = mean_absolute_error(y_test, sum_of_voices)
print(f"Ensemble model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"Ensemble model: MAE = {mae:.2f} > 4.10 ")

Ensemble model: MAE = 4.30 > 4.10 


In [271]:
# RandomForestClassifier model
rf_model = RandomForestClassifier(n_estimators = 4, min_samples_split=250)
rf_model.fit(X_train, y_train)
forecast = rf_model.predict(X_test)
mae = mean_absolute_error(y_test, forecast)
print(f"KNN model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"KNN model: MAE = {mae:.2f} > 4.10 ")

KNN model: MAE = 3.21 < 4.10 


In [289]:
# TreeClassifier model for actual test data - bad mae
tree_clf = DecisionTreeClassifier(max_depth=12).fit(df_train, sales)
forecast = tree_clf.predict(df_test)

forecast1 = pd.DataFrame(forecast, columns = ['prediction'])
id = pd.DataFrame(df_test.id, columns = ['id'])

forecast1.reset_index(drop=True, inplace=True)
id.reset_index(drop=True, inplace=True)

result = pd.concat([id, forecast1], axis=1)

In [300]:
# Linear  model for actual test data

model = LinearRegression()
model.fit(df_train, sales)

forecast = model.predict(df_test)

forecast1 = pd.DataFrame(forecast, columns = ['prediction'])
id = pd.DataFrame(df_test.id, columns = ['id'])

forecast1.reset_index(drop=True, inplace=True)
id.reset_index(drop=True, inplace=True)

result = pd.concat([id, forecast1], axis=1)

In [None]:
# KNN model for actual test data

knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(df_train, sales)
forecast_knn = knn_clf.predict(df_test)

forecast1 = pd.DataFrame(forecast_knn, columns = ['prediction'])
id = pd.DataFrame(df_test.id, columns = ['id'])

forecast1.reset_index(drop=True, inplace=True)
id.reset_index(drop=True, inplace=True)

result = pd.concat([id, forecast1], axis=1)

In [212]:
# Sum of KNN, Linear model and TreeClassifier 

tree_clf = DecisionTreeClassifier(min_samples_split=50).fit(df_train, sales)
forecast_tree = tree_clf.predict(df_test)

model = LinearRegression()
model.fit(df_train, sales)
forecast_lm = model.predict(df_test)

knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(df_train, sales)
forecast_knn = knn_clf.predict(df_test)


sum_of_voices = (forecast_knn + forecast_tree + forecast_lm)/3

forecast1 = pd.DataFrame(sum_of_voices, columns = ['prediction'])
id = pd.DataFrame(df_test.id, columns = ['id'])

forecast1.reset_index(drop=True, inplace=True)
id.reset_index(drop=True, inplace=True)

result = pd.concat([id, forecast1], axis=1)

In [301]:
result.head()

Unnamed: 0,id,prediction
0,666677,23.300873
1,666678,23.107509
2,666679,24.155128
3,666680,25.67957
4,666681,32.695642


In [302]:
result.to_csv("prediction.csv", index=False)
files.download("prediction.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Building ML models

  ML models:

- k-nearest neighbors algorithm (KNN) is a supervised learning method - classification and regression
- Linear - - regression
- Tree - - classification and regression
- Ensemble
  - Average of different classifiers
  - Random forest
  - Gradient Boosting

In [None]:
# 
train.date = pd.to_datetime(train.date)
train = train.assign(dayofweek=train.date.dt.dayofweek)
train = train.assign(weekend=lambda x: 1*(train.date.dt.dayofweek>4))
train.product_id = train.product_id.astype(str)
train.store_id = train.store_id.astype(str)
train = pd.get_dummies(train)

In [None]:
# add average of sales for each day of the week
z = []
for i in range(7):
  z.append(train[train.dayofweek == i].sales.mean())

day_sales_mean = {'day_sales_mean': z, 'dayofweek': [0,1,2,3,4,5,6]}
day_sales_mean = pd.DataFrame(day_sales_mean)
train = pd.merge(train, day_sales_mean, on="dayofweek")

In [None]:
train.dayofweek = train.dayofweek.astype(str)
train.category_id = train.category_id.astype(str)
train.price = train.price.astype(str)
train = pd.get_dummies(train)

In [None]:
# add lags
for i in (1, 2, 3):
    train[f'lag_day_{i*7}'] = train['sales'].shift(i)
train = train.dropna()

In [None]:
sales = train.sales
train.drop(labels=["sales", "date"], axis=1, inplace=True) #, "id"

In [None]:
# print(sales.shape, train.shape)

In [None]:
# scaler = StandardScaler()
# train.iloc[:,[1,2,3,4]] = scaler.fit_transform(train.iloc[:,[1,2,3,4]])

In [None]:
# make 4 subsets for training and testing
X_train, X_test, y_train, y_test = train_test_split(
    train,
    sales,
    train_size = 0.9999, 
    test_size = 0.0001,
    shuffle = True)


In [None]:
# Linear model

model = LinearRegression()
model.fit(X_train, y_train)

forecast_lm = model.predict(X_test)
mae = mean_absolute_error(y_test, forecast_lm)

print(f"Linear model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"Linear model: MAE = {mae:.2f} > 4.10 ")

Linear model: MAE = 4.02 < 4.10 


In [None]:
# TreeClassifier model 
tree_clf = DecisionTreeClassifier().fit(X_train, y_train)
forecast_tree = tree_clf.predict(X_test)
mae = mean_absolute_error(y_test, forecast_tree)
print(f"TreeClassifier model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"TreeClassifier model: MAE = {mae:.2f} > 4.10 ")


TreeClassifier model: MAE = 5.67 > 4.10 


In [None]:
# Sum of Linear regression and TreeClassifier
sum_of_voices = (forecast_lm + forecast_tree)/2
mae = mean_absolute_error(y_test, sum_of_voices)
print(f"Ensemble model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"Ensemble model: MAE = {mae:.2f} > 4.10 ")

Ensemble model: MAE = 4.35 > 4.10 


## old code

In [None]:
# Linear model - refression

model = LinearRegression()
model.fit(X_train.iloc[:,[0,3,4,5,6,8,9,10]].dropna(axis=0, how='any', subset=None, inplace=False), # выкидываю character data type and NA
          y_train.dropna(axis=0, how='any', inplace=False)) # выкидываю character data type and NA

forecast_lm = model.predict(X_test.iloc[:,[0,3,4,5,6,8,9,10]].dropna(axis=0, how='any', subset=None, inplace=False))
mae = mean_absolute_error(y_test, forecast_lm)

print(f"Linear model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"Linear model: MAE = {mae:.2f} > 4.10 ")


Linear model: MAE = 7.57 > 4.10 


In [None]:
# TreeClassifier model - classification
tree_clf = DecisionTreeClassifier().fit(X_train.iloc[:,[0,3,4,5,6,8,9,10]].dropna(axis=0, how='any', subset=None, inplace=False), 
                                        y_train.dropna(axis=0, how='any', inplace=False))
forecast_tree = tree_clf.predict(X_test.iloc[:,[0,3,4,5,6,8,9,10]].dropna(axis=0, how='any', subset=None, inplace=False))
mae = mean_absolute_error(y_test, forecast_tree)
print(f"TreeClassifier model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"TreeClassifier model: MAE = {mae:.2f} > 4.10 ")

TreeClassifier model: MAE = 4.32 > 4.10 


In [None]:
# KNN model
# have to choose perfect number of neighbours
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train.iloc[:,[0,3,4,5,6,8,9,10]].dropna(axis=0, how='any', subset=None, inplace=False), y_train.dropna(axis=0, how='any', inplace=False))
forecast_knn = knn_clf.predict(X_test.iloc[:,[0,3,4,5,6,8,9,10]].dropna(axis=0, how='any', subset=None, inplace=False))
mae = mean_absolute_error(y_test, forecast_knn)
print(f"KNN model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"KNN model: MAE = {mae:.2f} > 4.10 ")

KNN model: MAE = 0.35 < 4.10 


In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV

In [None]:
# https://chrisalbon.com/code/machine_learning/nearest_neighbors/identifying_best_value_of_k/
y = y_train.dropna(axis=0, how='any', inplace=False)
# Create standardizer
standardizer = StandardScaler()

# Standardize features
X_std = standardizer.fit_transform(X_train.iloc[:,[0,3,4,5,6,8,9,10]].dropna(axis=0, how='any', subset=None, inplace=False))

knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean', n_jobs=-1).fit(X_std, y)

# Create a pipeline
pipe = Pipeline([('standardizer', standardizer), ('knn', knn)])

# Create space of candidate values
search_space = [{'knn__n_neighbors': list(range(3,30,4))}]
clf = GridSearchCV(pipe, search_space, cv=5, verbose=0).fit(X_std, y)
# Best neighborhood size (k)
clf.best_estimator_.get_params()['knn__n_neighbors']



3

In [None]:
# mean_absolute_error
from sklearn.metrics import make_scorer
custom_scorer = make_scorer(mean_absolute_error, greater_is_better=False)
y = y_train.dropna(axis=0, how='any', inplace=False)
# Create standardizer
standardizer = StandardScaler()

# Standardize features
X_std = standardizer.fit_transform(X_train.iloc[:,[0,3,4,5,6,8,9,10]].dropna(axis=0, how='any', subset=None, inplace=False))

knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean', n_jobs=-1).fit(X_std, y)

# Create a pipeline
pipe = Pipeline([('standardizer', standardizer), ('knn', knn)])

# Create space of candidate values
search_space = [{'knn__n_neighbors': list(range(3,30,4))}]
clf = GridSearchCV(pipe, search_space, cv=5, verbose=0, scoring=custom_scorer).fit(X_std, y)
# Best neighborhood size (k)
clf.best_estimator_.get_params()['knn__n_neighbors']

# считал минут 8 - результат 3



3

In [None]:
# RandomForestClassifier model
rf_model = RandomForestClassifier(n_estimators = 4)
rf_model.fit(X_train.iloc[:,[0,3,4,5,6,8,9,10]].dropna(axis=0, how='any', subset=None, inplace=False), 
                                        y_train.dropna(axis=0, how='any', inplace=False))
forecast = rf_model.predict(X_test.iloc[:,[0,3,4,5,6,8,9,10]].dropna(axis=0, how='any', subset=None, inplace=False))
mae = mean_absolute_error(y_test, forecast)
print(f"KNN model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"KNN model: MAE = {mae:.2f} > 4.10 ")

In [None]:
# Gradient Boosting


In [None]:
index = [1,2,3,4,5,6,7,8,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35]

In [None]:
# index

In [None]:
train.loc[:,train.columns != "sales"].shape

(666676, 37)

In [None]:
index1 = list(range(2,32))
index.insert(0, 0)

In [None]:
test.iloc[:,index1].head(2)

Unnamed: 0,store_id,category_id,product_id,price,humidity,temperature,pressure,dayofweek,weekend,city_name_Волгоград,...,weather_desc_облачно,"weather_desc_облачно, без существенных осадков","weather_desc_облачно, небольшие осадки","weather_desc_облачно, небольшой дождь","weather_desc_облачно, небольшой снег",weather_desc_осадки,weather_desc_переменная облачность,"weather_desc_переменная облачность, небольшие осадки",weather_desc_снег,weather_desc_ясно
0,1,1,1,4.79,87.3125,-1.9375,749.3125,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1,1,1,1,4.79,88.75,-1.25,752.6875,1,0,0,...,0,0,0,0,0,0,1,0,0,0


In [None]:
test.shape

(24836, 32)

In [None]:
train.columns

Index(['id', 'date', 'store_id', 'category_id', 'product_id', 'price',
       'humidity', 'temperature', 'pressure', 'sales', 'dayofweek', 'weekend',
       'city_name_Волгоград', 'city_name_Воронеж', 'city_name_Екатеринбург',
       'city_name_Казань', 'city_name_Краснодар', 'city_name_Москва',
       'city_name_Нижний Новгород', 'city_name_Ростов-на-Дону',
       'city_name_Самара', 'city_name_Санкт-Петербург', 'weather_desc_дождь',
       'weather_desc_дождь, гроза', 'weather_desc_метель',
       'weather_desc_облачно',
       'weather_desc_облачно, без существенных осадков',
       'weather_desc_облачно, небольшие осадки',
       'weather_desc_облачно, небольшой дождь',
       'weather_desc_облачно, небольшой снег', 'weather_desc_осадки',
       'weather_desc_переменная облачность',
       'weather_desc_переменная облачность, дождь',
       'weather_desc_переменная облачность, небольшие осадки',
       'weather_desc_переменная облачность, небольшой дождь',
       'weather_desc_пере

In [None]:
test.columns

Index(['id', 'date', 'store_id', 'category_id', 'product_id', 'price',
       'humidity', 'temperature', 'pressure', 'dayofweek', 'weekend',
       'city_name_Волгоград', 'city_name_Воронеж', 'city_name_Екатеринбург',
       'city_name_Казань', 'city_name_Краснодар', 'city_name_Москва',
       'city_name_Нижний Новгород', 'city_name_Ростов-на-Дону',
       'city_name_Самара', 'city_name_Санкт-Петербург', 'weather_desc_дождь',
       'weather_desc_облачно',
       'weather_desc_облачно, без существенных осадков',
       'weather_desc_облачно, небольшие осадки',
       'weather_desc_облачно, небольшой дождь',
       'weather_desc_облачно, небольшой снег', 'weather_desc_осадки',
       'weather_desc_переменная облачность',
       'weather_desc_переменная облачность, небольшие осадки',
       'weather_desc_снег', 'weather_desc_ясно'],
      dtype='object')

In [None]:
X_train.loc[:, X_train.columns != "date"].columns

Index(['id', 'store_id', 'category_id', 'product_id', 'price', 'humidity',
       'temperature', 'pressure', 'sales', 'dayofweek'],
      dtype='object')

In [None]:
X_train.columns

Index(['id', 'date', 'store_id', 'category_id', 'product_id', 'price',
       'humidity', 'temperature', 'pressure', 'sales', 'dayofweek', 'weekend',
       'city_name_Волгоград', 'city_name_Воронеж', 'city_name_Екатеринбург',
       'city_name_Казань', 'city_name_Краснодар', 'city_name_Москва',
       'city_name_Нижний Новгород', 'city_name_Ростов-на-Дону',
       'city_name_Самара', 'city_name_Санкт-Петербург', 'weather_desc_дождь',
       'weather_desc_дождь, гроза', 'weather_desc_метель',
       'weather_desc_облачно',
       'weather_desc_облачно, без существенных осадков',
       'weather_desc_облачно, небольшие осадки',
       'weather_desc_облачно, небольшой дождь',
       'weather_desc_облачно, небольшой снег', 'weather_desc_осадки',
       'weather_desc_переменная облачность',
       'weather_desc_переменная облачность, дождь',
       'weather_desc_переменная облачность, небольшие осадки',
       'weather_desc_переменная облачность, небольшой дождь',
       'weather_desc_пере

In [None]:
# TreeClassifier model - classification
tree_clf = DecisionTreeClassifier().fit(X_train, 
                                        y_train)
forecast_tree = tree_clf.predict(X_test)
mae = mean_absolute_error(y_test, forecast_tree)
print(f"TreeClassifier model: MAE = {mae:.2f} < 4.10 " if mae < 4.10 else f"TreeClassifier model: MAE = {mae:.2f} > 4.10 ")

TypeError: ignored

In [None]:
train.columns

Index(['id', 'date', 'store_id', 'category_id', 'product_id', 'price',
       'humidity', 'temperature', 'pressure', 'sales', 'dayofweek', 'weekend',
       'city_name_Волгоград', 'city_name_Воронеж', 'city_name_Екатеринбург',
       'city_name_Казань', 'city_name_Краснодар', 'city_name_Москва',
       'city_name_Нижний Новгород', 'city_name_Ростов-на-Дону',
       'city_name_Самара', 'city_name_Санкт-Петербург', 'weather_desc_дождь',
       'weather_desc_дождь, гроза', 'weather_desc_метель',
       'weather_desc_облачно',
       'weather_desc_облачно, без существенных осадков',
       'weather_desc_облачно, небольшие осадки',
       'weather_desc_облачно, небольшой дождь',
       'weather_desc_облачно, небольшой снег', 'weather_desc_осадки',
       'weather_desc_переменная облачность',
       'weather_desc_переменная облачность, дождь',
       'weather_desc_переменная облачность, небольшие осадки',
       'weather_desc_переменная облачность, небольшой дождь',
       'weather_desc_пере

In [None]:
train.iloc[:,9]

0         26
1         37
2         25
3         26
4         22
          ..
666671    11
666672    17
666673     2
666674     7
666675    18
Name: sales, Length: 666676, dtype: int64

In [None]:
train.iloc[:,[0,1,2,3, 4, 5, 6, 7, 8,10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37]].columns

Index(['id', 'date', 'store_id', 'category_id', 'product_id', 'price',
       'humidity', 'temperature', 'pressure', 'dayofweek', 'weekend',
       'city_name_Волгоград', 'city_name_Воронеж', 'city_name_Екатеринбург',
       'city_name_Казань', 'city_name_Краснодар', 'city_name_Москва',
       'city_name_Нижний Новгород', 'city_name_Ростов-на-Дону',
       'city_name_Самара', 'city_name_Санкт-Петербург', 'weather_desc_дождь',
       'weather_desc_дождь, гроза', 'weather_desc_метель',
       'weather_desc_облачно',
       'weather_desc_облачно, без существенных осадков',
       'weather_desc_облачно, небольшие осадки',
       'weather_desc_облачно, небольшой дождь',
       'weather_desc_облачно, небольшой снег', 'weather_desc_осадки',
       'weather_desc_переменная облачность',
       'weather_desc_переменная облачность, дождь',
       'weather_desc_переменная облачность, небольшие осадки',
       'weather_desc_переменная облачность, небольшой дождь',
       'weather_desc_переменная об

In [None]:
train.loc[:,train.columns not in ("sales","date")].columns

ValueError: ignored

In [None]:
(train.columns not in ("sales","date")).any()

ValueError: ignored

In [None]:
# make 4 subsets for training and testing
X_train, X_test, y_train, y_test = train_test_split(
    train.loc[:,train.columns != "sales" | train.columns != "date"],
    train.iloc[:,9],
    train_size = 0.8, 
    test_size = 0.2,
    random_state = 2022)

In [None]:
# convert date column from string to datetime and add weekday column into 
train.date = pd.to_datetime(train.date)
train = train.assign(dayofweek=train.date.dt.dayofweek)
train = train.assign(weekend=lambda x: 1*(train.date.dt.dayofweek>4))
train = pd.get_dummies(train)

In [None]:
# convert date column from string to datetime and add weekday column into 
test.date = pd.to_datetime(test.date)
test = test.assign(dayofweek=test.date.dt.dayofweek)
test = test.assign(weekend=lambda x: 1*(test.date.dt.dayofweek>4))
test = pd.get_dummies(test)

In [None]:
train.info()

In [None]:
train.head(2)

Unnamed: 0,id,date,store_id,category_id,product_id,price,humidity,temperature,pressure,sales,...,"weather_desc_облачно, небольшой дождь","weather_desc_облачно, небольшой снег",weather_desc_осадки,weather_desc_переменная облачность,"weather_desc_переменная облачность, дождь","weather_desc_переменная облачность, небольшие осадки","weather_desc_переменная облачность, небольшой дождь","weather_desc_переменная облачность, небольшой снег",weather_desc_снег,weather_desc_ясно
0,1,2021-07-29,1,1,1,4.79,61.9375,23.1875,741.0,26,...,0,0,0,0,0,0,1,0,0,0
1,2,2021-07-30,1,1,1,4.79,70.25,22.1875,740.3125,37,...,0,0,0,0,0,0,1,0,0,0


In [None]:
train.category_id.unique()

array([1, 2, 3, 4, 5, 7, 8, 9, 6])

In [None]:
train.date.max()

Timestamp('2022-02-13 00:00:00')

In [None]:
test.date = pd.to_datetime(test.date)
test.date.max()

Timestamp('2022-02-20 00:00:00')

In [None]:
train = pd.get_dummies(train)

In [None]:
train.columns

Index(['id', 'date', 'store_id', 'category_id', 'product_id', 'price',
       'humidity', 'temperature', 'pressure', 'sales', 'dayofweek', 'weekend',
       'city_name_Kazan', 'city_name_Krasnodar', 'city_name_Moscow',
       'city_name_Nizhny.Novgorod', 'city_name_Rostov-on-Don',
       'city_name_Samara', 'city_name_St.Petersburg', 'city_name_Volgograd',
       'city_name_Voronezh', 'city_name_Yekaterinburg', 'weather_desc_дождь',
       'weather_desc_дождь, гроза', 'weather_desc_метель',
       'weather_desc_облачно',
       'weather_desc_облачно, без существенных осадков',
       'weather_desc_облачно, небольшие осадки',
       'weather_desc_облачно, небольшой дождь',
       'weather_desc_облачно, небольшой снег', 'weather_desc_осадки',
       'weather_desc_переменная облачность',
       'weather_desc_переменная облачность, дождь',
       'weather_desc_переменная облачность, небольшие осадки',
       'weather_desc_переменная облачность, небольшой дождь',
       'weather_desc_переме

In [None]:
# Добавим лаговые фичи от 7 до 14 дней
# df = df.sort_values(['region_id', 'date', 'hour']).reset_index(drop=True)
# group = df.groupby(['hour', 'region_id'])
for i in range(7, 15):
    train[f'lag_day_{i}'] = train['sales'].shift(i)

In [None]:
df[(df['region_id'] == 3) & (df['hour'] == 14)].iloc[-14:]

In [None]:
train.weather_desc.unique()

array(['переменная облачность, небольшой дождь', 'переменная облачность',
       'облачно, небольшой дождь', 'дождь, гроза',
       'облачно, без существенных осадков',
       'переменная облачность, дождь', 'дождь', 'облачно', 'ясно',
       'облачно, небольшой снег',
       'переменная облачность, небольшие осадки',
       'облачно, небольшие осадки', 'снег', 'метель', 'осадки',
       'переменная облачность, небольшой снег'], dtype=object)

## Data sets description

In [184]:
train = train.replace(
    ('Москва', 'Санкт-Петербург', 'Краснодар', 'Самара','Нижний Новгород', 'Ростов-на-Дону', 'Волгоград', 'Воронеж', 'Казань', 'Екатеринбург'),
    ("Moscow", "St.Petersburg", "Krasnodar", "Samara", "Nizhny.Novgorod", "Rostov-on-Don", "Volgograd", "Voronezh", "Kazan", "Yekaterinburg")  )

In [185]:
pd.unique(train["city_name"])
# pd.unique(train["store_id"])
# можно сделать график количества магазинов в каждом городе

array(['Moscow', 'St.Petersburg', 'Krasnodar', 'Samara',
       'Nizhny.Novgorod', 'Rostov-on-Don', 'Volgograd', 'Voronezh',
       'Kazan', 'Yekaterinburg'], dtype=object)

In [None]:
train.shape

(666676, 12)

In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28489 entries, 0 to 28488
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            28489 non-null  int64  
 1   date          28488 non-null  object 
 2   city_name     28488 non-null  object 
 3   store_id      28488 non-null  float64
 4   category_id   28488 non-null  float64
 5   product_id    28488 non-null  float64
 6   price         28488 non-null  float64
 7   weather_desc  28488 non-null  object 
 8   humidity      28488 non-null  float64
 9   temperature   28488 non-null  float64
 10  pressure      28488 non-null  float64
 11  sales         28488 non-null  float64
dtypes: float64(8), int64(1), object(3)
memory usage: 2.6+ MB


In [None]:
train.head()

Unnamed: 0,id,date,city_name,store_id,category_id,product_id,price,weather_desc,humidity,temperature,pressure,sales
0,1,2021-07-29,Moscow,1.0,1.0,1.0,4.79,"переменная облачность, небольшой дождь",61.9375,23.1875,741.0,26.0
1,2,2021-07-30,Moscow,1.0,1.0,1.0,4.79,"переменная облачность, небольшой дождь",70.25,22.1875,740.3125,37.0
2,3,2021-07-31,Moscow,1.0,1.0,1.0,4.79,переменная облачность,52.625,21.8125,741.625,25.0
3,4,2021-08-01,Moscow,1.0,1.0,1.0,4.79,"облачно, небольшой дождь",87.4375,20.0625,743.3125,26.0
4,5,2021-08-02,Moscow,1.0,1.0,1.0,4.79,переменная облачность,66.1875,23.4375,739.625,22.0


In [None]:
train.describe()

Unnamed: 0,id,store_id,category_id,product_id,price,humidity,temperature,pressure,sales
count,28489.0,28488.0,28488.0,28488.0,28488.0,28488.0,28488.0,28488.0,28488.0
mean,14244.00007,3.615768,2.398238,17.317397,5.104391,73.685486,7.737734,754.513429,12.057217
std,8224.210124,1.955576,1.902261,10.883911,3.345185,16.58221,10.770423,9.361326,15.567595
min,1.0,1.0,1.0,1.0,1.93,27.125,-18.3125,718.0625,0.0
25%,7122.0,2.0,1.0,9.0,3.0,61.5625,0.375,748.3125,3.0
50%,14244.0,3.0,1.0,16.0,4.09,73.0625,7.6875,756.0,7.0
75%,21366.0,5.0,4.0,28.0,6.2,89.3125,15.0625,761.0,15.0
max,28488.0,7.0,8.0,35.0,18.63,100.0,33.25,779.0,169.0


In [None]:
test.shape

(24836, 11)

In [None]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24836 entries, 0 to 24835
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            24836 non-null  int64  
 1   date          24836 non-null  object 
 2   city_name     24836 non-null  object 
 3   store_id      24836 non-null  int64  
 4   category_id   24836 non-null  int64  
 5   product_id    24836 non-null  int64  
 6   price         24836 non-null  float64
 7   weather_desc  24836 non-null  object 
 8   humidity      24836 non-null  float64
 9   temperature   24836 non-null  float64
 10  pressure      24836 non-null  float64
dtypes: float64(4), int64(4), object(3)
memory usage: 2.1+ MB


In [None]:
test.head()

Unnamed: 0,id,date,city_name,store_id,category_id,product_id,price,weather_desc,humidity,temperature,pressure
0,666677,2022-02-14,Москва,1,1,1,4.79,облачно,87.3125,-1.9375,749.3125
1,666678,2022-02-15,Москва,1,1,1,4.79,переменная облачность,88.75,-1.25,752.6875
2,666679,2022-02-16,Москва,1,1,1,4.79,переменная облачность,90.375,-1.5625,746.3125
3,666680,2022-02-17,Москва,1,1,1,4.79,"облачно, небольшой дождь",98.0,1.75,732.6875
4,666681,2022-02-18,Москва,1,1,1,4.79,"облачно, небольшие осадки",95.5,1.375,733.0


In [None]:
test.describe()

Unnamed: 0,id,store_id,category_id,product_id,price,humidity,temperature,pressure
count,24836.0,24836.0,24836.0,24836.0,24836.0,24836.0,24836.0,24836.0
mean,679094.5,78.053551,2.375423,17.8323,5.201144,87.285168,-0.751719,747.908286
std,7169.679979,45.689019,1.876578,10.826993,3.491933,9.839292,4.059063,9.743387
min,666677.0,1.0,1.0,1.0,1.93,55.875,-10.5,730.3125
25%,672885.75,40.0,1.0,9.0,3.0,84.8125,-3.625,740.0
50%,679094.5,76.0,1.0,17.0,4.09,89.9375,-0.3125,748.9375
75%,685303.25,117.0,4.0,28.0,6.02,94.3125,1.75,754.6875
max,691512.0,164.0,9.0,35.0,18.63,98.625,9.0625,769.0


In [None]:
group.head()

NameError: ignored

In [246]:
# график заказов по городам
group = train[['date', 'city_name', 'sales']].groupby(['date', 'city_name'], as_index=False).sum()
fig = px.line(group, x="date", y="sales", color='city_name', template='plotly_dark')
fig.update_layout(
    title={
        'text': "Orders in each city",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show() 
# можно еще добавить инфу как менялись заказы по месяцам, в течении недели

In [281]:
# график заказов по городам
group = df[['price', 'city_name', 'sales']].groupby(['price', 'city_name'], as_index=False).sum()
fig = px.line(group, x="price", y="sales", color = 'city_name', template='plotly_dark')
fig.update_layout(
    title={
        'text': "mean of orders in each city for each day of the week",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show() 
# можно еще добавить инфу как менялись заказы по месяцам, в течении недели

## Results

In [None]:
# df.to_csv("prediction.csv")