# Predict Future Sales

Final project for "How to win a data science competition" Coursera course
- analyst: Esan (https://www.github.com/Esantomi)
- link: https://www.kaggle.com/c/competitive-data-science-predict-future-sales/

## File descriptions
- sales_train.csv - the training set. Daily historical data from January 2013 to October 2015.
- test.csv - the test set. You need to forecast the sales for these shops and products for November 2015.
- sample_submission.csv - a sample submission file in the correct format.
- items.csv - supplemental information about the items/products.
- item_categories.csv  - supplemental information about the items categories.
-  shops.csv- supplemental information about the shops.

## Data fields
- ID - an Id that represents a (Shop, Item) tuple within the test set
- shop_id - unique identifier of a shop
- item_id - unique identifier of a product
- item_category_id - unique identifier of item category
- item_cnt_day - number of products sold. You are predicting a monthly amount of this measure
- item_price - current price of an item
- date - date in format dd/mm/yyyy
- date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
- item_name - name of item
- shop_name - name of shop
- item_category_name - name of item category

## Data

In [1]:
import pandas as pd
import numpy as np

train = pd.read_csv('./sales_train.csv')
test = pd.read_csv('./test.csv')
sample = pd.read_csv('./sample_submission.csv')
items = pd.read_csv('./items.csv')
item_cat = pd.read_csv('./item_categories.csv')
shops = pd.read_csv('./shops.csv')

In [5]:
train.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,02.01.2013,0,59,22154,999.0,1.0
1,03.01.2013,0,25,2552,899.0,1.0
2,05.01.2013,0,25,2552,899.0,-1.0
3,06.01.2013,0,25,2554,1709.05,1.0
4,15.01.2013,0,25,2555,1099.0,1.0


In [6]:
test.head()

Unnamed: 0,ID,shop_id,item_id
0,0,5,5037
1,1,5,5320
2,2,5,5233
3,3,5,5232
4,4,5,5268


In [7]:
sample.head()

Unnamed: 0,ID,item_cnt_month
0,0,0.5
1,1,0.5
2,2,0.5
3,3,0.5
4,4,0.5


In [2]:
items.head()

Unnamed: 0,item_name,item_id,item_category_id
0,! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D,0,40
1,!ABBYY FineReader 12 Professional Edition Full...,1,76
2,***В ЛУЧАХ СЛАВЫ (UNV) D,2,40
3,***ГОЛУБАЯ ВОЛНА (Univ) D,3,40
4,***КОРОБКА (СТЕКЛО) D,4,40


In [3]:
item_cat.head()

Unnamed: 0,item_category_name,item_category_id
0,PC - Гарнитуры/Наушники,0
1,Аксессуары - PS2,1
2,Аксессуары - PS3,2
3,Аксессуары - PS4,3
4,Аксессуары - PSP,4


In [4]:
shops.head()

Unnamed: 0,shop_name,shop_id
0,"!Якутск Орджоникидзе, 56 фран",0
1,"!Якутск ТЦ ""Центральный"" фран",1
2,"Адыгея ТЦ ""Мега""",2
3,"Балашиха ТРК ""Октябрь-Киномир""",3
4,"Волжский ТЦ ""Волга Молл""",4


In [8]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935849 entries, 0 to 2935848
Data columns (total 6 columns):
 #   Column          Dtype  
---  ------          -----  
 0   date            object 
 1   date_block_num  int64  
 2   shop_id         int64  
 3   item_id         int64  
 4   item_price      float64
 5   item_cnt_day    float64
dtypes: float64(2), int64(3), object(1)
memory usage: 134.4+ MB


In [10]:
train.describe(include = 'all')

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
count,2935849,2935849.0,2935849.0,2935849.0,2935849.0,2935849.0
unique,1034,,,,,
top,28.12.2013,,,,,
freq,9434,,,,,
mean,,14.56991,33.00173,10197.23,890.8532,1.242641
std,,9.422988,16.22697,6324.297,1729.8,2.618834
min,,0.0,0.0,0.0,-1.0,-22.0
25%,,7.0,22.0,4476.0,249.0,1.0
50%,,14.0,31.0,9343.0,399.0,1.0
75%,,23.0,47.0,15684.0,999.0,1.0


## Linear regression

In [14]:
# !pip install tensorflow

Collecting tensorflow
  Downloading tensorflow-2.5.0-cp38-cp38-manylinux2010_x86_64.whl (454.4 MB)
[K     |████████████████████████████████| 454.4 MB 100 kB/s eta 0:00:01     |██████████████████████████████▌ | 432.8 MB 496 kB/s eta 0:00:44
[?25hCollecting six~=1.15.0
  Using cached six-1.15.0-py2.py3-none-any.whl (10 kB)
Collecting wheel~=0.35
  Using cached wheel-0.36.2-py2.py3-none-any.whl (35 kB)
Collecting protobuf>=3.9.2
  Downloading protobuf-3.17.3-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.0 MB)
[K     |████████████████████████████████| 1.0 MB 5.1 MB/s eta 0:00:01
[?25hCollecting grpcio~=1.34.0
  Downloading grpcio-1.34.1-cp38-cp38-manylinux2014_x86_64.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 895 kB/s eta 0:00:01
[?25hCollecting google-pasta~=0.2
  Using cached google_pasta-0.2.0-py3-none-any.whl (57 kB)
Collecting astunparse~=1.6.3
  Using cached astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Processing /home/esantomi/.cache/pip/wheels/5f

In [18]:
# !pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-0.24.2-cp38-cp38-manylinux2010_x86_64.whl (24.9 MB)
[K     |████████████████████████████████| 24.9 MB 300 kB/s eta 0:00:01    |█▍                              | 1.0 MB 285 kB/s eta 0:01:24
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-2.2.0-py3-none-any.whl (12 kB)
Collecting joblib>=0.11
  Downloading joblib-1.0.1-py3-none-any.whl (303 kB)
[K     |████████████████████████████████| 303 kB 158 kB/s eta 0:00:01
Installing collected packages: threadpoolctl, joblib, scikit-learn
Successfully installed joblib-1.0.1 scikit-learn-0.24.2 threadpoolctl-2.2.0


In [19]:
import numpy as np
import pandas as pd
import tensorflow as tf

from keras.models import Sequential
from keras.layers import Dense
from sklearn.model_selection import train_test_split

train = pd.read_csv('./sales_train.csv')
test = pd.read_csv('./test.csv')