## ETL layer

### **Description:**

- Create an ETL layer based on DQC

In [1]:
import numpy as np
import pandas as pd
import matplotlib as plt

import sys
sys.path.append('../')
import scripts.etl as etl # etl.py module
import scripts.dqc as dqc # for "check_negative_values" function

## 1. Change dtypes for **df_train** columns

### Load all necessary data into dataframes at first

In [2]:
train_df = pd.read_csv('../data/sales_train.csv')
test_df = pd.read_csv('../data/test.csv')
items_df = pd.read_csv('../data/items.csv')
categories_df = pd.read_csv('../data/item_categories.csv')
shops_df = pd.read_csv('../data/shops.csv')

In [3]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935849 entries, 0 to 2935848
Data columns (total 6 columns):
 #   Column          Dtype  
---  ------          -----  
 0   date            object 
 1   date_block_num  int64  
 2   shop_id         int64  
 3   item_id         int64  
 4   item_price      float64
 5   item_cnt_day    float64
dtypes: float64(2), int64(3), object(1)
memory usage: 134.4+ MB


As it was mentioned earlier, all numerical data can be safely put into int/float-32 dtypes. Moreover, all **item_cnt_day** values are actually integers as well as **date** feature should be of 'datetime' type

In [4]:
train_df['date'] = pd.to_datetime(train_df['date'], format='%d.%m.%Y')

int_columns = ['date_block_num', 'shop_id', 'item_id', 'item_cnt_day']
float_columns = ['item_price']

train_df = etl.transform_df_types(train_df, int_columns, float_columns)
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935849 entries, 0 to 2935848
Data columns (total 6 columns):
 #   Column          Dtype         
---  ------          -----         
 0   date            datetime64[ns]
 1   date_block_num  int32         
 2   shop_id         int32         
 3   item_id         int32         
 4   item_price      float32       
 5   item_cnt_day    int32         
dtypes: datetime64[ns](1), float32(1), int32(4)
memory usage: 78.4 MB


### Do the same for other dataframes' integer columns

In [5]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214200 entries, 0 to 214199
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype
---  ------   --------------   -----
 0   ID       214200 non-null  int64
 1   shop_id  214200 non-null  int64
 2   item_id  214200 non-null  int64
dtypes: int64(3)
memory usage: 4.9 MB


In [6]:
test_df = etl.transform_df_types(test_df, int_columns=test_df.columns.to_list())

In [7]:
items_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22170 entries, 0 to 22169
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   item_name         22170 non-null  object
 1   item_id           22170 non-null  int64 
 2   item_category_id  22170 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 519.7+ KB


In [8]:
items_df = etl.transform_df_types(items_df, int_columns=['item_id', 'item_category_id'], object_columns=['item_name'])

In [9]:
categories_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84 entries, 0 to 83
Data columns (total 2 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   item_category_name  84 non-null     object
 1   item_category_id    84 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 1.4+ KB


In [10]:
categories_df = etl.transform_df_types(categories_df, int_columns=['item_category_id'], object_columns=['item_category_name'])

In [11]:
shops_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   shop_name  60 non-null     object
 1   shop_id    60 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 1.1+ KB


In [12]:
shops_df = etl.transform_df_types(shops_df, int_columns=['shop_id'], object_columns=['shop_name'])

## 2. Delete rows with negative values

As we already now, 'item_price' feature has negative values that should be deleted.

In [13]:
dqc.check_negative_values(train_df, 'item_price')

3.406169731481422e-05 percent of values are negative


In [14]:
train_df = etl.del_negative(train_df, 'item_price')

### Check for negative values again

In [15]:
dqc.check_negative_values(train_df, 'item_price')

No negative values found


In [16]:
train_df.reset_index(drop=True, inplace=True)
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935848 entries, 0 to 2935847
Data columns (total 6 columns):
 #   Column          Dtype         
---  ------          -----         
 0   date            datetime64[ns]
 1   date_block_num  int32         
 2   shop_id         int32         
 3   item_id         int32         
 4   item_price      float32       
 5   item_cnt_day    int32         
dtypes: datetime64[ns](1), float32(1), int32(4)
memory usage: 78.4 MB


## 3. Create a copy of **df_train** to aggregate monthly sales

In [17]:
train_aggregated = train_df.groupby(['date_block_num', 'shop_id', 'item_id']).agg({'item_cnt_day': 'sum', 'item_price': 'mean'}).reset_index()

train_aggregated.rename(columns={'item_cnt_day': 'item_cnt_month'}, inplace=True)
train_aggregated.head()

Unnamed: 0,date_block_num,shop_id,item_id,item_cnt_month,item_price
0,0,0,32,6,221.0
1,0,0,33,3,347.0
2,0,0,35,1,247.0
3,0,0,43,1,221.0
4,0,0,51,2,128.5


## 4. Add **year** and **month** columns to received datasets

In [18]:
train_df = etl.add_month_year_columns(train_df)
train_aggregated = etl.add_month_year_columns(train_aggregated)

train_df.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,month,year
0,2013-01-02,0,59,22154,999.0,1,0,0
1,2013-01-03,0,25,2552,899.0,1,0,0
2,2013-01-05,0,25,2552,899.0,-1,0,0
3,2013-01-06,0,25,2554,1709.050049,1,0,0
4,2013-01-15,0,25,2555,1099.0,1,0,0


## 5. Merge **train_df** with **items_df, categories_df, shops_df**

In [19]:
merged_train_df = etl.merge_df(train_df, items_df, categories_df, shops_df)
merged_train_aggregated_df = etl.merge_df(train_aggregated, items_df, categories_df, shops_df)

merged_train_df.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,month,year,item_name,item_category_id,item_category_name,shop_name
0,2013-01-02,0,59,22154,999.0,1,0,0,ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray,"Ярославль ТЦ ""Альтаир"""
1,2013-01-03,0,25,2552,899.0,1,0,0,DEEP PURPLE The House Of Blue Light LP,58,Музыка - Винил,"Москва ТРК ""Атриум"""
2,2013-01-05,0,25,2552,899.0,-1,0,0,DEEP PURPLE The House Of Blue Light LP,58,Музыка - Винил,"Москва ТРК ""Атриум"""
3,2013-01-06,0,25,2554,1709.050049,1,0,0,DEEP PURPLE Who Do You Think We Are LP,58,Музыка - Винил,"Москва ТРК ""Атриум"""
4,2013-01-15,0,25,2555,1099.0,1,0,0,DEEP PURPLE 30 Very Best Of 2CD (Фирм.),56,Музыка - CD фирменного производства,"Москва ТРК ""Атриум"""


In [20]:
merged_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935848 entries, 0 to 2935847
Data columns (total 12 columns):
 #   Column              Dtype         
---  ------              -----         
 0   date                datetime64[ns]
 1   date_block_num      int32         
 2   shop_id             int32         
 3   item_id             int32         
 4   item_price          float32       
 5   item_cnt_day        int32         
 6   month               int32         
 7   year                int32         
 8   item_name           category      
 9   item_category_id    int32         
 10  item_category_name  category      
 11  shop_name           category      
dtypes: category(3), datetime64[ns](1), float32(1), int32(7)
memory usage: 123.9 MB


In [21]:
merged_train_aggregated_df.head()

Unnamed: 0,date_block_num,shop_id,item_id,item_cnt_month,item_price,month,year,item_name,item_category_id,item_category_name,shop_name
0,0,0,32,6,221.0,0,0,1+1,40,Кино - DVD,"!Якутск Орджоникидзе, 56 фран"
1,0,0,33,3,347.0,0,0,1+1 (BD),37,Кино - Blu-Ray,"!Якутск Орджоникидзе, 56 фран"
2,0,0,35,1,247.0,0,0,10 ЛЕТ СПУСТЯ,40,Кино - DVD,"!Якутск Орджоникидзе, 56 фран"
3,0,0,43,1,221.0,0,0,100 МИЛЛИОНОВ ЕВРО,40,Кино - DVD,"!Якутск Орджоникидзе, 56 фран"
4,0,0,51,2,128.5,0,0,100 лучших произведений классики (mp3-CD) (Dig...,57,Музыка - MP3,"!Якутск Орджоникидзе, 56 фран"


In [22]:
merged_train_aggregated_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1609124 entries, 0 to 1609123
Data columns (total 11 columns):
 #   Column              Non-Null Count    Dtype   
---  ------              --------------    -----   
 0   date_block_num      1609124 non-null  int32   
 1   shop_id             1609124 non-null  int32   
 2   item_id             1609124 non-null  int32   
 3   item_cnt_month      1609124 non-null  int32   
 4   item_price          1609124 non-null  float32 
 5   month               1609124 non-null  int32   
 6   year                1609124 non-null  int32   
 7   item_name           1609124 non-null  category
 8   item_category_id    1609124 non-null  int32   
 9   item_category_name  1609124 non-null  category
 10  shop_name           1609124 non-null  category
dtypes: category(3), float32(1), int32(7)
memory usage: 55.9 MB


## 6. Do the same for **test_df**

In [23]:
merged_test_df = etl.merge_df(test_df, items_df, categories_df, shops_df)
merged_test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214200 entries, 0 to 214199
Data columns (total 7 columns):
 #   Column              Non-Null Count   Dtype   
---  ------              --------------   -----   
 0   ID                  214200 non-null  int32   
 1   shop_id             214200 non-null  int32   
 2   item_id             214200 non-null  int32   
 3   item_name           214200 non-null  category
 4   item_category_id    214200 non-null  int32   
 5   item_category_name  214200 non-null  category
 6   shop_name           214200 non-null  category
dtypes: category(3), int32(4)
memory usage: 4.8 MB


## 7. Look at outliers in merged dataframe

In [24]:
merged_train_df.sort_values('item_price', ascending=False).head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,month,year,item_name,item_category_id,item_category_name,shop_name
1163157,2013-12-13,11,12,6066,307980.0,1,11,0,Radmin 3 - 522 лиц.,75,Программы - Для дома и офиса,Интернет-магазин ЧС
885137,2013-09-17,8,12,11365,59200.0,1,8,0,Доставка (EMS),9,Доставка товара,Интернет-магазин ЧС
1488134,2014-03-20,14,25,13199,50999.0,1,2,1,Коллекционные шахматы (Властелин Колец),69,Подарки - Сувениры,"Москва ТРК ""Атриум"""
2327158,2015-01-29,24,12,7241,49782.0,1,0,2,UserGate Proxy & Firewall 6.X с модулем фильтр...,75,Программы - Для дома и офиса,Интернет-магазин ЧС
2931379,2015-10-20,33,22,13403,42990.0,1,9,2,"Комплект ""Microsoft Xbox One 1TB Limited Edit...",16,Игровые консоли - XBOX ONE,Москва Магазин С21


In [25]:
merged_train_df.sort_values('item_cnt_day', ascending=False).head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,month,year,item_name,item_category_id,item_category_name,shop_name
2909817,2015-10-28,33,12,11373,0.908714,2169,9,2,Доставка до пункта выдачи (Boxberry),9,Доставка товара,Интернет-магазин ЧС
2326929,2015-01-15,24,12,20949,4.0,1000,0,2,Фирменный пакет майка 1С Интерес белый (34*42)...,71,"Подарки - Сумки, Альбомы, Коврики д/мыши",Интернет-магазин ЧС
2864234,2015-09-30,32,12,9248,1692.526123,669,8,2,"Билет ""ИгроМир 2015"" - 3 октября 2015 (сайт) [...",80,Служебные - Билеты,Интернет-магазин ЧС
2851090,2015-09-30,32,55,9249,1702.825806,637,8,2,"Билет ""ИгроМир 2015"" - 3 октября 2015 (сайт) У...",8,Билеты (Цифра),Цифровой склад 1С-Онлайн
2608039,2015-04-14,27,12,3731,1904.548096,624,3,2,"Grand Theft Auto V [PC, русские субтитры]",30,Игры PC - Стандартные издания,Интернет-магазин ЧС


### It's clear, that data containing outliers is actually reasonable and may be helpful and informative in the future. That's why now I will leave all the outliers in the dataset at least until the moment EDA is performed - to explore them more carefully. Moreover, some models won't even loose in quality when having outliers in the training sets, so I see no reasons in processing outliers now.

## 8. Export dataframes to .csv files

In [26]:
merged_test_df.to_csv('../data/merged_test.csv', index=False)
merged_train_df.to_csv('../data/merged_train.csv', index=False)
merged_train_aggregated_df.to_csv('../data/merged_train_aggregated.csv', index=False)