<img src='../img/logo.png' alt='DS Market logo' height='150px'>

# 2 - Feature Engineering

## Table of Contents

* [A. Introduction](#introduction)
* [B. Importing Libraries](#libraries)
* [C. Importing data](#data)
* [D. Creating a master dataframe](#master_df)
* [E. Creating aggregated dataframes](#master_df)


## A. Introduction <a class="anchor" id="introduction"></a>

In the following notebook, we will be generating some features that will be needed for the analysis and the future models to generate.

Disclaimer: Running this dataframe requires quite some time (up to 30 - 40 minutes, depending on your computer). Whenever possible, download the file from the following [GDrive link](#https://drive.google.com/file/d/1_OCpE6AZK3ju5RJVTJZm8ox5Cn8G3_Ag/view?usp=sharing).

## B. Importing Libraries <a class="anchor" id="libraries"></a>

In [2]:
# system and path management
import sys
sys.path.append('../scripts') # including helper functions inside the scripts folder

# removing system warnings
import warnings
warnings.filterwarnings('ignore')

# data manipulation
import pandas as pd
import numpy as np

# helper functions
import file_management

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.options.display.float_format = '{:,.2f}'.format

## C. Importing Data <a class="anchor" id="data"></a>

In [3]:
# downloading the processed data files from gdrive, in case these were not available
directory = '../data/processed/'
urls = [
    {'filename': 'sales_processed.csv', 'url': 'https://drive.google.com/file/d/1JdeAgraKcaFQJrjG2HPVb5D0VD0iTlNB/view?usp=sharing'},
    {'filename': 'prices_processed.csv', 'url': 'https://drive.google.com/file/d/1pSEJAQfAU-owDjKmxcPrxf3CpGFivwa6/view?usp=sharing'},
    {'filename': 'calendar_processed.csv', 'url': 'https://drive.google.com/file/d/1Lnji96iBkTpFiWo-QXeW3TvESiNYWCML/view?usp=sharing'}
]
        
file_management.download_files_from_url(urls, directory)        

sales_processed.csv file already exists in ../data/processed/
prices_processed.csv file already exists in ../data/processed/
calendar_processed.csv file already exists in ../data/processed/


['./../data/processed//sales_processed.csv',
 './../data/processed//prices_processed.csv',
 './../data/processed//calendar_processed.csv']

In [4]:
sales = pd.read_csv(directory + 'sales_processed.csv', index_col = 0)
prices = pd.read_csv(directory + 'prices_processed.csv', index_col = 0)
calendar = pd.read_csv(directory + 'calendar_processed.csv', index_col = 0)

## D. Creating a master dataframe <a class="anchor" id="master_df"></a>

In [None]:
# generating a dataframe with each row being the total amount of sales per day with each product
master = sales.melt(
        id_vars = ['id', 'item', 'category', 'department', 'store', 'store_code', 'region'], 
        var_name = 'd', 
        value_name = 'num_sales'
)
master

In [None]:
# removing unnecessary columns that can be easily regenerated with the id or with a simple dictionary ('store')
master.drop(columns = ['item', 'category', 'department', 'store_code', 'region', 'store'], inplace = True)
master

In [None]:
# merging the master sales with calendar in order to get the date translation between 'd' and the real 'date'
master = pd.merge(
    master,
    calendar,
    on = 'd'
)

master

In [None]:
# master dataframe sorting, cleanup and feature generation for other merges
master.sort_values(by = ['id', 'date'], inplace = True)

master['date'] = pd.to_datetime(master['date'])

master['year'] = master['date'].dt.year
master['week'] = master['date'].dt.week

master.drop(columns = 'd', inplace = True)

In [None]:
# adding features and dropping columns in 'prices' for a more efficient merging
prices['id'] = prices['item'] + '_' + prices['store_code']
prices.drop(columns = ['item', 'store_code', 'category'], inplace = True)
prices

In [None]:
# merging 'master' dataframe with 'prices'
master = pd.merge(
    master,
    prices,
    how = 'left',
    on = ['id', 'week', 'year']
)

master

In [None]:
# filling in nulls in prices using backfill, as we want to get the last available correct price assuming it didn't change
master.sell_price.fillna(
    method = 'backfill',
    inplace = True
)

master

In [None]:
# saving master dataframe
directory = '../data/features'
dfs = [
    { 'filename': 'master', 'df': master }
]

file_management.save_dfs_to_csv(dfs, directory)

## E. Creating aggregated dataframes <a class="anchor" id="master_df"></a>

In this section we will be generating several dataframes that we will be using throughout the different analysis.

In [5]:
# downloading the feature file from gdrive - in case you didn't run the previous section
directory = '../data/features/'
urls = [
    {'filename': 'master.csv', 'url': 'https://drive.google.com/file/d/1_OCpE6AZK3ju5RJVTJZm8ox5Cn8G3_Ag/view?usp=sharing'},
]
        
file_management.download_files_from_url(urls, directory)

master = pd.read_csv(directory + 'master.csv', index_col = 0)

master.csv file already exists in ../data/features/


### Preparing the master file

In [None]:
master['total_income'] = master['num_sales'] * master['sell_price']
master.drop(columns = ['weekday', 'year', 'week', 'weekday_int', 'event'], inplace = True)

### Global sales DF

In [None]:
sales_by_date = master.drop(columns = 'sell_price').groupby(['date']).agg('sum')
sales_by_date

In [None]:
directory = '../data/features'
dfs = [
    { 'filename': 'sales_by_date', 'df': sales_by_date },
]

file_management.save_dfs_to_csv(dfs, directory, prefix = '')

### Global Sales by City

In [None]:
master['city'] = master['id'].apply(lambda x: x[-5:-2])
master

In [None]:
sales_by_date_city = master.drop(columns = 'sell_price').groupby(['date', 'city']).agg('sum')
sales_by_date_city

In [None]:
directory = '../data/features'
dfs = [
    { 'filename': 'sales_by_date_city', 'df': sales_by_date_city },
]

file_management.save_dfs_to_csv(dfs, directory, prefix = '')

### Global Sales by Store

In [None]:
master['store'] = master['id'].apply(lambda x: x[-5:])
master

In [None]:
sales_by_date_store = master.drop(columns = ['sell_price', 'city']).groupby(['date', 'store']).agg('sum')
sales_by_date_store

In [None]:
directory = '../data/features'
dfs = [
    { 'filename': 'sales_by_date_store', 'df': sales_by_date_store },
]

file_management.save_dfs_to_csv(dfs, directory, prefix = '')

### Sales per Item

In [5]:
# downloading the feature file from gdrive - in case you didn't run the previous section
directory = '../data/features/'
urls = [
    {'filename': 'master.csv', 'url': 'https://drive.google.com/file/d/1_OCpE6AZK3ju5RJVTJZm8ox5Cn8G3_Ag/view?usp=sharing'},
]
        
file_management.download_files_from_url(urls, directory)

master = pd.read_csv(directory + 'master.csv', index_col = 0)

master.csv file already exists in ../data/features/


In [6]:
master.head()

Unnamed: 0,id,num_sales,date,weekday,weekday_int,event,year,week,sell_price
0,ACCESORIES_1_001_BOS_1,0,2011-01-29,Saturday,1,,2011,4,12.74
1,ACCESORIES_1_001_BOS_1,0,2011-01-30,Sunday,2,,2011,4,12.74
2,ACCESORIES_1_001_BOS_1,0,2011-01-31,Monday,3,,2011,5,12.74
3,ACCESORIES_1_001_BOS_1,0,2011-02-01,Tuesday,4,,2011,5,12.74
4,ACCESORIES_1_001_BOS_1,0,2011-02-02,Wednesday,5,,2011,5,12.74


In [7]:
master.drop(columns = ['date', 'weekday', 'weekday_int', 'event', 'year', 'week' ], inplace = True)

In [8]:
master['item'] = master['id'].apply(lambda x: x[:-6])
master

Unnamed: 0,id,num_sales,sell_price,item
0,ACCESORIES_1_001_BOS_1,0,12.74,ACCESORIES_1_001
1,ACCESORIES_1_001_BOS_1,0,12.74,ACCESORIES_1_001
2,ACCESORIES_1_001_BOS_1,0,12.74,ACCESORIES_1_001
3,ACCESORIES_1_001_BOS_1,0,12.74,ACCESORIES_1_001
4,ACCESORIES_1_001_BOS_1,0,12.74,ACCESORIES_1_001
...,...,...,...,...
58327365,SUPERMARKET_3_827_PHI_3,0,1.20,SUPERMARKET_3_827
58327366,SUPERMARKET_3_827_PHI_3,0,1.20,SUPERMARKET_3_827
58327367,SUPERMARKET_3_827_PHI_3,0,1.20,SUPERMARKET_3_827
58327368,SUPERMARKET_3_827_PHI_3,0,1.20,SUPERMARKET_3_827


In [13]:
sales_by_product = master.drop(columns = ['id']).groupby(['item']).agg([np.sum, np.mean, np.std])
sales_by_product

Unnamed: 0_level_0,num_sales,num_sales,num_sales,sell_price,sell_price,sell_price
Unnamed: 0_level_1,sum,mean,std,sum,mean,std
item,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
ACCESORIES_1_001,4093,0.21,0.58,219344.58,11.47,0.73
ACCESORIES_1_002,5059,0.26,0.59,100941.36,5.28,0.09
ACCESORIES_1_003,1435,0.08,0.32,75518.03,3.95,0.13
ACCESORIES_1_004,39175,2.05,2.67,114391.52,5.98,0.28
ACCESORIES_1_005,14621,0.76,1.23,73414.40,3.84,0.22
...,...,...,...,...,...,...
SUPERMARKET_3_823,15388,0.80,1.71,63984.10,3.34,0.23
SUPERMARKET_3_824,8325,0.44,0.95,57897.00,3.03,0.26
SUPERMARKET_3_825,13526,0.71,1.20,94370.36,4.93,0.24
SUPERMARKET_3_826,12188,0.64,1.25,29381.33,1.54,0.01


In [14]:
directory = '../data/features'
dfs = [
    { 'filename': 'sales_by_product', 'df': sales_by_product },
]

file_management.save_dfs_to_csv(dfs, directory, prefix = '')

./../data/features/sales_by_product.csv doesn't exist. Creating new file


['./../data/features/sales_by_product.csv']