<img src='../img/logo.png' alt='DS Market logo' height='150px'>

# 2 - Feature Engineering

## Table of Contents

* [A. Introduction](#introduction)
* [B. Importing Libraries](#libraries)
* [C. Importing data](#data)
* [D. Creating a master dataframe](#master_df)

## A. Introduction <a class="anchor" id="introduction"></a>

In the following notebook, we will be generating some features that will be needed for the analysis and the future models to generate.

Disclaimer: Running this dataframe requires quite some time (from 8 - 20 minutes, depending on your computer). Whenever possible, download the file from the following [GDrive link](#https://drive.google.com/file/d/1_OCpE6AZK3ju5RJVTJZm8ox5Cn8G3_Ag/view?usp=sharing).

## B. Importing Libraries <a class="anchor" id="libraries"></a>

In [1]:
# system and path management
import sys
sys.path.append('../scripts') # including helper functions inside the scripts folder

# removing system warnings
import warnings
warnings.filterwarnings('ignore')

# data manipulation
import pandas as pd
import numpy as np

# helper functions
import file_management

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.options.display.float_format = '{:,.2f}'.format

## C. Importing Data <a class="anchor" id="data"></a>

In [2]:
# downloading the processed data files from gdrive, in case these were not available
directory = '../data/processed/'
urls = [
    {'filename': 'sales_processed.csv', 'url': 'https://drive.google.com/file/d/1JdeAgraKcaFQJrjG2HPVb5D0VD0iTlNB/view?usp=sharing'},
    {'filename': 'prices_processed.csv', 'url': 'https://drive.google.com/file/d/1pSEJAQfAU-owDjKmxcPrxf3CpGFivwa6/view?usp=sharing'},
    {'filename': 'calendar_processed.csv', 'url': 'https://drive.google.com/file/d/1Lnji96iBkTpFiWo-QXeW3TvESiNYWCML/view?usp=sharing'}
]
        
file_management.download_files_from_url(urls, directory)        

sales_processed.csv file already exists in ../data/processed/
prices_processed.csv file already exists in ../data/processed/
calendar_processed.csv file already exists in ../data/processed/


['./../data/processed//sales_processed.csv',
 './../data/processed//prices_processed.csv',
 './../data/processed//calendar_processed.csv']

## D. Creating a master dataframe <a class="anchor" id="master_df"></a>

In [3]:
sales = pd.read_csv(directory + 'sales_processed.csv', index_col = 0)
prices = pd.read_csv(directory + 'prices_processed.csv', index_col = 0)
calendar = pd.read_csv(directory + 'calendar_processed.csv', index_col = 0)

In [4]:
# generating a dataframe with each row being the total amount of sales per day with each product
master = sales.melt(
        id_vars = ['id', 'item', 'category', 'department', 'store', 'store_code', 'region'], 
        var_name = 'd', 
        value_name = 'num_sales'
)
master

Unnamed: 0,id,item,category,department,store,store_code,region,d,num_sales
0,ACCESORIES_1_001_NYC_1,ACCESORIES_1_001,ACCESORIES,ACCESORIES_1,Greenwich_Village,NYC_1,New York,d_1,0
1,ACCESORIES_1_002_NYC_1,ACCESORIES_1_002,ACCESORIES,ACCESORIES_1,Greenwich_Village,NYC_1,New York,d_1,0
2,ACCESORIES_1_003_NYC_1,ACCESORIES_1_003,ACCESORIES,ACCESORIES_1,Greenwich_Village,NYC_1,New York,d_1,0
3,ACCESORIES_1_004_NYC_1,ACCESORIES_1_004,ACCESORIES,ACCESORIES_1,Greenwich_Village,NYC_1,New York,d_1,0
4,ACCESORIES_1_005_NYC_1,ACCESORIES_1_005,ACCESORIES,ACCESORIES_1,Greenwich_Village,NYC_1,New York,d_1,0
...,...,...,...,...,...,...,...,...,...
58327365,SUPERMARKET_3_823_PHI_3,SUPERMARKET_3_823,SUPERMARKET,SUPERMARKET_3,Queen_Village,PHI_3,Philadelphia,d_1913,1
58327366,SUPERMARKET_3_824_PHI_3,SUPERMARKET_3_824,SUPERMARKET,SUPERMARKET_3,Queen_Village,PHI_3,Philadelphia,d_1913,0
58327367,SUPERMARKET_3_825_PHI_3,SUPERMARKET_3_825,SUPERMARKET,SUPERMARKET_3,Queen_Village,PHI_3,Philadelphia,d_1913,0
58327368,SUPERMARKET_3_826_PHI_3,SUPERMARKET_3_826,SUPERMARKET,SUPERMARKET_3,Queen_Village,PHI_3,Philadelphia,d_1913,3


In [5]:
# removing unnecessary columns that can be easily regenerated with the id or with a simple dictionary ('store')
master.drop(columns = ['item', 'category', 'department', 'store_code', 'region', 'store'], inplace = True)
master

Unnamed: 0,id,d,num_sales
0,ACCESORIES_1_001_NYC_1,d_1,0
1,ACCESORIES_1_002_NYC_1,d_1,0
2,ACCESORIES_1_003_NYC_1,d_1,0
3,ACCESORIES_1_004_NYC_1,d_1,0
4,ACCESORIES_1_005_NYC_1,d_1,0
...,...,...,...
58327365,SUPERMARKET_3_823_PHI_3,d_1913,1
58327366,SUPERMARKET_3_824_PHI_3,d_1913,0
58327367,SUPERMARKET_3_825_PHI_3,d_1913,0
58327368,SUPERMARKET_3_826_PHI_3,d_1913,3


In [6]:
# merging the master sales with calendar in order to get the date translation between 'd' and the real 'date'
master = pd.merge(
    master,
    calendar,
    on = 'd'
)

master

Unnamed: 0,id,d,num_sales,date,weekday,weekday_int,event
0,ACCESORIES_1_001_NYC_1,d_1,0,2011-01-29,Saturday,1,
1,ACCESORIES_1_002_NYC_1,d_1,0,2011-01-29,Saturday,1,
2,ACCESORIES_1_003_NYC_1,d_1,0,2011-01-29,Saturday,1,
3,ACCESORIES_1_004_NYC_1,d_1,0,2011-01-29,Saturday,1,
4,ACCESORIES_1_005_NYC_1,d_1,0,2011-01-29,Saturday,1,
...,...,...,...,...,...,...,...
58327365,SUPERMARKET_3_823_PHI_3,d_1913,1,2016-04-24,Sunday,2,
58327366,SUPERMARKET_3_824_PHI_3,d_1913,0,2016-04-24,Sunday,2,
58327367,SUPERMARKET_3_825_PHI_3,d_1913,0,2016-04-24,Sunday,2,
58327368,SUPERMARKET_3_826_PHI_3,d_1913,3,2016-04-24,Sunday,2,


In [7]:
# master dataframe sorting, cleanup and feature generation for other merges
master.sort_values(by = ['id', 'date'], inplace = True)

master['date'] = pd.to_datetime(master['date'])

master['year'] = master['date'].dt.year
master['week'] = master['date'].dt.week

master.drop(columns = 'd', inplace = True)

In [8]:
# adding features and dropping columns in 'prices' for a more efficient merging
prices['id'] = prices['item'] + '_' + prices['store_code']
prices.drop(columns = ['item', 'store_code', 'category'], inplace = True)
prices

Unnamed: 0,sell_price,year,week,id
0,12.74,2013,28,ACCESORIES_1_001_NYC_1
1,12.74,2013,29,ACCESORIES_1_001_NYC_1
2,10.99,2013,30,ACCESORIES_1_001_NYC_1
3,10.99,2013,31,ACCESORIES_1_001_NYC_1
4,10.99,2013,32,ACCESORIES_1_001_NYC_1
...,...,...,...,...
6965701,1.20,2016,21,SUPERMARKET_3_827_PHI_3
6965702,1.20,2016,22,SUPERMARKET_3_827_PHI_3
6965703,1.20,2016,23,SUPERMARKET_3_827_PHI_3
6965704,1.20,2016,24,SUPERMARKET_3_827_PHI_3


In [9]:
# merging 'master' dataframe with 'prices'
master = pd.merge(
    master,
    prices,
    how = 'left',
    on = ['id', 'week', 'year']
)

master

Unnamed: 0,id,num_sales,date,weekday,weekday_int,event,year,week,sell_price
0,ACCESORIES_1_001_BOS_1,0,2011-01-29,Saturday,1,,2011,4,
1,ACCESORIES_1_001_BOS_1,0,2011-01-30,Sunday,2,,2011,4,
2,ACCESORIES_1_001_BOS_1,0,2011-01-31,Monday,3,,2011,5,
3,ACCESORIES_1_001_BOS_1,0,2011-02-01,Tuesday,4,,2011,5,
4,ACCESORIES_1_001_BOS_1,0,2011-02-02,Wednesday,5,,2011,5,
...,...,...,...,...,...,...,...,...,...
58327365,SUPERMARKET_3_827_PHI_3,0,2016-04-20,Wednesday,5,,2016,16,1.20
58327366,SUPERMARKET_3_827_PHI_3,0,2016-04-21,Thursday,6,,2016,16,1.20
58327367,SUPERMARKET_3_827_PHI_3,0,2016-04-22,Friday,7,,2016,16,1.20
58327368,SUPERMARKET_3_827_PHI_3,0,2016-04-23,Saturday,1,,2016,16,1.20


In [14]:
# filling in nulls in prices using backfill, as we want to get the last available correct price assuming it didn't change
master.sell_price.fillna(
    method = 'backfill',
    inplace = True
)

master

Unnamed: 0,id,num_sales,date,weekday,weekday_int,event,year,week,sell_price
0,ACCESORIES_1_001_BOS_1,0,2011-01-29,Saturday,1,,2011,4,12.74
1,ACCESORIES_1_001_BOS_1,0,2011-01-30,Sunday,2,,2011,4,12.74
2,ACCESORIES_1_001_BOS_1,0,2011-01-31,Monday,3,,2011,5,12.74
3,ACCESORIES_1_001_BOS_1,0,2011-02-01,Tuesday,4,,2011,5,12.74
4,ACCESORIES_1_001_BOS_1,0,2011-02-02,Wednesday,5,,2011,5,12.74
...,...,...,...,...,...,...,...,...,...
58327365,SUPERMARKET_3_827_PHI_3,0,2016-04-20,Wednesday,5,,2016,16,1.20
58327366,SUPERMARKET_3_827_PHI_3,0,2016-04-21,Thursday,6,,2016,16,1.20
58327367,SUPERMARKET_3_827_PHI_3,0,2016-04-22,Friday,7,,2016,16,1.20
58327368,SUPERMARKET_3_827_PHI_3,0,2016-04-23,Saturday,1,,2016,16,1.20


In [11]:
# saving master dataframe
directory = '../data/features'
dfs = [
    { 'filename': 'master', 'df': master }
]

file_management.save_dfs_to_csv(dfs, directory)

master file already exists in ../data/features
Overwriting file


['./../data/features/master.csv']