# Feature Engineering

## To do's
- Apply filters
    - Only Shipped order
- Resample products with missing dates
- Create time series basic features:
    - lag
    - moving average
    - min, max window values
    - frequency
    - holidays


## Future
- Group the sales to create new features:
    - total sales
    - product line sales
    - product sales
    - country sales
    - territory sales
    - deal size sales
- Create stacking ensemble model using:
    - ARIMA
    - Prophet
    - Random Forest

In [1]:
import pandas as pd

In [178]:
INDEX_COLUMNS = ['PRODUCTCODE', 'YEAR_MONTH']
PREDICT_LAG = 1

In [148]:
df = pd.read_csv('../data/raw/sales_data_sample.csv', encoding="ISO-8859-1")

### Filters

In [8]:
df_filtered = df[df['STATUS'] == 'Shipped'].copy()

### Group Sales per product

In [169]:
df_filtered['YEAR_MONTH'] = (df_filtered['MONTH_ID'] 
                                + df_filtered['YEAR_ID'] * 100)
df_filtered['YEAR_MONTH'] = pd.to_datetime(df_filtered['YEAR_MONTH'],
                                              format='%Y%m')
df_filtered['YEAR_MONTH'].value_counts().sort_index()

df_groupped = df_filtered.groupby(
    INDEX_COLUMNS, as_index=False).agg(qt_sales=('QUANTITYORDERED', 'sum'),
                                       vl_sales=('SALES', 'sum'))

### Resample

In [170]:
all_dates = pd.date_range(start=df_groupped['YEAR_MONTH'].min(),
                          end=df_groupped['YEAR_MONTH'].max(), freq='MS'
).to_series(name='YEAR_MONTH')
all_product = pd.Series(df_groupped['PRODUCTCODE'].unique(), name='PRODUCTCODE')

df_resample = pd.merge(all_product, all_dates, how='cross')

df_groupped = df_resample.merge(df_groupped, on=INDEX_COLUMNS,
                                how='left').fillna(0)

### Basic time series features

### Lag

In [172]:
number_of_lags = 12

for lag in range(number_of_lags):
    df_groupped[f'qt_sales_lag_{lag+1}'] = df_groupped.groupby('PRODUCTCODE')[
        'qt_sales'].shift(lag+1)

### Moving Average

$MovingAverage = \frac{1}{k}\sum_{i=n-k+1}^n{p_i}$

Where $p$ represents the observations, $n$ is the number of values in the moving average, and $k$ denotes the initial position of the moving average.

In [179]:
df_groupped['qt_sales_mavg_3'] = df_groupped.groupby('PRODUCTCODE')[
    f'qt_sales_lag_{PREDICT_LAG}'].transform(lambda x: x.rolling(3, 1).mean())
df_groupped['qt_sales_mavg_6'] = df_groupped.groupby('PRODUCTCODE')[
    f'qt_sales_lag_{PREDICT_LAG}'].transform(lambda x: x.rolling(6, 1).mean())

In [183]:
df_groupped.fillna(0, inplace=True)

In [184]:
df_groupped.to_parquet('../data/temp/df_model.parquet')