This notebook was made with respect to the Walmart Kaggle competition which can be found here: https://www.kaggle.com/competitions/walmart-recruiting-store-sales-forecasting

Notebook Notes:

Some things I wish to point out before running the cells:

*   PyCaret install is a little weird. It wouldn't install on a Kaggle notebook for me. Also, when installing it on a Colab notebook, you'll have to run the install cell, then restart the runtime, then run it again (don't ask). That's when it'll work... Except, it will still give you errors on the Kaggle install, but it works 🤷.
*   The above is why I suggest you skip the pip install Kaggle and the 2 cells afterwards and import the datsets manually.
*   Otherwise, you'll need to download the kaggle.json from your own Kaggle account to be able to run the API. Place it in the root, and the second cell should hopefully place it into the right folder for you. Here's a walkthrough on how to get it: https://www.kaggle.com/general/74235

The pipeline is as follows:
1.   Join the features dataframe to the main dataframe.
2.   Join the stores dataframe to the main dataframe (which also has the features).
3.   Create features for the occassions that are listed on the competition itself. The list of them can be found at the bottom of : https://www.kaggle.com/competitions/walmart-recruiting-store-sales-forecasting/data
4.   Fill MarkDown column (5 MarkDown columns) NaN values with 0, because otherwise, I believe they will be imputed by the mean of the column. I do this, because NaN in this case means there was no MarkDown (i.e., 0).
5.   Create 4 columns - LastHoliday, ComingHoliday, WeeksSinceLastHoliday, WeeksToNextHoliday. For weeks which are holidays, the LastHoliday and ComingHoliday are the same values as the current week. Due to that, the WeekSinceLastHoliday and WeeksToNextHoliday for holiday weeks are 0. Lastly, you'll realize I added previous holiday dates that were outside of the timeframe of the dataframe. I got them from this website: https://www.timeanddate.com/holidays/us/2010?hol=1

Training and evaluating the results:

*   I only train random forest, because it performed best, but you're welcome to remove the "include" param from the "compare_models" call and see how it stacks up.
*   The training takes a while on Google Colab, even with just random forest. Definitely more than 30 mins for me (kinda way longer than expected).
*   The validation set was made by taking the amount of weeks in the test set and then removing the same amount of the later dates from the train dataframe, therefore making the validation set.
*   The evaluation metric WMAE was made as per the competitions guidelines.
*   You should end up with a WMAE of 2265, which beats 2nd place on the public leaderboard: https://www.kaggle.com/competitions/walmart-recruiting-store-sales-forecasting/leaderboard?tab=public

In [None]:
!pip install -q pycaret==2.3.10
!pip install -q pandas==1.4.4
!pip install -q numpy==1.19.5
!pip install -q kaggle==1.5.2

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
kaggle 1.5.2 requires urllib3<1.23.0,>=1.15, but you have urllib3 1.26.13 which is incompatible.[0m
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
docker 6.0.1 requires urllib3>=1.26.0, but you have urllib3 1.22 which is incompatible.[0m


In [None]:
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!kaggle competitions download -c walmart-recruiting-store-sales-forecasting

Downloading features.csv.zip to /content
  0% 0.00/158k [00:00<?, ?B/s]
100% 158k/158k [00:00<00:00, 48.7MB/s]
Downloading train.csv.zip to /content
  0% 0.00/2.47M [00:00<?, ?B/s]
100% 2.47M/2.47M [00:00<00:00, 116MB/s]
Downloading stores.csv to /content
  0% 0.00/532 [00:00<?, ?B/s]
100% 532/532 [00:00<00:00, 466kB/s]
Downloading test.csv.zip to /content
  0% 0.00/235k [00:00<?, ?B/s]
100% 235k/235k [00:00<00:00, 116MB/s]
Downloading sampleSubmission.csv.zip to /content
  0% 0.00/220k [00:00<?, ?B/s]
100% 220k/220k [00:00<00:00, 118MB/s]


In [None]:
# Yes, I know this isn't elegant.

from zipfile import ZipFile

with ZipFile(r'/content/test.csv.zip', 'r') as zObject:
  zObject.extractall()

with ZipFile(r'/content/train.csv.zip', 'r') as zObject:
  zObject.extractall()

with ZipFile(r'/content/features.csv.zip', 'r') as zObject:
  zObject.extractall()

In [None]:
from pycaret.regression import *
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('train.csv', parse_dates=['Date'])
features = pd.read_csv('features.csv', parse_dates=['Date'])
stores = pd.read_csv('stores.csv')
test = pd.read_csv('test.csv', parse_dates=['Date'])

In [None]:
occasions_dict = {
    'SuperBowl' : ['12-Feb-10', '11-Feb-11', '10-Feb-12', '8-Feb-13'],
    'LaborDay' : ['10-Sep-10', '9-Sep-11', '7-Sep-12', '6-Sep-13'],
    'Thanksgiving' : ['26-Nov-10', '25-Nov-11', '23-Nov-12', '29-Nov-13'],
    'Christmas' : ['31-Dec-10', '30-Dec-11', '28-Dec-12', '27-Dec-13']
}


def create_occasion_flag_columns(df, date_column='Date', occasions_dict=occasions_dict):
    for occasion in occasions_dict.keys():
        feat_name = 'Is'+occasion
        df[feat_name] = 0
        df.loc[df[date_column].isin(occasions_dict[occasion]), feat_name] = 1
        
    return df

def fill_general_columns_na(df, look_for='MarkDown', fill_with=0):
    features = [feature for feature in df.columns if look_for in feature]
    for feat in features:
        df[feat].fillna(fill_with, inplace=True)
        
    return df

def get_holiday_dates(df, holiday_flag_column='IsHoliday', additional_dates=[np.datetime64('2010-01-15'), np.datetime64('2012-10-05'), np.datetime64('2012-11-09')]):
    get_unique_dates = df[df[holiday_flag_column]==1].Date.unique()
    listify_dates = list(get_unique_dates)
    all_holiday_dates = listify_dates + additional_dates
    return all_holiday_dates

def calculate_weeks_from_to_holiday(df):
    holidays_list = get_holiday_dates(df)
    holidays_list.sort()
    holidays_tuple = [(holidays_list[i], holidays_list[i+1]) for i in range(len(holidays_list)-1)]

    df['LastHoliday'] = df['Date']
    df['ComingHoliday'] = df['Date']

    for last_holiday, coming_holiday in holidays_tuple:
        date_filter = (df['Date']>last_holiday) & (df['Date']<coming_holiday)
        df.loc[date_filter, 'LastHoliday'] = last_holiday
        df.loc[date_filter, 'ComingHoliday'] = coming_holiday

    week_divider = np.timedelta64(1,'W')
    df['WeeksSinceLastHoliday'] = np.abs((df['Date']-df['LastHoliday'])/week_divider)
    df['WeeksToNextHoliday'] = np.abs((df['Date']-df['ComingHoliday'])/week_divider)

    return df

def wmae(df, weight_column='IsHoliday', weights={0: 1, 1: 5}, actual_column='Weekly_Sales', predicted_column='Label'):
    weights = df[weight_column].replace(weights)
    abs_difference = np.abs(df[actual_column] - df[predicted_column])
    numerator = np.sum(weights * abs_difference)
    denominator = np.sum(weights)
    return numerator/denominator

In [None]:
df = (df.pipe(pd.merge, right=features, on=['Store', 'Date', 'IsHoliday']) # 1
              .pipe(pd.merge, right=stores, on=['Store']) # 2
              .pipe(create_occasion_flag_columns) # 3
              .pipe(fill_general_columns_na, look_for='MarkDown', fill_with=0) # 4
              .pipe(calculate_weeks_from_to_holiday) # 5
     )

In [None]:
amount_of_weeks_in_test = len(test.Date.unique())
split_date = np.sort(df['Date'].unique())[-amount_of_weeks_in_test]

train = df[df['Date']<split_date]
val = df[df['Date']>=split_date]

In [None]:
exp_reg101 = setup(data=train, target='Weekly_Sales', session_id=123)

Unnamed: 0,Description,Value
0,session_id,123
1,Target,Weekly_Sales
2,Original Data,"(305982, 24)"
3,Missing Values,False
4,Numeric Features,14
5,Categorical Features,6
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(214187, 51)"


INFO:logs:create_model_container: 0
INFO:logs:master_model_container: 0
INFO:logs:display_container: 1
INFO:logs:Pipeline(memory=None,
         steps=[('dtypes',
                 DataTypes_Auto_infer(categorical_features=[],
                                      display_types=True, features_todrop=[],
                                      id_columns=[], ml_usecase='regression',
                                      numerical_features=[],
                                      target='Weekly_Sales',
                                      time_features=[])),
                ('imputer',
                 Simple_Imputer(categorical_strategy='not_available',
                                fill_value_categorical=None,
                                fill_value_numerical=None,
                                numeric_str...
                ('scaling', 'passthrough'), ('P_transform', 'passthrough'),
                ('binn', 'passthrough'), ('rem_outliers', 'passthrough'),
                ('cluste

In [None]:
best = compare_models(include=['rf'])

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
rf,Random Forest Regressor,1543.8001,16745370.0,4078.5353,0.9682,0.4307,3.5066,263.064


INFO:logs:create_model_container: 1
INFO:logs:master_model_container: 1
INFO:logs:display_container: 2
INFO:logs:RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=-1, oob_score=False,
                      random_state=123, verbose=0, warm_start=False)
INFO:logs:compare_models() succesfully completed......................................


In [None]:
wmae(predict_model(best, data=val), actual_column='Weekly_Sales', predicted_column='Label')

INFO:logs:Initializing predict_model()
INFO:logs:predict_model(estimator=RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=-1, oob_score=False,
                      random_state=123, verbose=0, warm_start=False), probability_threshold=None, encoded_labels=True, drift_report=False, raw_score=False, round=4, verbose=True, ml_usecase=MLUsecase.REGRESSION, display=None, drift_kwargs=None)
INFO:logs:Checking exceptions
INFO:logs:Preloading libraries
INFO:logs:Preparing display monitor


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Random Forest Regressor,2120.8974,21076960.0,4590.9653,0.9571,0.6172,36.8045


2265.5267334836944