# 1.3 Create Variables

We will create:

- An ***independant*** dataframe: where one day = one row, independant of store numbers and product groups, here we create variables that are depending on the date, irrelevant of the store (for example, the day of the week)

- An ***all stores*** dataframe: we create a dataframe where we aggregate all 54 stores, and add variables that are depending on the stores (for example: store_closed)

- A ***group of product*** dataframes: the main dataframe where we aggregate per product and then add the previously created data from the independant df and all stores df.

# 1.3.1 Independant Dataframe

In [11]:
def create_multi_store_one_product_df(df, product_name):

    multistore_single_product = df[df['family'] == product_name]

    return multistore_single_product

In [12]:
def create_holiday_variables(df):

    holidays = pd.read_csv('/kaggle/input/store-sales-time-series-forecasting/holidays_events.csv')

    holidays = holidays[holidays['transferred'] == False]
    holidays['holiday_type'] = holidays['type']
    holidays.drop(['transferred', 'description', 'type'], axis=1, inplace=True)

    national_holidays = holidays[holidays['locale'] == 'National']
    national_holidays['national_holiday_type'] = national_holidays['holiday_type']
    national_holidays.drop(['locale', 'locale_name', 'holiday_type'], axis=1, inplace=True)
    national_holidays.drop_duplicates(subset='date', keep="first", inplace=True)
    df = pd.merge(df, national_holidays, how='left', on=['date'])

    state_holidays = holidays[holidays['locale'] == 'Regional']
    state_holidays['state'] = state_holidays['locale_name']
    state_holidays['state_holiday_type'] = state_holidays['holiday_type']
    state_holidays.drop(['locale', 'locale_name', 'holiday_type'], axis=1, inplace=True)
    df = pd.merge(df, state_holidays, how='left', on=['date', 'state'])

    city_holidays = holidays[holidays['locale'] == 'Local']
    city_holidays['city'] = city_holidays['locale_name']
    city_holidays['city_holiday_type'] = city_holidays['holiday_type']
    city_holidays.drop(['locale', 'locale_name', 'holiday_type'], axis=1, inplace=True)
    city_holidays.drop([265], axis=0, inplace=True)
    df = pd.merge(df, city_holidays, how='left', on=['date', 'city'])

    df['holiday_type'] = np.nan
    df['holiday_type'] = df['holiday_type'].fillna(df['national_holiday_type'])
    df['holiday_type'] = df['holiday_type'].fillna(df['state_holiday_type'])
    df['holiday_type'] = df['holiday_type'].fillna(df['city_holiday_type'])
    df.drop(['national_holiday_type', 'state_holiday_type', 'city_holiday_type'], axis=1, inplace=True)

    return df

# 1.3.3 Product Dataframes

The last step of our variable creation phase, is to create 33 **product** dataframes. We create a pipeline that integrates both our **independant** and our **all stores** dataframes.

In [17]:
# --- Execute Full Product Pipeline for each product --- #

# List all product families:

list_of_families = ['AUTOMOTIVE', 'BABY CARE', 'BEAUTY', 'BEVERAGES', 'BOOKS',
                    'BREAD/BAKERY', 'CELEBRATION', 'CLEANING', 'DAIRY', 'DELI', 'EGGS',
                    'FROZEN FOODS', 'GROCERY I', 'GROCERY II', 'HARDWARE',
                    'HOME AND KITCHEN I', 'HOME AND KITCHEN II', 'HOME APPLIANCES',
                    'HOME CARE', 'LADIESWEAR', 'LAWN AND GARDEN', 'LINGERIE',
                    'LIQUOR,WINE,BEER', 'MAGAZINES', 'MEATS', 'PERSONAL CARE',
                    'PET SUPPLIES', 'PLAYERS AND ELECTRONICS', 'POULTRY',
                    'PREPARED FOODS', 'PRODUCE', 'SCHOOL AND OFFICE SUPPLIES',
                    'SEAFOOD']

# Create new .csv for each product family:

for x in list_of_families:

    this_df = full_product_pipeline(x)

    if x == 'BREAD/BAKERY':

            x = 'BREADBAKERY'

    print('Completed eda for ' + str(x))
    this_df.to_csv('/kaggle/working/'+str(x)+'.csv', index=False)

Completed eda for AUTOMOTIVE
Completed eda for BABY CARE
Completed eda for BEAUTY
Completed eda for BEVERAGES
Completed eda for BOOKS
Completed eda for BREADBAKERY
Completed eda for CELEBRATION
Completed eda for CLEANING
Completed eda for DAIRY
Completed eda for DELI
Completed eda for EGGS
Completed eda for FROZEN FOODS
Completed eda for GROCERY I
Completed eda for GROCERY II
Completed eda for HARDWARE
Completed eda for HOME AND KITCHEN I
Completed eda for HOME AND KITCHEN II
Completed eda for HOME APPLIANCES
Completed eda for HOME CARE
Completed eda for LADIESWEAR
Completed eda for LAWN AND GARDEN
Completed eda for LINGERIE
Completed eda for LIQUOR,WINE,BEER
Completed eda for MAGAZINES
Completed eda for MEATS
Completed eda for PERSONAL CARE
Completed eda for PET SUPPLIES
Completed eda for PLAYERS AND ELECTRONICS
Completed eda for POULTRY
Completed eda for PREPARED FOODS
Completed eda for PRODUCE
Completed eda for SCHOOL AND OFFICE SUPPLIES
Completed eda for SEAFOOD


# 2. Modelling

# 2.1 Validation Testing

In [18]:
from sklearn.preprocessing import MinMaxScaler
from xgboost import XGBRegressor

sample_submission = pd.read_csv('/kaggle/input/store-sales-time-series-forecasting/sample_submission.csv')

In [19]:
def scorethis_rmsle(prediction_list, y_list):

    scorelist = list()

    for x in range(prediction_list.__len__()):


        log_score_x = np.abs(np.abs(prediction_list[x]) - np.abs(y_list[x]))
        
        try:
            [scorelist.append(y) for y in log_score_x.values]
        except:
            scorelist.append(log_score_x)

    score_array = np.array(scorelist)

    rmsle = np.sqrt(np.mean(score_array**2)) # sqrt of mean of power of difference of the logs
    rmsle = np.round(rmsle, 3)

    return rmsle

In [20]:
def create_validation(this_family_df, validation=True):
    
    if validation is True:
    
        this_family_df = this_family_df[:-864]
        # Remove the 864 top submission rows if it is for validation
    
    this_family_sales = this_family_df['sales']

    this_family_df.drop(['sales', 'date'], axis=1, inplace=True)

    # Scale Data           #

    scaler = MinMaxScaler()
    this_family_df[this_family_df.columns] = scaler.fit_transform(this_family_df[this_family_df.columns])

    # Split Train and Test #

    test = this_family_df.iloc[-864:]
    test_y = this_family_sales.iloc[-864:]

    train = this_family_df.iloc[:-864]
    train_y = this_family_sales.iloc[:-864]

    return train, train_y, test, test_y

In [21]:
def xgb_run(train, train_y, test, test_y, validation=True):
    
    # Create Model  #

    xgb_model = XGBRegressor(
        colsample_bytree=0.7,
        learning_rate=0.055,
        min_child_samples=10,
        max_depth=6,  # Add max_depth parameter for XGBRegressor
        objective='reg:squarederror',  # Change objective for XGBRegressor
        n_estimators=1000,
        n_jobs=4,
        random_state=337)

    # Execute XGB   #


    xgb_model.fit(train, train_y)
    xgb_pred = xgb_model.predict(test).tolist()
    xgb_pred = [round(x, 2) for x in xgb_pred]
    
    if validation == True:
        
        # validation set also has ground truths:
        test_y = test_y.to_list()

        return xgb_pred, test_y

    else:

        return xgb_pred


In [22]:
def execute_validation(thisfunc):

    double_list_of_predictions = []
    double_list_of_ground_truths = []

    for x in list_of_families: # 33
        
        if x == 'BREAD/BAKERY':

            x = 'BREADBAKERY'
            # Otherwise would create an error searching for the BREAD/ directory instead of the file

        print('Evaluating '+str(x)+'...')
        
        this_df = pd.read_csv('/kaggle/working/' + str(x) + '.csv')

        train, train_y, test, test_y = create_validation(this_df)
        pred, y = thisfunc(train, train_y, test, test_y, validation=True)
        
        if x == 'BOOKS':

            zero_list = []

            for g in range(864):

                zero_list.append(0.6931471805599453) 
                # this will be exactly 0 when we transform our predictions again
                # to before we did log(sales +1)

            double_list_of_predictions.append(zero_list)
            double_list_of_ground_truths.append(y) 
            
        else:
            
            double_list_of_predictions.append(pred) # 33 * [864]
            double_list_of_ground_truths.append(y) # 33 * [864]

    list_of_predictions = list()
    list_of_ground_truths = list()

    for x in double_list_of_predictions:
        for y in x:
            list_of_predictions.append(y) # unpack 33 * 864

    for x in double_list_of_ground_truths:
        for z in x:
            list_of_ground_truths.append(z) # unpack 33 * 864

    return list_of_predictions, list_of_ground_truths

# 2.2 Kaggle Submission

Now we are going to execute the same **LGBMR** model we've tested, on our kaggle submission set.

In [24]:
def execute_submission(thisfunc):

    list_of_predictions = []

    for x in list_of_families:
        
        if x == 'BREAD/BAKERY':

            x = 'BREADBAKERY'
            # Otherwise would create an error searching for the BREAD/ directory instead of the file

        print('Evaluating '+str(x)+'...')
        this_df = pd.read_csv('/kaggle/working/' + str(x) + '.csv')
        
        if x == 'BOOKS':

            zero_list = []

            for g in range(864):

                zero_list.append(0.6931471805599453) 
                # this will be exactly 0 when we transform our predictions again
                # to before we did log(sales +1)

            list_of_predictions.append(zero_list)

        else:
    
            train, train_y, test, test_y = create_validation(this_df, validation=False)
            pred = thisfunc(train, train_y, test, test_y=None, validation=False)
            list_of_predictions.append(pred)
    

    # Put Back In Submission Form # 
    
    restructured_predictions = list()

    for y in range(864):

        for z in range(33):
            restructured_predictions.append(list_of_predictions[z][y])

    restructured_predictions = np.expm1(restructured_predictions) - 1

    return restructured_predictions

In [25]:
# --- Execute Submission --- #

restructured_predictions = execute_submission(xgb_run)
sample_submission['sales'] = restructured_predictions

# Convert some (slightly) negative predictions to a zero prediction:
sample_submission['sales'] = [0 if x < 0 else x for x in sample_submission['sales']]

sample_submission.to_csv('/kaggle/working/submission.csv', index=False)

Evaluating AUTOMOTIVE...
Parameters: { "min_child_samples" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Evaluating BABY CARE...
Parameters: { "min_child_samples" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Evaluating BEAUTY...
Parameters: { "min_child_samples" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please op

Thanks for reading,


- **If you enjoyed this notebook or if you learned something, a simple upvote would be greatly appreciated.**

- **If you find a way to improve on this notebook, let me know in the comments !**


Arnout