# Introduction

This notebook is dedicated to building a regression model based on LGBMRegressor (Light Gradient Boosting Machine) to predict the sales of a set of stores. Various data sources are considered for this purpose:

1. General information regarding sales and stores (files `train.csv`, `test.csv`, and `stores.csv`),
2. Variations in oil prices (`oil.csv`), a factor that can influence consumer purchasing power,
3. Public holidays and special events (`holidays_events.csv`) that may affect buying behaviors,
4. Transaction records (`transactions.csv`).

After importing this data, the notebook proceeds with a series of preprocessing steps, including merging different sources of information, filling in missing values, and encoding categorical data. Additionally, additional time-based features are added to enrich the model.

Once these preparation steps are completed, the data is split into a training set and a test set, and an LGBMRegressor model is trained on this data. Predictions are then generated and evaluated using the mean squared logarithmic error. Finally, negative predictions are adjusted to zero since a sales prediction cannot be negative.

In [28]:
import pandas as pd
import numpy as np
import scipy as sp 
import seaborn as sns
import matplotlib.pyplot as plt
from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_log_error

These lines of code are used to import multiple datasets from CSV files and prepare the data for further analysis.

1. `holidays = pd.read_csv('holidays_events.csv', parse_dates=True)`: Imports the special events and holidays data, ensuring that all dates are correctly parsed as datetime objects.

2. `oil = pd.read_csv('oil.csv', parse_dates=True, index_col=0)`: Imports the oil price data, while setting the first column (which is the date) as the DataFrame's index.

3. `sample_submission = pd.read_csv('sample_submission.csv')`: Imports the submission template used to properly format our predictions before submitting them for evaluation.

4. `stores = pd.read_csv('stores.csv')`: Imports detailed information about each store.

5. `data_test = pd.read_csv('test.csv')` and `data_train = pd.read_csv('train.csv')`: Import the test and training data respectively. These datasets include detailed information about each transaction.

6. `transactions = pd.read_csv('transactions.csv')`: Imports information about all transactions made.

7. `data_test['date'] = pd.to_datetime(data_test.date)`, `data_train['date'] = pd.to_datetime(data_train.date)`, and `holidays['date'] = pd.to_datetime(holidays.date)`: Converts the date column in the test, training, and holidays DataFrames to datetime. This makes it easier to further process dates and times.

In [2]:
holidays = pd.read_csv('holidays_events.csv', parse_dates=True)
oil = pd.read_csv('oil.csv', parse_dates=True, index_col=0)
sample_submission = pd.read_csv('sample_submission.csv')
stores = pd.read_csv('stores.csv')
data_test = pd.read_csv('test.csv')
data_train = pd.read_csv('train.csv')
transactions = pd.read_csv('transactions.csv')

In [3]:
data_test['date'] = pd.to_datetime(data_test.date)
data_train['date'] = pd.to_datetime(data_train.date)
holidays['date'] = pd.to_datetime(holidays.date)

This portion of code performs several preprocessing operations on the data:

1. **Creation of the 'oil' index:** It starts by creating a new index for the 'oil' DataFrame. This index is a combination of unique dates present in both the 'data_train' and 'data_test' DataFrames. The missing values in the 'oil' index are then filled using the 'forward fill' and 'backward fill' methods.

2. **DataFrames merging:** The 'data_train' and 'data_test' DataFrames are then enriched by merging them with the 'oil', 'stores', and 'holidays' DataFrames on the appropriate columns. The 'left' method is used, which means that all rows from the 'data_train' and 'data_test' DataFrames are preserved, and only the corresponding rows from the other DataFrames are used.

3. **Handling missing values:** Next, all columns with missing values in 'data_test' are identified. These missing values are filled with the value 'None' in both 'data_train' and 'data_test'.

4. **Data factorization:** The 'data_train' and 'data_test' DataFrames are then concatenated into a single DataFrame called 'data_factorize', and all columns of type 'object' are factorized. This means that unique string values are replaced with integer numbers, which is often necessary for machine learning algorithms.

5. **Train-test data separation:** The 'data_factorize' DataFrame is then split back into 'data_train' and 'data_test' using the previously stored index ('test_idx').

6. **Adding time features:** Finally, the 'add_time_cols' function is defined and used to add additional time features ('day', 'month', 'year') to the 'data_train' and 'data_test' DataFrames. The 'date' column is then removed from these DataFrames.

In [4]:
index_oil = pd.Index((pd.concat([pd.Series(data_train.date.unique()), pd.Series(data_test.date.unique())], axis=0)
                     .reset_index()
                     .drop(['index'], axis=1))[0])
index_oil.name = 'date'
oil = oil.reindex(index_oil).reset_index()
oil = oil.fillna(method='ffill').fillna(method='bfill')

In [5]:
data_train = pd.merge(data_train, oil.reset_index(), on='date', how='left').drop(['index'], axis=1)
data_test = pd.merge(data_test, oil.reset_index(), on='date', how='left').drop(['index'], axis=1)
data_train = pd.merge(data_train, stores, on='store_nbr', how='left')
data_test = pd.merge(data_test, stores, on='store_nbr', how='left')
data_train = pd.merge(data_train, holidays.drop_duplicates(subset='date'), on='date', how='left')
data_test = pd.merge(data_test, holidays.drop_duplicates(subset='date'), on='date', how='left')

In [6]:
data_train = data_train.set_index('id')
data_test = data_test.set_index('id')

In [7]:
na_col = [c for c in data_test.isna().any()[data_test.isna().any()].index]
data_train[na_col] = data_train[na_col].fillna(value='None')
data_test[na_col] = data_test[na_col].fillna(value='None')

In [8]:
test_idx = data_test.index[0]
data_factorize = pd.concat([data_train, data_test], axis=0)
for col, types in zip(data_factorize, data_factorize.dtypes):
    if types == 'object':
        data_factorize[col] = pd.factorize(data_factorize[col])[0]
        
data_test = data_factorize.iloc[test_idx:].drop(['sales'], axis=1)
data_train = data_factorize.iloc[:test_idx]
del data_factorize

In [20]:
def add_time_cols(data):
    data['day'] = data['date'].dt.dayofweek
    data['month'] = data['date'].dt.month
    data['year'] = data['date'].dt.year
    data.drop(['date'], axis=1, inplace=True)
    return data

data_train, data_test = add_time_cols(data_train), add_time_cols(data_test)

This portion of code corresponds to training and evaluating the regression model:

1. **Separation of features and target:** The training data is divided into features (X) and target (y). The target is the 'sales' column that the model will try to predict.

2. **Splitting the data into training and test sets:** The data is then split into a training set and a test set using the `train_test_split` function from scikit-learn. 80% of the data is used for training the model, and the remaining 20% is used to test its performance. The `shuffle=False` option ensures that the data is not shuffled before the split, which is important for preserving the temporal structure of the data.

3. **Training the model:** A LightGBM regression model is then trained on the training set. The `n_jobs=-1` option allows using all the CPU cores for training.

4. **Prediction and evaluation:** The model is used to predict sales on the test set. Negative predictions are adjusted to 0 because a sales prediction cannot be negative. Finally, the mean squared logarithmic error (mean squared log error) between the predicted values and the actual values is calculated and displayed.

5. **Checking predictions:** Lastly, the minimum value of the predictions is calculated and displayed. This is probably done to verify that negative predictions have been adjusted to 0.

In [25]:
X, y = data_train.drop(['sales'], axis=1), data_train['sales']

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=False)

In [52]:
model = LGBMRegressor(n_jobs=-1)
model.fit(X_train, y_train)
preds = model.predict(X_test)
preds = np.where(preds < 0, 0, preds)
print(mean_squared_log_error(y_test, preds))

3.118522077604036


In [49]:
np.where(preds < 0, 0, preds).min()

125.5684359719774