# Introduction

## Brief Description of the Competition and Its Goals

In this competition, we aim to predict sales for thousands of product families sold at Favorita stores located in Ecuador. The training data includes dates, store and product information, whether that item was being promoted, as well as the sales numbers. Additional files include supplementary information that may be useful in building our models.

Specifically, we'll build a model that more accurately predicts the unit sales for thousands of items sold at different Favorita stores. This will help ensure retailers have just enough of the right products at the right time, decreasing food waste related to overstocking and improving customer satisfaction.

## Explanation of the Evaluation Metric (RMSLE)

The evaluation metric for this competition is Root Mean Squared Logarithmic Error (RMSLE). The RMSLE is calculated as:

$$
RMSLE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (\log(1 + \hat{y}_i) - \log(1 + y_i))^2}
$$

where:
- \( n \) is the total number of instances,
- \( \hat{y}_i \) is the predicted value of the target for instance \( i \),
- \( y_i \) is the actual value of the target for instance \( i \), and
- \( \log \) is the natural logarithm.

This metric is useful for this competition because it penalizes underestimates more than overestimates, which is important in a retail context where underestimating sales can lead to stockouts and lost revenue.

# Import Required Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_log_error
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

# Load Dataset
Load the dataset

In [2]:
# Load the training and test datasets
train_df = pd.read_csv('data/train.csv')
test_df = pd.read_csv('data/test.csv')

# Load the supplementary datasets
stores_df = pd.read_csv('data/stores.csv')
oil_df = pd.read_csv('data/oil.csv')
holidays_events_df = pd.read_csv('data/holidays_events.csv')

In [7]:
# Display the first few rows of each dataset
print("Training Data:")
print(train_df.head())
print("Size of training data:", train_df.shape)

print("\nTest Data:")
print(test_df.head())
print("Size of test data:", test_df.shape)

print("\nStores Data:")
print(stores_df.head())
print("Size of stores data:", stores_df.shape)

print("\nOil Data:")
print(oil_df.head())
print("Size of oil data:", oil_df.shape)

print("\nHolidays and Events Data:")
print(holidays_events_df.head())
print("Size of holidays and events data:", holidays_events_df.shape)

Training Data:
   id        date  store_nbr      family  sales  onpromotion
0   0  2013-01-01          1  AUTOMOTIVE    0.0            0
1   1  2013-01-01          1   BABY CARE    0.0            0
2   2  2013-01-01          1      BEAUTY    0.0            0
3   3  2013-01-01          1   BEVERAGES    0.0            0
4   4  2013-01-01          1       BOOKS    0.0            0
Size of training data: (3000888, 6)

Test Data:
        id        date  store_nbr      family  onpromotion
0  3000888  2017-08-16          1  AUTOMOTIVE            0
1  3000889  2017-08-16          1   BABY CARE            0
2  3000890  2017-08-16          1      BEAUTY            2
3  3000891  2017-08-16          1   BEVERAGES           20
4  3000892  2017-08-16          1       BOOKS            0
Size of test data: (28512, 5)

Stores Data:
   store_nbr           city                           state type  cluster
0          1          Quito                       Pichincha    D       13
1          2          Qui

# Data Preprocessing

In [5]:
# Check for missing values in the training dataset
print("Missing values in training data:")
print(train_df.isnull().sum())

# Check for missing values in the test dataset
print("\nMissing values in test data:")
print(test_df.isnull().sum())

# Check for missing values in the stores dataset
print("\nMissing values in stores data:")
print(stores_df.isnull().sum())

# Check for missing values in the oil dataset
print("\nMissing values in oil data:")
print(oil_df.isnull().sum())

# Check for missing values in the holidays and events dataset
print("\nMissing values in holidays and events data:")
print(holidays_events_df.isnull().sum())

Missing values in training data:
id             0
date           0
store_nbr      0
family         0
sales          0
onpromotion    0
dtype: int64

Missing values in test data:
id             0
date           0
store_nbr      0
family         0
onpromotion    0
dtype: int64

Missing values in stores data:
store_nbr    0
city         0
state        0
type         0
cluster      0
dtype: int64

Missing values in oil data:
date           0
dcoilwtico    43
dtype: int64

Missing values in holidays and events data:
date           0
type           0
locale         0
locale_name    0
description    0
transferred    0
dtype: int64


We only have 43 missing values in the oil data.

Given that the oil dataset has 1218 rows and only 43 missing values, forward fill (propagating the last observed value forward) is a convenient technique. This method is particularly useful for time series data, as it maintains the continuity of the data and is less likely to introduce bias compared to mean or median imputation.

In [8]:
# Forward Fill: Replace missing values with the last observed value
oil_df['dcoilwtico'].fillna(method='ffill', inplace=True)

# Verify that there are no missing values left
print("Missing values in oil data after forward fill:")
print(oil_df.isnull().sum())

Missing values in oil data after forward fill:
date          0
dcoilwtico    1
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  oil_df['dcoilwtico'].fillna(method='ffill', inplace=True)
  oil_df['dcoilwtico'].fillna(method='ffill', inplace=True)


# Exploratory Data Analysis
Perform exploratory data analysis using visualizations to understand the data distribution and relationships.

# Feature Engineering
Create new features from existing ones to improve model performance.

# Model Training
Train machine learning models using scikit-learn and other libraries.

# Model Evaluation
Evaluate the trained models using appropriate metrics and validation techniques.

# Model Deployment
Prepare the model for deployment, including saving the model and creating an API for predictions.