## üìà Introduction

Welcome to my notebook for the **"Store Sales - Time Series Forecasting"** competition on Kaggle!

In this competition, we're working with sales data from **Corporaci√≥n Favorita**, a large grocery retailer in Ecuador. The task is to build a model that accurately predicts **unit sales** for thousands of items across multiple stores.

This challenge is a great opportunity to explore **time-series forecasting**, practice working with real-world retail data, and apply machine learning to a problem with real business impact. Accurate forecasts help retailers reduce overstock, minimize food waste, and ensure products are available when customers need them.

Throughout this notebook, I‚Äôll walk through:

- üßπ Data exploration and preprocessing  
- üß† Feature engineering and modeling  
- üìä Evaluation and results  
- ‚úÖ Submission to the competition

## Import Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import seaborn as sns

In [None]:
# reading files
def read_csv(path):
    return pd.read_csv("/kaggle/input/store-sales-time-series-forecasting/" + path)

data_holidays = read_csv("holidays_events.csv")
data_oil = read_csv("oil.csv")
data_stores = read_csv("stores.csv")
data_transactions = read_csv("transactions.csv")
data_train = read_csv("train.csv")
data_test = read_csv("test.csv")

In [None]:
# setting up columns to proper date format
data_oil['date'] = pd.to_datetime(data_oil['date'])
data_holidays['date'] = pd.to_datetime(data_holidays['date'])
data_train['date'] = pd.to_datetime(data_train['date'])
data_test['date'] = pd.to_datetime(data_test['date'])
data_transactions['date'] = pd.to_datetime(data_transactions['date'])

# only keeping rows after 2013-01-01
data_train = data_train[data_train['date'] > '2013-01-01']

## Analysis

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Prepare KPI data using the correct DataFrame name
kpi_labels = [
    'Number of Stores',
    'Number of Different Products',
    'Window Start Date',
    'Window End Date',
    '# Rows in Training Set',
    '# Date Points in Train Dataset'
]

kpi_values = [
    data_train['store_nbr'].nunique(),
    data_train['family'].nunique(),
    str(data_train['date'].min().date()),  # Format date nicely
    str(data_train['date'].max().date()),
    f"{data_train.shape[0]:,}",  # Comma separator for large numbers
    data_train['date'].nunique()
]

kpi_df = pd.DataFrame({'KPI': kpi_labels, 'Value': kpi_values})

# Plotting
fig, ax = plt.subplots(figsize=(7, 4))
ax.axis('off')
ax.set_title("BASIC KPIs of TRAIN DATA", fontsize=14, fontweight='bold', pad=20)

# Create the table
table = ax.table(cellText=kpi_df.values, colLabels=kpi_df.columns, cellLoc='left', loc='center')

# Style the table
table.auto_set_font_size(False)
table.set_fontsize(11)
table.scale(1.2, 1.4)

# Bold header row
for (row, col), cell in table.get_celld().items():
    if row == 0:
        cell.set_text_props(weight='bold', color='white')
        cell.set_facecolor('#40466e')
    else:
        cell.set_facecolor('#f1f1f2')

plt.tight_layout()
plt.show()


In [None]:
import matplotlib.pyplot as plt

# Prepare data
train_aux = data_train[['date', 'sales', 'onpromotion']].groupby('date').mean().reset_index()

# Plot using Matplotlib
plt.figure(figsize=(12, 6))
plt.plot(train_aux['date'], train_aux['sales'], color='blue', label='Average Sales')

plt.title('Avg Sales by Date for All Stores and Products')
plt.xlabel('Date')
plt.ylabel('Avg Unit Sold')
plt.xticks(rotation=45)
plt.tight_layout()
plt.grid(True)
plt.legend()
plt.show()


- Overall, sales show an increasing trend.
- Over the past two years (since July 2015), the trend has been relatively stable (almost stationary time series).
- A clear weekly seasonality is present, repeating every 7 days.
- Sales are consistently higher on weekends, with peaks on Saturdays and Sundays.
- On January 1st each year, sales drop to zero as supermarkets are closed.


## Sales Distribution

In [None]:
plt.figure(figsize=(10,5))
plt.hist(data_train['sales'], bins=50, color='skyblue', edgecolor='black')
plt.title('Sales Distribution')
plt.xlabel('Sales')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()


## Store Type Distribution

In [None]:
data_stores['type'].value_counts().plot(kind='bar', title='Store Type Distribution', color='purple')
plt.xlabel('Store Type')
plt.ylabel('Count')
plt.grid(True)
plt.show()


## Transactions Trend Over Time

In [None]:
trans_by_date = data_transactions.groupby('date')['transactions'].sum().reset_index()

plt.figure(figsize=(14,6))
plt.plot(trans_by_date['date'], trans_by_date['transactions'], color='green')
plt.title('Total Transactions Over Time')
plt.xlabel('Date')
plt.ylabel('Transactions')
plt.grid(True)
plt.show()


## Oil Price Over Time

In [None]:
plt.figure(figsize=(14,6))
plt.plot(data_oil['date'], data_oil['dcoilwtico'], color='orange')
plt.title('Oil Prices Over Time')
plt.xlabel('Date')
plt.ylabel('Oil Price (WTI)')
plt.grid(True)
plt.show()


## Sales Distribution on Holidays

In [None]:
# Ensure datetime
data_train['date'] = pd.to_datetime(data_train['date'])
data_holidays['date'] = pd.to_datetime(data_holidays['date'])

# Filter only non-transferred holidays
valid_holidays = data_holidays[data_holidays['transferred'] == False].copy()

# Aggregate daily sales
daily_sales = data_train.groupby('date')['sales'].sum().reset_index()

# Merge holidays into sales
sales_with_holidays = pd.merge(daily_sales, valid_holidays, on='date', how='left')
sales_with_holidays['is_holiday'] = sales_with_holidays['type'].notnull()

# Compare average sales: holiday vs non-holiday
avg_sales_comparison = sales_with_holidays.groupby('is_holiday')['sales'].mean()

# Average sales by holiday type
avg_sales_by_type = sales_with_holidays[sales_with_holidays['is_holiday']] \
                    .groupby('type')['sales'].mean().sort_values(ascending=False)

# Plotting both charts side by side
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Left: Avg Sales Holiday vs Non-Holiday
axes[0].bar(['Non-Holiday', 'Holiday'], avg_sales_comparison.values, color=['skyblue', 'salmon'])
axes[0].set_title('Avg Sales: Holiday vs Non-Holiday')
axes[0].set_ylabel('Average Sales')
axes[0].grid(True)

# Right: Avg Sales by Holiday Type
avg_sales_by_type.plot(kind='bar', ax=axes[1], color='teal')
axes[1].set_title('Avg Sales by Holiday Type')
axes[1].set_ylabel('Average Sales')
axes[1].tick_params(axis='x', rotation=45)
axes[1].grid(True)

plt.tight_layout()
plt.show()


## Avg Sales vs Store Number

In [None]:
# Ensure store number is same type for merging
data_train['store_nbr'] = data_train['store_nbr'].astype(int)
data_stores['store_nbr'] = data_stores['store_nbr'].astype(int)

# Step 1: Compute average sales per store
avg_sales_store = data_train.groupby('store_nbr')['sales'].mean().reset_index()

# Step 2: Merge with store metadata to get 'type'
avg_sales_store = avg_sales_store.merge(data_stores[['store_nbr', 'type']], on='store_nbr', how='left')

# Step 3: Plot
plt.figure(figsize=(12,6))
sns.scatterplot(data=avg_sales_store, x='store_nbr', y='sales', hue='type', palette='Set2', s=100, edgecolor='black')

plt.title('Average Sales vs Store Number (Colored by Store Type)')
plt.xlabel('Store Number')
plt.ylabel('Average Sales')
plt.grid(True)
plt.legend(title='Store Type', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()


## Avg Sales vs On promotion in each city displaying number of stores

In [None]:
# Step 1: Merge data_train with store info to get city
data_train_city = data_train.merge(data_stores[['store_nbr', 'city']], on='store_nbr', how='left')

# Step 2: Aggregate avg sales, avg onpromotion, number of stores per city
city_summary = data_train_city.groupby('city').agg({
    'sales': 'mean',
    'onpromotion': 'mean'
}).reset_index()

store_counts = data_stores.groupby('city')['store_nbr'].nunique().reset_index()
store_counts.rename(columns={'store_nbr': 'num_stores'}, inplace=True)

# Step 3: Merge store counts into city summary
city_summary = city_summary.merge(store_counts, on='city')

# Step 4: Bubble plot
plt.figure(figsize=(14, 8))
sns.scatterplot(
    data=city_summary,
    x='onpromotion',
    y='sales',
    size='num_stores',
    hue='city',
    sizes=(100, 1000),
    alpha=0.7,
    palette='tab10',
    legend=False,
    edgecolor='black'
)

# Add labels to each point
for i, row in city_summary.iterrows():
    plt.text(row['onpromotion'] + 0.02, row['sales'], row['city'], fontsize=9)

plt.title('Avg Sales vs Avg Onpromotion per City (Bubble Size = #Stores)')
plt.xlabel('Average Onpromotion')
plt.ylabel('Average Sales')
plt.grid(True)
plt.tight_layout()
plt.show()


## Average Sales vs Store Number

In [None]:
data_train['store_nbr'] = data_train['store_nbr'].astype(int)
data_stores['store_nbr'] = data_stores['store_nbr'].astype(int)

# Step 1: Compute average sales per store
avg_sales_store = data_train.groupby('store_nbr')['sales'].mean().reset_index()

# Step 2: Merge with store metadata to get 'type'
avg_sales_store = avg_sales_store.merge(data_stores[['store_nbr', 'type']], on='store_nbr', how='left')

# Step 3: Plot
plt.figure(figsize=(12,6))
sns.scatterplot(data=avg_sales_store, x='store_nbr', y='sales', hue='type', palette='Set2', s=100, edgecolor='black')

plt.title('Average Sales vs Store Number (Colored by Store Type)')
plt.xlabel('Store Number')
plt.ylabel('Average Sales')
plt.grid(True)
plt.legend(title='Store Type', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()


## Processing Data

### üõ†Ô∏è `process_data()` Function Breakdown

1. **Selective Column Filtering**:
   - Keeps only necessary columns for training and test datasets.
   - Columns differ slightly based on whether `is_test=True`.

2. **Merging External Data**:
   - Merges oil prices (`data_oil`) and holiday information (`data_holidays`) using the `date` column as the key.
   - Renames:
     - `"dcoilwtico"` to `"crude_price"`
     - `"type"` to `"day_type"`

3. **Handling Missing Holiday Data**:
   - Fills missing `day_type` with `"Work Day"`.
   - Fills missing `transferred` values with `False`.

4. **Holiday Flag Creation**:
   - Adds a binary feature `is_holiday`:
     - `1` if the date is a valid, non-transferred holiday.
     - `0` otherwise.

5. **Promotion Encoding**:
   - Converts `onpromotion` values to binary:
     - `0` if it‚Äôs `0.0`, else `1`.

6. **Sequential Day Encoding**:
   - Generates a new feature `day_number`:
     - Encodes unique dates into sequential integers using `pd.factorize()`.

7. **Crude Price Imputation**:
   - Fills missing crude oil prices using **forward fill** method.

8. **One-Hot Encoding for Product Family**:
   - Applies one-hot encoding to the `family` column.
   - Merges new binary columns back into the dataset.
   - Drops the original `family` column.

9. **Final Cleanup**:
   - Drops unused columns: `day_type`, `transferred`, `date`.
   - Removes any remaining rows with missing values.

In [None]:
def process_data(df : pd.DataFrame, is_test = False):
    # coulumns that we want to keep
    col_train = [ 'date', 'family', 'onpromotion', 'sales', 'store_nbr', 'id']
    col_test = [ 'date', 'family', 'onpromotion', 'store_nbr', 'id']
    col_holidays = ['date', 'type', 'transferred']

    if is_test:
        col_df = col_test
    else:
        col_df = col_train
    
    # merge to form our dataset
    df = df[col_df].merge(data_oil,'left','date').merge(data_holidays[col_holidays],'left', 'date').rename(columns={'type': 'day_type', 'dcoilwtico' : 'crude_price'})

    df['day_type'] = df['day_type'].fillna('Work Day')
    df['transferred'] = df['transferred'].fillna(False)

    # only keep holidays that are not transferred. i.e true holidays
    df['is_holiday'] = np.where(
        (df['day_type'] == 'Work Day'), 0,
        np.where(
            (df['day_type'].isin(['Holiday', 'Additional', 'Event', 'Transfer', 'Bridge'])) &
            (df['transferred'] == False),
            1,
            0
        )
    )
    
    # where onpromotion is 0.0 change it to 0 else 1
    df['onpromotion'] = np.where(df['onpromotion'] == 0.0, 0, 1)
    
    # new column with index starting from 1 if date is same then keep
    # df['day_number'] = range(1, len(df) + 1)
    
    df['day_number'] = pd.factorize(df['date'])[0] + 1
    df = df.drop(['day_type', 'transferred', 'date'], axis=1)

    # wherever crude price is NaN fill it with last valid value
    df['crude_price'] = df['crude_price'].fillna(method='ffill')

    # encode product family to one-hot encoding
    family_encoding = pd.get_dummies(df['family'], prefix='family').astype(int)
    df = pd.concat([df, family_encoding], axis=1)

    # drop family as it is no more needed
    df = df.drop(['family'], axis=1)

    # drop 'na' rows if still left
    df = df.dropna()
    
    return df

In [None]:
data_train = process_data(data_train)
data_test = process_data(data_test, is_test=True)

data_train.head()

In [None]:
import xgboost as xgb
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score

train_X = data_train.drop(['sales'], axis=1)
train_y = data_train['sales']

# replace becuae xgboost doesnt work when headers have " "
train_X.columns = [col.replace(" ", "_").replace("-", "_") for col in train_X.columns]
data_test.columns = [col.replace(" ", "_").replace("-", "_") for col in data_test.columns]

# find optimal base score
log_mean_target = np.log1p(train_y.mean())
print(log_mean_target)

# Initialize the XGBoost model
xgb_model = xgb.XGBRegressor(objective='count:poisson',  n_estimators=100, base_score = log_mean_target)

# Train the model
xgb_model.fit(train_X, train_y)

# Make predictions on the test set
predict_y_xg = xgb_model.predict(data_test)

print(predict_y_xg)

In [None]:
end_data = data_test[['id']].copy()  # Copy the 'id' column as a DataFrame
end_data['sales'] = predict_y_xg  # Add the predicted sales as a new column

# Save to a CSV file
end_data.to_csv('/kaggle/working/submission.csv', index=False)