# Store Sales - Time Series Forecasting
This notebook provides a complete solution for the Kaggle competition using LightGBM.

# Store Sales - Time Series Forecasting
### Kaggle Competition Overview

This notebook provides a complete solution for the **Store Sales - Time Series Forecasting** Kaggle competition. The goal of this competition is to predict store item sales on specific dates. It is a challenging time-series problem involving multiple datasets, external features, and real-world complexities.

---

## Problem Statement
The task involves predicting the **daily sales** of items for given stores over a defined period using historical data and additional external features. Accurate sales forecasts can improve inventory management, reduce waste, and increase overall operational efficiency for businesses.

### Key Objectives:
1. **Model Training**:
   - Use historical sales data to train a predictive model.
   - Incorporate time-series trends, seasonality, and external factors such as holidays and oil prices.
2. **Feature Engineering**:
   - Extract meaningful features from the dataset (e.g., time-based features, lag features).
   - Leverage external datasets (e.g., oil prices, holidays) to enhance prediction accuracy.
3. **Prediction**:
   - Forecast future sales for each item in the test dataset.
   - Ensure predictions align with Kaggle’s submission requirements.
4. **Evaluation**:
   - Optimize the model to minimize the **Root Mean Squared Logarithmic Error (RMSLE)**.

---

## Dataset Overview

This competition provides the following datasets:

1. **`train.csv`**: Historical sales data containing the following columns:
   - `date`: The date of the sale.
   - `store_nbr`: Identifier for the store.
   - `family`: The category of the product.
   - `sales`: The target variable representing sales on a given day.
   - `onpromotion`: Indicates whether the item was on promotion that day.

2. **`test.csv`**: Contains the `id` column (unique identifier) and other columns for which we need to predict sales.

3. **Additional Datasets**:
   - **`stores.csv`**: Information about the stores (e.g., location, type).
   - **`oil.csv`**: Daily oil prices, which may affect the economy and sales.
   - **`holidays_events.csv`**: Public holidays and events that can influence sales.
   - **`transactions.csv`**: Historical daily transactions at each store.

---

## Workflow Summary
This notebook is divided into the following steps:

1. **Import Required Libraries**:
   - Load all necessary libraries for data processing, feature engineering, and modeling.
   
2. **Load and Inspect Datasets**:
   - Read the provided datasets, inspect their structure, and understand their relationships.
   - Convert `date` columns to datetime format for better handling of time-series data.

3. **Preprocessing and Feature Engineering**:
   - Merge external datasets (e.g., oil prices, holiday information) with `train` and `test`.
   - Handle missing values appropriately (e.g., forward-filling oil prices).
   - Encode categorical features (e.g., `family`, `type`) into numeric format using `LabelEncoder`.
   - Extract additional time-based features (e.g., day of the week, month).

4. **Train-Test Split**:
   - Split the `train` dataset into training and validation sets.
   - Use the validation set to evaluate the model's performance.

5. **Model Training with LightGBM**:
   - Train a gradient boosting model using LightGBM, optimizing for the RMSLE metric.
   - Use early stopping to prevent overfitting.

6. **Prediction and Submission**:
   - Generate predictions for the test dataset.
   - Create a submission file adhering to Kaggle’s format (`id` and `sales` columns).

---

## Metadata Description

- **Competition URL**: [Store Sales - Time Series Forecasting](https://www.kaggle.com/competitions/store-sales-time-series-forecasting)
- **Submission Format**: A CSV file with the following columns:
  - `id`: Unique identifier for each row in the test set.
  - `sales`: Predicted sales for the corresponding `id`.
- **Evaluation Metric**: Root Mean Squared Logarithmic Error (RMSLE).
  - RMSLE penalizes large differences between the logarithms of predicted and actual values, making it ideal for tasks with a wide range of target values.

---

## Expected Outcome
By the end of this notebook, you will:
1. Understand the data preprocessing, feature engineering, and modeling steps for time-series forecasting.
2. Generate accurate sales predictions for the test dataset.
3. Create a valid submission file for the Kaggle competition.


In [2]:
# Step 1: Install Kaggle API and Download Dataset
!pip install kaggle
from google.colab import files
files.upload()  # Upload kaggle.json
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle competitions download -c store-sales-time-series-forecasting
!unzip -q store-sales-time-series-forecasting.zip -d store_sales_data



Saving kaggle.json to kaggle.json
Downloading store-sales-time-series-forecasting.zip to /content
 23% 5.00M/21.4M [00:00<00:00, 43.7MB/s]
100% 21.4M/21.4M [00:00<00:00, 125MB/s] 


### Import Required Libraries
In this step, we import all necessary libraries, including:
- **pandas** for data manipulation.
- **numpy** for numerical operations.
- **lightgbm** for training the model.
- **sklearn** for splitting the data and evaluation metrics.
- **LabelEncoder** for encoding categorical features.

### Load and Inspect Datasets
- Load the datasets provided by Kaggle, including:
  - `train.csv`: Contains the historical sales data.
  - `test.csv`: Contains the test data for which predictions need to be made.
  - Additional files (`stores.csv`, `oil.csv`, `holidays_events.csv`, `transactions.csv`).
- Check the structure of the datasets using `.info()` and `.head()`.
- Convert `date` columns to datetime format for easy manipulation.


In [3]:
# Load Libraries and Dataset
import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_log_error
from sklearn.preprocessing import LabelEncoder

train = pd.read_csv('store_sales_data/train.csv')
test = pd.read_csv('store_sales_data/test.csv')
stores = pd.read_csv('store_sales_data/stores.csv')
oil = pd.read_csv('store_sales_data/oil.csv')
holidays = pd.read_csv('store_sales_data/holidays_events.csv')
transactions = pd.read_csv('store_sales_data/transactions.csv')

# Convert date columns to datetime
train['date'] = pd.to_datetime(train['date'])
test['date'] = pd.to_datetime(test['date'])
oil['date'] = pd.to_datetime(oil['date'])
holidays['date'] = pd.to_datetime(holidays['date'])
transactions['date'] = pd.to_datetime(transactions['date'])

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



### Data Preprocessing and Feature Engineering
This step involves:
- Merging external datasets (e.g., `stores`, `oil`, `transactions`, `holidays`) with the train and test datasets.
- Handling missing values using forward-fill for time-series data.
- Encoding categorical columns using `LabelEncoder` to convert them into numeric format.
- Creating time-based features from the `date` column such as `day_of_week`, `month`, and `year`.
- Ensuring that test and train datasets have the same features.


In [6]:
# Step 3: Preprocessing and Feature Engineering
# Merge external datasets
train = train.merge(stores, on='store_nbr', how='left')
test = test.merge(stores, on='store_nbr', how='left')
train = train.merge(oil, on='date', how='left')
test = test.merge(oil, on='date', how='left')
train = train.merge(transactions, on=['date', 'store_nbr'], how='left')
test = test.merge(transactions, on=['date', 'store_nbr'], how='left')
train = train.merge(holidays, on='date', how='left')
test = test.merge(holidays, on='date', how='left')

# Fill missing values
train['dcoilwtico'] = train['dcoilwtico'].ffill()
test['dcoilwtico'] = test['dcoilwtico'].ffill()

# Label Encoding for Categorical Columns
categorical_cols = train.select_dtypes(include=['object', 'string']).columns
label_encoders = {}
for col in categorical_cols:
    le = LabelEncoder()
    train[col] = le.fit_transform(train[col].astype(str))
    test[col] = le.transform(test[col].astype(str))
    label_encoders[col] = le

# Extract time-based features
for df in [train, test]:
    df['day_of_week'] = df['date'].dt.dayofweek
    df['month'] = df['date'].dt.month
    df['year'] = df['date'].dt.year

# Drop non-numeric and unnecessary columns
train = train.drop(columns=['id', 'date'])
test = test.drop(columns=['id', 'date'])

### Split Data into Training and Validation Sets
- Separate the target variable (`sales`) from the training dataset.
- Split the data into training and validation sets using `train_test_split`.
- Ensure validation data is reserved for evaluating model performance.


In [7]:
# Step 4: Split Training Data
X = train.drop(columns=['sales'])
y = train['sales']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

### Train LightGBM Model
- Define LightGBM parameters (`params`) such as learning rate, number of leaves, and objective.
- Create `lgb.Dataset` objects for training and validation data.
- Train the model using `lgb.train` with early stopping and evaluation metrics logging.

In [9]:
import lightgbm as lgb
from lightgbm import early_stopping, log_evaluation

# Define LightGBM parameters
params = {
    'objective': 'regression',         # Regression task
    'metric': 'rmse',                  # Root Mean Squared Error
    'boosting_type': 'gbdt',           # Gradient Boosting Decision Trees
    'learning_rate': 0.05,             # Step size for updating weights
    'num_leaves': 31,                  # Maximum leaves per tree
    'verbose': -1                      # Suppress training logs
}

# Prepare LightGBM datasets
lgb_train = lgb.Dataset(X_train, label=y_train)  # Training data
lgb_val = lgb.Dataset(X_val, label=y_val, reference=lgb_train)  # Validation data

# Train LightGBM model with callbacks for early stopping and logging
model = lgb.train(
    params,
    lgb_train,
    valid_sets=[lgb_train, lgb_val],  # Evaluate performance on training and validation data
    num_boost_round=1000,             # Maximum number of boosting rounds
    callbacks=[
        early_stopping(stopping_rounds=50),  # Stop early if no improvement for 50 rounds
        log_evaluation(50)                  # Log metrics every 50 rounds
    ]
)

# Output best iteration and evaluation results
print(f"Best iteration: {model.best_iteration}")
print(f"Best RMSE: {model.best_score['valid_1']['rmse']}")

Training until validation scores don't improve for 50 rounds
[50]	training's rmse: 451.035	valid_1's rmse: 435.653
[100]	training's rmse: 372.938	valid_1's rmse: 357.48
[150]	training's rmse: 338.795	valid_1's rmse: 324.18
[200]	training's rmse: 315.953	valid_1's rmse: 302.429
[250]	training's rmse: 299.481	valid_1's rmse: 286.572
[300]	training's rmse: 288.644	valid_1's rmse: 276.477
[350]	training's rmse: 279.697	valid_1's rmse: 268.12
[400]	training's rmse: 272.604	valid_1's rmse: 261.494
[450]	training's rmse: 266.578	valid_1's rmse: 256.238
[500]	training's rmse: 261.34	valid_1's rmse: 251.883
[550]	training's rmse: 257.154	valid_1's rmse: 248.247
[600]	training's rmse: 253.084	valid_1's rmse: 245.254
[650]	training's rmse: 249.867	valid_1's rmse: 242.595
[700]	training's rmse: 246.558	valid_1's rmse: 239.679
[750]	training's rmse: 243.609	valid_1's rmse: 237.052
[800]	training's rmse: 240.678	valid_1's rmse: 234.71
[850]	training's rmse: 238.036	valid_1's rmse: 232.573
[900]	trai

### Generate Predictions and Prepare Submission
- Align the test dataset features with the training dataset.
- Fill missing values in the test dataset to ensure compatibility with the model.
- Generate predictions for the test dataset.
- Create a submission file containing the `id` column from the original test dataset and the predictions.
- Verify that the submission file matches the required format and has the correct number of rows.

In [19]:
# Step 6: Predict and Create Submission File
# Align Test Dataset with Training Features
test_orig = pd.read_csv('store_sales_data/test.csv')
test_aligned, _ = test.align(X_train, axis=1, join='left')

# Fill Missing Values in Test Dataset
test_aligned.fillna(0, inplace=True)

# Generate Predictions
test_pred = model.predict(test_aligned)

# Create Submission File
# Use the 'id' column from the original test dataset to ensure IDs are correct
submission = pd.DataFrame({
    'id': test_orig['id'],   # Correct IDs from the original test dataset
    'sales': test_pred       # Predicted sales
})

# Verify Submission
print(f"Number of rows in submission: {submission.shape[0]}")
print(f"Missing IDs: {set(test_orig['id']) - set(submission['id'])}")
print(f"Extra IDs: {set(submission['id']) - set(test_orig['id'])}")

# Save Submission File
submission.to_csv('submission.csv', index=False)

Number of rows in submission: 28512
Missing IDs: set()
Extra IDs: set()
