# **Introduction**

## **Why This Project?**
This project is part of the "How to Win a Data Science Competition" Coursera course, and it presents a real-world challenge of predicting sales for one of Russia's largest software firms, 1C Company. Accurate sales forecasting is a critical aspect of retail operations, helping businesses optimize inventory, reduce overstock and understock situations, and improve customer satisfaction. The ability to predict future sales empowers businesses to plan better, minimize wastage, and maximize profits.

In this competition, we are tasked with forecasting the total sales for each product in every store for the upcoming month. Given the dynamic nature of product availability, pricing fluctuations, and store-specific trends, this problem mirrors the complex forecasting issues faced by retail companies worldwide.

## **Real-Life Problem Solved**
Sales forecasting is a fundamental problem for any retail chain, as it directly impacts supply chain management, operational efficiency, and profitability. Retailers need to handle large volumes of sales data across numerous stores and products, all while accounting for seasonal variations, promotional periods, and demand fluctuations. By developing a robust forecasting model, companies can ensure they maintain optimal stock levels, reducing the costs associated with both overstock and stockouts. The insights gained from this model can be extended to other industries reliant on accurate demand prediction, such as manufacturing, e-commerce, and supply chain management.

## **Strategies to Solve the Problem**
1. **Feature Engineering**: A key part of this project involves creating meaningful features from historical sales data, such as monthly sales aggregates, lag-based features (e.g., previous month sales), and price trends. We will also explore combining supplemental data such as item categories and shop information to enrich our predictions.
   
2. **Handling Time Series Data**: Since the dataset consists of daily sales data from January 2013 to October 2015, we need to model temporal dependencies effectively. Strategies such as using **rolling windows** or **lag features** will help capture past trends, while features like the **date_block_num** will be used to track time progression.

3. **Dealing with Data Sparsity**: In real-world retail settings, not every product is sold in every shop every month. This leads to sparse data, which can introduce noise into predictions. Techniques like **data aggregation** and **missing data imputation** will help mitigate these challenges.

4. **Modeling Approach**: For this problem, we will experiment with both traditional machine learning models and advanced algorithms. Some potential models include:
   - **Gradient Boosting Machines (GBM)**: XGBoost or LightGBM, which are well-suited for tabular data and can handle large datasets efficiently.
   - **Neural Networks for Time Series**: Potentially exploring recurrent neural networks (RNN) or LSTMs to capture time dependencies in the sales data.
   - **Ensembling**: Combining predictions from multiple models to create a more accurate and robust forecast.
   

5. **Evaluation Metric**: The competition uses Root Mean Squared Error (RMSE) as the evaluation metric, which penalizes larger errors more heavily. As the target values are clipped between 0 and 20, our model needs to focus on predicting values within this range accurately, avoiding extreme predictions.

## **Roadmap to Success**
- **Data Exploration and Preprocessing**: Initial exploratory data analysis (EDA) will help uncover key patterns in the data. Handling missing values, outliers, and understanding the distribution of target variables will form the foundation of the project.
- **Feature Engineering**: Creating time-related features, capturing shop and item-level characteristics, and analyzing sales trends over time will be the core strategies to improve model performance.
- **Modeling and Hyperparameter Tuning**: Experimenting with various machine learning algorithms, tuning hyperparameters, and leveraging cross-validation techniques will be crucial for model improvement.
- **Final Submission**: We will predict the monthly sales for November 2015, ensuring the submission file follows the required format: `ID,item_cnt_month`.


In [None]:
import pandas as pd
import os

# Function to load multiple CSV files
def load_data(file_paths):
    data = {}
    for name, path in file_paths.items():
        data[name] = pd.read_csv(path)
    return data

# Define file paths in a dictionary
file_paths = {
    'sales_data': '/kaggle/input/competitive-data-science-predict-future-sales/sales_train.csv',
    'item_cat': '/kaggle/input/competitive-data-science-predict-future-sales/item_categories.csv',
    'items': '/kaggle/input/competitive-data-science-predict-future-sales/items.csv',
    'shops': '/kaggle/input/competitive-data-science-predict-future-sales/shops.csv',
    'test':'/kaggle/input/competitive-data-science-predict-future-sales/test.csv'
}

# Load the data
data = load_data(file_paths)

# Access each dataset by its key
sales_data = data['sales_data']
item_categories = data['item_cat']
items = data['items']
shops = data['shops']
test_data=data['test']


In [None]:
# Import necessary libraries

import matplotlib.pyplot as plt
import seaborn as sns

# Basic information about datasets
print("Sales Data Info:\n")
print(sales_data.info())
print("\nTest Data Info:\n")
print(test_data.info())
print("\nItems Info:\n")
print(items.info())
print("\nItem Categories Info:\n")
print(item_categories.info())
print("\nShops Info:\n")
print(shops.info())

In [None]:
# Check for missing values
print("\nMissing Values in Sales Data:\n", sales_data.isnull().sum())
print("\nMissing Values in Test Data:\n", test_data.isnull().sum())

In [None]:
# Date parsing and feature extraction

sales_data['date'] = pd.to_datetime(sales_data['date'], format='%d.%m.%Y')
sales_data['year'] = sales_data['date'].dt.year
sales_data['month'] = sales_data['date'].dt.month

In [None]:
sales_data.head()

In [None]:
# Grouping by 'date', 'shop_id', and 'item_id' to get the total number of items sold
grouped_sales = sales_data.groupby(['date', 'shop_id', 'item_id']).agg({'item_cnt_day': 'sum'}).reset_index()

# Renaming the 'item_cnt_day' column to 'total_items_sold' for clarity
grouped_sales.rename(columns={'item_cnt_day': 'total_items_sold'}, inplace=True)

# Display the result
print(grouped_sales.head(10))  # Display the first few rows to check the result

# Time Series Plot for Total Items Sold by Date

This shows how the total number of items sold changes over time.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Group by 'date' to get total items sold across all shops and items for each date
datewise_sales = grouped_sales.groupby('date')['total_items_sold'].sum().reset_index()

# Convert 'date' to datetime format for better plotting
datewise_sales['date'] = pd.to_datetime(datewise_sales['date'], format='%d.%m.%Y')

# Plot the total items sold over time
plt.figure(figsize=(12, 6))
sns.lineplot(x='date', y='total_items_sold', data=datewise_sales)
plt.title('Total Items Sold Over Time')
plt.xlabel('Date')
plt.ylabel('Total Items Sold')
plt.xticks(rotation=45)
plt.show()


# Bar Plot for Total Items Sold by Shop

This shows how many total items were sold in each shop.

In [None]:
# Group by 'shop_id' to get total items sold for each shop
shopwise_sales = grouped_sales.groupby('shop_id')['total_items_sold'].sum().reset_index()

# Plot total items sold per shop
plt.figure(figsize=(12, 6))
sns.barplot(x='shop_id', y='total_items_sold', data=shopwise_sales)
plt.title('Total Items Sold per Shop')
plt.xlabel('Shop ID')
plt.ylabel('Total Items Sold')
plt.xticks(rotation=45)
plt.show()


In [None]:
# Count the number of unique shop_ids
num_shops = sales_data['shop_id'].nunique()

# Display the result
print(f'Total number of unique shops: {num_shops}')

# Monthly sales trend

In [None]:
monthly_sales = sales_data.groupby(['year', 'month']).agg({'item_cnt_day': 'sum'}).reset_index()
monthly_sales['date'] = pd.to_datetime(monthly_sales[['year', 'month']].assign(day=1))

plt.figure(figsize=(12, 6))
plt.plot(monthly_sales['date'], monthly_sales['item_cnt_day'], marker='o')
plt.title('Monthly Sales Trend (2013-2015)')
plt.xlabel('Date')
plt.ylabel('Total Sales')
plt.grid(True)
plt.show()


# Top 10 selling items

In [None]:
top_items = sales_data.groupby('item_id').agg({'item_cnt_day': 'sum'}).sort_values(by='item_cnt_day', ascending=False).head(10)
top_items = top_items.merge(items[['item_id', 'item_name']], on='item_id')

plt.figure(figsize=(12, 6))
sns.barplot(x='item_name', y='item_cnt_day', data=top_items)
plt.xticks(rotation=90)
plt.title('Top 10 Selling Items')
plt.xlabel('Item Name')
plt.ylabel('Total Sales')
plt.show()

# Top 10 selling shops

In [None]:
top_shops = sales_data.groupby('shop_id').agg({'item_cnt_day': 'sum'}).sort_values(by='item_cnt_day', ascending=False).head(10)
top_shops = top_shops.merge(shops[['shop_id', 'shop_name']], on='shop_id')

plt.figure(figsize=(12, 6))
sns.barplot(x='shop_id', y='item_cnt_day', data=top_shops)
plt.xticks(rotation=90)
plt.title('Top 10 Shops by Sales')
plt.xlabel('Shop ID')
plt.ylabel('Total Sales')
plt.show()

In [None]:
test_data.head()

# Reshape the data to analyse Total Sales over time

Lets see and analyze total sales shop-wise,item-wise over time. For that we need to reshape the data.The `pivot_table` function allows us to summarize and organize the sales data by transforming it into a more accessible format.

By grouping the data based on `shop_id` and `item_id`, and distributing it across different months (`date_block_num`), we can easily observe sales trends for each product-shop combination over time. This pivot table helps us in visualizing the sales patterns and preparing the data for future time-series analysis and forecasting.

The `fill_value=0` ensures that any missing values are replaced with `0`, indicating that no sales occurred for a specific product-shop combination during certain months. The `reset_index()` function is used to convert the pivoted data back into a standard DataFrame format with default integer indexing for easier manipulation in subsequent steps.

In [None]:
# Ensure shop_id and item_id are present in the pivot table
dataset = sales_data.pivot_table(index=['shop_id', 'item_id'], values='item_cnt_day',
                                 columns='date_block_num', fill_value=0, aggfunc='sum').reset_index()

# Flatten the multi-level columns in the dataset
dataset.columns = ['_'.join(map(str, col)) if isinstance(col, tuple) else col for col in dataset.columns]

# Merge with test_data
merged_data = pd.merge(test_data, dataset, on=['item_id', 'shop_id'], how='left')

# Display the merged dataset
print(merged_data.head())


# Purpose of the Merge

The reason for merging the test data with the dataset is to incorporate historical sales data from the training period into the test dataset. This allows the model to make predictions for each shop_id and item_id combination in the test data based on past sales patterns.

### Key Points:

* Left join ensures that all rows from test_data are preserved, even if there’s no corresponding historical data for the specific shop_id and item_id.

* Historical features such as past sales from the dataset are now included in the test_data, making it ready for prediction.

### Parameters used:

* on=['item_id', 'shop_id']: This tells the merge() function to combine the two DataFrames based on the common columns item_id and shop_id.

* how='left': This specifies a left join, meaning:

    *     All rows from the test_data will be retained.

    *     Only matching rows from dataset will be included. If there's no match for a particular item_id and shop_id in dataset, those values will be set to NaN.
    
**This ensures that your model is trained on historical data relevant to the combinations it needs to predict in the test set. The result is a more focused dataset that contains only the historical sales information necessary for making predictions for the specified shop_id and item_id pairs in the test data.**

In [None]:
merged_data.head()

In [None]:
merged_data.fillna(0,inplace=True)
merged_data.head()

In [None]:
merged_data.drop(['shop_id','item_id','ID'],axis=1,inplace=True)
merged_data.head()

### Why Drop These Columns:

- **Redundancy**: Once the data is merged, the `shop_id`, `item_id`, and `ID` columns no longer add value to the model, as their purpose was primarily organizational or structural.
  
- **Focus on Predictive Features**: The goal is to focus on features that help in making predictions (such as historical sales, prices, or other aggregated features). Columns that don't contribute to the model's learning are removed to avoid clutter.
  
- **Model Efficiency**: By removing unnecessary columns, the dataset becomes cleaner, and this can improve model training efficiency and reduce memory usage.


In [None]:
import numpy as np
X_train=np.expand_dims(merged_data.values[:,:-1],axis=2)

*merged_data.values[:, :-1] selects all rows and all columns except the last one from the merged_data dataframe.*

the date_block_num starts at 0, which represents January 2013, and each subsequent date_block_num increases by one month.

Thus, to find out which month and year date_block_num = 33 corresponds to, you can calculate it like this:

    * Month 0 = January 2013
    * Month 33 = 34th month after January 2013.

To calculate:

    * 34 months from January 2013 = November 2015.

**So, column 33 in merged_data corresponds to November 2015, for which we need to make the prediction.
Column 33 is out Target column, hence removed that column from X_train.**

In [None]:
len(X_train[0])

# Why Use `np.expand_dims` for Training Data?

The code `X_train = np.expand_dims(merged_data.values[:, :-1], axis=2)` is used to reshape the data into a format suitable for machine learning models, particularly those dealing with sequences, such as Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs). Let's break down why this transformation is necessary:

#### 1. Reshaping the Input:
- `merged_data.values[:, :-1]` selects all rows and columns except the last one from the `merged_data` dataframe. This helps extract the feature columns used for training while excluding the target variable (often the last column).
  
#### 2. Adding an Extra Dimension:
- `np.expand_dims(..., axis=2)` introduces a new dimension at the third axis (`axis=2`). This operation transforms a 2D array (rows and columns) into a 3D array (rows, columns, and depth).

#### 3. Why Add an Extra Dimension?
- **Sequential/Time-Series Data**: Certain models (e.g., RNNs, LSTMs, CNNs) expect input data in a 3D format, where each "depth" layer corresponds to either a sequence step or a feature map. Expanding the dimension allows the model to interpret each feature vector as part of a sequence.
- **Shape for Model Input**: Adding this extra dimension makes the data compatible with models that expect input in the format `(samples, timesteps, features)`. Without this adjustment, the model might not correctly interpret the structure of the input data.

#### Summary:
This transformation prepares the data for models requiring 3D input by adding an extra dimension, enabling the model to process the data correctly, especially for tasks like time-series forecasting or sequence modeling.


In [None]:
y_train=merged_data.values[:,-1:]

In [None]:
X_test=np.expand_dims(merged_data.values[:,1:],axis = 2)

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import optimizers
from sklearn.metrics  import mean_squared_error
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import Conv1D,MaxPooling1D
from tensorflow.keras.layers import Dense,Dropout,LSTM,RepeatVector,TimeDistributed,Flatten

In [None]:
'''import tensorflow as tf
from tensorflow.keras.layers import Input

# Initialize the model
model_lstm = tf.keras.Sequential()

# Add an Input layer to specify the input shape
model_lstm.add(Input(shape=(X_train.shape[1], X_train.shape[2])))

# Add LSTM layer
model_lstm.add(tf.keras.layers.LSTM(units=64))

# Add Dropout layer
model_lstm.add(tf.keras.layers.Dropout(0.4))

# Add Dense layer
model_lstm.add(tf.keras.layers.Dense(1))

# Compile the model
model_lstm.compile(loss='mse', optimizer='SGD', metrics=['mean_squared_error'])

# Display the model summary
model_lstm.summary()'''


In [None]:
X_train.shape[2]

In [None]:
'''history_lstm=model_lstm.fit(X_train,y_train,batch_size=4096,epochs=20)'''

In [None]:
'''import matplotlib.pyplot as plt
plt.plot(history_lstm.history['loss'],color='b',label="Training Loss")
plt.legend(loc='best',shadow=True)'''

In [None]:
'''submission_pfs=model_lstm.predict(X_test)
submission_pfs=submission_pfs.clip(0,20)
submission=pd.DataFrame({'ID':test_data['ID'],'item_cnt_month':submission_pfs.ravel()})
submission.to_csv('submission.csv',index=False)'''


In [None]:
'''final=pd.read_csv('submission.csv')
final.head()'''

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Input

# Initialize the model
model_lstm = tf.keras.Sequential()

# Add an Input layer to specify the input shape
model_lstm.add(Input(shape=(X_train.shape[1], X_train.shape[2])))

# Add LSTM layer with 128 units
model_lstm.add(tf.keras.layers.LSTM(units=128))

# Add Dropout layer with a dropout rate of 0.2
model_lstm.add(tf.keras.layers.Dropout(0.2))

# Add Dense layer for output
model_lstm.add(tf.keras.layers.Dense(1))

# Compile the model with Adam optimizer and learning rate of 0.001
model_lstm.compile(loss='mse', optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), metrics=['mean_squared_error'])

# Display the model summary
model_lstm.summary()


In [None]:
history_lstm=model_lstm.fit(X_train,y_train,batch_size=4096,epochs=20)

In [None]:
import matplotlib.pyplot as plt
plt.plot(history_lstm.history['loss'],color='b',label="Training Loss")
plt.legend(loc='best',shadow=True)

In [None]:
submission_pfs=model_lstm.predict(X_test)
submission_pfs=submission_pfs.clip(0,20)
submission=pd.DataFrame({'ID':test_data['ID'],'item_cnt_month':submission_pfs.ravel()})
submission.to_csv('submission.csv',index=False)


# Hyperparameter tuning

In [None]:
'''import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dropout, Dense
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import ParameterGrid

# Prepare your data (make sure to replace these with your actual data)
# X_train, y_train = ...

# Define hyperparameters to tune
param_grid = {
    'units': [32, 64, 128, 256],
    'dropout_rate': [0.2, 0.3, 0.4, 0.5],
    'learning_rate': [0.001, 0.01, 0.1]
}

# Store the best model and lowest MSE
best_model = None
lowest_mse = float('inf')

# Loop through all combinations of hyperparameters
for params in ParameterGrid(param_grid):
    print(f'Testing parameters: {params}')

    # Initialize the model
    model_lstm = Sequential()

    # Input layer
    model_lstm.add(Input(shape=(X_train.shape[1], X_train.shape[2])))

    # LSTM layer with variable units
    model_lstm.add(LSTM(units=params['units'], return_sequences=True))
    model_lstm.add(Dropout(params['dropout_rate']))

    # Add another LSTM layer
    model_lstm.add(LSTM(units=params['units'] // 2))  # Half the units for the second layer
    model_lstm.add(Dropout(params['dropout_rate']))

    # Dense output layer
    model_lstm.add(Dense(1))

    # Compile the model with variable learning rate
    model_lstm.compile(loss='mse', optimizer=Adam(learning_rate=params['learning_rate']), metrics=['mean_squared_error'])

    # Train the model
    model_lstm.fit(X_train, y_train, epochs=50, batch_size=32, verbose=0)

    # Evaluate the model
    mse = model_lstm.evaluate(X_train, y_train, verbose=0)[0]
    print(f'MSE for parameters {params}: {mse}')

    # Check if this model is the best one
    if mse < lowest_mse:
        lowest_mse = mse
        best_model = model_lstm

print(f'Best model parameters: {best_model.layers}')
print(f'Lowest MSE achieved: {lowest_mse}')
'''

In [None]:
'''#2nd model Multilayer Perceptron
adam=optimizers.Adam()

model_mlp=tf.keras.Sequential()
model_mlp.add(tf.keras.layers.Dense(100,activation='relu',input_dim=X_train.shape[1]))
model_mlp.add(tf.keras.layers.Dropout(0.4))
model_mlp.add(tf.keras.layers.Dense(1))

model_mlp.compile(loss='mse',optimizer='adam',metrics=['mean_squared_error'])
model_mlp.summary()'''

In [None]:
'''history_mlp=model_mlp.fit(X_train,y_train,batch_size=4096,epochs=20)'''

In [None]:
'''plt.plot(history_mlp.history['loss'],color='b',label='Training Loss')
plt.legend(loc='best',shadow=True)'''

In [None]:
'''submission_pfs=model_mlp.predict(X_test)
submission_pfs=submission_pfs.clip(0,20)
submission=pd.DataFrame({'ID':test_data['ID'],'item_cnt_month':submission_pfs.ravel()})
submission.to_csv('submission.csv',index=False)'''

In [None]:
'''final=pd.read_csv('submission.csv')
final.head()'''

# CNN

In [None]:
'''adam=optimizers.Adam()
model_cnn=tf.keras.Sequential()
model_cnn.add(Conv1D(filters=64,kernel_size=2,activation='relu',input_shape=(X_train.shape[1],X_train.shape[2])))
model_cnn.add(MaxPooling1D(pool_size=2))
model_cnn.add(Flatten())
model_cnn.add(Dense(150,activation='relu'))
model_cnn.add(Dense(1))
model_cnn.compile(loss='mse',optimizer=adam)
model_cnn.summary()'''

In [None]:
'''cnn_history=model_cnn.fit(X_train,y_train,epochs=20,verbose=2)
'''

In [None]:
'''plt.plot(cnn_history.history['loss'],color='b',label='Training Loss')
plt.legend(loc='best',shadow=True)'''

In [None]:
'''submission_pfs=model_cnn.predict(X_test)
submission_pfs=submission_pfs.clip(0,20)
submission=pd.DataFrame({'ID':test_data['ID'],'item_cnt_month':submission_pfs.ravel()})
submission.to_csv('submission.csv',index=False)'''