<a href="https://colab.research.google.com/github/AshinDevUA/GNN/blob/main/GNN_code_grocery_sales_forecasting_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### ***Step 1: Mounting Google Drive***
* Mount Google Drive to access datasets and save outputs.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Imported pandaslib for data manipulation and analysis (e.g., working with DataFrames).
import pandas as pd

# Imported PlotlyExpress for easier plotting of charts and visualizations.
import plotly.express as Plex

# Imported NumPy for numericaloperations (e.g., array manipulation and mathematical functions).
import numpy as np

# Imported tocategorical from Keras for encodinglabels as one-hot-vectors(typicallyused for classificationtasks).
from keras.utils import to_categorical

# Imported LinearRegression from sklearn to performlinear regressionanalysis(a type of regressionmodel).
from sklearn.linear_model import LinearRegression

# Imported graph_objects from Plotly for creating more customizable and advancedplots.
import plotly.graph_objects as go

# Imported metrics like meansquaredlogerror and meanabsoluteerror from sklearn to evaluate model performance.
from sklearn.metrics import mean_squared_log_error, mean_absolute_error

### ***Step 2: Data Preparation and Merging for Favorita Grocery Sales Forecasting***
This process involves preparing and merging datasets for a sales forecasting project using the Corporación Favorita dataset. Here's a step-by-step explanation:

* Dataset Paths and Loading:

Paths to all required CSV files are specified.
Datasets (holidays_events, items, oil, stores, transactions, train, and test) are read using PDas.read_csv.

* Handling Data Types:

The onpromotion column in the train dataset is explicitly read as a string (dtype={'onpromotion': str}) to address potential mixed-type issues.
Datetime Conversion:

Date columns across datasets are converted to datetime format for consistency and ease of filtering.

* Filling Missing Oil Prices:

Missing values in the dcoilwtico column of the oil dataset are forward-filled using .ffill() to ensure no gaps in data.

* Date Range Filtering:

Defined specific date ranges for training, validation, and test sets:
Train: March 1, 2017, to June 21, 2017.
Validation: June 28, 2017, to July 13, 2017.
Test: July 19, 2017, to July 23, 2017.
Filtered the train dataset into corresponding subsets using these ranges.

* Merging Datasets:

A function merge_datasets is created to join datasets (stores, items, oil, holidays_events, transactions) with the main data (train, validation, or test) using relevant keys.
Merges are performed on shared columns like store_nbr, item_nbr, and date.

* Resultant Dataset Shapes:

Printed the shapes of the merged datasets for training, validation, and test subsets to verify the merges.

In [None]:
# Creating a directory called 'dataset'
!mkdir 'dataset'

In [None]:
# Verify the directory was created
!ls -l

total 468656
drwxr-xr-x 2 root root      4096 Dec  9 10:06 dataset
-rw-r--r-- 1 root root      1898 Dec 11  2019 holidays_events.csv.7z
-rw-r--r-- 1 root root     14315 Dec 11  2019 items.csv.7z
-rw-r--r-- 1 root root      3762 Dec 11  2019 oil.csv.7z
-rw-r--r-- 1 root root    666528 Dec 11  2019 sample_submission.csv.7z
-rw-r--r-- 1 root root       648 Dec 11  2019 stores.csv.7z
-rw-r--r-- 1 root root   4885065 Dec 11  2019 test.csv.7z
-rw-r--r-- 1 root root 474092593 Dec 11  2019 train.csv.7z
-rw-r--r-- 1 root root    219499 Dec 11  2019 transactions.csv.7z


In [None]:
%cd 'dataset'
!unzip '/content/drive/MyDrive/project/favorita-grocery-sales-forecasting'

/content/dataset/dataset/dataset
Archive:  /content/drive/MyDrive/project/favorita-grocery-sales-forecasting.zip
  inflating: holidays_events.csv.7z  
  inflating: items.csv.7z            
  inflating: oil.csv.7z              
  inflating: sample_submission.csv.7z  
  inflating: stores.csv.7z           
  inflating: test.csv.7z             
  inflating: train.csv.7z            
  inflating: transactions.csv.7z     


In [None]:
import os
# Paths to the .csv files
datasets_paths = {
    'holidays_events': '/content/drive/MyDrive/project/holidays_events.csv',
    'items': '/content/drive/MyDrive/project/items.csv',
    'oil': '/content/drive/MyDrive/project/oil.csv',
    'sample_submission': '/content/drive/MyDrive/project/sample_submission.csv',
    'stores': '/content/drive/MyDrive/project/stores.csv',
    'train': '/content/drive/MyDrive/project/train.csv',
    'test': '/content/drive/MyDrive/project/test.csv',
    'transactions': '/content/drive/MyDrive/project/transactions.csv'
}

In [None]:
# Read the datasets directly using pd.read_csv
holidays_events = pd.read_csv(datasets_paths['holidays_events'], encoding='ISO-8859-1')
items = pd.read_csv(datasets_paths['items'], encoding='ISO-8859-1')
oil = pd.read_csv(datasets_paths['oil'], encoding='ISO-8859-1')
sample_submission = pd.read_csv(datasets_paths['sample_submission'], encoding='ISO-8859-1')
stores = pd.read_csv(datasets_paths['stores'], encoding='ISO-8859-1')

# Specify dtype to handle mixed types
train = pd.read_csv(datasets_paths['train'], encoding='ISO-8859-1', dtype={'onpromotion': str})
test = pd.read_csv(datasets_paths['test'], encoding='ISO-8859-1')
transactions = pd.read_csv(datasets_paths['transactions'], encoding='ISO-8859-1')

In [None]:
holidays_events.head()

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
1,2012-04-01,Holiday,Regional,Cotopaxi,Provincializacion de Cotopaxi,False
2,2012-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
3,2012-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
4,2012-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False


In [None]:
items.head()

Unnamed: 0,item_nbr,family,class,perishable
0,96995,GROCERY I,1093,0
1,99197,GROCERY I,1067,0
2,103501,CLEANING,3008,0
3,103520,GROCERY I,1028,0
4,103665,BREAD/BAKERY,2712,1


In [None]:
stores.head()

Unnamed: 0,store_nbr,city,state,type,cluster
0,1,Quito,Pichincha,D,13
1,2,Quito,Pichincha,D,13
2,3,Quito,Pichincha,D,8
3,4,Quito,Pichincha,D,9
4,5,Santo Domingo,Santo Domingo de los Tsachilas,D,4


In [None]:
train.tail()

Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion
125497035,125497035,2017-08-15,54,2089339,4.0,False
125497036,125497036,2017-08-15,54,2106464,1.0,True
125497037,125497037,2017-08-15,54,2110456,192.0,False
125497038,125497038,2017-08-15,54,2113914,198.0,True
125497039,125497039,2017-08-15,54,2116416,2.0,False


In [None]:
test.head()

Unnamed: 0,id,date,store_nbr,item_nbr,onpromotion
0,125497040,2017-08-16,1,96995,False
1,125497041,2017-08-16,1,99197,False
2,125497042,2017-08-16,1,103501,False
3,125497043,2017-08-16,1,103520,False
4,125497044,2017-08-16,1,103665,False


In [None]:
transactions.head()

Unnamed: 0,date,store_nbr,transactions
0,2013-01-01,25,770
1,2013-01-02,1,2111
2,2013-01-02,2,2358
3,2013-01-02,3,3487
4,2013-01-02,4,1922


In [None]:
#changing the name of coloumn 'type' in both holidays_events and stores

In [None]:
holidays_events = holidays_events.rename(columns={'type': 'event_type'})

In [None]:
stores = stores.rename(columns={'type': 'store_type'})

In [None]:
holidays_events.head()

Unnamed: 0,date,event_type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
1,2012-04-01,Holiday,Regional,Cotopaxi,Provincializacion de Cotopaxi,False
2,2012-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
3,2012-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
4,2012-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False


In [None]:
holidays_events['event_type'].unique()

array(['Holiday', 'Transfer', 'Additional', 'Bridge', 'Work Day', 'Event'],
      dtype=object)

In [None]:
print(holidays_events.isnull().sum())

date           0
event_type     0
locale         0
locale_name    0
description    0
transferred    0
dtype: int64


In [None]:
stores.head()

Unnamed: 0,store_nbr,city,state,store_type,cluster
0,1,Quito,Pichincha,D,13
1,2,Quito,Pichincha,D,13
2,3,Quito,Pichincha,D,8
3,4,Quito,Pichincha,D,9
4,5,Santo Domingo,Santo Domingo de los Tsachilas,D,4


In [None]:
# Convert 'date' columns to datetime format
train['date'] = pd.to_datetime(train['date'])
oil['date'] = pd.to_datetime(oil['date'])
holidays_events['date'] = pd.to_datetime(holidays_events['date'])
transactions['date'] = pd.to_datetime(transactions['date'])

In [None]:
# Use ffill() to fill NaN values
oil['dcoilwtico'] = oil['dcoilwtico'].ffill()

In [None]:
print(oil.isnull().sum())

date          0
dcoilwtico    0
dtype: int64


In [None]:
oil.head(5)

Unnamed: 0,date,dcoilwtico
0,2013-01-01,93.14
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.2


In [None]:
# Use bfill() to fill NaN values
oil['dcoilwtico'] = oil['dcoilwtico'].bfill()

In [None]:
print(oil.isnull().sum())

date          0
dcoilwtico    0
dtype: int64


In [None]:
# Define date ranges for filtering
train_start_date = '2017-03-01'
train_end_date = '2017-06-21'
validation_start_date = '2017-06-28'
validation_end_date = '2017-07-13'
test_start_date = '2017-07-19'
test_end_date = '2017-07-23'

# Filter train dataset by date ranges
train_filtered = train[(train['date'] >= train_start_date) & (train['date'] <= train_end_date)]
validation_filtered = train[(train['date'] >= validation_start_date) & (train['date'] <= validation_end_date)]
test_filtered = train[(train['date'] >= test_start_date) & (train['date'] <= test_end_date)]

In [None]:
# Verify the filtered datasets
print("Training Data:")
print(train_filtered.head())

print("Validation Data:")
print(validation_filtered.head())

print("Testing Data:")
print(test_filtered.head())

Training Data:
                  id       date  store_nbr  item_nbr  unit_sales onpromotion
107758056  107758056 2017-03-01          1    105574        10.0       False
107758057  107758057 2017-03-01          1    105575        18.0       False
107758058  107758058 2017-03-01          1    105737         3.0       False
107758059  107758059 2017-03-01          1    106716         2.0       False
107758060  107758060 2017-03-01          1    108698         6.0       False
Validation Data:
                  id       date  store_nbr  item_nbr  unit_sales onpromotion
120336876  120336876 2017-06-28          1     99197         3.0       False
120336877  120336877 2017-06-28          1    103520         3.0       False
120336878  120336878 2017-06-28          1    105574         5.0       False
120336879  120336879 2017-06-28          1    105575        11.0       False
120336880  120336880 2017-06-28          1    105577         3.0       False
Testing Data:
                  id       dat

In [None]:
# Merge datasets with stores, items, oil, holidays, and transactions based on date
def merge_datasets(data):
    # Merge with stores and items
    data = data.merge(stores, on='store_nbr', how='left')
    data = data.merge(items, on='item_nbr', how='left')
    data = data.merge(oil, on='date', how='left')
    data = data.merge(holidays_events, on='date', how='left')
    data = data.merge(transactions, on=['date', 'store_nbr'], how='left')
    return data

# Apply merging
train_merged = merge_datasets(train_filtered)
validation_merged = merge_datasets(validation_filtered)
test_merged = merge_datasets(test_filtered)

# Print the shapes of the merged datasets
print(f"Merged training dataset shape: {train_merged.shape}")
print(f"Merged validation dataset shape: {validation_merged.shape}")
print(f"Merged test dataset shape: {test_merged.shape}")

Merged training dataset shape: (12051403, 20)
Merged validation dataset shape: (1807983, 20)
Merged test dataset shape: (523365, 20)


In [None]:
train_merged.head()

Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion,city,state,store_type,cluster,family,class,perishable,dcoilwtico,event_type,locale,locale_name,description,transferred,transactions
0,107758056,2017-03-01,1,105574,10.0,False,Quito,Pichincha,D,13,GROCERY I,1045,0,53.82,,,,,,1873
1,107758057,2017-03-01,1,105575,18.0,False,Quito,Pichincha,D,13,GROCERY I,1045,0,53.82,,,,,,1873
2,107758058,2017-03-01,1,105737,3.0,False,Quito,Pichincha,D,13,GROCERY I,1044,0,53.82,,,,,,1873
3,107758059,2017-03-01,1,106716,2.0,False,Quito,Pichincha,D,13,GROCERY I,1032,0,53.82,,,,,,1873
4,107758060,2017-03-01,1,108698,6.0,False,Quito,Pichincha,D,13,DELI,2644,1,53.82,,,,,,1873


### ***Step 3: Check Data Integrity***
Check for null values in the datasets and handle them by filling with the mode or other strategies.


In [None]:
print(f"Missing values in training set:\n{train_merged.isnull().sum()}")
print(f"Missing values in validation set:\n{validation_merged.isnull().sum()}")
print(f"Missing values in test set:\n{test_merged.isnull().sum()}")

Missing values in training set:
id                     0
date                   0
store_nbr              0
item_nbr               0
unit_sales             0
onpromotion            0
city                   0
state                  0
store_type             0
cluster                0
family                 0
class                  0
perishable             0
dcoilwtico       3601003
event_type      10661224
locale          10661224
locale_name     10661224
description     10661224
transferred     10661224
transactions           0
dtype: int64
Missing values in validation set:
id                    0
date                  0
store_nbr             0
item_nbr              0
unit_sales            0
onpromotion           0
city                  0
state                 0
store_type            0
cluster               0
family                0
class                 0
perishable            0
dcoilwtico       460120
event_type      1589503
locale          1589503
locale_name     1589503
description  

In [None]:
# Forward fill to replace missing oil price values
train_merged['dcoilwtico'] = train_merged['dcoilwtico'].ffill()
validation_merged['dcoilwtico'] = validation_merged['dcoilwtico'].ffill()
test_merged['dcoilwtico'] = test_merged['dcoilwtico'].ffill()

In [None]:
# Fill missing holiday/event-related columns with default value 'No Event'
train_merged[['event_type', 'locale', 'locale_name', 'description', 'transferred']] = \
    train_merged[['event_type', 'locale', 'locale_name', 'description', 'transferred']].fillna('No Event')

validation_merged[['event_type', 'locale', 'locale_name', 'description', 'transferred']] = \
    validation_merged[['event_type', 'locale', 'locale_name', 'description', 'transferred']].fillna('No Event')

test_merged[['event_type', 'locale', 'locale_name', 'description', 'transferred']] = \
    test_merged[['event_type', 'locale', 'locale_name', 'description', 'transferred']].fillna('No Event')


In [None]:
# Saving to Google Drive directory
train_merged.to_csv('/content/drive/MyDrive/project/train_merged.csv', index=False)
validation_merged.to_csv('/content/drive/MyDrive/project/validation_merged.csv', index=False)
test_merged.to_csv('/content/drive/MyDrive/project/test_merged.csv', index=False)

### ***Step 4: Data Overview***
* Display the shape of datasets.
* Use .info() to inspect data types and non-null counts.
* Use .describe() for statistical summaries.

In [None]:
# Read the datasets back from Google Drive
TrainSetData = pd.read_csv('/content/drive/MyDrive/project/train_merged.csv', low_memory=False)
ValidationSetData = pd.read_csv('/content/drive/MyDrive/project/validation_merged.csv', low_memory=False)
TestSetData = pd.read_csv('/content/drive/MyDrive/project/test_merged.csv', low_memory=False)

In [None]:
print("Shape of TrainSetData     :", TrainSetData.shape)
print("Shape of ValidationSetData:", ValidationSetData.shape)
print("Shape of TestSetData      :", TestSetData.shape)

Shape of TrainSetData     : (12051403, 20)
Shape of ValidationSetData: (1807983, 20)
Shape of TestSetData      : (523365, 20)


In [None]:
print(f"Missing values in training set:\n{train_merged.isnull().sum()}")
print(f"Missing values in validation set:\n{validation_merged.isnull().sum()}")
print(f"Missing values in test set:\n{test_merged.isnull().sum()}")

Missing values in training set:
id              0
date            0
store_nbr       0
item_nbr        0
unit_sales      0
onpromotion     0
city            0
state           0
store_type      0
cluster         0
family          0
class           0
perishable      0
dcoilwtico      0
event_type      0
locale          0
locale_name     0
description     0
transferred     0
transactions    0
dtype: int64
Missing values in validation set:
id              0
date            0
store_nbr       0
item_nbr        0
unit_sales      0
onpromotion     0
city            0
state           0
store_type      0
cluster         0
family          0
class           0
perishable      0
dcoilwtico      0
event_type      0
locale          0
locale_name     0
description     0
transferred     0
transactions    0
dtype: int64
Missing values in test set:
id              0
date            0
store_nbr       0
item_nbr        0
unit_sales      0
onpromotion     0
city            0
state           0
store_type      

In [None]:
TrainSetData.head()

Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion,city,state,store_type,cluster,family,class,perishable,dcoilwtico,event_type,locale,locale_name,description,transferred,transactions
0,107758056,2017-03-01,1,105574,10.0,False,Quito,Pichincha,D,13,GROCERY I,1045,0,53.82,No Event,No Event,No Event,No Event,No Event,1873
1,107758057,2017-03-01,1,105575,18.0,False,Quito,Pichincha,D,13,GROCERY I,1045,0,53.82,No Event,No Event,No Event,No Event,No Event,1873
2,107758058,2017-03-01,1,105737,3.0,False,Quito,Pichincha,D,13,GROCERY I,1044,0,53.82,No Event,No Event,No Event,No Event,No Event,1873
3,107758059,2017-03-01,1,106716,2.0,False,Quito,Pichincha,D,13,GROCERY I,1032,0,53.82,No Event,No Event,No Event,No Event,No Event,1873
4,107758060,2017-03-01,1,108698,6.0,False,Quito,Pichincha,D,13,DELI,2644,1,53.82,No Event,No Event,No Event,No Event,No Event,1873


In [None]:
TrainSetData.tail()

Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion,city,state,store_type,cluster,family,class,perishable,dcoilwtico,event_type,locale,locale_name,description,transferred,transactions
12051398,119707611,2017-06-21,54,2088922,3.0,False,El Carmen,Manabi,C,3,GROCERY I,1076,0,42.48,No Event,No Event,No Event,No Event,No Event,658
12051399,119707612,2017-06-21,54,2089036,1.0,False,El Carmen,Manabi,C,3,GROCERY I,1034,0,42.48,No Event,No Event,No Event,No Event,No Event,658
12051400,119707613,2017-06-21,54,2089339,5.0,False,El Carmen,Manabi,C,3,GROCERY I,1006,0,42.48,No Event,No Event,No Event,No Event,No Event,658
12051401,119707614,2017-06-21,54,2103250,2.0,True,El Carmen,Manabi,C,3,BEAUTY,4254,0,42.48,No Event,No Event,No Event,No Event,No Event,658
12051402,119707615,2017-06-21,54,2106464,1.0,True,El Carmen,Manabi,C,3,BEVERAGES,1148,0,42.48,No Event,No Event,No Event,No Event,No Event,658


In [None]:
ValidationSetData.head()

Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion,city,state,store_type,cluster,family,class,perishable,dcoilwtico,event_type,locale,locale_name,description,transferred,transactions
0,120336876,2017-06-28,1,99197,3.0,False,Quito,Pichincha,D,13,GROCERY I,1067,0,44.74,No Event,No Event,No Event,No Event,No Event,1906
1,120336877,2017-06-28,1,103520,3.0,False,Quito,Pichincha,D,13,GROCERY I,1028,0,44.74,No Event,No Event,No Event,No Event,No Event,1906
2,120336878,2017-06-28,1,105574,5.0,False,Quito,Pichincha,D,13,GROCERY I,1045,0,44.74,No Event,No Event,No Event,No Event,No Event,1906
3,120336879,2017-06-28,1,105575,11.0,False,Quito,Pichincha,D,13,GROCERY I,1045,0,44.74,No Event,No Event,No Event,No Event,No Event,1906
4,120336880,2017-06-28,1,105577,3.0,False,Quito,Pichincha,D,13,GROCERY I,1045,0,44.74,No Event,No Event,No Event,No Event,No Event,1906


In [None]:
ValidationSetData.tail()

Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion,city,state,store_type,cluster,family,class,perishable,dcoilwtico,event_type,locale,locale_name,description,transferred,transactions
1807978,122035614,2017-07-13,54,2088922,6.0,False,El Carmen,Manabi,C,3,GROCERY I,1076,0,46.06,No Event,No Event,No Event,No Event,No Event,683
1807979,122035615,2017-07-13,54,2089339,3.0,False,El Carmen,Manabi,C,3,GROCERY I,1006,0,46.06,No Event,No Event,No Event,No Event,No Event,683
1807980,122035616,2017-07-13,54,2106464,1.0,False,El Carmen,Manabi,C,3,BEVERAGES,1148,0,46.06,No Event,No Event,No Event,No Event,No Event,683
1807981,122035617,2017-07-13,54,2110456,13.0,False,El Carmen,Manabi,C,3,BEVERAGES,1120,0,46.06,No Event,No Event,No Event,No Event,No Event,683
1807982,122035618,2017-07-13,54,2113914,200.0,True,El Carmen,Manabi,C,3,CLEANING,3040,0,46.06,No Event,No Event,No Event,No Event,No Event,683


In [None]:
TestSetData.head()

Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion,city,state,store_type,cluster,family,class,perishable,dcoilwtico,event_type,locale,locale_name,description,transferred,transactions
0,122566434,2017-07-19,1,99197,2.0,False,Quito,Pichincha,D,13,GROCERY I,1067,0,47.1,No Event,No Event,No Event,No Event,No Event,1797
1,122566435,2017-07-19,1,103520,1.0,False,Quito,Pichincha,D,13,GROCERY I,1028,0,47.1,No Event,No Event,No Event,No Event,No Event,1797
2,122566436,2017-07-19,1,103665,3.0,False,Quito,Pichincha,D,13,BREAD/BAKERY,2712,1,47.1,No Event,No Event,No Event,No Event,No Event,1797
3,122566437,2017-07-19,1,105574,4.0,False,Quito,Pichincha,D,13,GROCERY I,1045,0,47.1,No Event,No Event,No Event,No Event,No Event,1797
4,122566438,2017-07-19,1,105575,12.0,False,Quito,Pichincha,D,13,GROCERY I,1045,0,47.1,No Event,No Event,No Event,No Event,No Event,1797


In [None]:
TestSetData.tail()

Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion,city,state,store_type,cluster,family,class,perishable,dcoilwtico,event_type,locale,locale_name,description,transferred,transactions
523360,123089794,2017-07-23,54,2106464,1.0,False,El Carmen,Manabi,C,3,BEVERAGES,1148,0,45.78,Holiday,Local,Cayambe,Cantonizacion de Cayambe,False,926
523361,123089795,2017-07-23,54,2108569,3.0,False,El Carmen,Manabi,C,3,GROCERY I,1086,0,45.78,Holiday,Local,Cayambe,Cantonizacion de Cayambe,False,926
523362,123089796,2017-07-23,54,2110456,179.0,False,El Carmen,Manabi,C,3,BEVERAGES,1120,0,45.78,Holiday,Local,Cayambe,Cantonizacion de Cayambe,False,926
523363,123089797,2017-07-23,54,2113343,1.0,False,El Carmen,Manabi,C,3,BEVERAGES,1114,0,45.78,Holiday,Local,Cayambe,Cantonizacion de Cayambe,False,926
523364,123089798,2017-07-23,54,2113914,3.0,True,El Carmen,Manabi,C,3,CLEANING,3040,0,45.78,Holiday,Local,Cayambe,Cantonizacion de Cayambe,False,926


In [None]:
TrainSetData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12051403 entries, 0 to 12051402
Data columns (total 20 columns):
 #   Column        Dtype  
---  ------        -----  
 0   id            int64  
 1   date          object 
 2   store_nbr     int64  
 3   item_nbr      int64  
 4   unit_sales    float64
 5   onpromotion   bool   
 6   city          object 
 7   state         object 
 8   store_type    object 
 9   cluster       int64  
 10  family        object 
 11  class         int64  
 12  perishable    int64  
 13  dcoilwtico    float64
 14  event_type    object 
 15  locale        object 
 16  locale_name   object 
 17  description   object 
 18  transferred   object 
 19  transactions  int64  
dtypes: bool(1), float64(2), int64(7), object(10)
memory usage: 1.7+ GB


In [None]:
ValidationSetData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1807983 entries, 0 to 1807982
Data columns (total 20 columns):
 #   Column        Dtype  
---  ------        -----  
 0   id            int64  
 1   date          object 
 2   store_nbr     int64  
 3   item_nbr      int64  
 4   unit_sales    float64
 5   onpromotion   bool   
 6   city          object 
 7   state         object 
 8   store_type    object 
 9   cluster       int64  
 10  family        object 
 11  class         int64  
 12  perishable    int64  
 13  dcoilwtico    float64
 14  event_type    object 
 15  locale        object 
 16  locale_name   object 
 17  description   object 
 18  transferred   object 
 19  transactions  int64  
dtypes: bool(1), float64(2), int64(7), object(10)
memory usage: 263.8+ MB


In [None]:
TestSetData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 523365 entries, 0 to 523364
Data columns (total 20 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            523365 non-null  int64  
 1   date          523365 non-null  object 
 2   store_nbr     523365 non-null  int64  
 3   item_nbr      523365 non-null  int64  
 4   unit_sales    523365 non-null  float64
 5   onpromotion   523365 non-null  bool   
 6   city          523365 non-null  object 
 7   state         523365 non-null  object 
 8   store_type    523365 non-null  object 
 9   cluster       523365 non-null  int64  
 10  family        523365 non-null  object 
 11  class         523365 non-null  int64  
 12  perishable    523365 non-null  int64  
 13  dcoilwtico    523365 non-null  float64
 14  event_type    523365 non-null  object 
 15  locale        523365 non-null  object 
 16  locale_name   523365 non-null  object 
 17  description   523365 non-null  object 
 18  tran

In [None]:
TrainSetData.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,12051403.0,113721800.0,3437011.0,107758100.0,110770906.5,113681900.0,116694800.0,119707600.0
store_nbr,12051403.0,27.95042,16.22438,1.0,13.0,28.0,43.0,54.0
item_nbr,12051403.0,1162065.0,580617.1,96995.0,691945.0,1178696.0,1501581.0,2112404.0
unit_sales,12051403.0,8.206965,25.79124,-10002.0,2.0,4.0,8.0,17146.0
cluster,12051403.0,8.687854,4.584365,1.0,5.0,9.0,13.0,17.0
class,12051403.0,1971.561,1200.182,1002.0,1058.0,1190.0,2708.0,7780.0
perishable,12051403.0,0.2545377,0.4356011,0.0,0.0,0.0,1.0,1.0
dcoilwtico,12051403.0,48.95658,2.534697,42.48,47.3,48.83,50.54,53.82
transactions,12051403.0,1854.261,1016.059,292.0,1159.0,1508.0,2287.0,6398.0


In [None]:
ValidationSetData.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,1807983.0,121170900.0,479216.307233,120336900.0,120788900.0,121131600.0,121583600.0,122035600.0
store_nbr,1807983.0,28.21997,16.315943,1.0,13.0,29.0,44.0,54.0
item_nbr,1807983.0,1167997.0,586419.820723,96995.0,686036.0,1179580.0,1576313.0,2116416.0
unit_sales,1807983.0,8.138476,22.666561,-274.0,2.0,4.0,8.0,7033.0
cluster,1807983.0,8.711412,4.563887,1.0,5.0,9.0,13.0,17.0
class,1807983.0,1970.046,1197.425657,1002.0,1056.0,1190.0,2708.0,7780.0
perishable,1807983.0,0.2558254,0.436324,0.0,0.0,0.0,1.0,1.0
dcoilwtico,1807983.0,45.30683,0.711627,44.25,44.74,45.48,46.02,46.06
transactions,1807983.0,1851.103,1005.681769,427.0,1147.0,1537.0,2281.0,5664.0


In [None]:
TestSetData.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,523365.0,122828100.0,151082.606155,122566400.0,122697300.0,122828100.0,122959000.0,123089798.0
store_nbr,523365.0,28.30047,16.353602,1.0,13.0,29.0,44.0,54.0
item_nbr,523365.0,1169200.0,587435.337295,96995.0,686036.0,1209718.0,1576332.0,2127114.0
unit_sales,523365.0,8.017112,23.135861,-23.0,2.0,4.0,8.0,5639.0
cluster,523365.0,8.705173,4.567693,1.0,5.0,9.0,13.0,17.0
class,523365.0,1967.93,1194.758229,1002.0,1058.0,1190.0,2702.0,7780.0
perishable,523365.0,0.2585652,0.437847,0.0,0.0,0.0,1.0,1.0
dcoilwtico,523365.0,46.21664,0.564353,45.78,45.78,45.78,46.73,47.1
transactions,523365.0,1854.725,1029.870174,474.0,1142.0,1517.0,2239.0,5294.0


# ***EDA: Exploratory Data Anlysis***

### ***Data Visualization***
Description: Analyze and visualize data to uncover patterns and trends.
* Unit Sales Over Time: Line plots for train, validation, and test datasets.
* Promotion Analysis: Bar charts for average unit sales based on promotion status.
* Sales by Item Family: Sunburst charts to visualize sales contributions by family.

In [None]:
# Function to plot unit sales over time for a given dataset
def plot_sales_over_time(data, title):
    sales_time_series = data.groupby('date')['unit_sales'].sum().reset_index()
    fig = Plex.line(sales_time_series, x='date', y='unit_sales', title=title)
    fig.update_layout(xaxis_title='Date', yaxis_title='Total Unit Sales', xaxis_rangeslider_visible=True)
    fig.show()

# Plot each dataset
plot_sales_over_time(TrainSetData, 'Unit Sales Over Time (Train Data)')
plot_sales_over_time(ValidationSetData, 'Unit Sales Over Time (Validation Data)')
plot_sales_over_time(TestSetData, 'Unit Sales Over Time (Test Data)')

In [None]:
# Function to plot average unit sales based on promotion status for a given dataset
def plot_promo_sales(data, title, color):
    promo_sales = data.groupby('onpromotion')['unit_sales'].mean().reset_index()
    fig = Plex.bar(promo_sales, x='onpromotion', y='unit_sales', title=title, color_discrete_sequence=[color])
    fig.update_layout(xaxis_title='On Promotion', yaxis_title='Average Unit Sales')
    fig.show()

# Plot each dataset with different colors
plot_promo_sales(TrainSetData, 'Average Unit Sales: Promotion vs No Promotion (Train Data)', 'blue')
plot_promo_sales(TestSetData, 'Average Unit Sales: Promotion vs No Promotion (Test Data)', 'green')
plot_promo_sales(ValidationSetData, 'Average Unit Sales: Promotion vs No Promotion (Validation Data)', 'red')

In [None]:
# Function to plot unit sales by family in a sunburst chart for a given dataset
def plot_sunburst_sales(data, title, color_sequence):
    sales_by_family = data.groupby(['family'])['unit_sales'].sum().reset_index()
    fig = Plex.sunburst(sales_by_family, path=['family'], values='unit_sales', title=title, color_discrete_sequence=color_sequence)
    fig.show()

# Plot each dataset with different color sequences
plot_sunburst_sales(TrainSetData, 'Unit Sales by Item Family(TrainData)', Plex.colors.qualitative.Pastel)
plot_sunburst_sales(TestSetData, 'Unit Sales by Item Family(TestData)', Plex.colors.qualitative.Vivid)
plot_sunburst_sales(ValidationSetData, 'Unit Sales by Item Family(ValidationData)', Plex.colors.qualitative.Prism)

Step 12: Advanced Visualizations
Actions:
* Create sunburst charts for sales by city and family.
* Analyze trends in specific item families or cities.

In [None]:
# Function to plot unit sales by city and family in a sunburst chart for a given dataset
def plot_sunburst_sales(data, title, color_sequence):
    sales_by_city_family = data.groupby(['city', 'family'])['unit_sales'].sum().reset_index()
    fig = Plex.sunburst(sales_by_city_family, path=['city', 'family'], values='unit_sales', title=title, color_discrete_sequence=color_sequence)
    fig.show()

# Plot each dataset with different color sequences
plot_sunburst_sales(TrainSetData, 'Unit Sales by City and Item Family (Train Data)', Plex.colors.qualitative.Pastel)
plot_sunburst_sales(TestSetData, 'Unit Sales by City and Item Family (Test Data)', Plex.colors.qualitative.Vivid)
plot_sunburst_sales(ValidationSetData, 'Unit Sales by City and Item Family (Validation Data)', Plex.colors.qualitative.Prism)

In [None]:
# Function to plot top 10 cities by total unit sales in a horizontal bar chart
def plot_top_cities_sales(data, title, color):
    sales_by_city = data.groupby('city')['unit_sales'].sum().sort_values(ascending=False).head(10).reset_index()
    fig = Plex.bar(sales_by_city, x='unit_sales', y='city', title=title, orientation='h', color_discrete_sequence=[color])
    fig.update_layout(xaxis_title='Total Unit Sales', yaxis_title='City')
    fig.show()

# Plot each dataset with different colors
plot_top_cities_sales(TrainSetData, 'Top 10 Cities by Total Unit Sales (Train Data)', 'blue')
plot_top_cities_sales(TestSetData, 'Top 10 Cities by Total Unit Sales (Test Data)', 'green')
plot_top_cities_sales(ValidationSetData, 'Top 10 Cities by Total Unit Sales (Validation Data)', 'red')

In [None]:
# Function to plot unit sales by family and city in a treemap chart for a given dataset
def plot_treemap_sales(data, title, color_sequence):
    sales_by_family_city = data.groupby(['family', 'city'])['unit_sales'].sum().reset_index()
    fig = Plex.treemap(sales_by_family_city, path=['family', 'city'], values='unit_sales', title=title, color_discrete_sequence=color_sequence)
    fig.show()

# Plot each dataset with different color sequences
plot_treemap_sales(TrainSetData, 'Treemap of Unit Sales by Item Family and City (Train Data)', Plex.colors.qualitative.Pastel)
plot_treemap_sales(TestSetData, 'Treemap of Unit Sales by Item Family and City (Test Data)', Plex.colors.qualitative.Vivid)
plot_treemap_sales(ValidationSetData, 'Treemap of Unit Sales by Item Family and City (Validation Data)', Plex.colors.qualitative.Prism)

In [None]:
# Function to plot unit sales by city and store type in a stacked bar chart for a given dataset
def plot_sales_by_city_type(data, title, color_sequence):
    sales_by_city_type = data.groupby(['city', 'store_type'])['unit_sales'].sum().reset_index()
    fig = Plex.bar(sales_by_city_type, x='city', y='unit_sales', color='store_type',
                 title=title, barmode='stack', color_discrete_sequence=color_sequence)
    fig.update_layout(xaxis_title='City', yaxis_title='Total Unit Sales')
    fig.show()

# Plot each dataset with different color sequences
plot_sales_by_city_type(TrainSetData, 'Total Unit Sales by City and Store Type (Train Data)', Plex.colors.qualitative.Pastel)
plot_sales_by_city_type(TestSetData, 'Total Unit Sales by City and Store Type (Test Data)', Plex.colors.qualitative.Vivid)
plot_sales_by_city_type(ValidationSetData, 'Total Unit Sales by City and Store Type (Validation Data)', Plex.colors.qualitative.Prism)

In [None]:
# Function to plot a bubble chart of unit sales by store number and city for a given dataset
def plot_bubble_chart(data, title):
    store_sales = data.groupby(['store_nbr', 'city'])['unit_sales'].sum().reset_index()
    fig = Plex.scatter(store_sales, x='store_nbr', y='city', size='unit_sales',
                     title=title,
                     labels={'store_nbr': 'Store Number', 'city': 'City', 'unit_sales': 'Total Unit Sales'},
                     hover_name='city')  # Adding city name to hover for better context
    fig.update_layout(xaxis_title='Store Number', yaxis_title='City')
    fig.show()

# Plot each dataset
plot_bubble_chart(TrainSetData, 'Bubble Chart of Unit Sales by Store Location (Train Data)')
plot_bubble_chart(TestSetData, 'Bubble Chart of Unit Sales by Store Location (Test Data)')
plot_bubble_chart(ValidationSetData, 'Bubble Chart of Unit Sales by Store Location (Validation Data)')

### ***Group By Date for Time Series Aggregation***

In [None]:
# Group by 'date' and sum 'unit_sales' for each dataset
train_time_series = TrainSetData.groupby('date')['unit_sales'].sum().reset_index()
validation_time_series = ValidationSetData.groupby('date')['unit_sales'].sum().reset_index()
test_time_series = TestSetData.groupby('date')['unit_sales'].sum().reset_index()

In [None]:
train_time_series

Unnamed: 0,date,unit_sales
0,2017-03-01,1008521.710
1,2017-03-02,836225.179
2,2017-03-03,882639.775
3,2017-03-04,1125736.347
4,2017-03-05,1196983.690
...,...,...
108,2017-06-17,1096133.551
109,2017-06-18,965144.121
110,2017-06-19,791146.394
111,2017-06-20,787326.717


In [None]:
validation_time_series

Unnamed: 0,date,unit_sales
0,2017-06-28,731896.985
1,2017-06-29,630811.803
2,2017-06-30,802273.139
3,2017-07-01,1207529.922
4,2017-07-02,1296379.217
5,2017-07-03,1850286.818
6,2017-07-04,832359.286
7,2017-07-05,844301.613
8,2017-07-06,700272.01
9,2017-07-07,805792.302


In [None]:
test_time_series

Unnamed: 0,date,unit_sales
0,2017-07-19,767978.778
1,2017-07-20,688288.068
2,2017-07-21,782418.299
3,2017-07-22,932902.047
4,2017-07-23,1024288.741


### ***Convert Date Column to Datetime and Set as Index for Time Series Data***

In [None]:
# Convert the 'date' column in the training time series data to a datetime object
train_time_series['date'] = pd.to_datetime(train_time_series['date'])

# Set the 'date' column as the index for the training time series data
train_time_series = train_time_series.set_index('date')

# Convert the 'date' column in the test time series data to a datetime object
test_time_series['date'] = pd.to_datetime(test_time_series['date'])

# Set the 'date' column as the index for the test time series data
test_time_series = test_time_series.set_index('date')

# Convert the 'date' column in the validation time series data to a datetime object
validation_time_series['date'] = pd.to_datetime(validation_time_series['date'])

# Set the 'date' column as the index for the validation time series data
validation_time_series = validation_time_series.set_index('date')

### ***Standardize Time Series Data Using StandardScaler***

In [None]:
from sklearn.preprocessing import StandardScaler

# Assuming 'unit_sales' is the column you want to normalize
scaler = StandardScaler()

# Adjust the scaler using the training set.
train_scaled = scaler.fit_transform(train_time_series[['unit_sales']])

# The same scaler should be used to transform the test and validation data.
validation_scaled = scaler.transform(validation_time_series[['unit_sales']])
test_scaled = scaler.transform(test_time_series[['unit_sales']])

# Convert the scaled data back into DataFrames
train_scaled_df = pd.DataFrame(train_scaled, columns=['unit_sales'], index=train_time_series.index)
validation_scaled_df = pd.DataFrame(validation_scaled, columns=['unit_sales'], index=validation_time_series.index)
test_scaled_df = pd.DataFrame(test_scaled, columns=['unit_sales'], index=test_time_series.index)

# Print the first few rows of the scaled dataframes to confirm
print(train_scaled_df.head())
print(validation_scaled_df.head())
print(test_scaled_df.head())


            unit_sales
date                  
2017-03-01    0.726178
2017-03-02   -0.212777
2017-03-03    0.040166
2017-03-04    1.364956
2017-03-05    1.753229
            unit_sales
date                  
2017-06-28   -0.781329
2017-06-29   -1.332208
2017-06-30   -0.397804
2017-07-01    1.810702
2017-07-02    2.294899
            unit_sales
date                  
2017-07-19   -0.584696
2017-07-20   -1.018982
2017-07-21   -0.506006
2017-07-22    0.314077
2017-07-23    0.812102


### ***Create Sliding Window Dataset for Time Series Forecasting***

In [None]:
import numpy as np

def create_sliding_window_dataset(data, window_size):
    X = []  # List to store feature windows
    y = []  # List to store target values

    # Loop through the data, creating sliding windows
    for i in range(len(data) - window_size):
        # Append a window of 'unit_sales' values as features
        X.append(data.iloc[i:i + window_size]['unit_sales'].values)
        # Append the next 'unit_sales' value as the target
        y.append(data.iloc[i + window_size]['unit_sales'])

    # Convert lists to NumPy arrays for machine learning compatibility
    return np.array(X), np.array(y)


# Define the size of the sliding window
window_size = 4

# Create sliding window datasets for training set, validation set, and test set
X_trn, y_trn = create_sliding_window_dataset(train_scaled_df, window_size)
X_val, y_val = create_sliding_window_dataset(validation_scaled_df, window_size)
X_tst, y_tst = create_sliding_window_dataset(test_scaled_df, window_size)

# Print the shapes of the feature and target datasets to verify
print("X_trn shape:", X_trn.shape)  # Shape of training features
print("y_trn shape:", y_trn.shape)  # Shape of training targets
print("X_val shape:", X_val.shape)  # Shape of validation features
print("y_val shape:", y_val.shape)  # Shape of validation targets
print("X_tst shape:", X_tst.shape)  # Shape of testing features
print("y_tst shape:", y_tst.shape)  # Shape of testing targets

X_trn shape: (109, 4)
y_trn shape: (109,)
X_val shape: (12, 4)
y_val shape: (12,)
X_tst shape: (1, 4)
y_tst shape: (1,)


In [None]:
X_tst

array([[-0.58469623, -1.01898241, -0.50600594,  0.31407725]])

### ***Reshape Data for LSTM or CNN Model Input***

In [None]:
# Reshape the training data to add a third dimension
# The new shape will be (number of samples, window size, 1)
X_train_re = X_trn.reshape(X_trn.shape[0], X_trn.shape[1], 1)

# Reshape the validation data to add a third dimension
# This is needed for compatibility with models like LSTMs or CNNs that expect 3D input
X_val_re = X_val.reshape(X_val.shape[0], X_val.shape[1], 1)

# Reshape the test data to add a third dimension
# The third dimension typically represents features per time step
X_test_re = X_tst.reshape(X_tst.shape[0], X_tst.shape[1], 1)

In [None]:
X_test_re

array([[[-0.58469623],
        [-1.01898241],
        [-0.50600594],
        [ 0.31407725]]])

### ***One-Hot Encode Target Variables for Classification***

In [None]:
# Convert the trainingtarget data into one-hot encoded format
# `to_categorical` transforms integer labels into a binary matrix representation
y_train1 = to_categorical(y_trn)

# Convert the validationtarget data into one-hot encoded format
y_val1 = to_categorical(y_val)

# Convert the testtarget data into one-hot encoded format
y_test1 = to_categorical(y_tst)

In [None]:
print("y_train1 shape:", y_train1.shape)
print("y_val1 shape  :", y_val1.shape)
print("y_test1 shape :", y_test1.shape)

y_train1 shape: (109, 4)
y_val1 shape  : (12, 6)
y_test1 shape : (1, 1)


In [None]:
y_test1

array([[1.]], dtype=float32)

# ***Model DL: Graph Neural Network(GNN) with CNN Layers***

In [None]:
# Import TensorFlow and Keras modules for building the model
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Define the GNN-based model
def GNN_Model(input_shape):
    """
    Builds a neural network model with convolutional and attention layers.
    Args:
    input_shape(tuple): The shape of the input data.
    Returns:
    keras.Model: A compiled Keras model.
    """
    # Define the input layer with the specified shape
    inputs = keras.Input(shape=input_shape)

    # Add a 1D convolutional layer for feature extraction with ReLU activation
    x = layers.Conv1D(filters=64, kernel_size=3, activation='relu')(inputs)

    # Add a multihead attention layer to capture dependencies in the sequence
    attention = layers.MultiHeadAttention(num_heads=8, key_dim=64)(x, x)

    # Add a residual connection and combine it with the attention output
    x = layers.Add()([x, attention])

    # Normalize the output using LayerNormalization
    x = layers.LayerNormalization()(x)

    # Add a max pooling layer to reduce the dimensionality of the sequence
    x = layers.MaxPooling1D(pool_size=2)(x)

    # Flatten the output to prepare it for dense layers
    x = layers.Flatten()(x)

    # Add a dense layer with 50 units and ReLU activation
    x = layers.Dense(50, activation='relu')(x)

    # Add the final dense layer with 1 unit (for regression tasks)
    outputs = layers.Dense(1)(x)

    # Create the Keras model with defined inputs and outputs
    model = keras.Model(inputs=inputs, outputs=outputs)

    # Return the compiled model
    return model

# Specify the input shape for the model based on training data
input_shape = (X_train_re.shape[1], 1)

# Initialize the GNN model (get an instance of the model)
GNN_Model = GNN_Model(input_shape)

# Compile the GNN model with the specified optimizer and loss function.
GNN_Model.compile(optimizer='adam', loss='mse')

# Train the GNN model using the training data.
# Parameters:
# - X_train_re: Input features for training.
# - y_trn: Target labels for training.
# - epochs=50: Number of complete passes through the training dataset.
# - batch_size=32: Number of samples per gradient update.
# - validation_data=(X_val_re, y_val): Validation data (features and labels) used for performance monitoring during training.
GNN_Model.fit(X_train_re, y_trn, epochs=50, batch_size=32, validation_data=(X_val_re, y_val))

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.src.callbacks.History at 0x7aac081c0cd0>

### ***Calculate RMSLE and RMALE for Deep Learning Model Evaluation***

In [None]:
# Predict the output using the trained GNN model on the validation data.
y_predGNN = GNN_Model.predict(X_val_re)

# Inverse transform the predicted values to their original scale using the scaler.
# Reshaping the predictions to match the expected input shape for the scaler.
y_pred_GNN = scaler.inverse_transform(y_predGNN.reshape(-1, 1))

# Inverse transform the true values to their original scale using the scaler.
y_true_GNN = scaler.inverse_transform(y_val.reshape(-1, 1))

# Calculate the RMSLE (Root Mean Squared Logarithmic Error) between the true and predicted values.
# This metric is useful for regression tasks where the target values have a skewed distribution.
rmsle2 = np.sqrt(mean_squared_log_error(y_true_GNN, y_pred_GNN))

# Print the RMSLE result.
print("RMSLE:", rmsle2)

# Define a function to calculate the Mean Absolute Log Error (MALE) between the true and predicted values.
# MALE computes the average of the absolute differences between the log-transformed true and predicted values.
def mean_absolute_log_error(y_true, y_pred):
    """Calculate the Mean Absolute Log Error (MALE)."""
    return np.mean(np.abs(np.log(y_true + 1) - np.log(y_pred + 1)))

# Calculate the RMALE (Root Mean Absolute Log Error) between the true and predicted values.
rmale2 = np.sqrt(mean_absolute_log_error(y_true_GNN, y_pred_GNN))

# Print the RMALE result.
print("RMALE:", rmale2)


RMSLE: 0.22464034315237263
RMALE: 0.3824572168407713


### ***Deep Learning Model Prediction Visualize Actual vs Predicted Values with Plotly***

In [None]:
# Import the plotly.graph_objects library to create interactive visualizations.
import plotly.graph_objects as go

# Create a new figure object for the plot.
fig = go.Figure()

# Add a trace for the actual values(truevalues), plotted as lines with markers.
# `y_true_GNN.flatten()` flattens the array to a 1D array for plotting.
fig.add_trace(go.Scatter(y=y_true_GNN.flatten(),
                         mode='lines+markers', name='Actual'))

# Add a trace for the predicted values, also plotted as lines with markers.
# `y_pred_GNN.flatten()` flattens the predicted array for plotting.
fig.add_trace(go.Scatter(y=y_pred_GNN.flatten(),
                         mode='lines+markers', name='Predicted'))

# Update the layout of the plot: setting titles for the plot and axes.
fig.update_layout(title='Actual vs. Predicted Values',
                  xaxis_title='Time Step',
                  yaxis_title='Sales')

# Display the figure with the added traces and layout settings.
fig.show()

# ***Model DL: Hybrid(CNN-LSTM-GRU)***

In [None]:
# Import necessary libraries from TensorFlow and Keras for building the model.
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import (
    Conv1D, LSTM, GRU, Dense, Flatten, Input, MultiHeadAttention, LayerNormalization, Add
)
from tensorflow.keras.models import Model
from sklearn.metrics import mean_squared_log_error, mean_absolute_error

# Define the input shape based on the trainingdata(number of time steps, features).
input_shape = (X_train_re.shape[1], 1)

# Define the input layer with the specified inputshape.
inputs = Input(shape=input_shape)

# Add a 1D convolutional layer with 64 filters, kernel size of 2, and ReLUactivation.
conv = Conv1D(filters=64, kernel_size=2, activation='relu')(inputs)

# Apply multi-head attention to the convolutional output, using 8-heads and key dimension matching the last dimension of the conv layer.
attention = MultiHeadAttention(num_heads=8, key_dim=conv.shape[-1])(conv, conv)

# Add the originalconvolutional outputback to the attentionoutput(residualconnection) and normalize the result.
attention = Add()([attention, conv])
attention = LayerNormalization()(attention)

# Add an LSTM layer with 50 units and ReLU activation, keeping the sequence output for the next layer.
lstm = LSTM(50, activation='relu', return_sequences=True)(attention)

# Add a GRU layer with 50 units and ReLU activation, returning only the last output (not a sequence).
gru = GRU(50, activation='relu')(lstm)

# Add a dense layer with 1 unit to produce the final output (scalar value for regression).
outputs = Dense(1)(gru)

# Create the model using the input and output layers.
model = Model(inputs, outputs)

# Compile the model using the Adam optimizer and mean squared error loss function.
model.compile(optimizer='adam', loss='mse')

# Train the model using the training data, specifying 50 epochs and a batch size of 32.
# Use validation data to monitor performance during training.
model.fit(X_train_re, y_trn, epochs=50, batch_size=32, validation_data=(X_val_re, y_val))

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.src.callbacks.History at 0x7aa40c5f12a0>

### ***Calculate RMSLE and RMALE for Deep Learning Model Evaluation***

In [None]:
# Predict the output using the trained model on the validation data.
y_pred = model.predict(X_val_re)

# Inverse transform the predicted values to their original scale using the scaler.
# Reshaping the predictions to match the expected input shape for the scaler.
y_pred_original = scaler.inverse_transform(y_pred.reshape(-1, 1))

# Inverse transform the true values to their original scale using the scaler.
y_true_original = scaler.inverse_transform(y_val.reshape(-1, 1))

# Calculate the RMSLE (Root Mean Squared Logarithmic Error) between the true and predicted values.
# RMSLE is useful for regression tasks where the target values have a skewed distribution.
rmsle = np.sqrt(mean_squared_log_error(y_true_original, y_pred_original))

# Print the RMSLE result.
print("RMSLE:", rmsle)

# Define a function to calculate the Mean Absolute Log Error (MALE) between the true and predicted values.
# MALE computes the average of the absolute differences between the log-transformed true and predicted values.
def mean_absolute_log_error(y_true, y_pred):
    """Calculate the Mean Absolute Log Error (MALE)."""
    return np.mean(np.abs(np.log(y_true + 1) - np.log(y_pred + 1)))

# Calculate the RMAL(RootMeanAbsolute LogError) between the true and predicted values.
rmale = np.sqrt(mean_absolute_log_error(y_true_original, y_pred_original))

# Print the RMAL result.
print("RMALE:", rmale)

RMSLE: 0.26584635479500407
RMALE: 0.4017295583153643


### ***Visualize Actual vs Predicted Unit Sales with Plotly***

In [None]:
# Create a new figure object for the plot.
fig = go.Figure()

# Add a trace for the actual values (true values) of unit sales.
# Flatten the array to a 1D array for plotting and display it as a line.
fig.add_trace(go.Scatter(y=y_true_original.flatten(),
                         mode='lines',  # Display as a line plot
                         name='Actual'))  # Label the trace as 'Actual'

# Add a trace for the predicted values (model predictions) of unit sales.
# Flatten the array to a 1D array for plotting and display it as a line.
fig.add_trace(go.Scatter(y=y_pred_original.flatten(),
                         mode='lines',  # Display as a line plot
                         name='Predicted'))  # Label the trace as 'Predicted'

# Update the layout of the plot: setting titles for the plot and axes.
fig.update_layout(title='Actual vs Predicted Values',  # Title of the plot
                  xaxis_title='Time Step',  # Label for the x-axis
                  yaxis_title='Unit Sales')  # Label for the y-axis

# Display the figure with the added traces and layout settings.
fig.show()