## Part 1: Data Ingestion and Exploratory Data Analysis (EDA)
-------------------------------------------------------------

In this section, we will focus on **ingesting** the Rossmann Store Sales dataset and conducting a comprehensive **Exploratory Data Analysis (EDA)** prior to applying time series modeling and forecasting techniques. The workflow will proceed through the following steps:

1. **Import Libraries and Dependencies:** Load all necessary Python libraries and packages required for data manipulation, visualization, and analysis.

1. **Data Ingestion:** Load the Rossmann Store Sales dataset into the working environment for analysis.

3. **Exploratory Data Analysis (EDA):** Perform a detailed examination of the dataset to understand its structure and key characteristics:

   - **Inspect dataset metadata**: data types, number of observations (rows), and variables (columns)

    - **Identify and quantify missing values**

    - **Detect and handle duplicate records**

    - **Generate summary statistics** (mean, median, standard deviation, etc.)

    - **Analyze individual features and their distributions**

    - **Apply feature engineering techniques to enhance model readiness**

    - **Evaluate feature correlations to identify relationships**

    - **Visualize data using appropriate plots and charts**

    - **Conduct deeper analysis to uncover trends, patterns, and seasonality within the time series**

## 1. Setup & Imports Libraries
---------------------------------------

In [1]:
import time 

In [2]:
# Step 1: Data Ingestion
print("Step 1: Setup and Import Libraries in progress...")
time.sleep(1)  # Simulate processing time

Step 1: Setup and Import Libraries in progress...


In [3]:
# Data Manipulation & Processing
import os
import holidays
import pandas as pd
import numpy as np
from pathlib import Path
import scipy.stats as stats
from datetime import datetime
from sklearn.preprocessing import *

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format','{:.2f}'.format)

# Warnings
import warnings
warnings.simplefilter('ignore')
warnings.filterwarnings('ignore')

print("="*60)
print("Rossman Store Sales Time Series Analysis - Part 1")
print("="*60)
print("All libraries imported successfully!")

Rossman Store Sales Time Series Analysis - Part 1
All libraries imported successfully!


In [4]:
print("✅ Setup and Import Liraries completed.\n")

✅ Setup and Import Liraries completed.



In [5]:
# Start analysis

analysis_begin = pd.Timestamp.now()

bold_start = '\033[1m'
bold_end = '\033[0m'

print("🔍 Analysis Started")
print(f"🟢 Begin Date: {bold_start}{analysis_begin.strftime('%Y-%m-%d %H:%M:%S')}{bold_end}\n")

🔍 Analysis Started
🟢 Begin Date: [1m2025-08-13 21:09:59[0m




## 2. Data Ingestion
----------------------------

In [6]:
# Step 2: Data Ingestion
print("Step 2: Data Ingestion in progress...")
time.sleep(1)  # Simulate processing time

Step 2: Data Ingestion in progress...


In [7]:
# Use absolute path to avoid pwd issues
base_path = Path.home() /"Desktop"/"Time_Series_Analysis"/"data"/"raw"
train_path = base_path /"Retail_train_data.csv"
test_path = base_path / "Retail_test_data.csv"

# Check if files exist
print(f"Looking for files in: {base_path}\n")
print(f"Train file exists: {train_path.exists()}")
print(f"Test file exists: {test_path.exists()}")

if train_path.exists() and test_path.exists():
    train_df = pd.read_csv(train_path)
    test_df = pd.read_csv(test_path)

    print("\nFiles loaded successfully!\n")
    print(f"Train data shape: {train_df.shape}")
    print(f"Test data shape: {test_df.shape}\n")
    print('There are : %s Rows and %s Columns in training data' % (str(train_df.shape[0]) ,str(train_df.shape[1])))
    print('There are : %s Rows and %s Columns in testing data' % (str(test_df.shape[0]) ,str(test_df.shape[1])))
else:
    print("Files not found. Please check the file paths and names.")

Looking for files in: /home/mukwa/Desktop/Time_Series_Analysis/data/raw

Train file exists: True
Test file exists: True

Files loaded successfully!

Train data shape: (982644, 9)
Test data shape: (34565, 9)

There are : 982644 Rows and 9 Columns in training data
There are : 34565 Rows and 9 Columns in testing data


In [8]:
print("✅ Data Ingestion completed.\n")

✅ Data Ingestion completed.



# 3. Exploratory Data Analysis (EDA)
## 3.1. Basic Inspection
-------------------

In [9]:
# Step 3: Exploratory Data Analysis (EDA)
print("📊 Step 3: Exploratory Data Analysis in progress...")
time.sleep(1)  # Simulate processing time

📊 Step 3: Exploratory Data Analysis in progress...


In [10]:
train_df.columns = train_df.columns.str.lower()
test_df.columns = test_df.columns.str.lower()

# --- BASIC INFO AND DUPLICATES ---
print("DataFrame Info:")
train_df.info()

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 982644 entries, 0 to 982643
Data columns (total 9 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   store          982644 non-null  int64 
 1   dayofweek      982644 non-null  int64 
 2   date           982644 non-null  object
 3   sales          982644 non-null  int64 
 4   customers      982644 non-null  int64 
 5   open           982644 non-null  int64 
 6   promo          982644 non-null  int64 
 7   stateholiday   982644 non-null  object
 8   schoolholiday  982644 non-null  int64 
dtypes: int64(7), object(2)
memory usage: 67.5+ MB


## View or Display Dataset

In [11]:
print("\nTrain Data Preview:")
print(train_df.head(), "\n")


Train Data Preview:
   store  dayofweek        date  sales  customers  open  promo stateholiday  schoolholiday
0      1          2  2015-06-30   5735        568     1      1            0              0
1      2          2  2015-06-30   9863        877     1      1            0              0
2      3          2  2015-06-30  13261       1072     1      1            0              1
3      4          2  2015-06-30  13106       1488     1      1            0              0
4      5          2  2015-06-30   6635        645     1      1            0              0 



In [12]:
train_df['date'].min(), train_df['date'].max()

('2013-01-01', '2015-06-30')

In [13]:
train_df.sort_values('date', inplace =True, ascending =True)

print("\nTrain Data Preview:")
print(train_df.head(), "\n")
print("\nTest Data Preview:")
print(test_df.tail())


Train Data Preview:
        store  dayofweek        date  sales  customers  open  promo stateholiday  schoolholiday
982643   1115          2  2013-01-01      0          0     0      0            a              1
981908    379          2  2013-01-01      0          0     0      0            a              1
981907    378          2  2013-01-01      0          0     0      0            a              1
981906    377          2  2013-01-01      0          0     0      0            a              1
981905    376          2  2013-01-01      0          0     0      0            a              1 


Test Data Preview:
       store  dayofweek        date  sales  customers  open  promo  stateholiday  schoolholiday
34560   1111          3  2015-07-01   3701        351     1      1             0              1
34561   1112          3  2015-07-01  10620        716     1      1             0              1
34562   1113          3  2015-07-01   8222        770     1      1             0             

In [14]:

test_df.sort_values('date', inplace =True, ascending =True)

print("\nTest Data Preview:")
print(train_df.head(), "\n")
print("\nTest Data Preview:")
print(test_df.tail())


Test Data Preview:
        store  dayofweek        date  sales  customers  open  promo stateholiday  schoolholiday
982643   1115          2  2013-01-01      0          0     0      0            a              1
981908    379          2  2013-01-01      0          0     0      0            a              1
981907    378          2  2013-01-01      0          0     0      0            a              1
981906    377          2  2013-01-01      0          0     0      0            a              1
981905    376          2  2013-01-01      0          0     0      0            a              1 


Test Data Preview:
     store  dayofweek        date  sales  customers  open  promo  stateholiday  schoolholiday
745    746          5  2015-07-31   9082        638     1      1             0              1
746    747          5  2015-07-31  10708        826     1      1             0              1
747    748          5  2015-07-31   7481        578     1      1             0              1
741   

In [15]:
print("\nMissing Values per Column:")
print(train_df.isna().sum())
print('\nNumber of duplicated rows:', train_df.duplicated().sum())


Missing Values per Column:
store            0
dayofweek        0
date             0
sales            0
customers        0
open             0
promo            0
stateholiday     0
schoolholiday    0
dtype: int64

Number of duplicated rows: 0


## 3.2 Summary Statistics
---------------------

In [16]:
print("\nSummary Statistics:")
print(train_df.describe())


Summary Statistics:
          store  dayofweek     sales  customers      open     promo  schoolholiday
count 982644.00  982644.00 982644.00  982644.00 982644.00 982644.00      982644.00
mean     558.44       4.00   5760.84     632.77      0.83      0.38           0.17
std      321.91       2.00   3857.57     465.40      0.38      0.49           0.38
min        1.00       1.00      0.00       0.00      0.00      0.00           0.00
25%      280.00       2.00   3705.00     403.00      1.00      0.00           0.00
50%      558.00       4.00   5731.00     609.00      1.00      0.00           0.00
75%      838.00       6.00   7847.00     838.00      1.00      1.00           0.00
max     1115.00       7.00  41551.00    7388.00      1.00      1.00           1.00


----------------------------------------------------

### This my previous version of Statistical Analysis

1. **Store-Level Insights :**
 *  Store IDs range from **1 to 1115**, suggesting over a thousand **(1115)** unique stores in the dataset.
 *  The distribution of store IDs **(mean ≈ 558, std ≈ 322)** implies a relatively even spread across stores, but some clustering may exist around the median (558).
   
2. **Temporal Patterns :**
 *  DayOfWeek ranges from **1 to 7**, with a mean of **~4.00**—indicating a fairly uniform distribution of entries across all days.
 *  Could be interesting to examine how sales and customers vary by weekday, especially near the weekend peak **(e.g., Friday/Saturday)**.

3. **Sales & Customer Traffic:**
 *  Average daily sales per entry: **5760.84** with a very wide spread **(std ≈ 3857.57)**, and a maximum sale day of **$41,551**—a strong hint at major spikes for certain stores or events.
 *  Customer count averages **~633** per day, also with high variability **(max nearly 7,400)**, pointing to inconsistent foot traffic across locations and days.
 *  Zero-sales and zero-customer days are quite high **(≈ 17% of records)**. These likely correspond to closed stores or unusual business days and should be accounted for in any forecasting or revenue analysis.
 *   The mean for sales is **5760.843 and the median 5731.00**, suggesting the **sales distribution is right skewed**

5. **Store Availability:**
 *  Open flag mean is **0.829**, meaning stores are open roughly **83%** of the time.

6. **Promotional Impact**
 *  38% of entries have active promotions, and sales/customer variability suggests these promotions could have significant but uneven impact.
 *  Worth analyzing sales uplift due to **Promo = 1 compared to Promo = 0**.

7. **School Holidays**
 *  **Only 17.2%** of entries fall on a school holiday, suggesting that sales are not high on School Holidays
 *  This feature may influence customer counts, especially for stores near schools or those that rely on family shoppers.

--------------------------------------------------------
### This my current (updated) version of Statistical Analysis

Looking at these retail summary statistics, several key insights emerge about this dataset's business patterns and data quality:

**Business Operations Pattern**
The data reveals a clear weekly rhythm with stores typically open **5-6 days per week (82.9% open rate)**. The day-of-week distribution **(mean ~4.0, std 2.0)** suggests relatively even coverage across the week, though weekends likely see different patterns given the standard deviation.

**Sales Performance Distribution**
Daily sales **average €5,761** with substantial variation **(std €3,858)**, indicating significant differences between high and low-performing periods. The distribution appears right-skewed, as the **median (€5,731)** sits below the **75th percentile (€7,847)**, while the **maximum of €41,551** represents exceptional peak performance days.

**Customer traffic** follows a similar pattern, **averaging 633 customers per day** with the same **right-skewed distribution**. The sales-to-customer ratio of approximately **€9.11 per customer** suggests this could be a grocery or everyday retail environment.

**Promotional and Seasonal Effects**
Promotions run on **38%** of operating days, suggesting strategic rather than constant promotional activity. School holidays affect only **17.2%** of the dataset, indicating this captures mostly regular school periods with some holiday shopping included.

**Data Quality Concerns**
The most striking finding is that **168,494 rows (17.1%)** show zero sales, with an almost identical number showing zero customers **(168,492)**. This near-perfect correlation suggests these represent genuine store closures rather than data collection errors. These closed days significantly impact the overall averages and explain why the "Open" variable shows **82.9% rather than 100%**.

**Strategic Implications**
The wide performance range and promotional frequency suggest opportunities for optimizing both timing and targeting of marketing efforts. The substantial day-to-day variation in both sales and customer counts indicates strong external factors **(likely day-of-week effects, promotions, and seasonal patterns)** that could be leveraged for better forecasting and inventory management.

This appears to be a comprehensive retail dataset suitable for time series analysis, promotional effectiveness studies, and customer behavior modeling.

## 3.3 Variables Analysis
-------------------------

#### 1. Suspicious Days (Quality Check)

In [17]:
# Days with zero customers and non-zero sales (quality check)
odd_days = train_df[(train_df["customers"] == 0) & (train_df["sales"] > 0)]
print(f"Suspicious days with sales but no customers: {odd_days.shape[0]}")

Suspicious days with sales but no customers: 0


#### 2. Unique days of the week

In [18]:
unique_days = train_df['dayofweek'].sort_values().unique()
print(f"Unique days of the week in the dataset: {unique_days}")

Unique days of the week in the dataset: [1 2 3 4 5 6 7]


#### 3. Unique State Holidays

In [19]:
unique_stateholiday = train_df['stateholiday'].unique()
print(f"Unique State holiday in the dataset: {unique_stateholiday}")

Unique State holiday in the dataset: ['a' '0' 'b' 'c' 0]


In [20]:
# -- That mix of '0' (string) and 0 (integer) in your array means the column has inconsistent data types — which can definitely mess up mappings and analysis.

# -- Clean the Data Types
train_df['stateholiday'] = train_df['stateholiday'].astype(str)

# --- Sorted Output with counts
print("State Holiday Value Counts:")
print(train_df['stateholiday'].value_counts().sort_index())

State Holiday Value Counts:
stateholiday
0    951594
a     20260
b      6690
c      4100
Name: count, dtype: int64


-------------------------------------------------------
### State Holidays namming convention

 .  **a** stands for **Public Holiday**

 .  **b** stands for **Easter Holiday**

 .  **c** stands for **Christmas Holiday**

 .  **0** or None means **Normal Day**

These codes were defined by the dataset creators to simplify **holiday categorization across German states**. So when we preprocess the data, we map those codes to their actual meanings to make analysis and visualization more intuitive.

-------------------------------------------------------

#### 4. Closed Count

In [21]:
closed_count = (train_df['open'] == 0).sum()
print(f"\nTotal number of times stores were closed: {bold_start}{closed_count}{bold_end}")


Total number of times stores were closed: [1m168440[0m


#### 5. Opened_count

In [22]:
opened_count = (train_df['open'] == 1).sum()
print(f"\nTotal number of times stores were opened: {bold_start}{opened_count}{bold_end}")


Total number of times stores were opened: [1m814204[0m


#### 6. Compare Open vs Closed

In [23]:
total_days = len(train_df)

closed_count = (train_df['open'] == 0).sum()
opened_count = (train_df['open'] == 1).sum()

print(f"Opened: {bold_start}{opened_count}{bold_end} times ({bold_start}{(opened_count / total_days) * 100:.2f}%){bold_end}")
print(f"Closed: {bold_start}{closed_count}{bold_end} times ({bold_start}{(closed_count / total_days) * 100:.2f}%){bold_end}")

Opened: [1m814204[0m times ([1m82.86%)[0m
Closed: [1m168440[0m times ([1m17.14%)[0m


#### 7. Check for zero Sales

In [24]:
no_sales_count = (train_df['sales'] == 0).sum()
print(f"\nThere are {bold_start}{no_sales_count}{bold_end} times when stores made no sales.")


There are [1m168494[0m times when stores made no sales.


#### 8. Check for Sales

In [25]:
sales_count = (train_df['sales'] != 0).sum()
print(f"\nThere are {bold_start}{sales_count}{bold_end} times when stores recorded sales.")


There are [1m814150[0m times when stores recorded sales.


#### 9. Compare Sales vs No Sales

In [26]:
total_days = len(train_df)

print(f"Sales: {bold_start}{sales_count}{bold_end}  times ({bold_start}{(sales_count / total_days) * 100:.2f}%){bold_end}")
print(f"No-sales: {bold_start}{no_sales_count}{bold_end} times ({bold_start}{(no_sales_count / total_days) * 100:.2f}%){bold_end}")

Sales: [1m814150[0m  times ([1m82.85%)[0m
No-sales: [1m168494[0m times ([1m17.15%)[0m


#### 10. Customers Check

In [27]:
total_customers = len(train_df)
no_customers_count = (train_df['customers'] == 0).sum()

print(f"\nThere are {bold_start}{no_customers_count}{bold_end} times when no customers visited the stores.")
print(f"No-Customers: {bold_start}{no_customers_count}{bold_end} on ({bold_start}{(no_customers_count / total_customers) * 100:.2f}%){bold_end}")


There are [1m168492[0m times when no customers visited the stores.
No-Customers: [1m168492[0m on ([1m17.15%)[0m


#### 11. Impact of Day of the Week - Average Customers and Sales by Day

In [28]:
day_analysis = (
    train_df.groupby('dayofweek')[['customers', 'sales']]
    .mean()
    .reset_index()
    .rename(columns={'customers': 'avg_customers', 'sales': 'avg_sales'})
)

print(day_analysis)


   dayofweek  avg_customers  avg_sales
0          1         812.93    7797.64
1          2         761.86    7005.52
2          3         721.20    6536.45
3          4         695.78    6216.11
4          5         742.53    6703.50
5          6         658.76    5856.78
6          7          35.58     202.62


#### 12. Impact of Promotion - Average Customers and Sales by Promotion Status

In [29]:
promo_analysis = (
    train_df.groupby('promo')[['customers', 'sales']]
    .mean()
    .reset_index()
    .rename(columns={'customers': 'avg_customers', 'sales': 'avg_sales'})
)

print(promo_analysis)

   promo  avg_customers  avg_sales
0      0         517.45    4397.13
1      1         820.77    7984.12


#### 13. Day of Week Impact

In [30]:
# Average customers and sales by day of the week
dow_analysis = (
    train_df
    .groupby("dayofweek")[["customers", "sales"]]
    .mean()
    .reset_index()
    .rename(columns={"customers": "avg_customers", "sales": "avg_sales"})
)

print(dow_analysis)


   dayofweek  avg_customers  avg_sales
0          1         812.93    7797.64
1          2         761.86    7005.52
2          3         721.20    6536.45
3          4         695.78    6216.11
4          5         742.53    6703.50
5          6         658.76    5856.78
6          7          35.58     202.62


#### 14. SchoolHoliday Impact

In [31]:
schoolholiday_analysis = (
    train_df
    .groupby("schoolholiday")[["customers", "sales"]]
    .mean()
    .reset_index()
    .rename(columns={"customers": "avg_customers", "sales": "avg_sales"})
)

print(schoolholiday_analysis)

   schoolholiday  avg_customers  avg_sales
0              0         619.03    5627.01
1              1         698.96    6405.44


#### 15. Top 10 Crowded Store

In [32]:
top10_crowded_store = (
    train_df.groupby('store')['customers']
    .mean()
    .nlargest(10)
    .reset_index()
    .rename(columns={'customers': 'avg_customers'})
)

print(top10_crowded_store)

   store  avg_customers
0    733        3403.45
1    262        3400.41
2    562        3106.59
3    769        3071.50
4   1114        2652.57
5    817        2605.20
6   1097        2412.39
7    335        2390.88
8    259        2333.87
9    251        2028.02


#### 16. Top 10 Highest-Selling Stores

In [33]:
top10_selling_store = (
    train_df.groupby('store')['sales']
    .mean()
    .nlargest(10)
    .reset_index()
    .rename(columns={'sales': 'avg_sales'})
)

print(top10_selling_store)

   store  avg_sales
0    262   20684.09
1    817   18105.70
2    562   17984.64
3   1114   17097.69
4    251   15814.01
5    513   15133.66
6    842   15117.74
7    733   14945.51
8    788   14930.54
9    383   14294.00


#### 18. Crowded vs. Selling Stores - Sort by avg_sales in descending order

In [34]:
comparison_df = (
    pd.merge(top10_crowded_store, top10_selling_store, on='store', how='outer')
    .fillna(0)
    .sort_values(by='avg_sales', ascending=False)
    .reset_index(drop=True)
)

print(comparison_df)

    store  avg_customers  avg_sales
0     262        3400.41   20684.09
1     817        2605.20   18105.70
2     562        3106.59   17984.64
3    1114        2652.57   17097.69
4     251        2028.02   15814.01
5     513           0.00   15133.66
6     842           0.00   15117.74
7     733        3403.45   14945.51
8     788           0.00   14930.54
9     383           0.00   14294.00
10    335        2390.88       0.00
11    259        2333.87       0.00
12    769        3071.50       0.00
13   1097        2412.39       0.00


#### 19. Dual Comparison

In [35]:
comparison_df['top_both'] = (
    comparison_df['store'].isin(top10_crowded_store['store']) &
    comparison_df['store'].isin(top10_selling_store['store'])
)

print(comparison_df)

    store  avg_customers  avg_sales  top_both
0     262        3400.41   20684.09      True
1     817        2605.20   18105.70      True
2     562        3106.59   17984.64      True
3    1114        2652.57   17097.69      True
4     251        2028.02   15814.01      True
5     513           0.00   15133.66     False
6     842           0.00   15117.74     False
7     733        3403.45   14945.51      True
8     788           0.00   14930.54     False
9     383           0.00   14294.00     False
10    335        2390.88       0.00     False
11    259        2333.87       0.00     False
12    769        3071.50       0.00     False
13   1097        2412.39       0.00     False


## 4. Features Engineering - Use for Visualization only 
--------------------------

In [36]:
# Make a copy of the original dataframe to avoid modifying it
df_viz_feat = train_df.copy()

# Ensure date column is in datetime format
if not pd.api.types.is_datetime64_any_dtype(df_viz_feat['date']):
    df_viz_feat['date'] = pd.to_datetime(df_viz_feat['date'])

# Sort by date in ascending order, in place
df_viz_feat.sort_values(by='date', ascending=True, inplace=True)

# Create Time-based Covariates - Basic Temporal Features
df_viz_feat['day'] = df_viz_feat['date'].dt.strftime('%a')
df_viz_feat['week'] = df_viz_feat['date'].dt.isocalendar().week
df_viz_feat['month'] = df_viz_feat['date'].dt.strftime('%b')
df_viz_feat['quarter'] = df_viz_feat['date'].dt.quarter
df_viz_feat['year'] = df_viz_feat['date'].dt.year
df_viz_feat['isweekend']= df_viz_feat['dayofweek'] > 5

print(f"\nDays type distribution:\n{df_viz_feat[['day', 'isweekend']].value_counts()}\n")

print(df_viz_feat['year'].unique())
print(df_viz_feat['year'].dtype)



Days type distribution:
day  isweekend
Tue  False        141204
Mon  False        140270
Fri  False        140270
Sat  True         140270
Sun  True         140270
Thu  False        140270
Wed  False        140090
Name: count, dtype: int64

[2013 2014 2015]
int32


#### 4.1. StateHoliday Mapping

In [37]:
# Define the holiday type mapping
holiday_map = {
    "0": "Normal Day",
    "a": "Public",
    "b": "Easter",
    "c": "Christmas"  
}

df_viz_feat['stateholiday']= df_viz_feat['stateholiday'].map(holiday_map)

# Create IsHoliday feature
df_viz_feat['isholiday']= df_viz_feat['stateholiday'] !="Normal Day"

# IsSchool Feature - Rule: assume school is out for Public, Easter, and Christmas Breaks
df_viz_feat["isschoolDay"] = ~df_viz_feat["stateholiday"].isin(["Public", "Easter", "Christmas"])

# Mapping Check
unmapped_rows = df_viz_feat[df_viz_feat['stateholiday'].isna()]
print(f"Unmapped rows after mapping:\n{unmapped_rows[['stateholiday']]}")

# Print the count of each holiday type, including any missing (NaN) values for unmapped entries
print(f"\nHoliday type distribution:\n{df_viz_feat['stateholiday'].value_counts()}")

Unmapped rows after mapping:
Empty DataFrame
Columns: [stateholiday]
Index: []

Holiday type distribution:
stateholiday
Normal Day    951594
Public         20260
Easter          6690
Christmas       4100
Name: count, dtype: int64


#### Promo mapping

In [38]:
df_viz_feat['promo'] = df_viz_feat['promo'].astype(str).map({'1': 'Promo', '0': 'No Promo'})

#### Features Enginerring check

In [39]:
print(f'The shape after before features Engineering: {train_df.shape}')
print(f'The shape after adding features : {df_viz_feat.shape}')

The shape after before features Engineering: (982644, 9)
The shape after adding features : (982644, 17)


In [40]:
df_viz_feat.head()

Unnamed: 0,store,dayofweek,date,sales,customers,open,promo,stateholiday,schoolholiday,day,week,month,quarter,year,isweekend,isholiday,isschoolDay
982643,1115,2,2013-01-01,0,0,0,No Promo,Public,1,Tue,1,Jan,1,2013,False,True,False
982640,1112,2,2013-01-01,0,0,0,No Promo,Public,1,Tue,1,Jan,1,2013,False,True,False
982639,1111,2,2013-01-01,0,0,0,No Promo,Public,1,Tue,1,Jan,1,2013,False,True,False
982638,1110,2,2013-01-01,0,0,0,No Promo,Public,1,Tue,1,Jan,1,2013,False,True,False
982637,1109,2,2013-01-01,0,0,0,No Promo,Public,1,Tue,1,Jan,1,2013,False,True,False


In [41]:
# To pull train_df from one notebook to another in JupyterLab
%store train_df
%store df_viz_feat


Stored 'train_df' (DataFrame)
Stored 'df_viz_feat' (DataFrame)


---------------------------

In [42]:
print("✅ Data Ingestion and Exploratory Data Analysis completed successfully!")
print(f"🗓️ Analysis Date: {bold_start}{pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}{bold_end}")

✅ Data Ingestion and Exploratory Data Analysis completed successfully!
🗓️ Analysis Date: [1m2025-08-13 21:10:12[0m


--------------------------------

In [43]:
# End analysis
analysis_end = pd.Timestamp.now()
duration = analysis_end - analysis_begin

# Final summary print
print("\n📋 Analysis Summary")
print(f"🟢 Begin Date: {bold_start}{analysis_begin.strftime('%Y-%m-%d %H:%M:%S')}{bold_end}")
print(f"✅ End Date:   {bold_start}{analysis_end.strftime('%Y-%m-%d %H:%M:%S')}{bold_end}")
print(f"⏱️ Duration:   {bold_start}{str(duration)}{bold_end}")


📋 Analysis Summary
🟢 Begin Date: [1m2025-08-13 21:09:59[0m
✅ End Date:   [1m2025-08-13 21:10:12[0m
⏱️ Duration:   [1m0 days 00:00:12.963469[0m


-------------------------
## Project Design Rationale: Notebook Separation

To promote **clarity, maintainability, and scalability** within the project, **data engineering** and **visualization tasks** are intentionally separated into distinct notebooks. This modular approach prevents the accumulation of excessive code in a single notebook, making it easier to **debug, update, and collaborate across different stages of the workflow**. By isolating data transformation logic from visual analysis, **each notebook remains focused and purpose-driven**, ultimately **enhancing the overall efficiency and readability of the project**.