# Part 2.1: Features Engineering and Visualitions

This phase focuses on enhancing the dataset by creating meaningful features that capture **temporal patterns, store-specific behaviors, customers and sales analysis**. It also includes **visual exploration to uncover trends, seasonality, and anomalies—laying** the groundwork for robust forecasting models.


## 1. Setup & Imports Libraries
-------------------------------

In [None]:
import time 

In [None]:
# Step 1: Setup & Imports Libraries
print("Step 1: Setup and Import Libraries started...")
time.sleep(1)  # Simulate processing time

In [None]:
# Data Manipulation & Processing
import math
import numpy as np
import pandas as pd
from pathlib import Path
import scipy.stats as stats
from datetime import datetime
from sklearn.preprocessing import *

# Data Visualization
import seaborn as sbn
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

import plotly.io as pio
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from pandas.plotting import scatter_matrix

sbn.set(rc={'figure.figsize':(14,6)})
plt.style.use('seaborn-v0_8')
sbn.set_palette("husl")

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format','{:.2f}'.format)

# Warnings
import warnings
warnings.simplefilter('ignore')
warnings.filterwarnings('ignore')

In [None]:
print("="*60)
print("Rossman Store Sales Time Series Analysis - Part 2")
print("="*60)
print("All libraries imported successfully!")
print("Analysis Date:", pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S'))


In [None]:
print("✅ Setup and Import Liraries completed.\n")

In [None]:
# Start analysis

data_viz_begin = pd.Timestamp.now()

bold_start = '\033[1m'
bold_end = '\033[0m'

print("🔍 Part 2 Started ...")
print(f"🟢 Begin Date: {bold_start}{data_viz_begin.strftime('%Y-%m-%d %H:%M:%S')}{bold_end}\n")


## Restore the file
----------------------------

In [None]:
%store -r train_df

### View or Display Dataset

In [None]:
print("\nTrain Data Preview:")
print("\n",train_df.head())

# 2 Feature Engineering
-------------------------

In [None]:
# Step 2: Data Ingestion
print("Step 2: Features Engineering started...")
time.sleep(1)  # Simulate processing time

In [None]:
# Make a copy of the original dataframe to avoid modifying it
df_features = train_df.copy()

#### Promo mapping

In [None]:
df_features['promo'] = df_features['promo'].astype(str).map({'1': 'Promo', '0': 'No Promo'})

#### Check 

In [None]:
value_counts = df_features['promo'].value_counts()
value_counts

#### Define the holiday type mapping

In [None]:
# Define the holiday type mapping
holiday_map = {"0": "Normal Day","a": "Public","b": "Easter", "c": "Christmas"}

#### Spot Unexpected Values

In [None]:
# Before mapping
unexpected_values = df_features[~df_features['stateholiday'].isin(holiday_map.keys())]['stateholiday'].unique()
print(f"Unexpected values before mapping: {unexpected_values}")


#### StateHoliday Mapping

In [None]:
# StateHoliday mapping
df_features['stateholiday']= df_features['stateholiday'].map(holiday_map)

# Create IsHoliday feature
df_features['isholiday']= df_features['stateholiday'] !="Normal Day"

# IsSchool Feature - Rule: assume school is out for Public, Easter, and Christmas Breaks
df_features["isschoolDay"] = ~df_features["stateholiday"].isin(["Public", "Easter", "Christmas"])

# Mapping Check
unmapped_rows = df_features[df_features['stateholiday'].isna()]
print(f"Unmapped rows after mapping:\n{unmapped_rows[['stateholiday']]}")

# Print the count of each holiday type, including any missing (NaN) values for unmapped entries
print(f"\nHoliday type distribution:\n{df_features['stateholiday'].value_counts()}")

#### Temporal F eatures

In [None]:
# Ensure date column is in datetime format
if not pd.api.types.is_datetime64_any_dtype(df_features['date']):
    df_features['date'] = pd.to_datetime(df_features['date'])

# Sort by date in ascending order
df_features = df_features.sort_values(by='date', ascending=True)

# Baic Temporal Features
df_features['day'] = df_features['date'].dt.strftime('%a')
df_features['week'] = df_features['date'].dt.isocalendar().week
df_features['month'] = df_features['date'].dt.strftime('%b')
df_features['quarter'] = df_features['date'].dt.quarter
df_features['year'] = df_features['date'].dt.year.astype(int)
df_features['isweekend']= df_features['dayofweek'] > 5

print(f"\nDays type distribution:\n{df_features[['day', 'isweekend']].value_counts()}\n")

print(df_features['year'].unique())
print(df_features['year'].dtype)


#### Features Enginerring check

In [None]:
print(f'The shape after before features Engineering: {train_df.shape}')
print(f'The shape after adding features : {df_features.shape}')

In [None]:
df_features.head()

#### Store Features Dataframe

In [None]:
# To pull df_features from one notebook to another in JupyterLab
%store df_features

In [None]:
print("✅ Data Engineering completed.\n")

# 3. Data Visualization
----------------------

In [None]:
# Step 1: Setup & Imports Libraries
print("Step 3: Data Visualization started...")
time.sleep(1)  # Simulate processing time

In [None]:

def export_plotly_chart(fig, name="chart", output_subdir="results_visualization"):
    """
    Exports a Plotly figure to .html and .svg formats under drafts/results_visualization.

    Parameters:
    - fig: Plotly figure object
    - name: Base filename (without extension)
    - output_subdir: Subdirectory under 'drafts/' to save files (default: results_visualization)
    """
    # Find repo root assuming you're inside drafts/<subdir>
    repo_root = Path.cwd().resolve().parents[1]  # e.g., from drafts/data_exploration/
    output_path = repo_root / "drafts" / output_subdir
    output_path.mkdir(parents=True, exist_ok=True)

    # Save HTML (for docs or README links)
    html_path = output_path / f"{name}.html"
    fig.write_html(html_path, include_plotlyjs='cdn')

    # Save SVG (for markdown embedding)
    svg_path = output_path / f"{name}.svg"
    pio.write_image(fig, svg_path, format="svg", width=1200, height=500)

    print(f"✅ Exported to:\n- {html_path}\n- {svg_path}")



### Percentage Distribution per Open

In [None]:

# Count frequency of each unique value
value_counts = df_features['open'].value_counts()

# Map numeric labels to descriptive ones using if-else logic
labels = ["Open" if val == 1 else "Closed" for val in value_counts.index]
values = value_counts.values.tolist()

# Dynamically create 'pull' values to highlight the largest slice
pull = [0.1 if i == 0 else 0 for i in range(len(labels))]

# Create pie chart
fig = go.Figure(data=[go.Pie(
    labels=labels,
    values=values,
    pull=pull,
    textinfo='percent+label',
    hoverinfo='label+value+percent'
)])

# Update layout with left-aligned title
fig.update_layout(
    title_text='📊 Store Status Distribution: Open vs Closed',
    title_x=0.0,  # Left-aligned title
    showlegend=True,
    width=1200,    # Increased width
    height=450    # Increased height
)

fig.show()

export_plotly_chart(fig, name="open_vs_closed_pie")

### Percentage Distribution with Respect to Promo

In [None]:
# Get value counts (Series), extracting both labels and values in matching order
counts = df_features['promo'].value_counts()
labels = counts.index.tolist()
values = counts.values.tolist()

# Create a 'pull' list to highlight the first class, rest unpulled
pull = [0.1 if i == 0 else 0 for i in range(len(labels))]

# Create the pie chart
fig = go.Figure(
    data =[go.Pie(
        labels = labels,
        values = values,
        pull = pull,
        textinfo ='percent+label',
        insidetextorientation ='radial',
    )]
)

# Update the layout to fix title overlap
fig.update_layout(
    title_text ='📊 % Distribution per Promo',
    title_x = 1.0,  # Centered title
    title_font_size = 20,  # Optional: adjust font size
    margin = dict(t = 60, r = 60, b = 60, l = 60),  # Increased top margin
    width = 1200,
    height = 450
)

fig.show()

export_plotly_chart(fig, name="promo_pie")

### Percentage Distribution per Holiday Type

In [None]:

# Get value counts (Series), extracting both labels and values in matching order
counts = df_features['stateholiday'].value_counts()
labels = counts.index.tolist()
values = counts.values.tolist()

# Create a 'pull' list to highlight the first class, rest unpulled
pull = [0.1 if i == 0 else 0 for i in range(len(labels))]

# Create the pie chart
fig = go.Figure(
    data =[go.Pie(
        labels = labels,
        values = values,
        pull = pull,
        textinfo ='percent+label',
        insidetextorientation ='radial',
    )]
)

# Update the layout to fix title overlap
fig.update_layout(
    title_text ='📊 % Distribution per State Holiday',
    title_x = 1.0,  # Centered title
    title_font_size = 20,  # Optional: adjust font size
    margin = dict(t = 60, r = 60, b = 60, l = 60),  # Increased top margin
    width = 1200,
    height = 450
)

fig.show()

export_plotly_chart(fig, name="sateholiday_pie")

### Percentage Distribution per School Holiday

In [None]:
# Efficient value counting and labeling
counts = df_features['schoolholiday'].value_counts().sort_index()
labels = ['No School Holiday', 'School Holiday']  # Assumes 0=no, 1=yes
values = counts.values

# Create donut chart
fig = go.Figure(data=go.Pie(
    labels=labels,
    values=values,
    pull=[0, 0.1],  # Emphasize 'School Holiday' slice
    hole=0.3,
    textinfo='label+percent',
    marker=dict(colors=['#4ECDC4', '#FF6B6B'], line=dict(color='#FFFFFF', width=2))
))

# Layout for clarity and aesthetics
fig.update_layout(
    title_text = '📊 % Distribution per School Holiday',
    title_x = 0.5,
    font = dict(size = 12, family = 'Arial, sans-serif'),
    margin = dict(t = 60, b = 40, l = 40, r = 100),
    width = 1200,
    height = 450,
    showlegend  =True
)

# Annotation for total records
fig.add_annotation(
    text = f"Total Records: {len(df_features):,}",
    x = 0.5,
    y =- 0.1,
    xref ="paper",
    yref = "paper",
    showarrow = False,
    font = dict(size = 10, color="gray")
)

fig.show()

export_plotly_chart(fig, name="schoolholiday_pie")

### Sales Distribution

In [None]:
# Simple histogram
fig = px.histogram(
    df_features, 
    x='sales',
    nbins=30,
    title='📊 Sales Distribution'
)

# Minimal layout
fig.update_layout(title_x=0.5, width = 1200, height = 450,
)

fig.show()

# Simple summary
sales = df_features['sales']
print(f"\nSales Summary:")
print(f"Mean: €{sales.mean():,.0f}")
print(f"Median: €{sales.median():,.0f}")
print(f"Range: €{sales.min():,.0f} - €{sales.max():,.0f}")

# Simple outlier summary using IQR method
sales = df_features['sales']
q1 = sales.quantile(0.25)
q3 = sales.quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

outliers = sales[(sales < lower_bound) | (sales > upper_bound)]

print(f"\nOutlier Analysis:")
print(f"Q1: €{q1:,.0f}")
print(f"Q3: €{q3:,.0f}")
print(f"IQR: €{iqr:,.0f}")
print(f"Outlier bounds: €{lower_bound:,.0f} - €{upper_bound:,.0f}")
print(f"Outliers found: {len(outliers):,} ({len(outliers)/len(sales)*100:.1f}%)")
print(f"Outlier range: €{outliers.min():,.0f} - €{outliers.max():,.0f}")

print("\n")
export_plotly_chart(fig, name="sales_histogram")

### Customers Distribution

In [None]:
# Simple histogram
fig = px.histogram(
    df_features, 
    x='customers',
    nbins=30,
    title='📊 Customers Distribution'
)

# Minimal layout
fig.update_layout(title_x=0.5, width = 1200, height = 450
)

fig.show()

# Simple summary
sales = df_features['customers']
print(f"\nSales Summary:")
print(f"Mean: €{sales.mean():,.0f}")
print(f"Median: €{sales.median():,.0f}")
print(f"Range: €{sales.min():,.0f} - €{sales.max():,.0f}")

# Simple outlier summary using IQR method
sales = df_features['customers']
q1 = sales.quantile(0.25)
q3 = sales.quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

outliers = sales[(sales < lower_bound) | (sales > upper_bound)]

print(f"\nOutlier Analysis:")
print(f"Q1: €{q1:,.0f}")
print(f"Q3: €{q3:,.0f}")
print(f"IQR: €{iqr:,.0f}")
print(f"Outlier bounds: €{lower_bound:,.0f} - €{upper_bound:,.0f}")
print(f"Outliers found: {len(outliers):,} ({len(outliers)/len(sales)*100:.1f}%)")
print(f"Outlier range: €{outliers.min():,.0f} - €{outliers.max():,.0f}")

print("\n")
export_plotly_chart(fig, name="customer_histogram")


## Customer Analysis

#### Average Customers Trend per Day

In [None]:

# Define weekday order
weekday_order = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

# Calculate mean sales by day
dow_agg = df_features.groupby('day')['customers'].mean().reset_index()

# Apply categorical ordering
dow_agg['day'] = pd.Categorical(dow_agg['day'], categories=weekday_order, ordered=True)
dow_agg = dow_agg.sort_values('day')

# Create simple line chart
fig = px.line(
    dow_agg, 
    x = 'day', 
    y ='customers', 
    title ='📈 Average Customers by Day of Week',
    markers = True
)

# Simple styling
fig.update_layout(
    title={'x': 0.5, 'xanchor': 'center'},
    xaxis_title ='Day of Week',
    yaxis_title ='Average Customers',
)

# Find peak for simple annotation
peak_idx = dow_agg['customers'].idxmax()
peak_day = dow_agg.loc[peak_idx, 'day']
peak_value = dow_agg.loc[peak_idx, 'customers']

# Simple annotation
fig.add_annotation(
    x = peak_day,
    y = peak_value,
    text = f"Peak: {peak_day} ({peak_value:.0f})",
    showarrow = True,
    arrowcolor ='red',
    font = dict(color = 'red')
)

fig.update_layout(margin=dict(t=60, r=40, b=40, l=40),  width = 1200,height = 400)
fig.show()

print("Daily Sales Data:")
print(dow_agg)

print("\n")
export_plotly_chart(fig, name="daily_avg_customers_trend")


#### Average Customers Trend per Month

In [None]:

# Ensure month is ordered
month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun','Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

monthly_grp = df_features.groupby('month', as_index=False)['customers'].mean()
monthly_grp['month'] = pd.Categorical(monthly_grp['month'], categories=month_order, ordered=True)
monthly_grp = monthly_grp.sort_values('month')

# Identify peak
peak_row = monthly_grp.loc[monthly_grp['customers'].idxmax()]
peak_month = peak_row['month']
peak_value = peak_row['customers']

# Create Plotly Express line plot
fig = px.line(monthly_grp, x='month', y='customers',
              markers=True,
              title='📈  Average Customers Trend per Month',
              labels={'customers': 'Average Customers', 'month': 'Month'},
              line_shape='linear')  # You can switch to 'spline' for smooth curves

# Annotate peak
fig.add_annotation(x=peak_month, y=peak_value,
                   text=f'Peak: {peak_month} ({peak_value:.1f})',
                   showarrow=True,
                   arrowhead=2,
                   arrowsize=1,
                   arrowwidth=2,
                   arrowcolor='red',
                   font=dict(color='red', size=12),
                   yshift=15)

fig.update_layout(margin=dict(t=60, r=40, b=40, l=40),width = 1200,height = 450)
fig.show()

export_plotly_chart(fig, name="monthly_avg_customers_trend")

## Sales Analysis

#### Average SalesTrend per Day

In [None]:

# Define weekday order
weekday_order = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

# Calculate mean sales by day
dow_agg = df_features.groupby('day')['sales'].mean().reset_index()

# Apply categorical ordering
dow_agg['day'] = pd.Categorical(dow_agg['day'], categories=weekday_order, ordered=True)
dow_agg = dow_agg.sort_values('day')

# Create simple line chart
fig = px.line(
    dow_agg, 
    x = 'day', 
    y ='sales', 
    title ='📈 Average Sales by Day of Week',
    markers = True
)

# Simple styling
fig.update_layout(
    title={'x': 0.5, 'xanchor': 'center'},
    xaxis_title ='Day of Week',
    yaxis_title ='Average Sales',
)

# Find peak for simple annotation
peak_idx = dow_agg['sales'].idxmax()
peak_day = dow_agg.loc[peak_idx, 'day']
peak_value = dow_agg.loc[peak_idx, 'sales']

# Simple annotation
fig.add_annotation(
    x = peak_day,
    y = peak_value,
    text = f"Peak: {peak_day} ({peak_value:.0f})",
    showarrow = True,
    arrowcolor ='red',
    font = dict(color = 'red')
)

fig.update_layout(margin=dict(t=60, r=40, b=40, l=40), width = 1200,height = 400)
fig.show()

export_plotly_chart(fig, name="daily_avg_sales_trend")

### Average Sales Trend per Month

In [None]:

# Ensure month is ordered
month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun','Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

monthly_grp = df_features.groupby('month', as_index=False)['sales'].mean()
monthly_grp['month'] = pd.Categorical(monthly_grp['month'], categories=month_order, ordered=True)
monthly_grp = monthly_grp.sort_values('month')

# Identify peak
peak_row = monthly_grp.loc[monthly_grp['sales'].idxmax()]
peak_month = peak_row['month']
peak_value = peak_row['sales']

# Create Plotly Express line plot
fig = px.line(monthly_grp, x='month', y='sales',
              markers=True,
              title='📈  Average Sales Trend per Month',
              labels={'sales': 'Average Sales', 'month': 'Month'},
              line_shape='linear')  # You can switch to 'spline' for smooth curves

# Annotate peak
fig.add_annotation(x=peak_month, y=peak_value,
                   text=f'Peak: {peak_month} ({peak_value:.1f})',
                   showarrow=True,
                   arrowhead=2,
                   arrowsize=1,
                   arrowwidth=2,
                   arrowcolor='red',
                   font=dict(color='red', size=12),
                   yshift=15)

fig.update_layout(margin=dict(t=60, r=40, b=40, l=40),width = 1200,height = 450)
fig.show()

export_plotly_chart(fig, name="montly_avg_sales_trend")

### Box Plots by Time Segment

In [None]:
# Box plot by Month
fig1 = px.box(
    df_features, 
    x='month', 
    y='sales',
    title='Sales Distribution by Month',
    category_orders={'month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun','Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']}
)
fig1.update_layout(title_x=0.5, width = 1200,height=500)
fig1.show()

# Box plot by Day of Week  
fig2 = px.box(
    df_features, 
    x='day', 
    y='sales',
    title='Sales Distribution by Day of Week',
    category_orders={'day': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']}
)
fig2.update_layout(title_x=0.5, width = 1200,height=500)
fig2.show()

# Box plot by Year
fig3 = px.box(
    df_features, 
    x='year', 
    y='sales',
    title='Sales Distribution by Year'
)
fig3.update_layout(title_x=0.5, width = 1200,height = 500)
fig3.show()

# Simple summary statistics
print("Sales Distribution Summary by Category:")
print("=" * 45)

print("\nBy Month:")
monthly_stats = df_features.groupby('month')['sales'].agg(['mean', 'median', 'std']).round(0)
for month, stats in monthly_stats.iterrows():
    print(f"{month}: Mean=€{stats['mean']:,.0f}, Median=€{stats['median']:,.0f}")

print("\nBy Day:")
daily_stats = df_features.groupby('day')['sales'].agg(['mean', 'median', 'std']).round(0)
for day, stats in daily_stats.iterrows():
    print(f"{day}: Mean=€{stats['mean']:,.0f}, Median=€{stats['median']:,.0f}")

print("\n")
export_plotly_chart(fig, name="timesegment_sales_boxplot")


In [None]:
print("✅ Data Visualization completed.\n")

--------------------------------------------

In [None]:
print("✅ Features Engineering and Data Visualization (I) completed successfully!")
print(f"🗓️ Analysis Date: {bold_start}{pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}{bold_end}")

In [None]:
# End analysis
data_viz_end = pd.Timestamp.now()
duration = data_viz_end - data_viz_begin

# Final summary print
print("\n📋 Features Engineering && Data Viz Summary")
print(f"🟢 Begin Date: {bold_start}{data_viz_begin.strftime('%Y-%m-%d %H:%M:%S')}{bold_end}")
print(f"✅ End Date:   {bold_start}{data_viz_end.strftime('%Y-%m-%d %H:%M:%S')}{bold_end}")
print(f"⏱️ Duration:   {bold_start}{str(duration)}{bold_end}")

-------------------------
## Project Design Rationale: Notebook Separation

To promote **clarity, maintainability, and scalability** within the project,**Visualization Impact Analysis tasks** are intentionally separated into distinct notebooks. This modular approach prevents the accumulation of excessive code in a single notebook, making it easier to **debug, update, and collaborate across different stages of the workflow**. By isolating data transformation logic from visual analysis, **each notebook remains focused and purpose-driven**, ultimately **enhancing the overall efficiency and readability of the project**.


ARNAUD DAVY - MUKWA NDUDI 
-----------------------------------------