# Daily Pizza Sales Prediction


#### Project Workflow
1. Understand the Dataset
- Review all columns and their meanings (you‚Äôve already done this ‚Äî great start!)
- Identify which variables are:
- Independent (features): weather, promotions, school status, holidays, etc.
- Dependent (target): daily_sales
2. Clean and Prepare the Data
- Check for missing values or anomalies (e.g., nulls in temperature or sales)
- Convert date column to datetime format
- Create new features if needed:
- Week number
- Is exam week
- Ramadan or Lent flag (already modeled, but you can double-check)
3. Explore the Data (EDA)
Use visualizations to uncover patterns:
- üìà Line plots of sales over time
- üìä Bar charts comparing average sales by:
- Day of week
- Month
- Holiday vs non-holiday
- School in session vs strike
- üìâ Boxplots to see sales distribution by weather or promotion
- üìå Correlation heatmap to see which features influence sales most
4. Model Sales Drivers
- Use regression models (e.g., Linear Regression, Random Forest, XGBoost) to predict daily_sales
- Evaluate feature importance: which variables drive sales the most?
- Try time series models (e.g., ARIMA, Prophet) if you're forecasting future sales
5. Segment Your Insights
- Compare sales during:
- Strike vs normal periods
- Ramadan vs non-Ramadan
- Exam weeks vs regular weeks
- Identify high-performing days (e.g., Fridays with promotions)
6. Make Recommendations
Based on your findings, suggest:
- Best times to run promotions
- How to prepare for low-traffic periods (e.g., strikes, Lent)
- Staffing or inventory adjustments based on seasonality
7. Present Your Work
- Create a dashboard (Excel, Power BI, or Tableau)
- Summarize key insights in a slide deck or report
- Include visuals, trends, and actionable takeaways


# Data Cleaning

In [None]:
# Importing Libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [None]:
#Importing Dataset
df = pd.read_csv('pizza_sales_2021_2025.csv')

In [None]:
df.head()

In [None]:
df.columns

In [None]:
df.drop(['ramadan', 'lent'], axis=1, inplace=True)

In [None]:
print(df.info())

In [None]:
df.describe()

In [None]:
print(df.isnull().sum())

`Note: `In the `public_holiday_name` column, missing values likely mean ‚Äúnot a public holiday‚Äù ‚Äî which is perfectly valid. So these aren‚Äôt errors or gaps in data collection, they‚Äôre just non-holiday days.
- Since `is_holiday` as a Boolean column ‚Äî so you can use that to filter or group.
- Also, When `is_holiday` = False, it‚Äôs expected that public_holiday_name = NaN.





In [None]:
# Check for Duplicates
print(df.duplicated().sum())

In [None]:
#Check and FIx Data Types
print(df.dtypes)

In [None]:
# Convert Date column to datetime dtype
df['date'] = pd.to_datetime(df['date'], errors='coerce')


In [None]:
#Convert Category columns to category dtype
cat_cols = ['day_of_week', 'month', 'public_holiday_name', 'university_calendar_status', 'weather']
for col in cat_cols:
    df[col] = df[col].astype('category')
    

In [None]:
#Convert Boolean columns to bool dtype
bool_cols = ['is_weekend', 'is_holiday', 'is_school_in_session', 'promotion']
for col in bool_cols:
    df[col] = df[col].astype('bool')

In [None]:
#Convert Numeric Columns to floats type
num_cols = ['temperature_C', 'foot_traffic_index', 'student_density_index', 'daily_sales_NGN', 'transactions_count', 'avg_order_value_NGN']
for col in num_cols:  
    df[col] = df[col].astype('float64')


In [None]:
print(df.dtypes)

In [None]:
# Checking Time Continuity to ensure no missing dates

# Create a complete date range
full_range = pd.date_range(start=df['date'].min(), end=df['date'].max())

# Compare with actual dates
missing_dates = full_range.difference(df['date'])

print(f"Missing dates: {missing_dates}")


## Exploratory Data Analysis

In [None]:
# plot a histogram for each numerical attribute
df.hist(bins=50, figsize=(20,15))
plt.show()


In [None]:
# Summary stats
print(
    df[['daily_sales_NGN', 'temperature_C', 'student_density_index', 'foot_traffic_index', 'transactions_count', 'avg_order_value_NGN']].describe()
)


# Visualizing boxplots for numerical columns
for col in ['daily_sales_NGN', 'temperature_C', 'student_density_index', 'foot_traffic_index', 'transactions_count', 'avg_order_value_NGN']:
    sns.boxplot(x=df[col])
    plt.title(f'Boxplot of {col}')
    plt.show()

### Looking for Correlations

In [None]:
corr_matrix = df[num_cols].corr()

In [None]:
# Visualizing the correlation matrix using a heatmap

plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Numeric Features')
plt.show()

Understand the `correlation martix` before continuing with Data Preprocessing

## Data Preprocessing

- Handle Missing Values
- Define Variables
- Feature Engineering & Featur Scaling where necessary
- Encoding Categorical data


`Note:` For Feature engineering on this project;

 Addressing Special Non-Public HolidaysYou should create a new binary feature specifically to capture the effect of fixed, non-official holidays that dramatically influence consumer spending and dining habits.
 1. Create a New Feature: Festive_Day_FlagInstead of trying to fit Valentine's Day into the Is_Holiday column (which should be reserved only for nationally recognized public holidays), you should create a separate binary flag:
 
- New Column Name:  Festive_Day_FlagBinary
- DataType: (0 or 1)
- Description : 1 if the date is a major, fixed festive day known to influence dining, 0 otherwise
- Dates to Flag (Examples): February 14th (Valentine's), Mother's Day, Father's Day, New Year's Eve (Dec 31st).


`Note:` for Encoding Categorical Data;
1. One-Hot Encoding (Dummy Variables)
Best for tree-based models (Random Forest, XGBoost) and linear models.
`df_encoded = pd.get_dummies(df, columns=cat_cols, drop_first=True)`

- drop_first=True avoids multicollinearity by removing one category per feature.
- This turns each category into a binary column (0 or 1).


In [None]:
print(df.columns)


In [None]:
#Feature Engineering #Lag Features
# Sales lags
df['sales_lag_1'] = df['daily_sales_NGN'].shift(1)
df['sales_lag_7'] = df['daily_sales_NGN'].shift(7)
df['sales_lag_30'] = df['daily_sales_NGN'].shift(30)

# Foot traffic lags
df['traffic_lag_1'] = df['foot_traffic_index'].shift(1)
df['traffic_lag_7'] = df['foot_traffic_index'].shift(7)

# Transactions lags
df['transactions_lag_1'] = df['transactions_count'].shift(1)
df['transactions_lag_7'] = df['transactions_count'].shift(7)


In [None]:
#Rolling averages; helps the model understand recent trends and smooth out daily noise.

# Rolling averages for sales
df['sales_7d_avg'] = df['daily_sales_NGN'].rolling(window=7).mean()
df['sales_30d_avg'] = df['daily_sales_NGN'].rolling(window=30).mean()

# Rolling averages for foot traffic and transactions
df['traffic_7d_avg'] = df['foot_traffic_index'].rolling(window=7).mean()
df['transactions_7d_avg'] = df['transactions_count'].rolling(window=7).mean()
df.head(15)

Find answeres to this later;  If I drop the rows of the columns with NaN values, how will the model learn from the detials of their other colums with useful detials

In [None]:
# Adding non official public holidays or festive days that migh affect daily sales

festive_days = [
    '2021-02-14', '2021-03-14', '2021-06-20', '2021-12-24', '2021-12-31',
    '2022-02-14', '2022-03-27', '2022-06-19', '2022-12-24', '2022-12-31',
    '2023-02-14', '2023-03-19', '2023-06-18', '2023-12-24', '2023-12-31',
    '2024-02-14', '2024-03-10', '2024-06-16', '2024-12-24', '2024-12-31',
    '2025-02-14', '2025-03-30', '2025-06-15', '2025-12-24', '2025-12-31'
]

In [None]:
# Adding the names of the non official public holidays

festive_names = {
    '2021-02-14': "Valentine's Day",
    '2021-03-14': "Mother's Day",
    '2021-06-20': "Father's Day",
    '2021-12-24': "Christmas Eve",
    '2021-12-31': "New Year's Eve",
    '2022-02-14': "Valentine's Day",
    '2022-03-27': "Mother's Day",
    '2022-06-19': "Father's Day",
    '2022-12-24': "Christmas Eve",
    '2022-12-31': "New Year's Eve",
    '2023-02-14': "Valentine's Day",
    '2023-03-19': "Mother's Day",
    '2023-06-18': "Father's Day",
    '2023-12-24': "Christmas Eve",
    '2023-12-31': "New Year's Eve",
    '2024-02-14': "Valentine's Day",
    '2024-03-10': "Mother's Day",
    '2024-06-16': "Father's Day",
    '2024-12-24': "Christmas Eve",
    '2024-12-31': "New Year's Eve",
    '2025-02-14': "Valentine's Day",
    '2025-03-30': "Mother's Day",
    '2025-06-15': "Father's Day",
    '2025-12-24': "Christmas Eve",
    '2025-12-31': "New Year's Eve"
}

In [None]:
#Convert Date Column to String Format for mapping
df['date_str'] = df['date'].dt.strftime('%Y-%m-%d') 


In [None]:
# Update the colums to include festive days and their names
df.loc[df['date_str'].isin(festive_days), 'is_holiday'] = True


# Append festive names to public holiday name
df['public_holiday_name'] = df.apply(
    lambda row: festive_names[row['date_str']] if row['date_str'] in festive_names
    else row['public_holiday_name'], axis=1
)

In [None]:
df.drop(columns=['date_str'], inplace=True) #drop the date column in str format

In [None]:
df

- Cyclical Time Encoding ‚Äî turn day_of_week and month into sine/cosine features
- Train-Test Split ‚Äî prepare your data for modeling
- Feature Selection ‚Äî identify the most predictive features
- Modeling ‚Äî build and evaluate your forecasting model


In [None]:
# Handling Missing Values after Feature Engineering
print(df.isnull().sum())

In [None]:
# Handling NaNs in 'public_holiday_name'
df['public_holiday_name'] = df['public_holiday_name'].fillna('None')


In [None]:
df

## Train Test Split

## Train or Fit the Model into the Training Dataset