# 📈 **Sales Prediction Model Development**

## **🎯 Objective**
Develop a robust sales prediction model for Rossmann Pharmaceuticals to forecast daily sales across stores.

## **🛠️ Approach**

### **1️⃣ Data Preprocessing**
- **Handle Missing Values**: Impute or remove missing data.
- **Outlier Detection**: Identify and manage outliers.
- **Feature Scaling**: Normalize or standardize features.

### **2️⃣ Feature Engineering**
- **Create Features**: Extract time-based features (day, month, holidays) and lagged sales data.
- **Feature Selection**: Use correlation analysis and feature importance scores.

### **3️⃣ Model Development**
- **Model Selection**: Compare algorithms like Linear Regression, Random Forest, and LSTM.
- **Model Training**: Split data into training/validation sets and tune hyperparameters.

### **4️⃣ Model Evaluation**
- **Performance Metrics**: Use MAE, RMSE, and R-squared for evaluation.
- **Validation**: Ensure the model generalizes well on unseen data.

### **5️⃣ Prediction and Insights**
- **Sales Forecasting**: Generate predictions for the next six weeks.
- **Actionable Insights**: Summarize findings and recommendations.

## **✅ Summary of Steps**
1. Preprocess data.
2. Engineer features.
3. Train models.
4. Evaluate performance.
5. Generate forecasts and insights.

<style>
    h1 {
        color: #aaee99;
        text-shadow: 2px 2px 5px #000;
        font-family: "Comic Sans MS", sans-serif;
    }
</style>

<h1>✨ Preprocessing ✨</h1>


In [1]:
import logging

# Configure logging
logging.basicConfig(
    filename="predictive.log",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)

logger = logging.getLogger()

# Example log
logger.info("Logging setup complete.")


<style>
    h2 {
        color: #ffaa00;
        text-shadow: 2px 2px 5px #000;
        font-family: "Comic Sans MS", sans-serif;
    }
</style>

<h2>✨ Import Modules ✨</h1>


In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from sklearn.preprocessing import StandardScaler
import os
import sys
notebook_dir = os.getcwd()
sys.path.append(os.path.abspath(os.path.join(notebook_dir, '..')))
sys.path.append(os.path.abspath('../scripts'))
from scripts.Data_loader import load_data

<style>
    h2 {
        color: #ffaa00;
        text-shadow: 2px 2px 5px #000;
        font-family: "Comic Sans MS", sans-serif;
    }
</style>

<h2>✨ Load and Preview Data ✨</h1>


In [3]:
# Load the dataset
train_data = load_data(r'C:\Users\fikad\Desktop\10acedamy\Rossmann-Pharmaceuticals-Sales-Prediction\Data\train.csv')
test_data = pd.read_csv(r'C:\Users\fikad\Desktop\10acedamy\Rossmann-Pharmaceuticals-Sales-Prediction\Data\test.csv')
store_data = pd.read_csv(r'C:\Users\fikad\Desktop\10acedamy\Rossmann-Pharmaceuticals-Sales-Prediction\Data\store.csv')
logger.info("Loaded training and test datasets.")

# Preview the datasets
print(train_data.head())
print(test_data.head())
print(store_data.head())
logger.info("Previewed datasets.")


  return pd.read_csv(file_path)


   Store  DayOfWeek        Date  Sales  Customers  Open  Promo StateHoliday  \
0      1          5  2015-07-31   5263        555     1      1            0   
1      2          5  2015-07-31   6064        625     1      1            0   
2      3          5  2015-07-31   8314        821     1      1            0   
3      4          5  2015-07-31  13995       1498     1      1            0   
4      5          5  2015-07-31   4822        559     1      1            0   

   SchoolHoliday  
0              1  
1              1  
2              1  
3              1  
4              1  
   Id  Store  DayOfWeek        Date  Open  Promo StateHoliday  SchoolHoliday
0   1      1          4  2015-09-17   1.0      1            0              0
1   2      3          4  2015-09-17   1.0      1            0              0
2   3      7          4  2015-09-17   1.0      1            0              0
3   4      8          4  2015-09-17   1.0      1            0              0
4   5      9          4  2

<style>
    h2 {
        color: #ffaa00;
        text-shadow: 2px 2px 5px #000;
        font-family: "Comic Sans MS", sans-serif;
    }
</style>

<h2>✨ Merge Data and Handle Missing Values ✨</h1>



In [4]:
# Merge store information into training and test data
train_data = train_data.merge(store_data, on='Store', how='left')
test_data = test_data.merge(store_data, on='Store', how='left')

# Handle missing values in merged datasets
train_data.fillna(0, inplace=True)
test_data.fillna(0, inplace=True)

logger.info("Merged datasets and handled missing values.")


In [5]:
train_data.head()
test_data.head()

Unnamed: 0,Id,Store,DayOfWeek,Date,Open,Promo,StateHoliday,SchoolHoliday,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
0,1,1,4,2015-09-17,1.0,1,0,0,c,a,1270.0,9.0,2008.0,0,0.0,0.0,0
1,2,3,4,2015-09-17,1.0,1,0,0,a,a,14130.0,12.0,2006.0,1,14.0,2011.0,"Jan,Apr,Jul,Oct"
2,3,7,4,2015-09-17,1.0,1,0,0,a,c,24000.0,4.0,2013.0,0,0.0,0.0,0
3,4,8,4,2015-09-17,1.0,1,0,0,a,a,7520.0,10.0,2014.0,0,0.0,0.0,0
4,5,9,4,2015-09-17,1.0,1,0,0,a,c,2030.0,8.0,2000.0,0,0.0,0.0,0


<style>
    h2 {
        color: #ffaa00;
        text-shadow: 2px 2px 5px #000;
        font-family: "Comic Sans MS", sans-serif;
    }
</style>

<h2>✨ Extract Features from Date ✨</h1>

In [6]:
# Extract date-related features
def extract_date_features(df):
    df['Date'] = pd.to_datetime(df['Date'])
    df['Year'] = df['Date'].dt.year
    df['Month'] = df['Date'].dt.month
    df['Day'] = df['Date'].dt.day
    df['WeekOfYear'] = df['Date'].dt.isocalendar().week
    df['DayOfWeek'] = df['Date'].dt.dayofweek
    df['IsWeekend'] = df['DayOfWeek'].apply(lambda x: 1 if x >= 5 else 0)
    df['IsMonthStart'] = df['Date'].dt.is_month_start.astype(int)
    df['IsMonthEnd'] = df['Date'].dt.is_month_end.astype(int)
    return df

train_data = extract_date_features(train_data)
test_data = extract_date_features(test_data)

logger.info("Extracted date features.")

In [7]:
train_data

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday,StoreType,...,Promo2SinceWeek,Promo2SinceYear,PromoInterval,Year,Month,Day,WeekOfYear,IsWeekend,IsMonthStart,IsMonthEnd
0,1,4,2015-07-31,5263,555,1,1,0,1,c,...,0.0,0.0,0,2015,7,31,31,0,0,1
1,2,4,2015-07-31,6064,625,1,1,0,1,a,...,13.0,2010.0,"Jan,Apr,Jul,Oct",2015,7,31,31,0,0,1
2,3,4,2015-07-31,8314,821,1,1,0,1,a,...,14.0,2011.0,"Jan,Apr,Jul,Oct",2015,7,31,31,0,0,1
3,4,4,2015-07-31,13995,1498,1,1,0,1,c,...,0.0,0.0,0,2015,7,31,31,0,0,1
4,5,4,2015-07-31,4822,559,1,1,0,1,a,...,0.0,0.0,0,2015,7,31,31,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1017204,1111,1,2013-01-01,0,0,0,0,a,1,a,...,31.0,2013.0,"Jan,Apr,Jul,Oct",2013,1,1,1,0,1,0
1017205,1112,1,2013-01-01,0,0,0,0,a,1,c,...,0.0,0.0,0,2013,1,1,1,0,1,0
1017206,1113,1,2013-01-01,0,0,0,0,a,1,a,...,0.0,0.0,0,2013,1,1,1,0,1,0
1017207,1114,1,2013-01-01,0,0,0,0,a,1,a,...,0.0,0.0,0,2013,1,1,1,0,1,0


<style>
    h2 {
        color: #ffaa00;
        text-shadow: 2px 2px 5px #000;
        font-family: "Comic Sans MS", sans-serif;
    }
</style>

<h2>✨Additional Features (Days to/from Holidays, Promotions, etc.) ✨</h1>


In [8]:
# Create custom features
def custom_features(df):
    holidays = ['2015-12-25', '2015-01-01', '2015-07-04']  # Example holiday dates
    holidays = pd.to_datetime(holidays)
    df['DaysToHoliday'] = df['Date'].apply(lambda x: min([(h - x).days for h in holidays if h >= x], default=0))
    df['DaysAfterHoliday'] = df['Date'].apply(lambda x: min([(x - h).days for h in holidays if h <= x], default=0))
    return df

train_data = custom_features(train_data)
test_data = custom_features(test_data)

logger.info("Added custom features.")

In [13]:
train_data.dtypes

Store                                 int64
DayOfWeek                             int32
Date                         datetime64[ns]
Sales                               float64
Customers                           float64
Open                                  int64
Promo                                 int64
StateHoliday                         object
SchoolHoliday                         int64
StoreType                            object
Assortment                           object
CompetitionDistance                 float64
CompetitionOpenSinceMonth           float64
CompetitionOpenSinceYear            float64
Promo2                                int64
Promo2SinceWeek                     float64
Promo2SinceYear                     float64
PromoInterval                        object
Year                                  int32
Month                                 int32
Day                                   int32
WeekOfYear                           UInt32
IsWeekend                       

In [14]:
# Step 1: Identify numeric columns in train_data
numeric_features_train = train_data.select_dtypes(include=['int64', 'float64', 'int32', 'UInt32']).columns.tolist()

# Step 2: Identify numeric columns in test_data
numeric_features_test = test_data.select_dtypes(include=['int64', 'float64', 'int32', 'UInt32']).columns.tolist()

# Step 3: Find common numeric features between train and test data
common_numeric_features = list(set(numeric_features_train) & set(numeric_features_test))

# Step 4: Initialize the scaler
scaler = StandardScaler()

# Step 5: Fit and transform numeric features in train_data
train_data[common_numeric_features] = scaler.fit_transform(train_data[common_numeric_features])

# Step 6: Transform numeric features in test_data
test_data[common_numeric_features] = scaler.transform(test_data[common_numeric_features])

# Log a message to confirm scaling
logger.info("Scaled numeric features for both train and test datasets.")
