# 📈 **Sales Prediction Model Development**

## **🎯 Objective**
Develop a robust sales prediction model for Rossmann Pharmaceuticals to forecast daily sales across stores.

## **🛠️ Approach**

### **1️⃣ Data Preprocessing**
- **Handle Missing Values**: Impute or remove missing data.
- **Outlier Detection**: Identify and manage outliers.
- **Feature Scaling**: Normalize or standardize features.

### **2️⃣ Feature Engineering**
- **Create Features**: Extract time-based features (day, month, holidays) and lagged sales data.
- **Feature Selection**: Use correlation analysis and feature importance scores.

### **3️⃣ Model Development**
- **Model Selection**: Compare algorithms like Linear Regression, Random Forest, and LSTM.
- **Model Training**: Split data into training/validation sets and tune hyperparameters.

### **4️⃣ Model Evaluation**
- **Performance Metrics**: Use MAE, RMSE, and R-squared for evaluation.
- **Validation**: Ensure the model generalizes well on unseen data.

### **5️⃣ Prediction and Insights**
- **Sales Forecasting**: Generate predictions for the next six weeks.
- **Actionable Insights**: Summarize findings and recommendations.

## **✅ Summary of Steps**
1. Preprocess data.
2. Engineer features.
3. Train models.
4. Evaluate performance.
5. Generate forecasts and insights.

<style>
    h1 {
        color: #aaee99;
        text-shadow: 2px 2px 5px #000;
        font-family: "Comic Sans MS", sans-serif;
    }
</style>

<h1>✨ Preprocessing ✨</h1>


In [1]:
import logging

# Configure logging
logging.basicConfig(
    filename="predictive.log",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)

logger = logging.getLogger()

# Example log
logger.info("Logging setup complete.")


<style>
    h2 {
        color: #ffaa00;
        text-shadow: 2px 2px 5px #000;
        font-family: "Comic Sans MS", sans-serif;
    }
</style>

<h2>✨ Import Modules ✨</h1>


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from sklearn.preprocessing import StandardScaler
import os
import sys
notebook_dir = os.getcwd()
sys.path.append(os.path.abspath(os.path.join(notebook_dir, '..')))
sys.path.append(os.path.abspath('../scripts'))
from scripts.Data_loader import load_data, clean_data, clean_test_data
from scripts.Analysis import create_pipeline, rmspe, try_model
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import logging
import warnings
import os
import pickle
from datetime import datetime


<style>
    h2 {
        color: #ffaa00;
        text-shadow: 2px 2px 5px #000;
        font-family: "Comic Sans MS", sans-serif;
    }
</style>

<h2>✨ Load and Preview Data ✨</h1>


In [3]:
# File paths
train_path = r'C:\Users\fikad\Desktop\10acedamy\Rossmann-Pharmaceuticals-Sales-Prediction\Data\train.csv'
test_path = r'C:\Users\fikad\Desktop\10acedamy\Rossmann-Pharmaceuticals-Sales-Prediction\Data\test.csv'
store_path = r'C:\Users\fikad\Desktop\10acedamy\Rossmann-Pharmaceuticals-Sales-Prediction\Data\store.csv'

# Load and clean data
train, test = load_data(train_path, test_path, store_path)

# Clean train and test data separately
train_cleaned = clean_data(train)
test_cleaned = clean_test_data(test)


  train = pd.read_csv(train_path)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['CompetitionDistance'].fillna(df['CompetitionDistance'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Promo2SinceWeek'].fillna(0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will neve

In [4]:
print("Train columns:", train_cleaned.columns)
print("Test columns:", test_cleaned.columns)

Train columns: Index(['Store', 'DayOfWeek', 'Date', 'Sales', 'Customers', 'Open', 'Promo',
       'StateHoliday', 'SchoolHoliday', 'StoreType', 'Assortment',
       'CompetitionDistance', 'CompetitionOpenSinceMonth',
       'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek',
       'Promo2SinceYear', 'PromoInterval'],
      dtype='object')
Test columns: Index(['Id', 'Store', 'DayOfWeek', 'Date', 'Open', 'Promo', 'StateHoliday',
       'SchoolHoliday', 'StoreType', 'Assortment', 'CompetitionDistance',
       'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2',
       'Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval'],
      dtype='object')


 
<style>
    h2 {
        color: #ffaa00;
        text-shadow: 2px 2px 5px #000;
        font-family: "Comic Sans MS", sans-serif;
    }
</style>

<h2>✨ Preparing Training and Validation Datasets for Sales Prediction ✨</h1>




In [5]:
# Ensure the Date column is in datetime format
train_cleaned['Date'] = pd.to_datetime(train_cleaned['Date'])
test_cleaned['Date'] = pd.to_datetime(test_cleaned['Date'])

# Extract Day, Month, and Year from the Date column
train_cleaned['Day'] = train_cleaned['Date'].dt.day
train_cleaned['Month'] = train_cleaned['Date'].dt.month
train_cleaned['Year'] = train_cleaned['Date'].dt.year
test_cleaned['Day'] = test_cleaned['Date'].dt.day
test_cleaned['Month'] = test_cleaned['Date'].dt.month
test_cleaned['Year'] = test_cleaned['Date'].dt.year

# Prepare training and validation datasets
train_df = train_cleaned[train_cleaned.Date.dt.year <= 2014]
val_df = train_cleaned[train_cleaned.Date.dt.year == 2015]

# Define input and target columns
input_cols = ['Store', 'DayOfWeek', 'Promo', 'StateHoliday', 'StoreType', 'Assortment', 'Promo2', 'Day', 'Month', 'Year']
target_col = 'Sales'

# Extract inputs and targets for train, validation, and test sets
train_inputs = train_df[input_cols].copy()
train_targets = train_df[target_col].copy()
val_inputs = val_df[input_cols].copy()
val_targets = val_df[target_col].copy()
test_inputs = test_cleaned[input_cols].copy()

# Numerical and categorical columns
num_cols = ['Store', 'DayOfWeek', 'Day', 'Month', 'Year']
cat_cols = ['DayOfWeek', 'Promo', 'StoreType', 'Assortment', 'Promo2']

# Check the first few rows
train_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_cleaned['Date'] = pd.to_datetime(train_cleaned['Date'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_cleaned['Day'] = train_cleaned['Date'].dt.day
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_cleaned['Month'] = train_cleaned['Date'].dt.month
A value is trying to be set on a co

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday,StoreType,...,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval,Day,Month,Year
236380,1,3,2014-12-31,2605,327,1,0,0,1,c,...,1270.0,9.0,2008.0,0,0.0,0.0,,31,12,2014
236381,2,3,2014-12-31,2269,252,1,0,0,1,a,...,570.0,11.0,2007.0,1,13.0,2010.0,"Jan,Apr,Jul,Oct",31,12,2014
236382,3,3,2014-12-31,3804,408,1,0,0,1,a,...,14130.0,12.0,2006.0,1,14.0,2011.0,"Jan,Apr,Jul,Oct",31,12,2014
236383,4,3,2014-12-31,10152,1311,1,0,0,1,c,...,620.0,9.0,2009.0,0,0.0,0.0,,31,12,2014
236384,5,3,2014-12-31,1830,217,1,0,0,1,a,...,29910.0,4.0,2015.0,0,0.0,0.0,,31,12,2014


In [6]:
val_df.head()

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday,StoreType,...,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval,Day,Month,Year
0,1,5,2015-07-31,5263,555,1,1,0,1,c,...,1270.0,9.0,2008.0,0,0.0,0.0,,31,7,2015
1,2,5,2015-07-31,6064,625,1,1,0,1,a,...,570.0,11.0,2007.0,1,13.0,2010.0,"Jan,Apr,Jul,Oct",31,7,2015
2,3,5,2015-07-31,8314,821,1,1,0,1,a,...,14130.0,12.0,2006.0,1,14.0,2011.0,"Jan,Apr,Jul,Oct",31,7,2015
3,4,5,2015-07-31,13995,1498,1,1,0,1,c,...,620.0,9.0,2009.0,0,0.0,0.0,,31,7,2015
4,5,5,2015-07-31,4822,559,1,1,0,1,a,...,29910.0,4.0,2015.0,0,0.0,0.0,,31,7,2015


 
<style>
    h2 {
        color: #ffaa00;
        text-shadow: 2px 2px 5px #000;
        font-family: "Comic Sans MS", sans-serif;
    }
</style>

<h2>✨ Building a Data Preprocessing Pipeline for Sales Prediction ✨</h1>

In [7]:
pipeline = create_pipeline(input_cols, num_cols, cat_cols)

In [8]:
import os
import pickle
from datetime import datetime
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Define models
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(random_state=42, n_jobs=-1)
}

# Directory to save trained models
saved_dir = r"C:\Users\fikad\Desktop\10acedamy\Predictive_Pharmaceutical_data_analysis_Saved_models"
os.makedirs(saved_dir, exist_ok=True)  # Ensure the directory exists

for name, model in models.items():
    print(f"Evaluating model: {name}")

    # Transform inputs using the pipeline
    train_inputs_transformed = pipeline.fit_transform(train_inputs)
    val_inputs_transformed = pipeline.transform(val_inputs)

    # Train and evaluate the model using the custom `try_model` function
    try_model(model, train_inputs_transformed, train_targets, val_inputs_transformed, val_targets)

    # Generate a timestamped filename for the model
    timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    model_filename = f"{name.replace(' ', '_')}_{timestamp}.pkl"
    model_filepath = os.path.join(saved_dir, model_filename)

    # Save the trained model as a pickle file
    with open(model_filepath, 'wb') as file:
        pickle.dump(model, file)

    print(f"Model saved as: {model_filepath}")


Evaluating model: Linear Regression


TypeError: got an unexpected keyword argument 'squared'

In [9]:
from sklearn.metrics import mean_squared_error

In [10]:
# Fit and evaluate models
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(random_state=42, n_jobs=-1)
}

# Path to save the models
saved_dir = r"C:\Users\fikad\Desktop\10acedamy\Predictive_Pharmaceutical_data_analysis_Saved_models"
os.makedirs(saved_dir, exist_ok=True)  

for name, model in models.items():
    print(f"Evaluating model: {name}")
    
    # Transform inputs using the pipeline
    train_inputs_transformed = pipeline.fit_transform(train_inputs)
    val_inputs_transformed = pipeline.transform(val_inputs)
    
    # Assuming try_model is a function that fits and evaluates the model
    # Here I have used rsme loss function as defined in ML_pipline_functions
    try_model(model, train_inputs_transformed, train_targets, val_inputs_transformed, val_targets)

    # Serialize the model
    timestamp = datetime.now().strftime("%d-%m-%Y-%H-%M-%S")
    model_filename = f"{name.replace(' ', '_')}_{timestamp}.pkl"
    model_filepath = os.path.join(saved_dir, model_filename)  # Save to the correct directory
    
    with open(model_filepath, 'wb') as file:  # Use full file path
        pickle.dump(model, file)
        
    print(f"Model saved as: {model_filepath}")

Evaluating model: Linear Regression


TypeError: got an unexpected keyword argument 'squared'

<style>
    h2 {
        color: #ffaa00;
        text-shadow: 2px 2px 5px #000;
        font-family: "Comic Sans MS", sans-serif;
    }
</style>

<h2>✨ Extract Features from Date ✨</h1>

In [6]:
# Extract date-related features
def extract_date_features(df):
    df['Date'] = pd.to_datetime(df['Date'])
    df['Year'] = df['Date'].dt.year
    df['Month'] = df['Date'].dt.month
    df['Day'] = df['Date'].dt.day
    df['WeekOfYear'] = df['Date'].dt.isocalendar().week
    df['DayOfWeek'] = df['Date'].dt.dayofweek
    df['IsWeekend'] = df['DayOfWeek'].apply(lambda x: 1 if x >= 5 else 0)
    df['IsMonthStart'] = df['Date'].dt.is_month_start.astype(int)
    df['IsMonthEnd'] = df['Date'].dt.is_month_end.astype(int)
    return df

train_data = extract_date_features(train_data)
test_data = extract_date_features(test_data)

logger.info("Extracted date features.")

In [None]:
train_data

<style>
    h2 {
        color: #ffaa00;
        text-shadow: 2px 2px 5px #000;
        font-family: "Comic Sans MS", sans-serif;
    }
</style>

<h2>✨Additional Features (Days to/from Holidays, Promotions, etc.) ✨</h1>


In [8]:
# Create custom features
def custom_features(df):
    holidays = ['2015-12-25', '2015-01-01', '2015-07-04']  # Example holiday dates
    holidays = pd.to_datetime(holidays)
    df['DaysToHoliday'] = df['Date'].apply(lambda x: min([(h - x).days for h in holidays if h >= x], default=0))
    df['DaysAfterHoliday'] = df['Date'].apply(lambda x: min([(x - h).days for h in holidays if h <= x], default=0))
    return df

train_data = custom_features(train_data)
test_data = custom_features(test_data)

logger.info("Added custom features.")

In [None]:
train_data.dtypes

In [14]:
# Step 1: Identify numeric columns in train_data
numeric_features_train = train_data.select_dtypes(include=['int64', 'float64', 'int32', 'UInt32']).columns.tolist()

# Step 2: Identify numeric columns in test_data
numeric_features_test = test_data.select_dtypes(include=['int64', 'float64', 'int32', 'UInt32']).columns.tolist()

# Step 3: Find common numeric features between train and test data
common_numeric_features = list(set(numeric_features_train) & set(numeric_features_test))

# Step 4: Initialize the scaler
scaler = StandardScaler()

# Step 5: Fit and transform numeric features in train_data
train_data[common_numeric_features] = scaler.fit_transform(train_data[common_numeric_features])

# Step 6: Transform numeric features in test_data
test_data[common_numeric_features] = scaler.transform(test_data[common_numeric_features])

# Log a message to confirm scaling
logger.info("Scaled numeric features for both train and test datasets.")


In [None]:
train_data.columns

In [None]:
test_data.columns

<style>
    h1 {
        color: #aaee99;
        text-shadow: 2px 2px 5px #000;
        font-family: "Comic Sans MS", sans-serif;
    }
</style>

<h1>✨ Build Models with Sklearn Pipelines ✨</h1>


In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Identify categorical columns
categorical_columns = ['StateHoliday', 'StoreType', 'Assortment', 'PromoInterval']

# Identify numerical columns
numerical_columns = X_train.select_dtypes(include=['float64', 'int64']).columns.tolist()

# Remove categorical columns from numerical columns
numerical_columns = [col for col in numerical_columns if col not in categorical_columns]

# Define column transformer for preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', 'passthrough', numerical_columns),  # Pass through numerical columns
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_columns)  # One-hot encode categorical columns
    ]
)

# Define pipeline with preprocessing and model
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor(n_estimators=100, random_state=42))
])

# Train pipeline
pipeline.fit(X_train, y_train)

# Make predictions on test data
y_pred = pipeline.predict(X_test)

# Save predictions
test_data['Predicted_Sales'] = y_pred
test_data[['Id', 'Predicted_Sales']].to_csv('predicted_sales.csv', index=False)

print("Model trained and predictions saved successfully.")
