## üõ†Ô∏è Data Preprocessing for Modeling

In this step, we prepare the dataset for time series modeling by:

- Renaming columns to match the expected format (`ds`, `y`)
- Converting data types (e.g., `date` to datetime)
- Filtering or restructuring data as needed for training forecasting models like Prophet


In [16]:
import numpy as np 
import pandas as pd  

df = pd.read_csv('../data/sales_data.csv')

In [17]:
df.head()

Unnamed: 0,item_id,item_name,category,date,sales_qty,month,is_holiday
0,1,PiƒÖtnica Mleko 3.2%,–ú–æ–ª–æ—á–Ω—ñ –ø—Ä–æ–¥—É–∫—Ç–∏,2023-01-01,11,1,True
1,1,PiƒÖtnica Mleko 3.2%,–ú–æ–ª–æ—á–Ω—ñ –ø—Ä–æ–¥—É–∫—Ç–∏,2023-01-02,25,1,False
2,1,PiƒÖtnica Mleko 3.2%,–ú–æ–ª–æ—á–Ω—ñ –ø—Ä–æ–¥—É–∫—Ç–∏,2023-01-03,37,1,False
3,1,PiƒÖtnica Mleko 3.2%,–ú–æ–ª–æ—á–Ω—ñ –ø—Ä–æ–¥—É–∫—Ç–∏,2023-01-04,60,1,False
4,1,PiƒÖtnica Mleko 3.2%,–ú–æ–ª–æ—á–Ω—ñ –ø—Ä–æ–¥—É–∫—Ç–∏,2023-01-05,82,1,False


In [18]:
df.describe()

Unnamed: 0,item_id,sales_qty,month
count,475150.0,475150.0,475150.0
mean,325.5,21.603424,6.519836
std,187.638813,12.040455,3.449555
min,1.0,0.0,1.0
25%,163.0,13.0,4.0
50%,325.5,19.0,7.0
75%,488.0,28.0,10.0
max,650.0,141.0,12.0


In [19]:
df.drop(columns=['item_id', 'category', 'month', 'is_holiday'], inplace=True)

In [20]:
df.describe()

Unnamed: 0,sales_qty
count,475150.0
mean,21.603424
std,12.040455
min,0.0
25%,13.0
50%,19.0
75%,28.0
max,141.0


In [21]:
df.head()

Unnamed: 0,item_name,date,sales_qty
0,PiƒÖtnica Mleko 3.2%,2023-01-01,11
1,PiƒÖtnica Mleko 3.2%,2023-01-02,25
2,PiƒÖtnica Mleko 3.2%,2023-01-03,37
3,PiƒÖtnica Mleko 3.2%,2023-01-04,60
4,PiƒÖtnica Mleko 3.2%,2023-01-05,82


In [22]:
df.dtypes

item_name    object
date         object
sales_qty     int64
dtype: object

In [23]:
df['date'] = pd.to_datetime(df['date'])
df['item_name'] = df['item_name'].astype('category')


In [24]:
df.head()

Unnamed: 0,item_name,date,sales_qty
0,PiƒÖtnica Mleko 3.2%,2023-01-01,11
1,PiƒÖtnica Mleko 3.2%,2023-01-02,25
2,PiƒÖtnica Mleko 3.2%,2023-01-03,37
3,PiƒÖtnica Mleko 3.2%,2023-01-04,60
4,PiƒÖtnica Mleko 3.2%,2023-01-05,82


In [25]:
df = df.rename(columns={'date': 'ds', 'sales_qty': 'y'})

In [26]:
df.head()

Unnamed: 0,item_name,ds,y
0,PiƒÖtnica Mleko 3.2%,2023-01-01,11
1,PiƒÖtnica Mleko 3.2%,2023-01-02,25
2,PiƒÖtnica Mleko 3.2%,2023-01-03,37
3,PiƒÖtnica Mleko 3.2%,2023-01-04,60
4,PiƒÖtnica Mleko 3.2%,2023-01-05,82


## üîÆ Forecasting with Prophet

In this step, we train an individual Prophet time series model for each unique product in the dataset:

- Each product‚Äôs historical daily sales data is used to train a separate model
- The model forecasts the next 30 days of sales
- Forecasts include predicted values (`yhat`) and confidence intervals (`yhat_lower`, `yhat_upper`)
- All models are saved as `.pkl` files for future use
- Forecast results are combined into a single DataFrame for easy access and visualization


In [27]:
from prophet import Prophet
import pandas as pd
import pickle
import os

# üìÅ Create directory for saving models
model_dir = 'saved_models'
os.makedirs(model_dir, exist_ok=True)

# üîÅ Loop through unique products
forecast_list = []
forecast_horizon = 30  # number of days to forecast

for item in df['item_name'].unique():
    item_df = df[df['item_name'] == item][['ds', 'y']].copy()

    try:
        # üß† Train the model
        model = Prophet()
        model.fit(item_df)

        # üìà Make forecast
        future = model.make_future_dataframe(periods=forecast_horizon)
        forecast = model.predict(future)

        # üìä Prepare results
        forecast_result = forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].copy()
        forecast_result['item_name'] = item
        forecast_list.append(forecast_result)

        # üíæ Save the model
        with open(f'{model_dir}/prophet_{item}.pkl', 'wb') as f:
            pickle.dump(model, f)

    except Exception as e:
        print(f"‚ö†Ô∏è Error processing '{item}': {e}")

# üìö Combine all forecasts into one DataFrame
final_forecast_df = pd.concat(forecast_list, ignore_index=True)

# ‚úÖ Final result
print(final_forecast_df.head())


20:39:34 - cmdstanpy - INFO - Chain [1] start processing
20:39:34 - cmdstanpy - INFO - Chain [1] done processing
20:39:34 - cmdstanpy - INFO - Chain [1] start processing
20:39:34 - cmdstanpy - INFO - Chain [1] done processing
20:39:35 - cmdstanpy - INFO - Chain [1] start processing
20:39:35 - cmdstanpy - INFO - Chain [1] done processing
20:39:35 - cmdstanpy - INFO - Chain [1] start processing
20:39:35 - cmdstanpy - INFO - Chain [1] done processing
20:39:35 - cmdstanpy - INFO - Chain [1] start processing
20:39:35 - cmdstanpy - INFO - Chain [1] done processing
20:39:35 - cmdstanpy - INFO - Chain [1] start processing
20:39:35 - cmdstanpy - INFO - Chain [1] done processing
20:39:35 - cmdstanpy - INFO - Chain [1] start processing
20:39:35 - cmdstanpy - INFO - Chain [1] done processing
20:39:35 - cmdstanpy - INFO - Chain [1] start processing
20:39:35 - cmdstanpy - INFO - Chain [1] done processing
20:39:35 - cmdstanpy - INFO - Chain [1] start processing
20:39:35 - cmdstanpy - INFO - Chain [1]

          ds       yhat  yhat_lower  yhat_upper            item_name
0 2023-01-01  32.540017   17.764524   47.585393  PiƒÖtnica Mleko 3.2%
1 2023-01-02  35.012600   19.931512   49.799393  PiƒÖtnica Mleko 3.2%
2 2023-01-03  39.967029   24.647311   55.826834  PiƒÖtnica Mleko 3.2%
3 2023-01-04  37.870754   23.129344   52.617087  PiƒÖtnica Mleko 3.2%
4 2023-01-05  35.593315   20.309577   50.305416  PiƒÖtnica Mleko 3.2%


### Interactive Widget to Display 30-Day Forecast per Item

This code creates a dropdown menu to select a product and shows the predicted sales quantity for the next 30 days based on the forecast dataframe `final_forecast_df`. It assumes the last 30 days of data as historical, and the following days as forecast.


In [28]:
import pandas as pd
import ipywidgets as widgets
from IPython.display import display, clear_output

# Convert 'ds' column to datetime format
final_forecast_df['ds'] = pd.to_datetime(final_forecast_df['ds'])

# Assume the last real sales date is 30 days before the max date in the dataframe
last_real_date = final_forecast_df['ds'].max() - pd.Timedelta(days=30)  # last 30 days are history, the rest is forecast

# Create a dropdown widget for item selection
item_dropdown = widgets.Dropdown(
    options=final_forecast_df['item_name'].unique(),
    description='Select Item:',
    disabled=False,
)

output = widgets.Output()

def on_change(change):
    if change['type'] == 'change' and change['name'] == 'value':
        with output:
            clear_output()
            selected_item = change['new']
            # Filter forecast for the selected item and next 30 days (after last_real_date)
            df_item = final_forecast_df[
                (final_forecast_df['item_name'] == selected_item) & 
                (final_forecast_df['ds'] > last_real_date)
            ][['ds', 'yhat']]
            df_item['yhat'] = df_item['yhat'].round().astype(int)
            df_item['ds'] = df_item['ds'].dt.strftime('%Y-%m-%d')
            
            if df_item.empty:
                print("No forecast available for the next 30 days")
            else:
                for _, row in df_item.iterrows():
                    print(f"{row['ds']} ‚Äî {row['yhat']} units")

# Subscribe to dropdown changes
item_dropdown.observe(on_change)

display(item_dropdown, output)

# Display forecast for the first item on start
item_dropdown.value = item_dropdown.options[0]


Dropdown(description='–û–±–µ—Ä—ñ—Ç—å —Ç–æ–≤–∞—Ä:', options=('PiƒÖtnica Mleko 3.2%', 'Mlekovita Ser ≈º√≥≈Çty plasterki', 'Presi‚Ä¶

Output()

In [29]:
# Convert 'ds' column to datetime if it's not already
df['date'] = pd.to_datetime(df['ds'])

# Get the minimum and maximum dates from the dataset
min_date = df['date'].min()
max_date = df['date'].max()

# Print the date range
print(f"Start date in dataset: {min_date.strftime('%Y-%m-%d')}")
print(f"End date in dataset: {max_date.strftime('%Y-%m-%d')}")


–ü–æ—á–∞—Ç–∫–æ–≤–∞ –¥–∞—Ç–∞ –≤ –¥–∞—Ç–∞—Å–µ—Ç—ñ: 2023-01-01
–ö—ñ–Ω—Ü–µ–≤–∞ –¥–∞—Ç–∞ –≤ –¥–∞—Ç–∞—Å–µ—Ç—ñ: 2024-12-31
