# Overview of project
We are provided historic data of raw material deliveries and orders through the end of 2024. GOAL: Develop a model that forecasts the cumulative weight of incoming deliveries of each raw material from Jan 1, 2025, up to any specified end date between Jan 1 and May 31, 2025.

- recievals = historical records of material recievals
- purchase_orders = ordered quantities and expected deliv
- materials(opt) = metadata on various raw materials
- transportation(opt) = transport-related data

QuantileLoss0.2(Fi,Ai) = max(0.2*(Ai − Fi), 0.8*(Fi − Ai)).

rm_id = unique identifer for raw material

In [65]:
# We need to explore the data

# First I want to check the difference between purchase orders
# and recievals. How much was the difference between the two?

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the data
data_orders = pd.read_csv('data/kernel/purchase_orders.csv')
data_receivals = pd.read_csv('data/kernel/receivals.csv')

# Link the 5 first data orders to the recievals
data = pd.merge(data_orders.head(5), data_receivals, on='purchase_order_id', suffixes=('_order', '_receival'), how='left')

data.head(2)

# NOT EVERY ORDER HAS A RECEIVAL? Oh... it makes sense cause some orders are never received? But I put the 5 in the head.... 5....
# 5 whole orders are not received? That seems like a lot.... Nah maybe it's just purchase_order_id is a bad key to merge on.
# Let's try purchase_order_item_no... # Absolutely not. I forgot it was like simple 1 etc....

Unnamed: 0,purchase_order_id,purchase_order_item_no_order,quantity,delivery_date,product_id_order,product_version,created_date_time,modified_date_time,unit_id,unit,...,status,rm_id,product_id_receival,purchase_order_item_no_receival,receival_item_no,batch_id,date_arrival,receival_status,net_weight,supplier_id
0,1,1,-14.0,2003-05-12 00:00:00.0000000 +02:00,91900143,1,2003-05-12 10:00:48.0000000 +00:00,2004-06-15 06:16:18.0000000 +00:00,,,...,Closed,,,,,,,,,
1,22,1,23880.0,2003-05-27 00:00:00.0000000 +02:00,91900160,1,2003-05-27 12:42:07.0000000 +00:00,2012-06-29 09:41:13.0000000 +00:00,,,...,Closed,,,,,,,,,


In [60]:
# Count all rows with quantity ordered negative
print(data_orders['quantity'].lt(0).sum())

print(data_orders['quantity'].eq(150000).sum())
# 6 rows with negative quantity... prob wrong..

6
563


In [61]:
# I want to check the transportation of the 5 orders in the head
data_transport = pd.read_csv('data/extended/transportation.csv')

data = pd.merge(data_orders.head(5), data_transport, on='purchase_order_id', suffixes=('_order', '_transport'))

print(data)

# Observation: Not all orders are transported either....

Empty DataFrame
Columns: [purchase_order_id, purchase_order_item_no_order, quantity, delivery_date, product_id_order, product_version, created_date_time, modified_date_time, unit_id, unit, status_id, status, rm_id, product_id_transport, purchase_order_item_no_transport, receival_item_no, batch_id, transporter_name, vehicle_no, unit_status, vehicle_start_weight, vehicle_end_weight, gross_weight, tare_weight, net_weight, wood, ironbands, plastic, water, ice, other, chips, packaging, cardboard]
Index: []

[0 rows x 34 columns]


In [53]:
# I want to check the material details of the 5 orders in the head
# NVM... cooked... orders have nothing to directly link to materials

In [None]:
# OKAY! Let's try to drop all the orders with no recievals maybe? And try to predict? But in a real scenario I probably shouldn't
# Cause maybe the orders with no recievals are equal to 0 recieved? But I don't know if that's true. Gotta test
# So try 2 stuff: 1. drop the orders with no recievals, 2. set the recievals to 0 if no recievals

# But first I neeed to know what my model will predict? Like will I get orders and recievals? Or just predict by the order prev?
# Okay I don't think I'll get more orders in the future, so I guess I just have to predict based on previous orders

# Purchase orders have an expected delivery_date though.
# They are using YYYY-MM-DD format I guess
# They want from 2025-01-01 to 2025-05-31
# We got some deliveries expected in 2025-03-XX, but none after, so we prob need to predict that there will be more orders.

# By making a model that predicts the quantity ordered based on previous orders, I can then use that to predict future orders
# I should prob make a model for each of the materials and then sum them up for each order date

In [62]:
# Starting by dropping the orders with no recievals
data = pd.merge(
    data_orders,
    data_receivals,
    on=['purchase_order_id', 'purchase_order_item_no'],
    suffixes=('_order', '_receival')
)

# 122537 rows, but recievals has 122590 rows. So some recievals are from orders not in the orders dataset?
data_extra_receivals = pd.merge(
    data_orders,
    data_receivals,
    on=['purchase_order_id', 'purchase_order_item_no'],
    suffixes=('_order', '_receival'),
    how='right'
)

print(data_extra_receivals.shape)
print(data.shape)
# 122591 rows, so 54 extra recievals that are not in the orders dataset
# Let's check if they are all from the same purchase_order_id

# I want the data_extra_receivals rows that are not in data
data_diff = pd.concat([data_extra_receivals, data]).drop_duplicates(keep=False)
print(data_diff.head(5))



(122590, 20)
(122537, 20)
       purchase_order_id  purchase_order_item_no  quantity delivery_date  \
61798                NaN                     NaN       NaN           NaN   
63356                NaN                     NaN       NaN           NaN   
64105                NaN                     NaN       NaN           NaN   
65448                NaN                     NaN       NaN           NaN   
71981                NaN                     NaN       NaN           NaN   

       product_id_order  product_version created_date_time modified_date_time  \
61798               NaN              NaN               NaN                NaN   
63356               NaN              NaN               NaN                NaN   
64105               NaN              NaN               NaN                NaN   
65448               NaN              NaN               NaN                NaN   
71981               NaN              NaN               NaN                NaN   

       unit_id unit  status_id

In [63]:
# Check recievals with no purchase order id or purchase order item no
print(data_receivals['purchase_order_id'].isna().sum())
print(data_receivals['purchase_order_item_no'].isna().sum())

53
53


In [67]:
# Okay let me first try using the data with recievals and where the recievals can be linked to purchase
data = pd.merge(
    data_orders,
    data_receivals,
    on=['purchase_order_id', 'purchase_order_item_no'],
    suffixes=('_order', '_receival'),
    how='left'
)

In [82]:
data = pd.read_csv('data/kernel/receivals.csv')
print(data['receival_status'].unique())
print(data['receival_status'].value_counts())

['Completed' 'Finished unloading' 'Planned' 'Start unloading']
receival_status
Completed             122448
Finished unloading       106
Start unloading           32
Planned                    4
Name: count, dtype: int64


In [1]:
# Found out that some materials cease to be ordered after some time. Maybe they are obsolete?
# Some dates use different time zones than others
# Some units are in KG, LBs and pounds --> Need to make them to KG and drop that column
# Some materials stock are deleted in the materials file

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

############## CLEANING THE PURCHASE ORDERS DATA ##############

orders = pd.read_csv("./data/kernel/purchase_orders.csv")

# Make the orders with PUND in KGs, and change quantity accordingly
# 1 PUND = 0,45359237 kilogram
orders.loc[orders['unit'] == 'PUND', 'quantity'] = orders.loc[orders['unit'] == 'PUND', 'quantity'] * 0.45359237
# Change the unit to KG too: orders.loc[orders['unit'] == 'PUND', 'unit'] = 'KG'
# Drop unit_id and unit columns
orders = orders.drop(columns=['unit_id', 'unit'])

# Time is in GMT+2 which is Norway time
# Make delivery_date, created_date_time and modified_date_time to GMT +2
orders['delivery_date'] = pd.to_datetime(orders['delivery_date'], utc=True).dt.tz_convert('Etc/GMT-2')
orders['created_date_time'] = pd.to_datetime(orders['created_date_time'], utc=True).dt.tz_convert('Etc/GMT-2')
orders['modified_date_time'] = pd.to_datetime(orders['modified_date_time'], utc=True).dt.tz_convert('Etc/GMT-2')

# Save the cleaned data to a new CSV file in data_cleaned folder
#orders.to_csv('./data_cleaned/purchase_orders_cleaned.csv', index=False)

   purchase_order_id  purchase_order_item_no  quantity  \
0                  1                       1     -14.0   
1                 22                       1   23880.0   

              delivery_date  product_id  product_version  \
0 2003-05-12 00:00:00+02:00    91900143                1   
1 2003-05-27 00:00:00+02:00    91900160                1   

          created_date_time        modified_date_time  status_id  status  
0 2003-05-12 12:00:48+02:00 2004-06-15 08:16:18+02:00          2  Closed  
1 2003-05-27 14:42:07+02:00 2012-06-29 11:41:13+02:00          2  Closed  


In [24]:
### CLEANING THE RECEIVALS DATA ###
receivals = pd.read_csv("./data/kernel/receivals.csv")

# Make the date_arrival to GMT +2
receivals['date_arrival'] = pd.to_datetime(receivals['date_arrival'], utc=True).dt.tz_convert('Etc/GMT-2')
# Save the cleaned data to a new CSV file in data_cleaned folder
receivals.to_csv('./data_cleaned/receivals_cleaned.csv', index=False)

In [None]:
# METHOD 1: Merge orders and receivals, group them by orders and aggregate the recievals
# Then merge the aggregated recievals into orders. THIS IS NOT IN USE RN
import pandas as pd

orders = pd.read_csv("./data_cleaned/purchase_orders_cleaned.csv", parse_dates=["delivery_date", "created_date_time", "modified_date_time"])
receivals = pd.read_csv("./data_cleaned/receivals_cleaned.csv", parse_dates=["date_arrival"])

# --- Aggregate receivals per order line ---
order_receivals = receivals.groupby(
    ["purchase_order_id", "purchase_order_item_no", "rm_id"]
).agg(
    total_received_qty=("net_weight", "sum"),
    first_receival_date=("date_arrival", "min")
).reset_index()

# --- Merge into orders ---
orders = orders.merge(
    order_receivals,
    on=["purchase_order_id", "purchase_order_item_no"],
    how="left"
)
print(orders.shape)

# --- Fill missing values for undelivered orders ---
orders["total_received_qty"] = orders["total_received_qty"].fillna(0)
orders["first_receival_date"] = pd.to_datetime(orders["first_receival_date"])

# --- Derived features ---
orders["fill_fraction"] = orders["total_received_qty"] / orders["quantity"]
orders["lead_time"] = (orders["first_receival_date"] - orders["delivery_date"]).dt.days
orders["lead_time"] = orders["lead_time"].fillna(0)

# --- Save the final cleaned and merged dataset ---
#orders.to_csv('./data_cleaned/orders_with_receivals.csv', index=False)
orders.to_csv('./data_cleaned/rm_idGROUPED.csv', index=False)

(38297, 13)


In [52]:
# Method 2: Merge orders and receivals directly, then aggregate the recievals per order line
# This will create duplicate rows for orders with multiple recievals, but we can aggregate them

# --- Load data ---
orders = pd.read_csv(
    "./data_cleaned/purchase_orders_cleaned.csv",
    parse_dates=["delivery_date", "created_date_time", "modified_date_time"]
)
receivals = pd.read_csv(
    "./data_cleaned/receivals_cleaned.csv",
    parse_dates=["date_arrival"]
)

# --- Merge orders and receivals WITHOUT aggregation ---
orders_with_receivals = orders.merge(
    receivals,
    on=["purchase_order_id", "purchase_order_item_no"],
    how="left",
    suffixes=('_order', '_receival')
)

# --- Fill missing values for orders with no receivals ---
orders_with_receivals["net_weight"] = orders_with_receivals["net_weight"].fillna(0)
orders_with_receivals["date_arrival"] = pd.to_datetime(orders_with_receivals["date_arrival"])

# --- Derived features ---
orders_with_receivals["fill_fraction"] = orders_with_receivals["net_weight"] / orders_with_receivals["quantity"]
orders_with_receivals["lead_time"] = (
    orders_with_receivals["date_arrival"] - orders_with_receivals["delivery_date"]
).dt.days
orders_with_receivals["lead_time"] = orders_with_receivals["lead_time"].fillna(0)

# --- Save result ---
#orders_with_receivals.to_csv("./data_cleaned/orders_with_receivals_detailed.csv", index=False)

print(orders_with_receivals.shape)


(133409, 20)


In [77]:
orders_merged = pd.read_csv("./data_cleaned/orders_with_receivals.csv")

# print how many orders have value for total_received_qty that is different from 0
print((orders_merged['total_received_qty'] != 0).sum())
print((orders_merged['total_received_qty'] == 0).sum())
print(orders_merged.shape)

27410
10887
(38297, 15)


In [55]:
# Check how many orders with no receivals
print((orders_with_receivals['net_weight'] == 0).sum())
print(orders_with_receivals.shape)
# 122 591 rows, so 54 extra recievals that are not in the orders dataset
# 11 026 orders with no receivals
# 122 537 + 11 026 = 133 563
# So it doesn't make sense that I get 133 409 rows when merging
# Errr... whatever for now I guess
# Ehhrm why do I get 10 887 above and 11 026 here? errrm...

# TODO: FIND OUT WHY THE DIFFERENCE IN ROWS WHEN MERGING

11026
(133409, 20)


In [3]:
# Okay now use orders_with_receivals_detailed.csv to make a model that predicts net_weight based on previous orders
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

orders = pd.read_csv("./data_cleaned/purchase_orders_cleaned.csv", parse_dates=["delivery_date", "created_date_time", "modified_date_time"])
receivals = pd.read_csv("./data_cleaned/receivals_cleaned.csv", parse_dates=["date_arrival"])
orders_with_receivals = pd.read_csv("./data_cleaned/orders_with_receivals_detailed.csv", parse_dates=["delivery_date", "created_date_time", "modified_date_time", "date_arrival"])

# Make sure dates are tz-naive for calculations
orders_with_receivals['date_arrival'] = orders_with_receivals['date_arrival'].dt.tz_convert(None)
orders_with_receivals['delivery_date'] = orders_with_receivals['delivery_date'].dt.tz_convert(None)
orders_with_receivals['created_date_time'] = orders_with_receivals['created_date_time'].dt.tz_convert(None)
orders_with_receivals['modified_date_time'] = orders_with_receivals['modified_date_time'].dt.tz_convert(None)

# Mean fill_fraction, lead_time and weekly order quantity per material

material_stats = orders_with_receivals.groupby('rm_id').agg(
    avg_fill_fraction=('fill_fraction', 'mean'),
    avg_lead_time=('lead_time', 'mean'),
    avg_weekly_order_qty=('quantity', lambda x: x.sum() / ((orders_with_receivals['delivery_date'].max() - orders_with_receivals['delivery_date'].min()).days / 7))
).reset_index()

print(material_stats.head(2))

   rm_id  avg_fill_fraction  avg_lead_time  avg_weekly_order_qty
0  342.0           0.479615           23.0             42.573099
1  343.0           0.004804         -642.0           3708.771930


In [4]:
# But when I think closely. Why don't I make a model that trains on
# data up to 2013 and predicts 2014? Cause then I can actually see if it works
# So I need to split the data into train and test based on date

# Sidenote: I just thought about something for transportation: Mby some transporter names are more reliable than others?
# Like some transporters always deliver on time, while others are late. They prob get better as time goes on too?

# Okay let's start with splitting the data into train and test based on date
train_data = orders_with_receivals[orders_with_receivals['delivery_date'] < '2024-01-01']
test_data = orders_with_receivals[orders_with_receivals['delivery_date'] >= '2024-01-01']

# The thing is I have so much data from previous years so I feel like I will overfit if I use all of it
# So I will use only the most recent 3 years of data for training? Hmm some years we got events like
# COVID or NM in skiing that might make some years different too

train_data = train_data[train_data['delivery_date'] >= '2021-01-01']

print(train_data.shape)
print(test_data.shape)




(19430, 20)
(6340, 20)


In [5]:
### TRYING TO PREDICT 2024 BY USING EARLIER YEARS

import pandas as pd
import numpy as np
import lightgbm as lgb

# -----------------------------
# 1. Prepare data
# -----------------------------
# Make sure datetime columns are correct
orders_with_receivals['delivery_date'] = pd.to_datetime(orders_with_receivals['delivery_date'])
orders_with_receivals['date_arrival'] = pd.to_datetime(orders_with_receivals['date_arrival'])

# Fill missing values for lead_time and fill_fraction
orders_with_receivals['lead_time'] = orders_with_receivals['lead_time'].fillna(orders_with_receivals['lead_time'].mean())
orders_with_receivals['fill_fraction'] = orders_with_receivals['fill_fraction'].fillna(orders_with_receivals['fill_fraction'].mean())

# Compute received_qty (target)
orders_with_receivals['received_qty'] = orders_with_receivals['net_weight']

# -----------------------------
# 2. Aggregate daily per rm_id
# -----------------------------
daily_data = orders_with_receivals.groupby(
    ['rm_id', 'delivery_date']
).agg(
    quantity=('quantity', 'sum'),
    avg_lead_time=('lead_time', 'mean'),
    avg_fill_fraction=('fill_fraction', 'mean'),
    received_qty=('received_qty', 'sum')
).reset_index()

# Extract temporal features
daily_data['day_of_week'] = daily_data['delivery_date'].dt.dayofweek
daily_data['week_of_year'] = daily_data['delivery_date'].dt.isocalendar().week
daily_data['month'] = daily_data['delivery_date'].dt.month

# -----------------------------
# 3. Train/test split
# -----------------------------
train_df = daily_data[daily_data['delivery_date'] < '2024-01-01']
test_df  = daily_data[(daily_data['delivery_date'] >= '2024-01-01') & (daily_data['delivery_date'] < '2025-01-01')]

features = ['rm_id', 'quantity', 'avg_lead_time', 'avg_fill_fraction', 'day_of_week', 'week_of_year', 'month']
target = 'received_qty'

X_train = train_df[features]
y_train = train_df[target]
X_test = test_df[features]
y_test = test_df[target]

# Convert rm_id to categorical
X_train['rm_id'] = X_train['rm_id'].astype('category')
X_test['rm_id'] = X_test['rm_id'].astype('category')

# -----------------------------
# 4. Train LightGBM for 0.2 Quantile Regression
# -----------------------------
train_data = lgb.Dataset(X_train, label=y_train, categorical_feature=['rm_id'])
val_data = lgb.Dataset(X_test, label=y_test, reference=train_data, categorical_feature=['rm_id'])

params_quantile = {
    'objective': 'quantile',
    'alpha': 0.2,  # 0.2 quantile
    'metric': 'quantile',
    'boosting_type': 'gbdt',
    'learning_rate': 0.05,
    'num_leaves': 31,
    'seed': 42
}

model_q = lgb.train(
    params_quantile,
    train_data,
    valid_sets=[train_data, val_data],
    num_boost_round=10000,
    callbacks=[
        lgb.early_stopping(stopping_rounds=50),
        lgb.log_evaluation(period=50)
    ]
)

# -----------------------------
# 5. Predict 2024 daily received quantities
# -----------------------------
test_df['forecast_received_qty_0_2'] = model_q.predict(X_test)

# Compute cumulative forecast per rm_id
test_df['forecast_cum_qty_0_2'] = test_df.groupby('rm_id')['forecast_received_qty_0_2'].cumsum()

# -----------------------------
# 6. Compute 0.2 Quantile Loss for model
# -----------------------------
def quantile_loss_0_2(actual, forecast):
    return np.mean(np.maximum(0.2*(actual - forecast), 0.8*(forecast - actual)))

qloss_model = quantile_loss_0_2(y_test.values, test_df['forecast_received_qty_0_2'].values)
print("Quantile Loss 0.2 on 2024 data (model):", qloss_model)

# -----------------------------
# 7. Compare against zeros baseline
# -----------------------------
zero_forecast = np.zeros_like(y_test.values)
qloss_zero = quantile_loss_0_2(y_test.values, zero_forecast)
print("Quantile Loss 0.2 on 2024 data (all zeros):", qloss_zero)


# -----------------------------
# 7. Inspect forecast
# -----------------------------
print(test_df.head(2))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['rm_id'] = X_train['rm_id'].astype('category')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test['rm_id'] = X_test['rm_id'].astype('category')


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000807 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 971
[LightGBM] [Info] Number of data points in the train set: 6203, number of used features: 7
[LightGBM] [Info] Start training from score 10093.000000
Training until validation scores don't improve for 50 rounds
[50]	training's quantile: 32609.4	valid_1's quantile: 32665.8
[100]	training's quantile: 27325.5	valid_1's quantile: 25683.3
[150]	training's quantile: 20760.4	valid_1's quantile: 19496.3
[200]	training's quantile: 13914	valid_1's quantile: 17042.8
[250]	training's quantile: 12347.8	valid_1's quantile: 16252.5
[300]	training's quantile: 11880.1	valid_1's quantile: 16019.3
[350]	training's quantile: 11567.5	valid_1's quantile: 15620.5
[400]	training's quantile: 11341.8	valid_1's quantile: 15580.6
[450]	training's quantile: 1112

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['forecast_received_qty_0_2'] = model_q.predict(X_test)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['forecast_cum_qty_0_2'] = test_df.groupby('rm_id')['forecast_received_qty_0_2'].cumsum()


In [6]:

### TRYING TO PREDICT 2025 BY USING EARLIER YEARS
import pandas as pd
import numpy as np
import lightgbm as lgb

# -----------------------------
# 1. Prepare data
# -----------------------------
# Ensure datetime columns
orders_with_receivals['delivery_date'] = pd.to_datetime(orders_with_receivals['delivery_date'])
orders_with_receivals['date_arrival'] = pd.to_datetime(orders_with_receivals['date_arrival'])

# Fill missing values
orders_with_receivals['lead_time'] = orders_with_receivals['lead_time'].fillna(orders_with_receivals['lead_time'].mean())
orders_with_receivals['fill_fraction'] = orders_with_receivals['fill_fraction'].fillna(orders_with_receivals['fill_fraction'].mean())
orders_with_receivals['received_qty'] = orders_with_receivals['net_weight']

# Filter historical data from 2014 onward
hist_data = orders_with_receivals[orders_with_receivals['delivery_date'] >= '2014-01-01'].copy()

# -----------------------------
# 2. Aggregate daily per rm_id
# -----------------------------
daily_data = hist_data.groupby(['rm_id', 'delivery_date']).agg(
    quantity=('quantity', 'sum'),
    avg_lead_time=('lead_time', 'mean'),
    avg_fill_fraction=('fill_fraction', 'mean'),
    received_qty=('received_qty', 'sum')
).reset_index()

# Temporal features
daily_data['day_of_week'] = daily_data['delivery_date'].dt.dayofweek
daily_data['week_of_year'] = daily_data['delivery_date'].dt.isocalendar().week
daily_data['month'] = daily_data['delivery_date'].dt.month

# -----------------------------
# 3. Prepare features for training
# -----------------------------
features = ['rm_id', 'quantity', 'avg_lead_time', 'avg_fill_fraction', 'day_of_week', 'week_of_year', 'month']
target = 'received_qty'

X_train = daily_data[features].copy()
y_train = daily_data[target]

X_train['rm_id'] = X_train['rm_id'].astype('category')

# -----------------------------
# 4. Train LightGBM 0.2 Quantile Regression
# -----------------------------
train_data = lgb.Dataset(X_train, label=y_train, categorical_feature=['rm_id'])

params_quantile = {
    'objective': 'quantile',
    'alpha': 0.2,
    'metric': 'quantile',
    'boosting_type': 'gbdt',
    'learning_rate': 0.05,
    'num_leaves': 31,
    'seed': 42
}

model_q = lgb.train(
    params_quantile,
    train_data,
    num_boost_round=1000,
    valid_sets=[train_data],
    callbacks=[lgb.log_evaluation(period=50)]
)

# -----------------------------
# 5. Generate future submission
# -----------------------------
future_dates = pd.date_range('2025-01-01', '2025-05-31', freq='D')
rm_ids = daily_data['rm_id'].unique()

submission = pd.MultiIndex.from_product([rm_ids, future_dates], names=['rm_id', 'delivery_date']).to_frame(index=False)

# Fill features using historical averages per rm_id
submission['quantity'] = submission['rm_id'].map(daily_data.groupby('rm_id')['quantity'].mean())
submission['avg_lead_time'] = submission['rm_id'].map(daily_data.groupby('rm_id')['avg_lead_time'].mean())
submission['avg_fill_fraction'] = submission['rm_id'].map(daily_data.groupby('rm_id')['avg_fill_fraction'].mean())
submission['day_of_week'] = submission['delivery_date'].dt.dayofweek
submission['week_of_year'] = submission['delivery_date'].dt.isocalendar().week
submission['month'] = submission['delivery_date'].dt.month

submission['rm_id'] = submission['rm_id'].astype('category')

# Predict
submission['forecast_received_qty_0_2'] = model_q.predict(submission[features])

# Cumulative forecast per rm_id
submission['forecast_cum_qty_0_2'] = submission.groupby('rm_id')['forecast_received_qty_0_2'].cumsum()

# -----------------------------
# 6. Inspect sample
# -----------------------------
submission['zero_forecast'] = 0
def quantile_loss_0_2(actual, forecast):
    return np.mean(np.maximum(0.2*(actual - forecast), 0.8*(forecast - actual)))

qloss_zero = quantile_loss_0_2(submission['forecast_received_qty_0_2'].values, submission['zero_forecast'].values)
print("Quantile Loss 0.2 between model forecast and all zeros:", qloss_zero)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000633 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 938
[LightGBM] [Info] Number of data points in the train set: 4105, number of used features: 7
[LightGBM] [Info] Start training from score 12172.000977
[50]	training's quantile: 28548.7
[100]	training's quantile: 22746
[150]	training's quantile: 15419.7
[200]	training's quantile: 10614.7
[250]	training's quantile: 8962.28
[300]	training's quantile: 8472.58
[350]	training's quantile: 8201.08
[400]	training's quantile: 7967.16
[450]	training's quantile: 7636.87
[500]	training's quantile: 7501.64
[550]	training's quantile: 7322.02
[600]	training's quantile: 7229.18
[650]	training's quantile: 7104.31
[700]	training's quantile: 6978.08
[750]	training's quantile: 6888.69
[800]	training's quantile: 6788.24
[850]	training's quantile: 6714
[900]	training's quantile: 6649.14
[950]	training's quantile: 6584.1

  submission['forecast_cum_qty_0_2'] = submission.groupby('rm_id')['forecast_received_qty_0_2'].cumsum()


In [40]:
print(submission.head(3))

   rm_id delivery_date  quantity  avg_lead_time  avg_fill_fraction  \
0  355.0    2025-01-01  250000.0           -2.0            0.09888   
1  355.0    2025-01-02  250000.0           -2.0            0.09888   
2  355.0    2025-01-03  250000.0           -2.0            0.09888   

   day_of_week  week_of_year  month  forecast_received_qty_0_2  \
0            2             1      1               20750.417123   
1            3             1      1               21402.471634   
2            4             1      1               19804.558908   

   forecast_cum_qty_0_2  zero_forecast  
0          20750.417123              0  
1          42152.888756              0  
2          61957.447664              0  


In [8]:
import pandas as pd

# 'submission' = your daily forecast DataFrame
# columns: ['rm_id', 'delivery_date', 'forecast_received_qty_0_2', ...]

# Load sample_submission and mapping
sample_submission = pd.read_csv("./data/sample_submission.csv")  # columns: ID, predicted_weight (empty)
mapping = pd.read_csv("./data/prediction_mapping.csv")  # columns: ID, rm_id, forecast_start_date, forecast_end_date

mapping['forecast_start_date'] = pd.to_datetime(mapping['forecast_start_date'])
mapping['forecast_end_date'] = pd.to_datetime(mapping['forecast_end_date'])

# Aggregate daily forecasts into predicted_weight
def aggregate_forecast(row):
    mask = (
        (submission['rm_id'] == row['rm_id']) &
        (submission['delivery_date'] >= row['forecast_start_date']) &
        (submission['delivery_date'] <= row['forecast_end_date'])
    )
    return submission.loc[mask, 'forecast_received_qty_0_2'].sum()

mapping['predicted_weight'] = mapping.apply(aggregate_forecast, axis=1)

# Merge back into sample_submission by ID
sample_submission = sample_submission.drop(columns=['predicted_weight']).merge(
    mapping[['ID','predicted_weight']], on='ID', how='left'
)

# Make filled sample_submission predicted_weight * 0.001
sample_submission['predicted_weight'] = sample_submission['predicted_weight'] * 0.001

# I want to make earlier predictions more conservative
# So I will multiply the predictions for Jan by 0.01, Feb by 0.02, Mar by 0.5, Apr by 0.95, May by 1.0

# Merge forecast_end_date into sample_submission
sample_submission = sample_submission.merge(
    mapping[['ID', 'forecast_end_date']],
    on='ID',
    how='left'
)

def adjust_for_month(row):
    month = row['forecast_end_date'].month
    if month == 1:
        return row['predicted_weight'] * 0.01
    elif month == 2:
        return row['predicted_weight'] * 0.02
    elif month == 3:
        return row['predicted_weight'] * 0.5
    elif month == 4:
        return row['predicted_weight'] * 0.95
    elif month == 5:
        return row['predicted_weight'] * 1.0
    return row['predicted_weight']

sample_submission['predicted_weight'] = sample_submission.apply(adjust_for_month, axis=1)

# Save ready-to-submit CSV
sample_submission.to_csv("filled_sample_submission.csv", index=False)
print(sample_submission.head(20))


    ID  predicted_weight forecast_end_date
0    1               0.0        2025-01-02
1    2               0.0        2025-01-03
2    3               0.0        2025-01-04
3    4               0.0        2025-01-05
4    5               0.0        2025-01-06
5    6               0.0        2025-01-07
6    7               0.0        2025-01-08
7    8               0.0        2025-01-09
8    9               0.0        2025-01-10
9   10               0.0        2025-01-11
10  11               0.0        2025-01-12
11  12               0.0        2025-01-13
12  13               0.0        2025-01-14
13  14               0.0        2025-01-15
14  15               0.0        2025-01-16
15  16               0.0        2025-01-17
16  17               0.0        2025-01-18
17  18               0.0        2025-01-19
18  19               0.0        2025-01-20
19  20               0.0        2025-01-21


In [None]:
### I WANT TO MAKE EVERY RM_ID in the submission have 0 predicted weight if they are in materials.csv and the stock is deleted?
import pandas as pd
# 'submission' = your daily forecast DataFrame
# columns: ['rm_id', 'delivery_date', 'forecast_received_qty_0_2
# Load materials data
materials = pd.read_csv("./data/extended/materials.csv")
# Filter materials with stock_deleted = True
deleted_materials = materials[materials['stock_deleted'] == True]['rm_id'].unique()
