# Overview of project
We are provided historic data of raw material deliveries and orders through the end of 2024. GOAL: Develop a model that forecasts the cumulative weight of incoming deliveries of each raw material from Jan 1, 2025, up to any specified end date between Jan 1 and May 31, 2025.

- recievals = historical records of material recievals
- purchase_orders = ordered quantities and expected deliv
- materials(opt) = metadata on various raw materials
- transportation(opt) = transport-related data

QuantileLoss0.2(Fi,Ai) = max(0.2*(Ai − Fi), 0.8*(Fi − Ai)).

rm_id = unique identifer for raw material

In [69]:
# We need to explore the data

# First I want to check the difference between purchase orders
# and recievals. How much was the difference between the two?

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the data
data_orders = pd.read_csv('data/kernel/purchase_orders.csv')
data_receivals = pd.read_csv('data/kernel/receivals.csv')

# Link the 5 first data orders to the recievals
data = pd.merge(data_orders.head(5), data_receivals, on='purchase_order_id', suffixes=('_order', '_receival'), how='left')

data.head(2)

# NOT EVERY ORDER HAS A RECEIVAL? Oh... it makes sense cause some orders are never received? But I put the 5 in the head.... 5....
# 5 whole orders are not received? That seems like a lot.... Nah maybe it's just purchase_order_id is a bad key to merge on.
# Let's try purchase_order_item_no... # Absolutely not. I forgot it was like simple 1 etc....

Unnamed: 0,purchase_order_id,purchase_order_item_no_order,quantity,delivery_date,product_id_order,product_version,created_date_time,modified_date_time,unit_id,unit,...,status,rm_id,product_id_receival,purchase_order_item_no_receival,receival_item_no,batch_id,date_arrival,receival_status,net_weight,supplier_id
0,1,1,-14.0,2003-05-12 00:00:00.0000000 +02:00,91900143,1,2003-05-12 10:00:48.0000000 +00:00,2004-06-15 06:16:18.0000000 +00:00,,,...,Closed,,,,,,,,,
1,22,1,23880.0,2003-05-27 00:00:00.0000000 +02:00,91900160,1,2003-05-27 12:42:07.0000000 +00:00,2012-06-29 09:41:13.0000000 +00:00,,,...,Closed,,,,,,,,,


In [59]:
# Count all rows with quantity ordered negative
print(data_orders['quantity'].lt(0).sum())

print(data_orders['quantity'].eq(150000).sum())
# 6 rows with negative quantity... prob wrong..

6
563


In [None]:
# I want to check the transportation of the 5 orders in the head
data_transport = pd.read_csv('data/extended/transportation.csv')

data = pd.merge(data_orders.head(5), data_transport, on='purchase_order_id', suffixes=('_order', '_transport'))

print(data)

# Observation: Not all orders are transported either....

Empty DataFrame
Columns: [purchase_order_id, purchase_order_item_no_order, quantity, delivery_date, product_id_order, product_version, created_date_time, modified_date_time, unit_id, unit, status_id, status, rm_id, product_id_transport, purchase_order_item_no_transport, receival_item_no, batch_id, transporter_name, vehicle_no, unit_status, vehicle_start_weight, vehicle_end_weight, gross_weight, tare_weight, net_weight, wood, ironbands, plastic, water, ice, other, chips, packaging, cardboard]
Index: []

[0 rows x 34 columns]


In [53]:
# I want to check the material details of the 5 orders in the head
# NVM... cooked... orders have nothing to directly link to materials

In [None]:
# OKAY! Let's try to drop all the orders with no recievals maybe? And try to predict? But in a real scenario I probably shouldn't
# Cause maybe the orders with no recievals are equal to 0 recieved? But I don't know if that's true. Gotta test
# So try 2 stuff: 1. drop the orders with no recievals, 2. set the recievals to 0 if no recievals

# But first I neeed to know what my model will predict? Like will I get orders and recievals? Or just predict by the order prev?
# Okay I don't think I'll get more orders in the future, so I guess I just have to predict based on previous orders

# Purchase orders have an expected delivery_date though.
# They are using YYYY-MM-DD format I guess
# They want from 2025-01-01 to 2025-05-31
# We got some deliveries expected in 2025-03-XX, but none after, so we prob need to predict that there will be more orders.

# By making a model that predicts the quantity ordered based on previous orders, I can then use that to predict future orders
# I should prob make a model for each of the materials and then sum them up for each order date

In [77]:
# Starting by dropping the orders with no recievals
data = pd.merge(
    data_orders,
    data_receivals,
    on=['purchase_order_id', 'purchase_order_item_no'],
    suffixes=('_order', '_receival')
)

# 122537 rows, but recievals has 122590 rows. So some recievals are from orders not in the orders dataset?
data_extra_receivals = pd.merge(
    data_orders,
    data_receivals,
    on=['purchase_order_id', 'purchase_order_item_no'],
    suffixes=('_order', '_receival'),
    how='right'
)

print(data_extra_receivals.shape)
print(data.shape)
# 122591 rows, so 54 extra recievals that are not in the orders dataset
# Let's check if they are all from the same purchase_order_id

# I want the data_extra_receivals rows that are not in data
data_diff = pd.concat([data_extra_receivals, data]).drop_duplicates(keep=False)
print(data_diff.head(5))



(122590, 20)
(122537, 20)
       purchase_order_id  purchase_order_item_no  quantity delivery_date  \
61798                NaN                     NaN       NaN           NaN   
63356                NaN                     NaN       NaN           NaN   
64105                NaN                     NaN       NaN           NaN   
65448                NaN                     NaN       NaN           NaN   
71981                NaN                     NaN       NaN           NaN   

       product_id_order  product_version created_date_time modified_date_time  \
61798               NaN              NaN               NaN                NaN   
63356               NaN              NaN               NaN                NaN   
64105               NaN              NaN               NaN                NaN   
65448               NaN              NaN               NaN                NaN   
71981               NaN              NaN               NaN                NaN   

       unit_id unit  status_id

In [79]:
# Check recievals with no purchase order id or purchase order item no
print(data_receivals['purchase_order_id'].isna().sum())
print(data_receivals['purchase_order_item_no'].isna().sum())

53
53


In [None]:
# Okay let me first try using the data with recievals and where the recievals can be linked to purchase
data = pd.merge(
    data_orders,
    data_receivals,
    on=['purchase_order_id', 'purchase_order_item_no'],
    suffixes=('_order', '_receival')
)

# Let's try CatBoost to predict the 


(122537, 20)


In [82]:
data = pd.read_csv('data/kernel/receivals.csv')
print(data['receival_status'].unique())
print(data['receival_status'].value_counts())

['Completed' 'Finished unloading' 'Planned' 'Start unloading']
receival_status
Completed             122448
Finished unloading       106
Start unloading           32
Planned                    4
Name: count, dtype: int64
