# **Data Analysis & Machine Learning - Final Project**
# **MPS 18 Months**

# Step 1: Data Understanding (Dataset Overview)

### Attribute Information (Understand what each column is representing)

•   PLN-MM/YY   : Planning period (e.g., 202310 = October 2023). <br>
•	Sales_Doc.  : Unique identifier for sales document.<br>
•	Item        : Line item number within the sales document.<br>
•	V           : Likely a concatenated or encoded field (e.g., Sales Doc + Item = V).<br>
•	Loaded_line	: Line status code: values like CR, PP (might refer to process types).<br>
•	New_Plan	: Text version of planning month (e.g., "2024 Jan").<br>
•	Sold_to_party : Customer name in Arabic.<br>
•	Country_Key	: Country (mostly "Egypt" in sample).<br>
•	Destination	: Final customer destination category (e.g., "Local").<br>
•	Material	: Product code (e.g., CRCF, POCF).<br>
•	CUST_THICKNESS : Customer-requested thickness (in mm likely).<br>
•	CUST_WIDTH  : Customer-requested width (in mm likely).<br>
•	KS_GRADE01	: Product grade (e.g., DC01, DX51D).<br>
•	ZINC01	    : Zinc coating (values like 80, 120, 180; may be missing if not applicable).<br>
•	Remain_GRF  : Remaining quantity or planning figure (possibly in tons or tons-equivalent).<br>
•	Destination	: Sub-destination or business segment (e.g., Corporate, Service Center).<br>
•	Creation	: Incoterm (e.g., EXW, CPT – terms of delivery).<br>
•	CBE$        : Some cost or price per unit (As per Central Bank of Egypt Exchange Rate, in USD).<br>
•	BM$         : Some cost or price per unit (As per Black Market Exchange Rate, in USD).<br>

# Step 2: Install Necessary Libraries

In [4]:
##! pip install jupyterlab

In [5]:
##! pip install notebook

In [6]:
##! pip install voila

In [7]:
##! pip install numpy

In [8]:
##! pip install pandas

In [9]:
##! pip install scikit-learn

In [10]:
##! pip install plotly

In [11]:
##! pip install streamlit

In [12]:
##! pip install seaborn

In [13]:
##! pip install xgboost

In [14]:
##! pip install catboost


In [15]:
##! pip install lightgbm

In [16]:
##! pip install --upgrade lightgbm

In [17]:
##! pip install pipreqs

In [18]:
##! pip install joblib

In [19]:
##! pip install catboost --prefer-binary

In [20]:
##! pip install --upgrade scikit-learn

In [21]:
import plotly.io as pio
pio.templates.default = "plotly"

In [22]:
import numpy as np
import pandas as pd
import plotly.express as px

In [23]:
import sklearn

# Step 3: Data Loading

In [24]:
df = pd.read_csv('MPS_18_Months_V1.csv')
df

Unnamed: 0,PLN-MM/YY,Sales_Doc.,Item,V,Loaded_line,New_Plan,Sold_to_party,Country_Key,Destination,Material,CUST_THICKNESS,CUST_WIDTH,KS_GRADE01,ZINC01,Remain_GRF,Destination.1,Creation,CBE$,BM$
0,202310,30079276,20,3007927620,CR,2024 Jan,L0056,Egypt,Local,CRCF,3.0,1250.0,DC01,,6.000,Corporate,CPT,2110,2938
1,202401,30081591,10,3008159110,PP,2024 Jan,L0076,Egypt,Local,POCF,0.4,1250.0,DX51D,80,20.000,Service Center,EXW,1570,2515
2,202401,30081591,20,3008159120,PP,2024 Jan,L0076,Egypt,Local,POCF,0.4,1250.0,DX51D,80,20.000,Service Center,EXW,1770,2515
3,202401,30081591,30,3008159130,PP,2024 Jan,L0076,Egypt,Local,POCF,,1250.0,DX51D,120,40.000,Service Center,EXW,1570,2508
4,202401,30081591,40,3008159140,PP,2024 Jan,L0076,Egypt,Local,POCF,0.5,1250.0,DX51D,180,40.000,Service Center,EXW,1770,2574
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16387,202506,30097656,350,30097656350,CR,2025 June,L0003,Egypt,Local,CRSF,0.5,1074.0,DC01,,52.035,High Tech industries,,2090,775
16388,202506,30097656,360,30097656360,CR,2025 June,L0003,Egypt,Local,CRSF,0.5,1285.0,DC01,,145.530,High Tech industries,CPT,2490,775
16389,202506,30102527,40,3010252740,CR,2025 June,L0280,Egypt,Local,CRNF,0.7,945.0,DC04EK,,40.000,Large Corporate,EXW,2210,235
16390,202506,30102527,70,3010252770,CR,2025 June,L0280,Egypt,Local,CRNF,0.8,960.0,DC04EK,,40.000,Large Corporate,,2210,235


In [25]:
df.shape[0]

16392

# Step 4: Data Exploration (Overview about the data)

### i-Check Data Types

In [26]:
df. info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16392 entries, 0 to 16391
Data columns (total 19 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   PLN-MM/YY       16392 non-null  int64  
 1   Sales_Doc.      16392 non-null  int64  
 2   Item            16392 non-null  int64  
 3   V               16392 non-null  int64  
 4   Loaded_line     16392 non-null  object 
 5   New_Plan        16392 non-null  object 
 6   Sold_to_party   16392 non-null  object 
 7   Country_Key     16392 non-null  object 
 8   Destination     16065 non-null  object 
 9   Material        16392 non-null  object 
 10  CUST_THICKNESS  15573 non-null  float64
 11  CUST_WIDTH      15573 non-null  float64
 12  KS_GRADE01      15901 non-null  object 
 13  ZINC01          11425 non-null  object 
 14  Remain_GRF      16392 non-null  float64
 15  Destination.1   16233 non-null  object 
 16  Creation        15262 non-null  object 
 17  CBE$            16392 non-null 

In [27]:
cols_to_convert = ['PLN-MM/YY', 'Sales_Doc.', 'Item', 'V']
df[cols_to_convert] = df[cols_to_convert].astype(object)

In [28]:
df. info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16392 entries, 0 to 16391
Data columns (total 19 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   PLN-MM/YY       16392 non-null  object 
 1   Sales_Doc.      16392 non-null  object 
 2   Item            16392 non-null  object 
 3   V               16392 non-null  object 
 4   Loaded_line     16392 non-null  object 
 5   New_Plan        16392 non-null  object 
 6   Sold_to_party   16392 non-null  object 
 7   Country_Key     16392 non-null  object 
 8   Destination     16065 non-null  object 
 9   Material        16392 non-null  object 
 10  CUST_THICKNESS  15573 non-null  float64
 11  CUST_WIDTH      15573 non-null  float64
 12  KS_GRADE01      15901 non-null  object 
 13  ZINC01          11425 non-null  object 
 14  Remain_GRF      16392 non-null  float64
 15  Destination.1   16233 non-null  object 
 16  Creation        15262 non-null  object 
 17  CBE$            16392 non-null 

### ii- Check Summary Statistics for Numerical Columns


In [29]:
df.describe().round(1)

Unnamed: 0,CUST_THICKNESS,CUST_WIDTH,Remain_GRF,CBE$,BM$
count,15573.0,15573.0,16392.0,16392.0,16392.0
mean,1.0,976.6,58.1,2328.7,1891.6
std,0.6,370.1,111.5,469.3,1111.4
min,0.2,20.0,0.0,1130.0,0.0
25%,0.5,780.0,17.9,1970.0,829.0
50%,0.8,1219.0,27.6,2370.0,1990.0
75%,1.2,1250.0,60.0,2590.0,2690.0
max,3.3,1300.0,3761.9,3830.0,6181.0


### iii- Check Summary Statistics for Categorical Columns

In [30]:
df.describe(include= 'object')

Unnamed: 0,PLN-MM/YY,Sales_Doc.,Item,V,Loaded_line,New_Plan,Sold_to_party,Country_Key,Destination,Material,KS_GRADE01,ZINC01,Destination.1,Creation
count,16392,16392,16392,16392,16392,16392,16392,16392,16065,16392,15901,11425,16233,15262
unique,31,2441,57,10022,9,18,550,42,2,45,33,20,14,5
top,202501,35006385,10,3009103810,GI,2025 Mar,L0131,Egypt,Local,GCCF,DX51D,275,Large Corporate,EXW
freq,1364,103,3554,8,8896,1137,481,10452,10257,5188,9518,2420,3045,5287


# Step 5: Data Cleaning (In depth check for each column)

### i- Drop unnecessary index column (key)

In [31]:
df.reset_index(inplace= True, drop= True)

### ii- Not reasonable Values as per Data Exploration

In [32]:
df = df[df['CBE$'] != 0]
df

Unnamed: 0,PLN-MM/YY,Sales_Doc.,Item,V,Loaded_line,New_Plan,Sold_to_party,Country_Key,Destination,Material,CUST_THICKNESS,CUST_WIDTH,KS_GRADE01,ZINC01,Remain_GRF,Destination.1,Creation,CBE$,BM$
0,202310,30079276,20,3007927620,CR,2024 Jan,L0056,Egypt,Local,CRCF,3.0,1250.0,DC01,,6.000,Corporate,CPT,2110,2938
1,202401,30081591,10,3008159110,PP,2024 Jan,L0076,Egypt,Local,POCF,0.4,1250.0,DX51D,80,20.000,Service Center,EXW,1570,2515
2,202401,30081591,20,3008159120,PP,2024 Jan,L0076,Egypt,Local,POCF,0.4,1250.0,DX51D,80,20.000,Service Center,EXW,1770,2515
3,202401,30081591,30,3008159130,PP,2024 Jan,L0076,Egypt,Local,POCF,,1250.0,DX51D,120,40.000,Service Center,EXW,1570,2508
4,202401,30081591,40,3008159140,PP,2024 Jan,L0076,Egypt,Local,POCF,0.5,1250.0,DX51D,180,40.000,Service Center,EXW,1770,2574
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16387,202506,30097656,350,30097656350,CR,2025 June,L0003,Egypt,Local,CRSF,0.5,1074.0,DC01,,52.035,High Tech industries,,2090,775
16388,202506,30097656,360,30097656360,CR,2025 June,L0003,Egypt,Local,CRSF,0.5,1285.0,DC01,,145.530,High Tech industries,CPT,2490,775
16389,202506,30102527,40,3010252740,CR,2025 June,L0280,Egypt,Local,CRNF,0.7,945.0,DC04EK,,40.000,Large Corporate,EXW,2210,235
16390,202506,30102527,70,3010252770,CR,2025 June,L0280,Egypt,Local,CRNF,0.8,960.0,DC04EK,,40.000,Large Corporate,,2210,235


### iii- Checking and Dropping Duplicates

In [33]:

df.duplicated().sum()

0

### iv- Checking and Handling Missing Values

In [34]:
df.select_dtypes(include= 'float64').columns

Index(['CUST_THICKNESS', 'CUST_WIDTH', 'Remain_GRF'], dtype='object')

In [35]:
df.select_dtypes(include= 'object').columns

Index(['PLN-MM/YY', 'Sales_Doc.', 'Item', 'V', 'Loaded_line', 'New_Plan',
       'Sold_to_party', 'Country_Key', 'Destination', 'Material', 'KS_GRADE01',
       'ZINC01', 'Destination.1', 'Creation'],
      dtype='object')

In [36]:

df.isna().mean().round(4)*100

PLN-MM/YY          0.00
Sales_Doc.         0.00
Item               0.00
V                  0.00
Loaded_line        0.00
New_Plan           0.00
Sold_to_party      0.00
Country_Key        0.00
Destination        1.99
Material           0.00
CUST_THICKNESS     5.00
CUST_WIDTH         5.00
KS_GRADE01         3.00
ZINC01            30.30
Remain_GRF         0.00
Destination.1      0.97
Creation           6.89
CBE$               0.00
BM$                0.00
dtype: float64

In [37]:
round((df.isna().mean()) * 100, 2)

PLN-MM/YY          0.00
Sales_Doc.         0.00
Item               0.00
V                  0.00
Loaded_line        0.00
New_Plan           0.00
Sold_to_party      0.00
Country_Key        0.00
Destination        1.99
Material           0.00
CUST_THICKNESS     5.00
CUST_WIDTH         5.00
KS_GRADE01         3.00
ZINC01            30.30
Remain_GRF         0.00
Destination.1      0.97
Creation           6.89
CBE$               0.00
BM$                0.00
dtype: float64

In [38]:
round((df.isna().sum() / df.shape[0]) * 100, 2)

PLN-MM/YY          0.00
Sales_Doc.         0.00
Item               0.00
V                  0.00
Loaded_line        0.00
New_Plan           0.00
Sold_to_party      0.00
Country_Key        0.00
Destination        1.99
Material           0.00
CUST_THICKNESS     5.00
CUST_WIDTH         5.00
KS_GRADE01         3.00
ZINC01            30.30
Remain_GRF         0.00
Destination.1      0.97
Creation           6.89
CBE$               0.00
BM$                0.00
dtype: float64

In [39]:
##Drop NaN if percentage less than 5%
df = df.dropna(subset=['Destination', 'KS_GRADE01', 'Destination.1' , 'CBE$' , 'BM$'  ])

In [40]:
df.shape[0]

15432

In [41]:
round((df.isna().sum() / df.shape[0]) * 100, 2)

PLN-MM/YY          0.00
Sales_Doc.         0.00
Item               0.00
V                  0.00
Loaded_line        0.00
New_Plan           0.00
Sold_to_party      0.00
Country_Key        0.00
Destination        0.00
Material           0.00
CUST_THICKNESS     5.00
CUST_WIDTH         5.03
KS_GRADE01         0.00
ZINC01            30.17
Remain_GRF         0.00
Destination.1      0.00
Creation           6.89
CBE$               0.00
BM$                0.00
dtype: float64

In [42]:
##Fill if percentage less than 40%
##Numerical
from sklearn.impute import KNNImputer
import pandas as pd
df = df.copy()
num_imputer = KNNImputer()
cols_to_impute = ['CUST_THICKNESS', 'CUST_WIDTH']
df.loc[:, cols_to_impute] = num_imputer.fit_transform(df[cols_to_impute])

In [43]:
df.shape[0]

15432

In [44]:
round((df.isna().mean()) * 100, 2)

PLN-MM/YY          0.00
Sales_Doc.         0.00
Item               0.00
V                  0.00
Loaded_line        0.00
New_Plan           0.00
Sold_to_party      0.00
Country_Key        0.00
Destination        0.00
Material           0.00
CUST_THICKNESS     0.00
CUST_WIDTH         0.00
KS_GRADE01         0.00
ZINC01            30.17
Remain_GRF         0.00
Destination.1      0.00
Creation           6.89
CBE$               0.00
BM$                0.00
dtype: float64

In [45]:
##Fill if percentage less than 40%
##Categorical

from sklearn.impute import SimpleImputer
import pandas as pd
df = df.copy()
cat_imputer = SimpleImputer(strategy='most_frequent')  
cols_to_impute = ['Creation']
df[cols_to_impute] = cat_imputer.fit_transform(df[cols_to_impute])

In [46]:
df.shape[0]

15432

In [47]:
round((df.isna().mean()) * 100, 2)

PLN-MM/YY          0.00
Sales_Doc.         0.00
Item               0.00
V                  0.00
Loaded_line        0.00
New_Plan           0.00
Sold_to_party      0.00
Country_Key        0.00
Destination        0.00
Material           0.00
CUST_THICKNESS     0.00
CUST_WIDTH         0.00
KS_GRADE01         0.00
ZINC01            30.17
Remain_GRF         0.00
Destination.1      0.00
Creation           0.00
CBE$               0.00
BM$                0.00
dtype: float64

### v- Special Case for Column ZINC01

In [48]:
df['ZINC01'] = pd.to_numeric(df['ZINC01'], errors='coerce')

In [49]:
cols_to_convert = ['ZINC01']
df[cols_to_convert] = df[cols_to_convert].astype('float64')

In [50]:
##Handling Real and Not Real NaN of Column ZINC01
##If Material starts with P or G, it is a real NAN, if not, it is a fake Nan and need to replace it with 0 number
df[['Material', 'ZINC01']]

Unnamed: 0,Material,ZINC01
0,CRCF,
1,POCF,80.0
2,POCF,80.0
3,POCF,120.0
4,POCF,180.0
...,...,...
16387,CRSF,
16388,CRSF,
16389,CRNF,
16390,CRNF,


In [51]:
df.loc[~df['Material'].str.startswith(('P', 'G')) & df['ZINC01'].isna(), 'ZINC01'] = 0

In [52]:
df[['Material', 'ZINC01']].head(10)


Unnamed: 0,Material,ZINC01
0,CRCF,0.0
1,POCF,80.0
2,POCF,80.0
3,POCF,120.0
4,POCF,180.0
8,POSF,120.0
9,GCCF,100.0
10,GHCF,100.0
11,GCCF,100.0
19,GHCF,180.0


In [53]:
zinc_nan_material_P_G = df['Material'].str.startswith(('P', 'G'))
zinc_is_nan = df['ZINC01'].isna()
real_zinc_nan = (zinc_nan_material_P_G) & (zinc_is_nan)
fake_zinc_nan = (~zinc_nan_material_P_G) & (zinc_is_nan)
df.loc[fake_zinc_nan, 'ZINC01'] = 0
df[df['ZINC01'].isna()][['Material', 'ZINC01']]

Unnamed: 0,Material,ZINC01
15017,POCF,


In [54]:
df = df.dropna(subset=['ZINC01' ])

In [55]:
df.shape[0]

15431

In [56]:
df[df['ZINC01'] == 0][['Material', 'ZINC01']]

Unnamed: 0,Material,ZINC01
0,CRCF,0.0
23,CRCF,0.0
24,CRCF,0.0
25,CRCF,0.0
32,CRSF,0.0
...,...,...
16386,CRSF,0.0
16387,CRSF,0.0
16388,CRSF,0.0
16389,CRNF,0.0


In [57]:
zeros = df[df['ZINC01'] == 0]
mask_pg = zeros['Material'].str.startswith(('P','G'))
zeros_pg = zeros[mask_pg][['Material', 'ZINC01']]
zeros_pg

Unnamed: 0,Material,ZINC01


In [58]:
df. info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15431 entries, 0 to 16391
Data columns (total 19 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   PLN-MM/YY       15431 non-null  object 
 1   Sales_Doc.      15431 non-null  object 
 2   Item            15431 non-null  object 
 3   V               15431 non-null  object 
 4   Loaded_line     15431 non-null  object 
 5   New_Plan        15431 non-null  object 
 6   Sold_to_party   15431 non-null  object 
 7   Country_Key     15431 non-null  object 
 8   Destination     15431 non-null  object 
 9   Material        15431 non-null  object 
 10  CUST_THICKNESS  15431 non-null  float64
 11  CUST_WIDTH      15431 non-null  float64
 12  KS_GRADE01      15431 non-null  object 
 13  ZINC01          15431 non-null  float64
 14  Remain_GRF      15431 non-null  float64
 15  Destination.1   15431 non-null  object 
 16  Creation        15431 non-null  object 
 17  CBE$            15431 non-null 

In [59]:
df.shape[0]

15431

In [60]:
round((df.isna().sum() / df.shape[0]) * 100, 2)

PLN-MM/YY         0.0
Sales_Doc.        0.0
Item              0.0
V                 0.0
Loaded_line       0.0
New_Plan          0.0
Sold_to_party     0.0
Country_Key       0.0
Destination       0.0
Material          0.0
CUST_THICKNESS    0.0
CUST_WIDTH        0.0
KS_GRADE01        0.0
ZINC01            0.0
Remain_GRF        0.0
Destination.1     0.0
Creation          0.0
CBE$              0.0
BM$               0.0
dtype: float64

### vi- Handling Outliers

In [61]:
##Handling Outliers,Drop before Split or Impute After Split
numeric_cols = df.select_dtypes(include=['float64']).columns
Q1 = df[numeric_cols].quantile(0.25)
Q3 = df[numeric_cols].quantile(0.75)
IQR = Q3 - Q1
outliers_count = ((df[numeric_cols] < (Q1 - 1.5*IQR)) | (df[numeric_cols] > (Q3 + 1.5*IQR))).sum()
outliers_count

CUST_THICKNESS     690
CUST_WIDTH        1203
ZINC01               0
Remain_GRF        1494
dtype: int64

In [62]:
Q1 = df['CBE$'].quantile(0.25)
Q3 = df['CBE$'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df = df[(df['CBE$'] >= lower_bound) & (df['CBE$'] <= upper_bound)]


In [63]:
df.shape[0]

15274

In [64]:
Q1 = df['BM$'].quantile(0.25)
Q3 = df['BM$'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df = df[(df['BM$'] >= lower_bound) & (df['BM$'] <= upper_bound)]

In [65]:
Q1 = df['CUST_WIDTH'].quantile(0.25)
Q3 = df['CUST_WIDTH'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df = df[(df['CUST_WIDTH'] >= lower_bound) & (df['CUST_WIDTH'] <= upper_bound)]

In [66]:
Q1 = df['CUST_THICKNESS'].quantile(0.25)
Q3 = df['CUST_THICKNESS'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df = df[(df['CUST_THICKNESS'] >= lower_bound) & (df['CUST_THICKNESS'] <= upper_bound)]

In [67]:
df.shape[0]

13384

In [68]:
##Handling Outliers,Drop before Split or Impute After Split
numeric_cols = df.select_dtypes(include=['float64']).columns
Q1 = df[numeric_cols].quantile(0.25)
Q3 = df[numeric_cols].quantile(0.75)
IQR = Q3 - Q1
outliers_count = ((df[numeric_cols] < (Q1 - 1.5*IQR)) | (df[numeric_cols] > (Q3 + 1.5*IQR))).sum()
outliers_count

CUST_THICKNESS       0
CUST_WIDTH        1227
ZINC01               0
Remain_GRF        1323
dtype: int64

In [69]:
##Handling Outliers,As per above, 
## in real Remain_GRF is not a judge, and all other numbers outliers are much important than ban be dropped now.

### vii- Important Edits For Some Categorical Data

In [70]:
df.rename(columns={'Destination.1': 'Sector'}, inplace=True)

In [71]:
cat_cols=df.select_dtypes(include='object').columns
cat_cols

Index(['PLN-MM/YY', 'Sales_Doc.', 'Item', 'V', 'Loaded_line', 'New_Plan',
       'Sold_to_party', 'Country_Key', 'Destination', 'Material', 'KS_GRADE01',
       'Sector', 'Creation'],
      dtype='object')

In [72]:
for col in cat_cols:
    
    print(col)
    print(df[col].nunique())
    print(df[col].unique())
    print ('-' * 100)


PLN-MM/YY
30
[202401 202309 202308 202311 202302 202312 202310 202305 202303 202402
 202306 202301 202307 202403 202404 202405 202406 202407 202408 202409
 202410 202411 202412 202501 202502 202503 202504 202505 202506 202507]
----------------------------------------------------------------------------------------------------
Sales_Doc.
2195
[30081591 30078038 35005563 ... 30102529 35006679 30102527]
----------------------------------------------------------------------------------------------------
Item
57
[10 20 30 40 80 90 70 140 120 60 50 100 110 190 180 130 170 150 160 200
 210 220 250 260 280 290 300 310 320 230 240 270 360 340 350 370 380 330
 390 420 560 400 410 460 500 430 490 510 450 470 440 480 520 530 540 550
 570]
----------------------------------------------------------------------------------------------------
V
8539
[3008159110 3008159120 3008159130 ... 3010032930 3010252740 3010252770]
-----------------------------------------------------------------------------------

In [73]:
def clean_loaded_line(x):
    if x == 'pp':
        return 'PP'
    if x not in ['PP', 'CR', 'GI']:
        return 'Other'
    else:
        return x
df.Loaded_line = df.Loaded_line.apply(clean_loaded_line)

In [74]:
df.Loaded_line.unique()

array(['PP', 'GI', 'CR', 'Other'], dtype=object)

In [75]:
df.Loaded_line.value_counts()

GI       7050
CR       3777
PP       2276
Other     281
Name: Loaded_line, dtype: int64

In [76]:
def clean_Creation(x):
    if x in ['CIF', 'CFR', 'CPT']:
        return 'Other'
    else:
        return x

df['Creation'] = df['Creation'].apply(clean_Creation)

In [77]:
df.Creation.unique()

array(['EXW', 'FOB', 'Other'], dtype=object)

In [78]:
df['Sector'].unique()

array(['Service Center', 'Large Corporate', 'Europe', 'Corporate',
       'Projects', 'High Tech industries', 'North America', 'Middle East',
       'Africa', 'Stocking and Distrib', 'KAMA Service Center',
       'South America', 'Australia', 'MTS&قليوب&Other'], dtype=object)

In [79]:
def clean_Sector(x):   # fixed function name
    if x in ['Service Center', 'Stocking and Distrib', 'MTS&قليوب&Other']:
        return 'Local Traders'
    if x in ['Corporate', 'Large Corporate', 'High Tech industries', 'KAMA Service Center', 'Projects']:
        return 'Local Companies'
    if x in ['Africa', 'North America', 'South America', 'Australia']:
        return 'Export Far'
    else:
        return 'Export Near'

df['Sector'] = df['Sector'].apply(clean_Sector)

In [80]:
df['Sector'].unique()

array(['Local Traders', 'Local Companies', 'Export Near', 'Export Far'],
      dtype=object)

In [81]:
df['KS_GRADE01'].unique()

array(['DX51D', 'DX52D', 'DC01', 'DX56D', 'DC04', 'S220GD', 'S350GD',
       'S280GD', 'DC04EK', 'DX53D', 'C350GD', 'DC06', 'DC05', 'DC03',
       'BSQH', 'DX54D', 'DD11', 'S320GD', 'DC01X', 'C280GD', 'S250GD',
       'DD11F', 'S550GD', 'S235M', 'DD13', 'S355', 'M1050-50D', 'HC260LA',
       'S355GD', 'DD14'], dtype=object)

In [82]:
def clean_KS_GRADE01(x):   
    if x in ['DC04', 'S235M','DD13', 'DX53D','DC06', 'DC04EK', 'S350GD', 'DC05','S320GD', 'DX56D', 'DX54D', 'C350GD','HC260LA','S550GD', 'M1050-50D','S355GD','DD14']:
        return 'Special Grade'
    else:
        return 'Normal Grade'
df['KS_GRADE01'] = df['KS_GRADE01'].apply(clean_KS_GRADE01)

In [83]:
df['KS_GRADE01'].unique()

array(['Normal Grade', 'Special Grade'], dtype=object)

In [84]:
df

Unnamed: 0,PLN-MM/YY,Sales_Doc.,Item,V,Loaded_line,New_Plan,Sold_to_party,Country_Key,Destination,Material,CUST_THICKNESS,CUST_WIDTH,KS_GRADE01,ZINC01,Remain_GRF,Sector,Creation,CBE$,BM$
1,202401,30081591,10,3008159110,PP,2024 Jan,L0076,Egypt,Local,POCF,0.40,1250.0,Normal Grade,80.0,20.000,Local Traders,EXW,1570,2515
2,202401,30081591,20,3008159120,PP,2024 Jan,L0076,Egypt,Local,POCF,0.40,1250.0,Normal Grade,80.0,20.000,Local Traders,EXW,1770,2515
3,202401,30081591,30,3008159130,PP,2024 Jan,L0076,Egypt,Local,POCF,1.14,1250.0,Normal Grade,120.0,40.000,Local Traders,EXW,1570,2508
4,202401,30081591,40,3008159140,PP,2024 Jan,L0076,Egypt,Local,POCF,0.50,1250.0,Normal Grade,180.0,40.000,Local Traders,EXW,1770,2574
8,202309,30078038,30,3007803830,PP,2024 Jan,L0078,Egypt,Local,POSF,0.40,699.8,Normal Grade,120.0,1.700,Local Companies,EXW,2410,4320
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16387,202506,30097656,350,30097656350,CR,2025 June,L0003,Egypt,Local,CRSF,0.50,1074.0,Normal Grade,0.0,52.035,Local Companies,EXW,2090,775
16388,202506,30097656,360,30097656360,CR,2025 June,L0003,Egypt,Local,CRSF,0.50,1285.0,Normal Grade,0.0,145.530,Local Companies,Other,2490,775
16389,202506,30102527,40,3010252740,CR,2025 June,L0280,Egypt,Local,CRNF,0.70,945.0,Special Grade,0.0,40.000,Local Companies,EXW,2210,235
16390,202506,30102527,70,3010252770,CR,2025 June,L0280,Egypt,Local,CRNF,0.80,960.0,Special Grade,0.0,40.000,Local Companies,EXW,2210,235


In [85]:
df.duplicated().sum()

0

# Step 6: Feature Engineering

In [86]:
##Any feature can help in anaysis or what are most-related columns to the target Column
df.head()

Unnamed: 0,PLN-MM/YY,Sales_Doc.,Item,V,Loaded_line,New_Plan,Sold_to_party,Country_Key,Destination,Material,CUST_THICKNESS,CUST_WIDTH,KS_GRADE01,ZINC01,Remain_GRF,Sector,Creation,CBE$,BM$
1,202401,30081591,10,3008159110,PP,2024 Jan,L0076,Egypt,Local,POCF,0.4,1250.0,Normal Grade,80.0,20.0,Local Traders,EXW,1570,2515
2,202401,30081591,20,3008159120,PP,2024 Jan,L0076,Egypt,Local,POCF,0.4,1250.0,Normal Grade,80.0,20.0,Local Traders,EXW,1770,2515
3,202401,30081591,30,3008159130,PP,2024 Jan,L0076,Egypt,Local,POCF,1.14,1250.0,Normal Grade,120.0,40.0,Local Traders,EXW,1570,2508
4,202401,30081591,40,3008159140,PP,2024 Jan,L0076,Egypt,Local,POCF,0.5,1250.0,Normal Grade,180.0,40.0,Local Traders,EXW,1770,2574
8,202309,30078038,30,3007803830,PP,2024 Jan,L0078,Egypt,Local,POSF,0.4,699.8,Normal Grade,120.0,1.7,Local Companies,EXW,2410,4320


In [87]:
df['Demanded_Month'] = df['PLN-MM/YY'].astype(str).str[4:6]
month_map = {
    '01': 'Jan', '02': 'Feb', '03': 'Mar', '04': 'Apr',
    '05': 'May', '06': 'Jun', '07': 'Jul', '08': 'Aug',
    '09': 'Sep', '10': 'Oct', '11': 'Nov', '12': 'Dec'}
df['Demanded_Month'] = df['Demanded_Month'].map(month_map)
df['Demanded_Month']

1        Jan
2        Jan
3        Jan
4        Jan
8        Sep
        ... 
16387    Jun
16388    Jun
16389    Jun
16390    Jun
16391    Dec
Name: Demanded_Month, Length: 13384, dtype: object

In [88]:
df['Demanded_Month'].unique()

array(['Jan', 'Sep', 'Aug', 'Nov', 'Feb', 'Dec', 'Oct', 'May', 'Mar',
       'Jun', 'Jul', 'Apr'], dtype=object)

In [89]:
df['Plan_Month'] = df['New_Plan'].astype(str).str[5:8]
df['Plan_Month']

1        Jan
2        Jan
3        Jan
4        Jan
8        Jan
        ... 
16387    Jun
16388    Jun
16389    Jun
16390    Jun
16391    Jun
Name: Plan_Month, Length: 13384, dtype: object

In [90]:
df['Plan_Month'].unique()

array(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep',
       'Oct', 'Nov', 'Dec'], dtype=object)

In [91]:
df.head()

Unnamed: 0,PLN-MM/YY,Sales_Doc.,Item,V,Loaded_line,New_Plan,Sold_to_party,Country_Key,Destination,Material,...,CUST_WIDTH,KS_GRADE01,ZINC01,Remain_GRF,Sector,Creation,CBE$,BM$,Demanded_Month,Plan_Month
1,202401,30081591,10,3008159110,PP,2024 Jan,L0076,Egypt,Local,POCF,...,1250.0,Normal Grade,80.0,20.0,Local Traders,EXW,1570,2515,Jan,Jan
2,202401,30081591,20,3008159120,PP,2024 Jan,L0076,Egypt,Local,POCF,...,1250.0,Normal Grade,80.0,20.0,Local Traders,EXW,1770,2515,Jan,Jan
3,202401,30081591,30,3008159130,PP,2024 Jan,L0076,Egypt,Local,POCF,...,1250.0,Normal Grade,120.0,40.0,Local Traders,EXW,1570,2508,Jan,Jan
4,202401,30081591,40,3008159140,PP,2024 Jan,L0076,Egypt,Local,POCF,...,1250.0,Normal Grade,180.0,40.0,Local Traders,EXW,1770,2574,Jan,Jan
8,202309,30078038,30,3007803830,PP,2024 Jan,L0078,Egypt,Local,POSF,...,699.8,Normal Grade,120.0,1.7,Local Companies,EXW,2410,4320,Sep,Jan


In [92]:
df['Remain_GRF']

1         20.000
2         20.000
3         40.000
4         40.000
8          1.700
          ...   
16387     52.035
16388    145.530
16389     40.000
16390     40.000
16391     50.000
Name: Remain_GRF, Length: 13384, dtype: float64

In [93]:
def order_item_type(x):
    if x > 199:
        return 'Large Item'
    elif x < 50:
        return 'Small Item'
    else:
        return 'Medium Item'

df['order_item_type'] = df['Remain_GRF'].apply(order_item_type)

In [94]:
df['order_item_type'].value_counts()

Small Item     8713
Medium Item    3889
Large Item      782
Name: order_item_type, dtype: int64

In [95]:
df.head()

Unnamed: 0,PLN-MM/YY,Sales_Doc.,Item,V,Loaded_line,New_Plan,Sold_to_party,Country_Key,Destination,Material,...,KS_GRADE01,ZINC01,Remain_GRF,Sector,Creation,CBE$,BM$,Demanded_Month,Plan_Month,order_item_type
1,202401,30081591,10,3008159110,PP,2024 Jan,L0076,Egypt,Local,POCF,...,Normal Grade,80.0,20.0,Local Traders,EXW,1570,2515,Jan,Jan,Small Item
2,202401,30081591,20,3008159120,PP,2024 Jan,L0076,Egypt,Local,POCF,...,Normal Grade,80.0,20.0,Local Traders,EXW,1770,2515,Jan,Jan,Small Item
3,202401,30081591,30,3008159130,PP,2024 Jan,L0076,Egypt,Local,POCF,...,Normal Grade,120.0,40.0,Local Traders,EXW,1570,2508,Jan,Jan,Small Item
4,202401,30081591,40,3008159140,PP,2024 Jan,L0076,Egypt,Local,POCF,...,Normal Grade,180.0,40.0,Local Traders,EXW,1770,2574,Jan,Jan,Small Item
8,202309,30078038,30,3007803830,PP,2024 Jan,L0078,Egypt,Local,POSF,...,Normal Grade,120.0,1.7,Local Companies,EXW,2410,4320,Sep,Jan,Small Item


# Step 7: Data Analysis (Questions)


## i- Univariate Analysis

#### What is the percentage of each product in the data ?

In [96]:
px.pie(data_frame=df,names='Loaded_line',width=800, height=500)

### What is the percentage of each Incoterm in the data ?

In [97]:
px.pie(data_frame=df,names='Creation',width=800, height=500)

### What is the percentage of each Destination (Export/Local) in the data ?

In [98]:
px.pie(data_frame=df,names='Destination', width=800, height=500)

### What is the percentage of each Destination (Detailed) in the data ?

In [99]:
px.pie(data_frame=df,names='Sector', width=800, height=500)

### What is the percentage of each Order Item Size in the data ?

In [100]:
px.pie(data_frame=df,names='order_item_type', width=800, height=500)

### What is the Box-Plot status of Zinc Weight in the data ?

In [101]:
px.box(data_frame=df,x='ZINC01', width=800, height=400)

### What is the Box-Plot status of BM $ in the data ?

In [102]:
px.box(data_frame=df,x='BM$', width=800, height=400)

## ii- Bivariate Analysis

### What is the relation between CBE $ and BM $ in the data ?

In [103]:
px.scatter(df,x='CBE$',y='BM$',width=600, height=400)

### Is there a relation between CUST_THICKNESS and CUST_WIDTH in the data ?

In [104]:
px.scatter(df,x='CUST_THICKNESS',y='CUST_WIDTH',width=600, height=400)

### What is the correlation bwetween all Numerical Values ? (Heat Map)

In [105]:
corr_df= df.select_dtypes(include=['number']).corr().round(2)
corr_df

Unnamed: 0,CUST_THICKNESS,CUST_WIDTH,ZINC01,Remain_GRF,CBE$,BM$
CUST_THICKNESS,1.0,0.18,0.14,0.01,-0.18,-0.03
CUST_WIDTH,0.18,1.0,0.28,0.09,-0.26,-0.0
ZINC01,0.14,0.28,1.0,-0.05,-0.08,0.0
Remain_GRF,0.01,0.09,-0.05,1.0,0.05,0.07
CBE$,-0.18,-0.26,-0.08,0.05,1.0,0.48
BM$,-0.03,-0.0,0.0,0.07,0.48,1.0


In [106]:
px.imshow(corr_df,text_auto=True,width=600, height=400)

### The Box-Plot Comparison between Every Loaded line as per CBE $ in the data ?

In [107]:
px.box(data_frame=df, x='Loaded_line', y='CBE$')

### The Box-Plot Comparison between Every Incoterm as per CBE $ in the data ?

In [108]:
px.box(data_frame=df, x='Creation', y='CBE$')

for col in num_cols:
    px.histogram(data_frame=df, x=col, title=col).show()


In [109]:
df.head()

Unnamed: 0,PLN-MM/YY,Sales_Doc.,Item,V,Loaded_line,New_Plan,Sold_to_party,Country_Key,Destination,Material,...,KS_GRADE01,ZINC01,Remain_GRF,Sector,Creation,CBE$,BM$,Demanded_Month,Plan_Month,order_item_type
1,202401,30081591,10,3008159110,PP,2024 Jan,L0076,Egypt,Local,POCF,...,Normal Grade,80.0,20.0,Local Traders,EXW,1570,2515,Jan,Jan,Small Item
2,202401,30081591,20,3008159120,PP,2024 Jan,L0076,Egypt,Local,POCF,...,Normal Grade,80.0,20.0,Local Traders,EXW,1770,2515,Jan,Jan,Small Item
3,202401,30081591,30,3008159130,PP,2024 Jan,L0076,Egypt,Local,POCF,...,Normal Grade,120.0,40.0,Local Traders,EXW,1570,2508,Jan,Jan,Small Item
4,202401,30081591,40,3008159140,PP,2024 Jan,L0076,Egypt,Local,POCF,...,Normal Grade,180.0,40.0,Local Traders,EXW,1770,2574,Jan,Jan,Small Item
8,202309,30078038,30,3007803830,PP,2024 Jan,L0078,Egypt,Local,POSF,...,Normal Grade,120.0,1.7,Local Companies,EXW,2410,4320,Sep,Jan,Small Item


### The Bi-Bar Chart Comparison between Demanded Month and Plan Month in the data ?

In [110]:
Remain_GRF_per_Demanded_Month_Per_Plan_Month=(df.groupby(['Demanded_Month','Plan_Month'])['Remain_GRF']).mean().round(2)
Remain_GRF_per_Demanded_Month_Per_Plan_Month

Demanded_Month  Plan_Month
Apr             Apr            49.24
                Aug            25.86
                Jul           102.29
                Jun            67.35
                Mar            49.37
                               ...  
Sep             Jan            58.76
                Mar            17.84
                Nov            96.63
                Oct            86.10
                Sep            88.89
Name: Remain_GRF, Length: 113, dtype: float64

In [111]:
df_plot = Remain_GRF_per_Demanded_Month_Per_Plan_Month.reset_index(name='Remain_GRF')


In [112]:
import plotly.express as px

fig = px.bar(
    data_frame=df_plot,
    x='Demanded_Month',
    y='Remain_GRF',
    color='Plan_Month',
    text_auto=True,
    title='Remain_GRF per Demanded Month per Plan Month',
    barmode='group'
)
fig.show()

### The Bi-Bar Chart Comparison between Destination and Loaded Line in the data ?

In [113]:
Remain_GRF_per_Destination_Per_Loaded_line=(df.groupby(['Destination','Loaded_line'])['Remain_GRF']).mean().round(2)
Remain_GRF_per_Destination_Per_Loaded_line

Destination  Loaded_line
Export       CR              87.06
             GI              96.92
             Other          200.20
             PP              55.55
Local        CR              47.51
             GI              37.32
             Other           72.63
             PP              37.92
Name: Remain_GRF, dtype: float64

In [114]:
df_plot2 = Remain_GRF_per_Destination_Per_Loaded_line.reset_index(name='Remain_GRF')


In [115]:
import plotly.express as px

fig = px.bar(
    data_frame=df_plot2,
    x='Destination',
    y='Remain_GRF',
    color='Loaded_line',
    text_auto=True,
    title='Remain_GRF_per_Destination_Per_Loaded_line',
    barmode='group'
)
fig.show()

# Step 8: Early Feature Selection

## i- Very Low or Very High Variance Filter 

In [116]:
numeric_df = df.select_dtypes(include=['number'])
column_variances = numeric_df.var()
column_variances

CUST_THICKNESS    2.288301e-01
CUST_WIDTH        6.499810e+04
ZINC01            9.289552e+03
Remain_GRF        1.339347e+04
CBE$              1.941431e+05
BM$               1.191312e+06
dtype: float64

## ii- High Correlation Filter

In [117]:
numeric_df = df.select_dtypes(include=['number'])
correlation_with_target = numeric_df.corr()['CBE$'].sort_values(ascending=False)
correlation_with_target

CBE$              1.000000
BM$               0.475877
Remain_GRF        0.053503
ZINC01           -0.077492
CUST_THICKNESS   -0.180468
CUST_WIDTH       -0.260533
Name: CBE$, dtype: float64

## iii- High Cardinality Filter

In [118]:
df.describe(include= 'object')

Unnamed: 0,PLN-MM/YY,Sales_Doc.,Item,V,Loaded_line,New_Plan,Sold_to_party,Country_Key,Destination,Material,KS_GRADE01,Sector,Creation,Demanded_Month,Plan_Month,order_item_type
count,13384,13384,13384,13384,13384,13384,13384,13384,13384,13384,13384,13384,13384,13384,13384,13384
unique,30,2195,57,8539,4,18,516,39,2,44,2,4,3,12,12,3
top,202501,35006385,10,3008123950,GI,2025 Mar,L0131,Egypt,Local,GCCF,Normal Grade,Local Companies,Other,Jan,Apr,Small Item
freq,1183,98,2899,8,7050,971,411,8700,8709,4918,11969,5669,7118,1996,1745,8713


In [119]:
df = df.drop(columns=['PLN-MM/YY' ,'BM$', 'Country_Key','Sales_Doc.','Item','V','New_Plan','Sold_to_party','Plan_Month','Remain_GRF', 'order_item_type', 'Demanded_Month' ])

In [120]:
df

Unnamed: 0,Loaded_line,Destination,Material,CUST_THICKNESS,CUST_WIDTH,KS_GRADE01,ZINC01,Sector,Creation,CBE$
1,PP,Local,POCF,0.40,1250.0,Normal Grade,80.0,Local Traders,EXW,1570
2,PP,Local,POCF,0.40,1250.0,Normal Grade,80.0,Local Traders,EXW,1770
3,PP,Local,POCF,1.14,1250.0,Normal Grade,120.0,Local Traders,EXW,1570
4,PP,Local,POCF,0.50,1250.0,Normal Grade,180.0,Local Traders,EXW,1770
8,PP,Local,POSF,0.40,699.8,Normal Grade,120.0,Local Companies,EXW,2410
...,...,...,...,...,...,...,...,...,...,...
16387,CR,Local,CRSF,0.50,1074.0,Normal Grade,0.0,Local Companies,EXW,2090
16388,CR,Local,CRSF,0.50,1285.0,Normal Grade,0.0,Local Companies,Other,2490
16389,CR,Local,CRNF,0.70,945.0,Special Grade,0.0,Local Companies,EXW,2210
16390,CR,Local,CRNF,0.80,960.0,Special Grade,0.0,Local Companies,EXW,2210


In [121]:
df.duplicated().sum()

9283

In [122]:
df = df.drop_duplicates()

In [123]:
df.shape[0]

4101

In [124]:
# Save Cleaned df
df.to_csv('cleaned_df.csv')

# Step 9: Data Preprocessing for Machine Learning (Preprocessing Pipeline)

In [125]:
##Will include all steps in a Preprocessing Pipeline

## i- Split Data into Input Features and Target Column (Inputs & Output)

In [126]:
x = df.drop('CBE$', axis= 1)
y = df['CBE$']

## ii- Numerical Pipeline

In [127]:
num_cols = x.select_dtypes(include='number').columns
num_cols

Index(['CUST_THICKNESS', 'CUST_WIDTH', 'ZINC01'], dtype='object')

In [128]:
from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer
from sklearn.preprocessing import RobustScaler

knn_imputer = KNNImputer()
rc = RobustScaler()

num_pipeline = Pipeline([('Knn', knn_imputer), ('Scaling', rc)])
num_pipeline

Pipeline(steps=[('Knn', KNNImputer()), ('Scaling', RobustScaler())])

## iii- Categorical Pipeline

### a-OneHotEncoder Pipeline

In [129]:
df.describe(include= 'object')

Unnamed: 0,Loaded_line,Destination,Material,KS_GRADE01,Sector,Creation
count,4101,4101,4101,4101,4101,4101
unique,4,2,44,2,4,3
top,GI,Local,GCCF,Normal Grade,Local Companies,Other
freq,2106,2333,1325,3500,1905,2256


In [130]:
ohe_cols = ['Loaded_line' ,'KS_GRADE01' , 'Destination' , 'Creation', 'Sector']
ohe_cols

['Loaded_line', 'KS_GRADE01', 'Destination', 'Creation', 'Sector']

In [131]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(drop= 'first', sparse= False)

ohe_pipeline = Pipeline(steps= [ ('OHE', ohe) ])
ohe_pipeline

Pipeline(steps=[('OHE', OneHotEncoder(drop='first', sparse=False))])

### b-Binary Encoder Pipeline

In [132]:
df.describe(include= 'object')

Unnamed: 0,Loaded_line,Destination,Material,KS_GRADE01,Sector,Creation
count,4101,4101,4101,4101,4101,4101
unique,4,2,44,2,4,3
top,GI,Local,GCCF,Normal Grade,Local Companies,Other
freq,2106,2333,1325,3500,1905,2256


In [133]:
be_cols = ['Material' ]
be_cols

['Material']

In [134]:
from category_encoders import BinaryEncoder

be = BinaryEncoder()

be_pipeline = Pipeline(steps= [ ('BE', be) ])
be_pipeline

Pipeline(steps=[('BE', BinaryEncoder())])

## iv- Column Transformer to Assign columns to be processed

In [135]:
from sklearn.compose import ColumnTransformer

preprocessing = ColumnTransformer(transformers= [ ('Num Pipeline', num_pipeline, num_cols),
                                  ('OHE Pipeline', ohe_pipeline, ohe_cols),
                                  ('BE Pipeline', be_pipeline, be_cols) ],
                                  remainder= 'passthrough')
preprocessing

ColumnTransformer(remainder='passthrough',
                  transformers=[('Num Pipeline',
                                 Pipeline(steps=[('Knn', KNNImputer()),
                                                 ('Scaling', RobustScaler())]),
                                 Index(['CUST_THICKNESS', 'CUST_WIDTH', 'ZINC01'], dtype='object')),
                                ('OHE Pipeline',
                                 Pipeline(steps=[('OHE',
                                                  OneHotEncoder(drop='first',
                                                                sparse=False))]),
                                 ['Loaded_line', 'KS_GRADE01', 'Destination',
                                  'Creation', 'Sector']),
                                ('BE Pipeline',
                                 Pipeline(steps=[('BE', BinaryEncoder())]),
                                 ['Material'])])

## v- Handling Imbalance 

(If Classification Case due to imbalanced Output categorical column, An may appear in Regression case but reason originated from imbalanced Input Categorical Columns, Ex: if a certain category is only 1!, and any low number will be a risk)

In [136]:
df.describe()

Unnamed: 0,CUST_THICKNESS,CUST_WIDTH,ZINC01,CBE$
count,4101.0,4101.0,4101.0,4101.0
mean,0.90517,1023.421054,86.726652,2359.729334
std,0.489796,264.759109,87.657071,396.271956
min,0.2,265.0,0.0,1130.0
25%,0.5,915.0,0.0,2110.0
50%,0.8,1120.0,80.0,2370.0
75%,1.2,1250.0,120.0,2590.0
max,2.25,1300.0,330.0,3510.0


In [137]:
df.describe(include= 'object')

Unnamed: 0,Loaded_line,Destination,Material,KS_GRADE01,Sector,Creation
count,4101,4101,4101,4101,4101,4101
unique,4,2,44,2,4,3
top,GI,Local,GCCF,Normal Grade,Local Companies,Other
freq,2106,2333,1325,3500,1905,2256


# Step 10: Model Selection + Cross Validation + Targeted Column Scaling

### Supervised Machine Learning, Import Models to be Used

In [138]:
## 1- Logistic Regression (LogisticRegression)
## 2- K-Nearest Neighbors (KNN) (KNeighborsRegressor)
## 3- Random Forests (RandomForestRegressor)
## 4- Decision Trees (DecisionTreeRegressor)
## 5- XGBoost (XGBRegressor)
## 6- Catboost (CatBoostRegressor)
## 7- Lightgbm (LGBMRegressor)

In [139]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor

### Import Cross Validation

In [140]:
from sklearn.model_selection import cross_validate

### Import Transformed Target Regressor, Target Column Scaling

In [141]:
from sklearn.compose import TransformedTargetRegressor

### Model Selection

In [142]:
models = [
    ('Linear Regression', LinearRegression()),
    ('KNN', KNeighborsRegressor()),
    ('Decision Tree', DecisionTreeRegressor(random_state=42)),
    ('Random Forest', RandomForestRegressor(random_state=42)),
    ('XGBoost', XGBRegressor()),
    ('CatBoost', CatBoostRegressor()),
    ('LightGBM', LGBMRegressor(verbose=-1))]

for model in models:

    model_pipeline = Pipeline(steps=[('Preprocessing', preprocessing), ('Model', model[1])])

    model_pipeline_scaled_target = TransformedTargetRegressor(model_pipeline, func=np.log1p, inverse_func=np.expm1)

    result = cross_validate( model_pipeline_scaled_target, x, y, cv=5, scoring='r2', return_train_score=True, n_jobs=-1)

    print(model[0])
    print('Train Score :', round(result['train_score'].mean() * 100, 2))
    print('Test Score :', round(result['test_score'].mean() * 100, 2))
    print('-' * 50)

Linear Regression
Train Score : 81.03
Test Score : 79.63
--------------------------------------------------
KNN
Train Score : 88.76
Test Score : 79.56
--------------------------------------------------
Decision Tree
Train Score : 97.08
Test Score : 72.1
--------------------------------------------------
Random Forest
Train Score : 95.71
Test Score : 80.46
--------------------------------------------------
XGBoost
Train Score : 94.35
Test Score : 83.58
--------------------------------------------------
CatBoost
Train Score : 92.26
Test Score : 85.89
--------------------------------------------------
LightGBM
Train Score : 91.06
Test Score : 84.83
--------------------------------------------------


# Step 11: Hyperparameter Tuning for Selected Models

##### XGBoost Hyperparameter Tuning

In [143]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import Pipeline
from xgboost import XGBRegressor
import numpy as np

# Make sure 'preprocessing' is defined
# preprocessing = ...

param_grid = {
    "regressor__Model__n_estimators": [500, 1000, 1500],
    "regressor__Model__max_depth": [3, 5, 7],
    "regressor__Model__min_child_weight": [1, 3, 5],
    "regressor__Model__colsample_bytree": [0.6, 0.8, 1.0]
}

model_pipeline = Pipeline(steps=[
    ('Preprocessing', preprocessing),
    ('Model', XGBRegressor(objective='reg:squarederror', random_state=42))
])

model_pipeline_scaled_target = TransformedTargetRegressor(
    regressor=model_pipeline,
    func=np.log1p,
    inverse_func=np.expm1
)

result = RandomizedSearchCV(
    estimator=model_pipeline_scaled_target,
    param_distributions=param_grid,
    cv=5,
    scoring='r2',
    return_train_score=True,
    n_jobs=-1
)

result.fit(x, y)

RandomizedSearchCV(cv=5,
                   estimator=TransformedTargetRegressor(func=<ufunc 'log1p'>,
                                                        inverse_func=<ufunc 'expm1'>,
                                                        regressor=Pipeline(steps=[('Preprocessing',
                                                                                   ColumnTransformer(remainder='passthrough',
                                                                                                     transformers=[('Num '
                                                                                                                    'Pipeline',
                                                                                                                    Pipeline(steps=[('Knn',
                                                                                                                                     KNNImputer()),
                                              

In [144]:
result.cv_results_['mean_test_score']

array([0.79260191, 0.79410087, 0.83606562, 0.83360667, 0.82375291,
       0.82134123, 0.83861428, 0.80072671, 0.84735114, 0.79302231])

In [145]:
result.best_score_ * 100

84.73511393670574

In [146]:
result.best_params_

{'regressor__Model__n_estimators': 500,
 'regressor__Model__min_child_weight': 5,
 'regressor__Model__max_depth': 3,
 'regressor__Model__colsample_bytree': 0.8}

##### CatBoost Hyperparameter Tuning

In [147]:
from sklearn.model_selection import RandomizedSearchCV
param_grid = {
    "regressor__Model__learning_rate": [0.01, 0.05, 0.1],
    "regressor__Model__l2_leaf_reg": [3, 5, 10],
    "regressor__Model__iterations": [1000, 1500 , 2000] }


model_pipeline = Pipeline(steps= [ ('Preprocessing', preprocessing), ('Model', CatBoostRegressor())])

model_pipeline_scaled_target = TransformedTargetRegressor(model_pipeline, func= np.log1p, inverse_func= np.expm1)

result = RandomizedSearchCV(model_pipeline_scaled_target, param_grid, cv= 5, scoring= 'r2', return_train_score= True, n_jobs= -1)

result.fit(x, y)

0:	learn: 0.1719438	total: 135ms	remaining: 4m 30s
1:	learn: 0.1708033	total: 138ms	remaining: 2m 17s
2:	learn: 0.1697146	total: 141ms	remaining: 1m 33s
3:	learn: 0.1686032	total: 143ms	remaining: 1m 11s
4:	learn: 0.1674991	total: 145ms	remaining: 57.9s
5:	learn: 0.1663965	total: 147ms	remaining: 48.8s
6:	learn: 0.1653035	total: 149ms	remaining: 42.4s
7:	learn: 0.1643314	total: 151ms	remaining: 37.6s
8:	learn: 0.1632633	total: 153ms	remaining: 33.8s
9:	learn: 0.1622146	total: 155ms	remaining: 30.8s
10:	learn: 0.1611862	total: 157ms	remaining: 28.3s
11:	learn: 0.1601468	total: 158ms	remaining: 26.2s
12:	learn: 0.1591314	total: 160ms	remaining: 24.5s
13:	learn: 0.1581445	total: 162ms	remaining: 23s
14:	learn: 0.1571411	total: 164ms	remaining: 21.7s
15:	learn: 0.1561503	total: 166ms	remaining: 20.6s
16:	learn: 0.1551728	total: 168ms	remaining: 19.6s
17:	learn: 0.1542841	total: 170ms	remaining: 18.7s
18:	learn: 0.1533327	total: 171ms	remaining: 17.9s
19:	learn: 0.1523836	total: 173ms	remai

RandomizedSearchCV(cv=5,
                   estimator=TransformedTargetRegressor(func=<ufunc 'log1p'>,
                                                        inverse_func=<ufunc 'expm1'>,
                                                        regressor=Pipeline(steps=[('Preprocessing',
                                                                                   ColumnTransformer(remainder='passthrough',
                                                                                                     transformers=[('Num '
                                                                                                                    'Pipeline',
                                                                                                                    Pipeline(steps=[('Knn',
                                                                                                                                     KNNImputer()),
                                              

In [148]:
result.cv_results_['mean_test_score']

array([0.85284786, 0.85806037, 0.8570747 , 0.84767701, 0.85474106,
       0.85094288, 0.85866745, 0.85400497, 0.8536853 , 0.85506236])

In [149]:
result.best_score_ * 100

85.86674516983518

In [150]:
result.best_params_

{'regressor__Model__learning_rate': 0.01,
 'regressor__Model__l2_leaf_reg': 10,
 'regressor__Model__iterations': 2000}

##### LightGBM Hyperparameter Tuning

In [151]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import TransformedTargetRegressor
from lightgbm import LGBMRegressor
import numpy as np

param_grid = {
    "regressor__Model__num_leaves": [31, 50, 70],
    "regressor__Model__learning_rate": [0.01, 0.05, 0.1],
    "regressor__Model__n_estimators": [500, 1000, 1500],
    "regressor__Model__max_depth": [-1, 5, 10]}

model_pipeline = Pipeline(steps=[
    ('Preprocessing', preprocessing),
    ('Model', LGBMRegressor(objective='regression', random_state=42))])

model_pipeline_scaled_target = TransformedTargetRegressor(
    regressor=model_pipeline, func=np.log1p, inverse_func=np.expm1)

result = RandomizedSearchCV(estimator=model_pipeline_scaled_target, param_distributions=param_grid, cv=5, scoring='r2', return_train_score=True, n_jobs=-1 )

result.fit(x, y)


RandomizedSearchCV(cv=5,
                   estimator=TransformedTargetRegressor(func=<ufunc 'log1p'>,
                                                        inverse_func=<ufunc 'expm1'>,
                                                        regressor=Pipeline(steps=[('Preprocessing',
                                                                                   ColumnTransformer(remainder='passthrough',
                                                                                                     transformers=[('Num '
                                                                                                                    'Pipeline',
                                                                                                                    Pipeline(steps=[('Knn',
                                                                                                                                     KNNImputer()),
                                              

In [152]:
result.cv_results_['mean_test_score']

array([0.84927474, 0.84068881, 0.83446664, 0.83696976, 0.83893914,
       0.83575577, 0.84068881, 0.83992699, 0.84947483, 0.84414554])

In [153]:
result.best_score_ * 100

84.94748326092582

In [154]:
result.best_params_

{'regressor__Model__num_leaves': 70,
 'regressor__Model__n_estimators': 1000,
 'regressor__Model__max_depth': 5,
 'regressor__Model__learning_rate': 0.01}

# Step 12: Selected Model + Quick Test

In [155]:
catboost_pipeline = Pipeline(steps= [ ('Preprocessing', preprocessing),
                                      ('Model', CatBoostRegressor(learning_rate = 0.01, l2_leaf_reg = 3, iterations = 2000))])

catboost_pipeline_scaled_target = TransformedTargetRegressor(catboost_pipeline, func= np.log1p, inverse_func= np.expm1)

catboost_pipeline_scaled_target.fit(x, y)

0:	learn: 0.1718630	total: 4.15ms	remaining: 8.29s
1:	learn: 0.1706568	total: 8.87ms	remaining: 8.86s
2:	learn: 0.1695093	total: 13.2ms	remaining: 8.82s
3:	learn: 0.1683295	total: 17.9ms	remaining: 8.91s
4:	learn: 0.1671575	total: 22.2ms	remaining: 8.86s
5:	learn: 0.1660101	total: 26.4ms	remaining: 8.78s
6:	learn: 0.1648472	total: 29.7ms	remaining: 8.45s
7:	learn: 0.1638133	total: 33ms	remaining: 8.22s
8:	learn: 0.1626791	total: 36ms	remaining: 7.97s
9:	learn: 0.1615933	total: 38.9ms	remaining: 7.75s
10:	learn: 0.1605085	total: 42.1ms	remaining: 7.62s
11:	learn: 0.1594058	total: 45.1ms	remaining: 7.46s
12:	learn: 0.1583385	total: 47.8ms	remaining: 7.3s
13:	learn: 0.1573060	total: 50.4ms	remaining: 7.15s
14:	learn: 0.1562554	total: 53ms	remaining: 7.01s
15:	learn: 0.1552068	total: 55.3ms	remaining: 6.85s
16:	learn: 0.1541749	total: 57.4ms	remaining: 6.69s
17:	learn: 0.1532459	total: 59.2ms	remaining: 6.52s
18:	learn: 0.1522404	total: 61.2ms	remaining: 6.38s
19:	learn: 0.1512383	total: 6

TransformedTargetRegressor(func=<ufunc 'log1p'>, inverse_func=<ufunc 'expm1'>,
                           regressor=Pipeline(steps=[('Preprocessing',
                                                      ColumnTransformer(remainder='passthrough',
                                                                        transformers=[('Num '
                                                                                       'Pipeline',
                                                                                       Pipeline(steps=[('Knn',
                                                                                                        KNNImputer()),
                                                                                                       ('Scaling',
                                                                                                        RobustScaler())]),
                                                                                       Index

In [156]:
df[0:1]

Unnamed: 0,Loaded_line,Destination,Material,CUST_THICKNESS,CUST_WIDTH,KS_GRADE01,ZINC01,Sector,Creation,CBE$
1,PP,Local,POCF,0.4,1250.0,Normal Grade,80.0,Local Traders,EXW,1570


In [157]:
actual = df[0:1]['CBE$'].iloc[0]
predicted = catboost_pipeline_scaled_target.predict(x[0:1]).round(2)[0]
error = abs(((predicted - actual) / actual * 100).round(1))
print("Actual:", actual)
print("predicted:", predicted)
print("error %:", error)


Actual: 1570
predicted: 1744.77
error %: 11.1


In [158]:
df[14:15]

Unnamed: 0,Loaded_line,Destination,Material,CUST_THICKNESS,CUST_WIDTH,KS_GRADE01,ZINC01,Sector,Creation,CBE$
25,CR,Export,CRCF,0.7,1000.0,Normal Grade,0.0,Export Near,Other,2590


In [159]:
actual = df[14:15]['CBE$'].iloc[0]
predicted = catboost_pipeline_scaled_target.predict(x[14:15]).round(2)[0]
error = abs(((predicted - actual) / actual * 100).round(1))
print("Actual:", actual)
print("predicted:", predicted)
print("error %:", error)

Actual: 2590
predicted: 2523.01
error %: 2.6


# Step 13: Saving Selected Model By joblib

In [160]:
import joblib

joblib.dump(catboost_pipeline_scaled_target, 'catboost.pkl')

['catboost.pkl']

In [161]:
df.describe()

Unnamed: 0,CUST_THICKNESS,CUST_WIDTH,ZINC01,CBE$
count,4101.0,4101.0,4101.0,4101.0
mean,0.90517,1023.421054,86.726652,2359.729334
std,0.489796,264.759109,87.657071,396.271956
min,0.2,265.0,0.0,1130.0
25%,0.5,915.0,0.0,2110.0
50%,0.8,1120.0,80.0,2370.0
75%,1.2,1250.0,120.0,2590.0
max,2.25,1300.0,330.0,3510.0


In [162]:
df

Unnamed: 0,Loaded_line,Destination,Material,CUST_THICKNESS,CUST_WIDTH,KS_GRADE01,ZINC01,Sector,Creation,CBE$
1,PP,Local,POCF,0.40,1250.0,Normal Grade,80.0,Local Traders,EXW,1570
2,PP,Local,POCF,0.40,1250.0,Normal Grade,80.0,Local Traders,EXW,1770
3,PP,Local,POCF,1.14,1250.0,Normal Grade,120.0,Local Traders,EXW,1570
4,PP,Local,POCF,0.50,1250.0,Normal Grade,180.0,Local Traders,EXW,1770
8,PP,Local,POSF,0.40,699.8,Normal Grade,120.0,Local Companies,EXW,2410
...,...,...,...,...,...,...,...,...,...,...
16378,GI,Local,GHCF,1.50,1250.0,Normal Grade,60.0,Local Traders,EXW,1130
16382,GI,Local,GHCF,1.50,1180.0,Special Grade,275.0,Local Traders,EXW,1810
16384,GI,Local,GHCF,1.50,1070.0,Special Grade,275.0,Local Traders,EXW,1810
16385,GI,Local,GHCF,2.00,1070.0,Special Grade,275.0,Local Traders,EXW,1810


In [163]:
df.columns

Index(['Loaded_line', 'Destination', 'Material', 'CUST_THICKNESS',
       'CUST_WIDTH', 'KS_GRADE01', 'ZINC01', 'Sector', 'Creation', 'CBE$'],
      dtype='object')

In [164]:

df.duplicated().sum()

0

# Step 14: Deployment

In [165]:
%%writefile Flat_Steel_Price.py

import pandas as pd
import streamlit as st
import joblib
from sklearn.preprocessing import RobustScaler, OneHotEncoder
from category_encoders import BinaryEncoder
from catboost import CatBoostRegressor

st.set_page_config(layout='wide', page_title='Flat Steel Prices')

html_title = """<h1 style="color:white;text-align:center;"> Flat Steel Prices </h1>"""
st.markdown(html_title, unsafe_allow_html=True)

st.image('https://www.shutterstock.com/image-photo/packed-rolls-steel-sheet-cold-600nw-338337974.jpg')

df = pd.read_csv('cleaned_df.csv', index_col= 0)
st.dataframe(df.head())

Loaded_line = st.selectbox('Loaded_line', df.Loaded_line.unique())
KS_GRADE01 = st.selectbox('KS_GRADE01', df.KS_GRADE01.unique())
CUST_THICKNESS = st.sidebar.slider('CUST_THICKNESS', min_value=0.2, max_value=2.25, step=0.05)
CUST_WIDTH = st.sidebar.slider('CUST_WIDTH', min_value=265, max_value=1300, step=5)
ZINC01 = st.sidebar.slider('ZINC01', min_value=0, max_value=330, step=10)
Destination = st.sidebar.selectbox('Destination', df.Destination.unique())
Sector = st.sidebar.selectbox('Sector', df.Sector.unique())
Creation = st.selectbox('Creation', df.Creation.unique())
Material = st.selectbox('Material', df.Material.unique())

ml_model = joblib.load('catboost.pkl')

if st.button('Predict Flat Steel Price'):

    new_data = pd.DataFrame(columns= df.columns.drop('CBE$'), data= [[Loaded_line, Destination, Material, CUST_THICKNESS,
       CUST_WIDTH, KS_GRADE01, ZINC01, Sector, Creation]])

    st.write('Flat Steel Price :', ml_model.predict(new_data).round(2)[0])

Overwriting Flat_Steel_Price.py


In [None]:
! streamlit run Flat_Steel_Price.py

In [None]:
! pip install pipreqs



In [None]:
import pipreqs
! pipreqs

INFO: Successfully saved requirements file in e:\3-Academic\Data Science Diploma\B-Final Project-MPS 18 Months (Price Prediction)\requirements.txt
