# Data Preprocessing
## Food Delivery Time Dataset

## Tujuan Tahap

Tahap ini bertujuan untuk:
- Memuat dataset food delivery
- Memahami struktur dan tipe data
- Mengidentifikasi missing values dan duplikasi
- Memvalidasi target variabel (delivery_time)

Belum dilakukan pembersihan data secara agresif pada tahap ini.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Load dataset
df = pd.read_csv("../data/raw/Food_Delivery_Times.csv")

# Tampilkan 5 baris pertama
df.head()

Unnamed: 0,Order_ID,Distance_km,Weather,Traffic_Level,Time_of_Day,Vehicle_Type,Preparation_Time_min,Courier_Experience_yrs,Delivery_Time_min
0,522,7.93,Windy,Low,Afternoon,Scooter,12,1.0,43
1,738,16.42,Clear,Medium,Evening,Bike,20,2.0,84
2,741,9.52,Foggy,Low,Night,Scooter,28,1.0,59
3,661,7.44,Rainy,Medium,Afternoon,Scooter,5,1.0,37
4,412,19.03,Clear,Low,Morning,Bike,16,5.0,68


In [3]:
#Jumlah baris dan kolom
df.shape

(1000, 9)

In [4]:
# Informasi struktur dan data type
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Order_ID                1000 non-null   int64  
 1   Distance_km             1000 non-null   float64
 2   Weather                 970 non-null    object 
 3   Traffic_Level           970 non-null    object 
 4   Time_of_Day             970 non-null    object 
 5   Vehicle_Type            1000 non-null   object 
 6   Preparation_Time_min    1000 non-null   int64  
 7   Courier_Experience_yrs  970 non-null    float64
 8   Delivery_Time_min       1000 non-null   int64  
dtypes: float64(2), int64(3), object(4)
memory usage: 70.4+ KB


In [5]:
# Jumlah missing values per kolom
df.isna().sum()

Order_ID                   0
Distance_km                0
Weather                   30
Traffic_Level             30
Time_of_Day               30
Vehicle_Type               0
Preparation_Time_min       0
Courier_Experience_yrs    30
Delivery_Time_min          0
dtype: int64

In [6]:
# Jumlah duplikat
df.duplicated().sum()

np.int64(0)

In [7]:
# Cek apakah kolom delivery_time ada
"Delivery_Time_min" in df.columns

True

In [8]:
# Statistik deskriptif target variabel
df["Delivery_Time_min"].describe()

count    1000.000000
mean       56.732000
std        22.070915
min         8.000000
25%        41.000000
50%        55.500000
75%        71.000000
max       153.000000
Name: Delivery_Time_min, dtype: float64

## Data Cleaning & Transformation

Tahap ini bertujuan untuk:
- Menghapus kolom yang tidak relevan
- Menangani missing values secara rasional
- Menyiapkan dataset agar siap untuk EDA dan modeling

In [9]:
# Drop kolom ID
df = df.drop(columns=["Order_ID"])

df.head()

Unnamed: 0,Distance_km,Weather,Traffic_Level,Time_of_Day,Vehicle_Type,Preparation_Time_min,Courier_Experience_yrs,Delivery_Time_min
0,7.93,Windy,Low,Afternoon,Scooter,12,1.0,43
1,16.42,Clear,Medium,Evening,Bike,20,2.0,84
2,9.52,Foggy,Low,Night,Scooter,28,1.0,59
3,7.44,Rainy,Medium,Afternoon,Scooter,5,1.0,37
4,19.03,Clear,Low,Morning,Bike,16,5.0,68


## Missing Values Setelah Drop Kolom ID

In [10]:
df.isna().sum()

Distance_km                0
Weather                   30
Traffic_Level             30
Time_of_Day               30
Vehicle_Type               0
Preparation_Time_min       0
Courier_Experience_yrs    30
Delivery_Time_min          0
dtype: int64

In [11]:
# Isi missing values untuk kolom kategorikal dengan mode
categorical_cols = ["Weather", "Traffic_Level", "Time_of_Day"]

for col in categorical_cols:
    df[col] = df[col].fillna(df[col].mode()[0])

In [12]:
# Isi  missing values kolom nuremik dengan median
df["Courier_Experience_yrs"] = df["Courier_Experience_yrs"].fillna(df["Courier_Experience_yrs"].median())

In [13]:
df.isna().sum()

Distance_km               0
Weather                   0
Traffic_Level             0
Time_of_Day               0
Vehicle_Type              0
Preparation_Time_min      0
Courier_Experience_yrs    0
Delivery_Time_min         0
dtype: int64

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Distance_km             1000 non-null   float64
 1   Weather                 1000 non-null   object 
 2   Traffic_Level           1000 non-null   object 
 3   Time_of_Day             1000 non-null   object 
 4   Vehicle_Type            1000 non-null   object 
 5   Preparation_Time_min    1000 non-null   int64  
 6   Courier_Experience_yrs  1000 non-null   float64
 7   Delivery_Time_min       1000 non-null   int64  
dtypes: float64(2), int64(2), object(4)
memory usage: 62.6+ KB


In [21]:
df.describe()

Unnamed: 0,Distance_km,Preparation_Time_min,Courier_Experience_yrs,Delivery_Time_min
count,1000.0,1000.0,1000.0,1000.0
mean,10.05997,16.982,4.592,56.732
std,5.696656,7.204553,2.871198,22.070915
min,0.59,5.0,0.0,8.0
25%,5.105,11.0,2.0,41.0
50%,10.19,17.0,5.0,55.5
75%,15.0175,23.0,7.0,71.0
max,19.99,29.0,9.0,153.0


In [16]:
# Simpan dataset yang sudah dibersihkan
df.to_csv("../data/processed/food_delivery_clean.csv", index=False)