# Food Delivery Time Analysis
## Notebook 01: Data Exploration & Cleaning

### Objective
In this notebook, we will perform the initial steps of data analysis for the `Food_Delivery_Times.csv` dataset, which include:

- **Data Loading:** Loading the dataset into a pandas DataFrame.
- **Data Exploration:** Understanding the structure, content, and basic statistics of the data.
- **Data Cleaning:** Handling missing values, correcting data types, and addressing any inconsistencies.

### Problem Context

This project aims to analyze food delivery times. Understanding the dataset's characteristics is crucial for subsequent analysis, which may involve predicting delivery times, identifying factors influencing delays, or optimizing delivery routes.

The purpose of this initial phase is to prepare a clean and well-understood dataset for further modeling and insights.

### Step 1: Load Required Libraries
We import libraries required for data handling and numerical analysis.


In [1]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',100)
print("libraries imported successfully")

libraries imported successfully


## Step 2: Load the Datasets

In this step, we load the raw dataset from the `data/raw' directory.
These datasets will be used throughout the project for exploration, cleaning, and modeling.


In [2]:
df=pd.read_csv("../data/raw/Food_Delivery_Times.csv")
print("Dataset loaded successfully")

Dataset loaded successfully


## Step 3: Initial Data Overview

In this step, we'll gain a preliminary understanding of our dataset by inspecting its basic characteristics. We will look at:

- **Shape:** The number of rows and columns.
- **Info:** Data types, non-null counts, and memory usage.
- **Describe:** Statistical summary of numerical columns.
- **Head & Tail:** The first and last few rows to get a glimpse of the data content.

In [3]:
print(f"Dataset shape: rows: {df.shape[0]}, columns: {df.shape[1]}")

Dataset shape: rows: 1000, columns: 9


In [4]:
print("Dataset information")
print()
df.info()

Dataset information

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Order_ID                1000 non-null   int64  
 1   Distance_km             1000 non-null   float64
 2   Weather                 970 non-null    object 
 3   Traffic_Level           970 non-null    object 
 4   Time_of_Day             970 non-null    object 
 5   Vehicle_Type            1000 non-null   object 
 6   Preparation_Time_min    1000 non-null   int64  
 7   Courier_Experience_yrs  970 non-null    float64
 8   Delivery_Time_min       1000 non-null   int64  
dtypes: float64(2), int64(3), object(4)
memory usage: 70.4+ KB


In [5]:
print("Statistical summary of numerical columns")
print()
df.describe()

Statistical summary of numerical columns



Unnamed: 0,Order_ID,Distance_km,Preparation_Time_min,Courier_Experience_yrs,Delivery_Time_min
count,1000.0,1000.0,1000.0,970.0,1000.0
mean,500.5,10.05997,16.982,4.579381,56.732
std,288.819436,5.696656,7.204553,2.914394,22.070915
min,1.0,0.59,5.0,0.0,8.0
25%,250.75,5.105,11.0,2.0,41.0
50%,500.5,10.19,17.0,5.0,55.5
75%,750.25,15.0175,23.0,7.0,71.0
max,1000.0,19.99,29.0,9.0,153.0


In [6]:
print("First 5 rows of the dataset")
print()
df.head()

First 5 rows of the dataset



Unnamed: 0,Order_ID,Distance_km,Weather,Traffic_Level,Time_of_Day,Vehicle_Type,Preparation_Time_min,Courier_Experience_yrs,Delivery_Time_min
0,522,7.93,Windy,Low,Afternoon,Scooter,12,1.0,43
1,738,16.42,Clear,Medium,Evening,Bike,20,2.0,84
2,741,9.52,Foggy,Low,Night,Scooter,28,1.0,59
3,661,7.44,Rainy,Medium,Afternoon,Scooter,5,1.0,37
4,412,19.03,Clear,Low,Morning,Bike,16,5.0,68


In [7]:
print("last 5 rows of the dataset")
print()
df.tail()

last 5 rows of the dataset



Unnamed: 0,Order_ID,Distance_km,Weather,Traffic_Level,Time_of_Day,Vehicle_Type,Preparation_Time_min,Courier_Experience_yrs,Delivery_Time_min
995,107,8.5,Clear,High,Evening,Car,13,3.0,54
996,271,16.28,Rainy,Low,Morning,Scooter,8,9.0,71
997,861,15.62,Snowy,High,Evening,Scooter,26,2.0,81
998,436,14.17,Clear,Low,Afternoon,Bike,8,0.0,55
999,103,6.63,Foggy,Low,Night,Scooter,24,3.0,58


## Observations from Initial Data Overview

Based on the initial inspection, here are the key observations about the `Food_Delivery_Times.csv` dataset:

*   **Dataset Dimensions:** The dataset comprises `1000` rows and `9` columns.

*   **Data Types and Missing Values:**
    *   `Order_ID`, `Distance_km`, `Preparation_Time_min`, `Vehicle_Type`, and `Delivery_Time_min` appear to be complete with no missing values and have appropriate data types (`int64`, `float64`).
    *   Columns `Weather`, `Traffic_Level`, `Time_of_Day`, and `Courier_Experience_yrs` each have `970` non-null entries out of `1000`, indicating `30` missing values in each of these columns. These will need to be addressed during data cleaning.
    *   `Weather`, `Traffic_Level`, `Time_of_Day`, and `Vehicle_Type` are of `object` type, suggesting they are categorical features.

*   **Statistical Summary (`df.describe()`):**
    *   `Distance_km` ranges from approximately 0.59 km to 19.99 km, with a mean of about 10 km.
    *   `Preparation_Time_min` ranges from 5 minutes to 29 minutes.
    *   `Courier_Experience_yrs` ranges from 0 to 9 years, with a mean of about 4.58 years.
    *   `Delivery_Time_min` ranges from 8 minutes to 153 minutes, with a mean of about 56.73 minutes. The wide range and standard deviation (22.07) indicate variability in delivery times.

*   **Data Content (`df.head()` and `df.tail()`):**
    *   The first and last few rows confirm the presence of various categorical values for `Weather` (e.g., 'Windy', 'Clear', 'Foggy', 'Rainy', 'Snowy'), `Traffic_Level` (e.g., 'Low', 'Medium', 'High'), `Time_of_Day` (e.g., 'Afternoon', 'Evening', 'Night', 'Morning'), and `Vehicle_Type` (e.g., 'Scooter', 'Bike', 'Car').
    *   Numerical values for distance, preparation time, courier experience, and delivery time are within expected ranges based on the `describe()` output.

## Step 4: Data Cleaning - Handling Missing Values and Duplicates

In this step, we will address data quality issues identified during the initial exploration. This involves:

-   **Handling Missing Values:** Imputing or removing missing values in columns such as `Weather`, `Traffic_Level`, `Time_of_Day`, and `Courier_Experience_yrs`.
-   **Handling Duplicate Values:** Identifying and removing any duplicate rows to ensure data integrity.

In [8]:
df_missing=df.isna().sum()
df_missing_percentange=(df_missing/len(df)*100).round(2)
missing_data=pd.DataFrame({"Columns":df_missing.index,"Missing_Count":df_missing.values,"Missing_Percentage":df_missing_percentange.values})
missing_data=missing_data[missing_data["Missing_Count"]>0].sort_values(by="Missing_Percentage",ascending=False)
missing_data

Unnamed: 0,Columns,Missing_Count,Missing_Percentage
2,Weather,30,3.0
3,Traffic_Level,30,3.0
4,Time_of_Day,30,3.0
7,Courier_Experience_yrs,30,3.0


### Strategy for Handling Missing Values

Based on the analysis of missing values, we will employ the following imputation strategies:

*   **Categorical Columns (`Weather`, `Traffic_Level`, `Time_of_Day`):** These columns contain categorical data. The most appropriate imputation method for such features is to replace missing values with the **mode** (the most frequent value). This approach helps maintain the distribution of the categorical variable and is robust to outliers.

*   **Numerical Column (`Courier_Experience_yrs`):** This column contains numerical data representing courier experience. Given that the mean and median are relatively close (4.58 vs. 5.0), we will impute missing values with the **mean** of the column. This is a common and effective method for numerical features, assuming the distribution is not heavily skewed by outliers.

In [9]:
for col in ['Weather', 'Traffic_Level', 'Time_of_Day']:
    df[col].fillna(df[col].mode()[0], inplace=True)

print("Missing values in categorical columns have been imputed with their modes.")

df['Courier_Experience_yrs'].fillna(df['Courier_Experience_yrs'].mean(), inplace=True)
print("Missing values in the 'Courier_Experience_yrs' column have been imputed with the mean.")

Missing values in categorical columns have been imputed with their modes.
Missing values in the 'Courier_Experience_yrs' column have been imputed with the mean.


In [10]:
if df.duplicated().sum()>0:
    df.drop_duplicates(inplace=True)
    print("Duplicate rows have been removed")
else:
    print("No duplicate rows found")

No duplicate rows found


In [11]:
df.to_csv("../data/processed/clean_food_delivery_times.csv",index=False)
print("Clean dataset saved successfully")

Clean dataset saved successfully


## Notebook Summary: Data Exploration & Cleaning

This notebook, '01: Data Exploration & Cleaning', successfully accomplished the initial phase of analyzing the `Food_Delivery_Times.csv` dataset. The key steps and outcomes are as follows:

### 1. Data Loading
*   The `Food_Delivery_Times.csv` dataset was loaded into a pandas DataFrame, making it ready for analysis.

### 2. Data Exploration
*   **Initial Data Overview:** We gained a comprehensive understanding of the dataset's structure, including:
    *   **Shape:** Identified `1000` rows and `9` columns.
    *   **Data Types and Non-Null Counts:** Examined data types and noted `30` missing values in `Weather`, `Traffic_Level`, `Time_of_Day`, and `Courier_Experience_yrs`.
    *   **Statistical Summary:** Analyzed numerical columns to understand their distribution, range, and central tendencies.
    *   **Data Content:** Used `head()` and `tail()` to inspect the actual data entries for categorical and numerical features.

### 3. Data Cleaning
*   **Handling Missing Values:**
    *   Categorical columns (`Weather`, `Traffic_Level`, `Time_of_Day`) were successfully imputed with their respective **modes**.
    *   The numerical column (`Courier_Experience_yrs`) was imputed with its **mean**.
*   **Handling Duplicate Values:** It was confirmed that there were no duplicate rows in the dataset, ensuring data integrity.

### Conclusion
By the end of this notebook, we have a clean and well-understood dataset, ready for further in-depth analysis, feature engineering, and model building in subsequent stages of the project. The cleaned dataset has been saved as `clean_food_delivery_times.csv`.