## 1. Gather the Data
Answer:
The dataset has been successfully loaded from the provided URL. The data contains information about food delivery orders in New Delhi, including columns like order_id, delivery_method, commission_fee, order_value, payment_method, delivery_time, and refunds_chargebacks.

In [18]:
import pandas as pd
import numpy as np
from scipy import stats

# Load the dataset
url = "https://statso.io/wp-content/uploads/2024/02/food_orders_new_delhi.csv"
df = pd.read_csv(url)

## 2. Clean the Dataset
**Answer:**
The dataset was cleaned by:

Removing rows with missing values.

Standardizing categorical values (e.g., delivery_method and payment_method to lowercase).

Ensuring numeric columns (commission_fee, order_value, delivery_time) are correctly typed.

In [24]:
import pandas as pd

# Load the dataset
df = pd.read_csv("food_orders_new_delhi.csv")

# Display basic info
print("Initial Dataset Info:")
print(df.info())

# Step 1: Handle missing values
missing_values = df.isnull().sum()
print("\nMissing Values Before Handling:\n", missing_values)

# Drop rows where essential columns (Order ID, Customer ID, Restaurant ID) are missing
df.dropna(subset=['Order ID', 'Customer ID', 'Restaurant ID'], inplace=True)

# Fill missing numerical values with median
num_cols = ['Order Value', 'Delivery Fee', 'Commission Fee', 'Payment Processing Fee', 'Refunds/Chargebacks']
df[num_cols] = df[num_cols].fillna(df[num_cols].median())

# Fill missing categorical values with mode
cat_cols = ['Discounts and Offers']
df[cat_cols] = df[cat_cols].apply(lambda x: x.fillna(x.mode()[0]))

# Step 2: Remove duplicate records
df.drop_duplicates(inplace=True)

# Step 3: Convert data types
df['Order Date and Time'] = pd.to_datetime(df['Order Date and Time'])
df['Delivery Date and Time'] = pd.to_datetime(df['Delivery Date and Time'])

# Step 4: Identify and handle outliers using IQR method
for col in num_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]

# Display cleaned dataset info
print("\nCleaned Dataset Info:")
print(df.info())

# Save the cleaned dataset
df.to_csv("food_orders_cleaned.csv", index=False)

print("\nData Cleaning Completed. Cleaned dataset saved as 'food_orders_cleaned.csv'.")

Initial Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Order ID                1000 non-null   int64 
 1   Customer ID             1000 non-null   object
 2   Restaurant ID           1000 non-null   object
 3   Order Date and Time     1000 non-null   object
 4   Delivery Date and Time  1000 non-null   object
 5   Order Value             1000 non-null   int64 
 6   Delivery Fee            1000 non-null   int64 
 7   Payment Method          1000 non-null   object
 8   Discounts and Offers    815 non-null    object
 9   Commission Fee          1000 non-null   int64 
 10  Payment Processing Fee  1000 non-null   int64 
 11  Refunds/Chargebacks     1000 non-null   int64 
dtypes: int64(6), object(6)
memory usage: 93.9+ KB
None

Missing Values Before Handling:
 Order ID                    0
Customer ID             