
<h2 style="color:#1F618D;">Milestone 1: Requirements & Data Preparation</h2>
<h3 style="color:#2874A6;">Step 1: Data Loading and Initial Exploration</h3>

<p style="font-size:15px;">As part of the data ingestion and preparation phase, it begins by importing the required Python libraries and loading the dataset for analysis.
The dataset is sourced from <b>Kaggle Dynamic Pricing Dataset</b> and <b>Statso Case Study</b>, containing historical sales, pricing, and inventory information.</p>


In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import zscore

# Load dataset
df = pd.read_csv("dynamic_pricing.csv")

# Basic info
print("Shape:", df.shape)
print("\nData Types:\n", df.dtypes)
print("\nFirst 5 rows:\n", df.head())



**Sample Output:**  
```
Shape: (1000, 10)

Data Types:
 Number_of_Riders             int64
Number_of_Drivers            int64
Location_Category           object
Customer_Loyalty_Status     object
Number_of_Past_Rides         int64
Average_Ratings            float64
Time_of_Booking             object
Vehicle_Type                object
Expected_Ride_Duration       int64
Historical_Cost_of_Ride    float64
dtype: object

First 5 rows:
    Number_of_Riders  Number_of_Drivers Location_Category  Customer_Loyalty_Status  Number_of_Past_Rides  Average_Ratings  Time_of_Booking  Vehicle_Type  Expected_Ride_Duration  Historical_Cost_of_Ride
0                90                 45             Urban                  Silver                    13             4.47           Night      Premium                      90               284.257273
1                58                 39          Suburban                  Silver                    72             4.06         Evening      Economy                      43               173.874753
2                42                 31             Rural                  Silver                     0             3.99       Afternoon      Premium                      76               329.795469
3                89                 28             Rural                 Regular                    67             4.31       Afternoon      Premium                     134               470.201232
4                78                 22             Rural                 Regular                    74             3.77       Afternoon      Economy                     149               579.681422
```



<h3 style="color:#2874A6;">Step 3: Handle Categorical Variables (Encoding)</h3>
<p style="font-size:15px;">Before model training, all categorical (text-based) features in the dataset must be converted into numerical form. This ensures that machine learning algorithms such as XGBoost, LightGBM, and other advanced models can process the data effectively.</p>


In [None]:

from sklearn.preprocessing import LabelEncoder

# Encode categorical variables
label_enc = LabelEncoder()
for col in df.select_dtypes(include="object").columns:
    df[col] = label_enc.fit_transform(df[col])

print("Categorical variables encoded")
print(df.head())



**Sample Output:**  
```
Categorical variables encoded
   Number_of_Riders  Number_of_Drivers  Location_Category  Customer_Loyalty_Status  Number_of_Past_Rides  Average_Ratings  Time_of_Booking  Vehicle_Type  Expected_Ride_Duration  Historical_Cost_of_Ride
0                90                 45                  2                        2                    13             4.47                3             1                      90               284.257273
1                58                 39                  1                        2                    72             4.06                1             0                      43               173.874753
2                42                 31                  0                        2                     0             3.99                0             1                      76               329.795469
3                89                 28                  0                        1                    67             4.31                0             1                     134               470.201232
4                78                 22                  0                        1                    74             3.77                0             0                     149               579.681422
```



<h3 style="color:#2874A6;">Step 4: Save Cleaned and Encoded Dataset</h3>
<p style="font-size:15px;">After completing all preprocessing steps — including missing value handling and encoding — the cleaned dataset is saved for later use in feature engineering and model training.</p>


In [None]:

# Save cleaned dataset
df.to_csv("cleaned_csv_data.csv", index=False)
print("Cleaned dataset saved as 'cleaned_csv_data.csv'")



**Output:**  
```
Cleaned dataset saved as 'cleaned_csv_data.csv'
```



<h3 style="color:#2874A6;">Step 5: Basic Data Loading and Cleaning Pipeline</h3>
<p style="font-size:15px;">To ensure a standardized, reproducible data preparation process, a machine learning pipeline is constructed using Scikit-learn’s <b>Pipeline</b> and <b>ColumnTransformer</b> modules. This automates missing value handling, scaling of numerical data, and consistent categorical imputation.</p>


In [None]:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

numeric_features = df.select_dtypes(include=['int64', 'float64']).columns
categorical_features = []  # already encoded

# Define transformers
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent"))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features)
    ]
)

X_processed = preprocessor.fit_transform(df)

print("Pipeline applied successfully!")
print("Processed data shape:", X_processed.shape)



**Output:**  
```
Pipeline applied successfully!
Processed data shape: (1000, 10)
```
