## Milestone 1: Requirements & Data Preparation

###  Step 1: Data Loading and Initial Exploration

As part of the **data ingestion and preparation phase**, it begin by importing the required Python libraries and loading the dataset for analysis.  
The dataset is sourced from **Kaggle Dynamic Pricing Dataset** and **Statso Case Study**, containing historical sales, pricing, and inventory information.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import zscore

# Load dataset
df = pd.read_csv("dynamic_pricing.csv")

# Basic info
print("Shape:", df.shape)
print("\nData Types:\n", df.dtypes)
print("\nFirst 5 rows:\n", df.head())


Shape: (1000, 10)

Data Types:
 Number_of_Riders             int64
Number_of_Drivers            int64
Location_Category           object
Customer_Loyalty_Status     object
Number_of_Past_Rides         int64
Average_Ratings            float64
Time_of_Booking             object
Vehicle_Type                object
Expected_Ride_Duration       int64
Historical_Cost_of_Ride    float64
dtype: object

First 5 rows:
    Number_of_Riders  Number_of_Drivers Location_Category  \
0                90                 45             Urban   
1                58                 39          Suburban   
2                42                 31             Rural   
3                89                 28             Rural   
4                78                 22             Rural   

  Customer_Loyalty_Status  Number_of_Past_Rides  Average_Ratings  \
0                  Silver                    13             4.47   
1                  Silver                    72             4.06   
2                  Silv

### Milestone 1: Requirements & Data Preparation
#### Step 3: Handle Categorical Variables (Encoding)

Before model training, all categorical (text-based) features in the dataset must be converted into numerical form.
This ensures that machine learning algorithms such as XGBoost, LightGBM, and other advanced models can process the data effectively.
Encoding categorical variables is a crucial preprocessing step that directly supports later stages of Feature Engineering and Model Development

In [2]:
#Handle categorical variables (Encoding)
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Load dataset
df = pd.read_csv("dynamic_pricing.csv")

# Encode categorical variables
label_enc = LabelEncoder()
for col in df.select_dtypes(include="object").columns:
    df[col] = label_enc.fit_transform(df[col])

print(" Categorical variables encoded")
print(df.head())


 Categorical variables encoded
   Number_of_Riders  Number_of_Drivers  Location_Category  \
0                90                 45                  2   
1                58                 39                  1   
2                42                 31                  0   
3                89                 28                  0   
4                78                 22                  0   

   Customer_Loyalty_Status  Number_of_Past_Rides  Average_Ratings  \
0                        2                    13             4.47   
1                        2                    72             4.06   
2                        2                     0             3.99   
3                        1                    67             4.31   
4                        1                    74             3.77   

   Time_of_Booking  Vehicle_Type  Expected_Ride_Duration  \
0                3             1                      90   
1                1             0                      43   
2      

### Milestone 1: Requirements & Data Preparation
#### Step 4: Save Cleaned and Encoded Dataset

After completing all data preprocessing steps — including missing value handling, encoding categorical features, and data validation — the cleaned dataset is saved for later use in feature engineering and model training.
Saving the preprocessed data ensures reproducibility, traceability, and efficient workflow management throughout the AI: PriceOptima pipeline.

In [3]:
# Save cleaned dataset
df.to_csv("cleaned_csv_data.csv", index=False)

print(" Cleaned dataset saved as 'cleaned_csv_data.csv'")


 Cleaned dataset saved as 'cleaned_csv_data.csv'


### Milestone 1: Requirements & Data Preparation
#### Step 5: Basic Data Loading and Cleaning Pipeline

To ensure a standardized, reproducible data preparation process, a machine learning pipeline is constructed using Scikit-learn’s Pipeline and ColumnTransformer modules.
This step automates the handling of missing values and scaling of numerical data, while ensuring categorical features are properly imputed for consistency.

In [4]:
# a basic pipeline to load and clean data

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer


numeric_features = df.select_dtypes(include=['int64', 'float64']).columns
categorical_features = df.select_dtypes(include=['object']).columns

# Define transformers
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")),   # Fill missing numeric values
    ("scaler", StandardScaler())                   # Scale numeric features
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent"))  
   
])


preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)


X_processed = preprocessor.fit_transform(df)

print("Pipeline applied successfully!")
print("Processed data shape:", X_processed.shape)


Pipeline applied successfully!
Processed data shape: (1000, 10)
