
<h1 style="text-align:center; color:darkgreen; font-family:Helvetica, Arial, sans-serif; font-weight:bold;">
Milestone 2 â€“ Data Ingestion Pipeline
</h1>


### Milestone 2: Data Ingestion Pipeline
### Step 1: Handle Missing Values

Before performing data transformations or model training, itâ€™s essential to handle missing values to maintain data quality and prevent model bias.
This step fills missing numerical values with their mean and categorical values with their mode to ensure consistent and complete data.

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load dataset
df = pd.read_csv("dynamic_pricing.csv")
#Handle Missing Values
# Fill missing numerical with mean
for col in df.select_dtypes(include=np.number).columns:
    df[col] = df[col].fillna(df[col].mean())

# Fill missing categorical with mode
for col in df.select_dtypes(include="object").columns:
    df[col] = df[col].fillna(df[col].mode()[0])

print("\nAfter filling NA values:\n", df.isnull().sum())



After filling NA values:
 Number_of_Riders           0
Number_of_Drivers          0
Location_Category          0
Customer_Loyalty_Status    0
Number_of_Past_Rides       0
Average_Ratings            0
Time_of_Booking            0
Vehicle_Type               0
Expected_Ride_Duration     0
Historical_Cost_of_Ride    0
dtype: int64


### Milestone 2: Data Ingestion Pipeline
####  step 2: Export Cleaned Dataset to CSV

After handling missing values and performing necessary preprocessing, the cleaned dataset is exported as a CSV file for downstream tasks such as feature engineering and model development.

In [5]:
#Cleaned the csv sheet
df.to_csv("dynamic_pricing_cleaned.csv", index=False)
print(" Cleaned dataset saved as dynamic_pricing_cleaned.csv")


 Cleaned dataset saved as dynamic_pricing_cleaned.csv


###  Milestone 2: Data Ingestion Pipeline

####  Step 1â€“4: Data Cleaning and Preprocessing Pipeline

In **Milestone 2: Data Ingestion Pipeline**, the goal is to automate the **data ingestion and preprocessing** workflow â€” ensuring clean, reliable input for model training and dashboard analytics.  
This step creates a modular pipeline to **load, clean, encode, and store** the processed dataset, ready for feature engineering and model development.

In [8]:
from sklearn.preprocessing import LabelEncoder
import numpy as np
import pandas as pd

def preprocess_data(file_path, save_path="cleaned_csv_data.csv"):
    """Compound function: loads, cleans, encodes, and saves dataset."""
    
    # Step 1: Load dataset
    df = pd.read_csv(file_path)
    print(f"âœ… Data loaded successfully: {df.shape[0]} rows, {df.shape[1]} columns")

    # Step 2: Handle missing values
    for col in df.select_dtypes(include=np.number).columns:
        df[col] = df[col].fillna(df[col].mean())
    for col in df.select_dtypes(include="object").columns:
        df[col] = df[col].fillna(df[col].mode()[0])
    print("âœ… Missing values handled.")

    # Step 3: Encode categorical variables
    le = LabelEncoder()
    for col in df.select_dtypes(include="object").columns:
        df[col] = le.fit_transform(df[col])
    print("âœ… Categorical columns encoded.")

    # Step 4: Save the cleaned dataset
    df.to_csv(save_path, index=False)
    print(f"âœ… Cleaned data saved as {save_path}")

    return df

# ðŸš€ Run the compound pipeline
file_path = "dynamic_pricing.csv"
df_cleaned = preprocess_data(file_path)


âœ… Data loaded successfully: 1000 rows, 10 columns
âœ… Missing values handled.
âœ… Categorical columns encoded.
âœ… Cleaned data saved as cleaned_csv_data.csv
