**Milestone2: Data Ingestion Pipeline**

A data ingestion pipeline is an automated process that collects, cleans, and prepares raw data from different sources so itâ€™s ready for analysis, machine learning, or reporting.

| Step | Task                 | Example                                           |
| ---- | -------------------- | ------------------------------------------------- |
| 1    | **Ingest (collect)** | Load data from CSV or API                         |
| 2    | **Clean**            | Handle missing values, remove duplicates          |
| 3    | **Transform**        | Encode categorical variables, standardize formats |
| 4    | **Store**            | Save cleaned file or push to a database           |
| 5    | **Automate**         | Schedule it daily or trigger on new uploads       |


In [2]:
import pandas as pd
import numpy as np
import os

def data_ingestion_pipeline(file_path):
    """
    Generic Data Ingestion Pipeline using pandas
    Automatically loads, cleans, encodes, and saves a cleaned version.
    Works for most structured CSV datasets.
    """
    print(f"ðŸš€ Loading data from: {file_path}")
    df = pd.read_csv(file_path)
    
    print("\nâœ… Step 1: Basic Info")
    print(df.info())

    # Detect columns by dtype
    categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
    numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    
    print("\nCategorical Columns:", categorical_cols)
    print("Numerical Columns:", numerical_cols)

    # Step 2: Handle missing values
    print("\nðŸ§© Handling Missing Values...")
    for col in df.columns:
        if df[col].isnull().sum() > 0:
            if col in numerical_cols:
                df[col].fillna(df[col].median(), inplace=True)  # use median for numerical
            else:
                df[col].fillna(df[col].mode()[0], inplace=True)  # use mode for categorical

    # Step 3: Encoding categorical columns
    print("\nðŸ”¡ Encoding categorical features...")
    df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

    # Step 4: Remove duplicates
    before = df.shape[0]
    df.drop_duplicates(inplace=True)
    after = df.shape[0]
    print(f"\nðŸ§¹ Removed {before - after} duplicate rows")

    # Step 5: Save cleaned dataset
    cleaned_path = "cleaned_" + os.path.basename(file_path)
    df.to_csv(cleaned_path, index=False)
    print(f"\nðŸ’¾ Cleaned data saved to: {cleaned_path}")

    return df

# Example usage:
df_clean = data_ingestion_pipeline("medical_insurance.csv")


ðŸš€ Loading data from: medical_insurance.csv

âœ… Step 1: Basic Info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 54 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   person_id                    100000 non-null  int64  
 1   age                          100000 non-null  int64  
 2   sex                          100000 non-null  object 
 3   region                       100000 non-null  object 
 4   urban_rural                  100000 non-null  object 
 5   income                       100000 non-null  float64
 6   education                    100000 non-null  object 
 7   marital_status               100000 non-null  object 
 8   employment_status            100000 non-null  object 
 9   household_size               100000 non-null  int64  
 10  dependents                   100000 non-null  int64  
 11  bmi                          100000 non-null  fl

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mode()[0], inplace=True)  # use mode for categorical



ðŸ§¹ Removed 0 duplicate rows

ðŸ’¾ Cleaned data saved to: cleaned_medical_insurance.csv
