<a href="https://colab.research.google.com/github/2spoorthy/2spoorthy/blob/main/Data_Pipeline_Development.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Import all required tools

In [2]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

This script follows an ETL (Extract, Transform, Load) process: it extracts data from a CSV, transforms it using scaling (StandardScaler) and encoding (OneHotEncoder), and saves the processed data back to a CSV. It uses `ColumnTransformer` and `Pipeline` to streamline preprocessing for machine learning.

In [4]:
def extract_data(file_path):
    """Extract raw data from a CSV file."""
    return pd.read_csv(file_path)

def transform_data(df):
    """Transform data using scaling and encoding."""
    numerical_features = df.select_dtypes(include=['int64', 'float64']).columns
    categorical_features = df.select_dtypes(include=['object']).columns

    numeric_transformer = StandardScaler()
    categorical_transformer = OneHotEncoder(handle_unknown='ignore')

    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numerical_features),
            ('cat', categorical_transformer, categorical_features)
        ]
    )

    pipeline = Pipeline(steps=[('preprocessor', preprocessor)])
    transformed_data = pipeline.fit_transform(df)

    return transformed_data, pipeline

def load_data(transformed_data, output_path):
    """Load the transformed data into a CSV file."""
    pd.DataFrame(transformed_data).to_csv(output_path, index=False)

This script reads a CSV file (Salary_Data.csv), preprocesses it by scaling numerical features and encoding categorical ones, and saves the cleaned data into processed_data.csv, printing status messages at each step.

In [5]:
def main():
    input_file = '/content/Salary_Data.csv'
    output_file = 'processed_data.csv'

    # Extract
    df = extract_data(input_file)
    print("Data extracted successfully")

    # Transform
    transformed_data, _ = transform_data(df)
    print("Data transformed successfully")

    # Load
    load_data(transformed_data, output_file)
    print("Data loaded successfully")

if __name__ == "__main__":
    main()

Data extracted successfully
Data transformed successfully
Data loaded successfully
