# Task-1: Data Pipeline Development (ETL)

**Internship:** CODTECH  
**Task:** Create a pipeline for data preprocessing, transformation, and loading  
**Tools Used:** Pandas, NumPy, Scikit-learn  
**Dataset:** Titanic Dataset (Kaggle)

---

## Objective
The goal of this task is to build an automated data pipeline that:
1. Extracts data from a CSV file  
2. Preprocesses and transforms the data  
3. Loads the cleaned data into a new CSV file  

This notebook demonstrates a complete ETL workflow using Python.


## Step 1: Import Required Libraries

In this step, we import all the necessary Python libraries required for:
- Data handling
- Numerical operations
- Data preprocessing
- Pipeline creation


In [15]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

## Step 2: Load the Dataset (Extract)

We load the Titanic dataset (`train.csv`) using Pandas.
This step represents the **Extract** phase of the ETL pipeline.


In [16]:
df = pd.read_csv("train.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Step 3: Understand the Dataset

We inspect the dataset to understand:
- Column names
- Data types
- Missing values


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


## Step 4: Handle Missing Values

Missing values can negatively affect data processing.
We use **forward fill (ffill)** to fill missing values.
This is a simple and effective preprocessing technique.


In [18]:
df.ffill(inplace=True)

## Step 5: Separate Features and Target Variable

- **Features (X):** Input columns
- **Target (y):** Output column (`Survived`)

In [19]:
X = df.drop("Survived", axis=1)
y = df["Survived"]

## Step 6: Identify Numerical and Categorical Columns

Different preprocessing techniques are applied to:
- Numerical columns
- Categorical (text) columns

In [20]:
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

## Step 7: Create Data Transformation Pipelines

- Numerical data is scaled using **StandardScaler**
- Categorical data is encoded using **OneHotEncoder**

In [21]:
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

## Step 8: Combine Pipelines Using ColumnTransformer

ColumnTransformer allows applying different preprocessing steps
to different columns in a single pipeline.

In [22]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

## Step 9: Apply Data Transformations (Transform)

The pipeline is applied to the feature set.
This step represents the **Transform** phase of ETL.

In [23]:
X_processed = preprocessor.fit_transform(X)

## Step 10: Convert Processed Data to DataFrame

After transformation, the data is converted into a Pandas DataFrame
so that it can be saved easily.

In [24]:
processed_df = pd.DataFrame(
    X_processed.toarray() if hasattr(X_processed, "toarray") else X_processed
)

## Step 11: Save the Processed Data (Load)

The cleaned and transformed data is saved into a new CSV file.
This represents the **Load** phase of the ETL pipeline.

In [25]:
processed_df.to_csv("processed_data.csv", index=False)

## ETL Pipeline Completed Successfully ðŸŽ‰

The data pipeline has successfully:
- Extracted raw data
- Transformed and preprocessed it
- Loaded the processed data into a new CSV file

In [26]:
print("ETL Data Pipeline executed successfully!")

ETL Data Pipeline executed successfully!
