# Titanic Dataset — End-to-End ML Pipeline

## Evolution from Notebook 02
In the previous notebook, we built preprocessing and the model as **separate steps** — calling `preprocessor.fit_transform()`, then `model.fit()`, then `model.predict()` individually. This approach works, but requires manually managing the preprocessing and model as separate objects, which is error-prone and harder to maintain.

Here, we combine everything into a **single scikit-learn Pipeline** — one object that goes from raw features straight to predictions.

## Why Pipelines Matter
- **Prevents data leakage** — the pipeline ensures `fit` only happens on training data, by design
- **Cleaner code** — one `fit()` call, one `predict()` call, no intermediate variables
- **Production-ready** — you can save this single pipeline object and deploy it directly
- **Industry standard** — this is how ML workflows are built in professional environments

## 1. Import Libraries and Load Data

In [9]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [2]:
df = pd.read_csv('data/train.csv')
print(f"Dataset shape: {df.shape}")
df.head(3)

Dataset shape: (891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


## 2. Prepare Features and Target

In [10]:
# Target
y = df['Survived']

# Features — drop target + non-useful columns in one step
X = df.drop(columns=['Survived', 'PassengerId', 'Name', 'Ticket', 'Cabin'])

print(f"Features: {list(X.columns)}")
print(f"X shape: {X.shape}, y shape: {y.shape}")
X.head()

Features: ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
X shape: (891, 7), y shape: (891,)


Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,22.0,1,0,7.25,S
1,1,female,38.0,1,0,71.2833,C
2,3,female,26.0,0,0,7.925,S
3,1,female,35.0,1,0,53.1,S
4,3,male,35.0,0,0,8.05,S


## 3. Train/Test Split

In [11]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape}")
print(f"Test set:     {X_test.shape}")

Training set: (712, 7)
Test set:     (179, 7)


## 4. Build the End-to-End Pipeline

This is the key upgrade. Instead of separate preprocessing and model steps, we create **one Pipeline** that chains:

1. **Preprocessor** (ColumnTransformer) — handles imputation and encoding
2. **Model** (Logistic Regression) — learns from the preprocessed data

When we call `pipeline.fit(X_train, y_train)`, it automatically:
- Fits the preprocessor on `X_train` and transforms it
- Fits the model on the transformed result

When we call `pipeline.predict(X_test)`, it automatically:
- Transforms `X_test` using the already-fitted preprocessor
- Predicts using the already-fitted model

In [12]:
# Define column types
numeric_cols = ['Age', 'SibSp', 'Parch', 'Fare', 'Pclass']
categorical_cols = ['Sex', 'Embarked']

In [13]:
# Preprocessing step (same logic as notebook 02)
preprocessor = ColumnTransformer(transformers=[
    ('num', SimpleImputer(strategy='median'), numeric_cols),
    ('cat', Pipeline([
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder(drop='first', sparse_output=False))
    ]), categorical_cols)
])

# End-to-end pipeline: preprocessing + model in one object
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', LogisticRegression(max_iter=1000))
])

print("Pipeline created:")
print(pipeline)

Pipeline created:
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  SimpleImputer(strategy='median'),
                                                  ['Age', 'SibSp', 'Parch',
                                                   'Fare', 'Pclass']),
                                                 ('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehot',
                                                                   OneHotEncoder(drop='first',
                                                                                 sparse_output=False))]),
                                                  ['Sex', 'Embarked'])])),
                ('model', LogisticRegression(max_iter=1000))])


## 5. Train and Predict

Notice how clean this is — **two lines** to go from raw data to predictions.

In [14]:
# Train: preprocesses + fits the model in one call
pipeline.fit(X_train, y_train)

# Predict: preprocesses + predicts in one call
y_pred = pipeline.predict(X_test)
y_pred_proba = pipeline.predict_proba(X_test)

print("Pipeline trained and predictions generated!")
print(f"Predictions shape: {y_pred.shape}")
print(f"Probabilities shape: {y_pred_proba.shape}")

Pipeline trained and predictions generated!
Predictions shape: (179,)
Probabilities shape: (179, 2)


## 6. Evaluate the Model

In [15]:
# --- Accuracy ---
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f} ({accuracy * 100:.2f}%)\n")

# --- Confusion Matrix ---
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
print(f"""
  - True Negatives  (correctly predicted died):          {cm[0][0]}
  - False Positives (predicted survived, actually died): {cm[0][1]}
  - False Negatives (predicted died, actually survived): {cm[1][0]}
  - True Positives  (correctly predicted survived):      {cm[1][1]}
""")

# --- Classification Report ---
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Died', 'Survived']))

Accuracy: 0.8045 (80.45%)

Confusion Matrix:
[[98 12]
 [23 46]]

  - True Negatives  (correctly predicted died):          98
  - False Positives (predicted survived, actually died): 12
  - False Negatives (predicted died, actually survived): 23
  - True Positives  (correctly predicted survived):      46

Classification Report:
              precision    recall  f1-score   support

        Died       0.81      0.89      0.85       110
    Survived       0.79      0.67      0.72        69

    accuracy                           0.80       179
   macro avg       0.80      0.78      0.79       179
weighted avg       0.80      0.80      0.80       179



## 7. Compare: Before vs After

| Aspect | Notebook 02 (Separate Steps) | Notebook 03 (Pipeline) |
|---|---|---|
| **Preprocessing** | Manual `fit_transform` + `transform` calls | Handled automatically inside the pipeline |
| **Training** | `model.fit(X_train_processed, y_train)` | `pipeline.fit(X_train, y_train)` — takes raw data directly |
| **Prediction** | Must remember to transform first, then predict | `pipeline.predict(X_test)` — one call does everything |
| **Leakage risk** | Possible if you accidentally fit on test data | Prevented by design |
| **Deployment** | Must save preprocessor and model separately | Save one pipeline object |

The results (accuracy, confusion matrix, classification report) are **identical** — we didn't change the logic, just the structure. This is a **refactor**, not a new model.

## 8. Conclusion & Next Steps

### What we achieved
- Refactored separate preprocessing and model steps into a **single end-to-end Pipeline**
- Confirmed that results are identical — this was a structural improvement, not a model change
- The pipeline is now production-ready: one object handles everything from raw data to predictions

### What's next
- **Try different models** — swap `LogisticRegression` for Random Forest, SVM, or Gradient Boosting within the same pipeline structure
- **Feature engineering** — create new features (e.g., family size from SibSp + Parch) to improve recall on the "Survived" class
- **Hyperparameter tuning** — use GridSearchCV or RandomizedSearchCV with the pipeline to find optimal settings