 # Titanic Logistic Regression Pipeline

This project builds a full pipeline to predict survival on the Titanic dataset using logistic regression, covering data preprocessing, training, evaluation, validation, and model persistence.

In [None]:
import joblib
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score, ConfusionMatrixDisplay, RocCurveDisplay
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

## 1. Data Import

**Note:** Place the Titanic dataset (`train.csv`) in your notebook's working directory.

In [None]:
df = pd.read_csv('train.csv')
df.head()

## 2. Feature Selection & Preprocessing

We will use these columns as features:
- `Pclass`, `Sex`, `Age`, `SibSp`, `Parch`, `Fare`, `Embarked`

Target column:
- `Survived`

In [None]:
X = df[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
y = df['Survived'].astype(int)

## 3. Preprocessing Pipelines

- **Numerical columns**: impute missing with median, then scale.
- **Categorical columns**: impute missing with the most frequent, then one-hot encode.

In [None]:
num_cols = ['Age', 'SibSp', 'Parch', 'Fare']
cat_cols = ['Pclass', 'Sex', 'Embarked']

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_cols),
    ('cat', cat_pipeline, cat_cols)
])

## 4. Model Pipeline and Split

We combine the preprocessor and logistic regression into one pipeline and split the data into train and test sets.

In [None]:
pipe = Pipeline([
    ('preprocess', preprocessor),
    ('clf', LogisticRegression(max_iter=1000))
])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

## 5. Train the Model

In [None]:
pipe.fit(X_train, y_train)

## 6. Evaluate the Model

Get classification metrics, ROC-AUC, and visual confusion matrix and ROC curve.

In [None]:
y_pred = pipe.predict(X_test)
y_proba = pipe.predict_proba(X_test)[:, 1]

print("Classification Report on Test Set")
print(classification_report(y_test, y_pred))
print(f"Test ROC-AUC Score: {roc_auc_score(y_test, y_proba):.3f}")

ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
plt.title("Confusion Matrix (Test)")
plt.show()

RocCurveDisplay.from_predictions(y_test, y_proba)
plt.title("ROC Curve (Test)")
plt.show()

## 7. 5-Fold Cross-Validation (ROC-AUC)

In [None]:
cv_auc = cross_val_score(pipe, X, y, cv=5, scoring='roc_auc')
print(f"5-Fold CV ROC-AUC Mean = {cv_auc.mean():.3f}, Std = {cv_auc.std():.3f}")

## 8. Save and Reload the Model

The pipeline is saved to disk and reloaded to demonstrate reuse and deployment.

In [None]:
joblib.dump(pipe, 'model.joblib')
print("Model pipeline saved as 'model.joblib'.")

model = joblib.load('model.joblib')
print("Sample predictions (reloaded model):", model.predict(X_test.head()))

## 9. Summary & Next Steps

- Key Features: Sex, Fare, Pclass
- Logistic Regression is a strong and interpretable baseline
- Evaluated with ROC-AUC and 5-fold CV
- Next Steps: Feature engineering, hyperparameter tuning, or advanced models for improvement