
# 📘 Notebook Outline — Titanic Logistic Regression (CSV) + EDA
**Goal:** Predict *Survival* (0/1) using Kaggle-style Titanic CSV.

## 1) Introduction — What & Why
- **Definition:** Logistic Regression models the log-odds of the positive class as a linear function of features.
- **Why classification:** `Survived` is binary (0/1); we need probabilities & class labels.
- **Why Titanic:** Classic, interpretable dataset for teaching classification + metrics.

## 2) Load Dataset — Local CSV
- Use `pd.read_csv('titanic.csv')` to avoid network issues.

## 3) EDA — Understand the data before modeling
- Check class balance and missingness.
- Visualize Age distribution and survival rates by key features.

## 4) Data Preprocessing
- **Imputation:** medians for numeric, most_frequent for categoricals.
- **Encoding:** one‑hot for `Sex, Embarked`; add `Alone` feature.
- **Scaling:** standardize numeric features.

## 5) Train–Test Split — Stratify on `Survived`

## 6) Model Training — Logistic Regression

## 7) Evaluation — Accuracy, Precision, Recall, F1, ROC‑AUC + diagnostics

## 8) Interpretation — coefficients, odds ratios, insights


## Setup & Data Loading

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load local CSV
df = pd.read_csv('titanic.csv')
print(df.shape)
df.head()


## EDA — Structure, Missingness, Class Balance

In [None]:

# Data types and non-null counts
df.info()


In [None]:

# Missing values per column
df.isna().sum().sort_values(ascending=False)


In [None]:

# Class balance for Survived
df['Survived'].value_counts(normalize=True).rename('proportion')


## EDA — Distributions & Survival Rates

In [None]:

# Age distribution
plt.figure(figsize=(6,4))
df['Age'].plot(kind='hist', bins=30)
plt.xlabel('Age')
plt.title('Age Distribution')
plt.show()


In [None]:

# Survival rate by Sex (bar chart via groupby)
sex_rate = df.groupby('Sex')['Survived'].mean().sort_values(ascending=False)
plt.figure(figsize=(6,4))
sex_rate.plot(kind='bar')
plt.ylabel('Survival Rate')
plt.title('Survival Rate by Sex')
plt.ylim(0,1)
plt.show()


In [None]:

# Survival rate by Pclass
pclass_rate = df.groupby('Pclass')['Survived'].mean()
plt.figure(figsize=(6,4))
pclass_rate.plot(kind='bar')
plt.ylabel('Survival Rate')
plt.title('Survival Rate by Pclass')
plt.ylim(0,1)
plt.show()


## Target & Features

In [None]:

cols = ['Survived','Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']
data = df[cols].copy()

# Engineered feature
data['Alone'] = ((data['SibSp'].fillna(0) + data['Parch'].fillna(0)) == 0).astype(int)

y = data['Survived']
X = data.drop(columns=['Survived'])

num_features = ['Pclass','Age','SibSp','Parch','Fare']
cat_features = ['Sex','Embarked','Alone']


## Preprocessing Pipeline

In [None]:

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score,
                             roc_auc_score, confusion_matrix, RocCurveDisplay)

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocess = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_features),
        ('cat', categorical_transformer, cat_features)
    ]
)


## Train–Test Split

In [None]:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


## Model — Logistic Regression

In [None]:

model = Pipeline(steps=[
    ('preprocess', preprocess),
    ('clf', LogisticRegression(max_iter=1000))
])

model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

metrics = {
    'accuracy': round(accuracy_score(y_test, y_pred),3),
    'precision': round(precision_score(y_test, y_pred),3),
    'recall': round(recall_score(y_test, y_pred),3),
    'f1': round(f1_score(y_test, y_pred),3),
    'roc_auc': round(roc_auc_score(y_test, y_proba),3)
}
metrics


## Confusion Matrix & ROC Curve

In [None]:

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
pd.DataFrame(cm, index=['Actual 0','Actual 1'], columns=['Pred 0','Pred 1'])


In [None]:

# ROC Curve
RocCurveDisplay.from_estimator(model, X_test, y_test)
plt.title('ROC Curve')
plt.show()


## Coefficients (Odds Interpretation)

In [None]:

# Recover feature names
ct = model.named_steps['preprocess']
ohe = ct.named_transformers_['cat'].named_steps['onehot']
num_names = num_features
cat_names = list(ohe.get_feature_names_out(cat_features))
all_feature_names = num_names + cat_names

coef = model.named_steps['clf'].coef_[0]
coef_df = pd.DataFrame({'feature': all_feature_names, 'coef': coef})
coef_df['odds_ratio'] = np.exp(coef_df['coef'])
coef_df.sort_values('odds_ratio', ascending=False).head(12)
