
# Titanic ML Starter Project (NumPy • pandas • Matplotlib • scikit-learn)

Welcome! This notebook is a hands-on, beginner-friendly project that walks you through an end‑to‑end ML workflow using the classic **Titanic** dataset.

**You will practice:**
- Loading & inspecting data (pandas, NumPy)
- Exploratory Data Analysis (Matplotlib only)
- Data cleaning & feature engineering
- Building a baseline ML model (scikit‑learn `LogisticRegression`)
- Evaluating with accuracy, confusion matrix, ROC‑AUC
- Packaging steps with `Pipeline` + `ColumnTransformer`

> Note: We use `seaborn` only to **load** the Titanic dataset. All **charts** use Matplotlib (as required).


## 0) Setup

In [None]:

# !pip install numpy pandas matplotlib scikit-learn seaborn --quiet  # (Uncomment if needed)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# We import seaborn JUST to load the Titanic dataset easily.
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay, roc_auc_score, RocCurveDisplay


## 1) Load the dataset

In [None]:

# Load Titanic dataset from seaborn
df = sns.load_dataset('titanic')
print(df.shape)
df.head()


## 2) Quick inspection

In [None]:

# Overview of columns, types, and missingness
df.info()


In [None]:

df.describe(include='all').T


## 3) Exploratory Data Analysis (Matplotlib only)

In [None]:

# Target distribution (survived)
fig = plt.figure()
df['survived'].value_counts().sort_index().plot(kind='bar')
plt.title('Target Distribution: Survived (0 = No, 1 = Yes)')
plt.xlabel('Survived')
plt.ylabel('Count')
plt.show()


In [None]:

# Age histogram
fig = plt.figure()
df['age'].plot(kind='hist', bins=30)
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()


In [None]:

# Fare vs. Survived (simple box-like visualization using Matplotlib's boxplot)
fig = plt.figure()
plt.boxplot([df.loc[df['survived']==0, 'fare'].dropna(), df.loc[df['survived']==1, 'fare'].dropna()],
            labels=['Not Survived','Survived'])
plt.title('Fare by Survival')
plt.ylabel('Fare')
plt.show()



## 4) Choose features & target

We'll keep things simple:
- **Target**: `survived`
- **Features**: `pclass`, `sex`, `age`, `sibsp`, `parch`, `fare`, `embarked`

We'll one‑hot encode categorical variables and scale numeric variables (good practice for linear models).


In [None]:

target = 'survived'
features = ['pclass','sex','age','sibsp','parch','fare','embarked']

X = df[features].copy()
y = df[target].astype(int)
X.head()


## 5) Train/Test Split

In [None]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
X_train.shape, X_test.shape


## 6) Preprocessing: Impute • Encode • Scale

In [None]:

# Identify numeric and categorical columns
numeric_features = ['age','sibsp','parch','fare']
categorical_features = ['pclass','sex','embarked']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocess = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)
preprocess


## 7) Baseline Model: Logistic Regression (with Pipeline)

In [None]:

clf = Pipeline(steps=[
    ('preprocess', preprocess),
    ('model', LogisticRegression(max_iter=1000))
])

clf.fit(X_train, y_train)
print('Training done.')


## 8) Evaluation: Accuracy, Confusion Matrix, ROC-AUC

In [None]:

y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print(f'Accuracy: {acc:.3f}')
print('Confusion Matrix:\n', cm)

# Confusion matrix plot
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
fig = plt.figure()
disp.plot(values_format='d')
plt.title('Confusion Matrix')
plt.show()

# ROC-AUC
y_proba = clf.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_proba)
print(f'ROC-AUC: {auc:.3f}')

fig = plt.figure()
RocCurveDisplay.from_predictions(y_test, y_proba)
plt.title('ROC Curve')
plt.show()


## 9) (Optional) Peek inside the model

In [None]:

# After one-hot encoding, it's useful to see feature names:
ohe = clf.named_steps['preprocess'].named_transformers_['cat'].named_steps['onehot']
cat_feature_names = ohe.get_feature_names_out(categorical_features)
all_feature_names = np.concatenate([numeric_features, cat_feature_names])

# Extract coefficients from the logistic regression
coefs = clf.named_steps['model'].coef_.ravel()
coef_df = pd.DataFrame({'feature': all_feature_names, 'coef': coefs}).sort_values('coef', ascending=False)
coef_df.head(10)



## 10) Next Steps & Exercises

- Try adding more features: `class`, `who`, `alone`, `embark_town`, `deck` (after cleaning).
- Compare models: `RandomForestClassifier`, `XGBClassifier` (if available), `SVC`.
- Hyperparameter tuning: `GridSearchCV` or `RandomizedSearchCV` on Logistic Regression `C`, or RandomForest `n_estimators` / `max_depth`.
- Cross‑validation: Use `cross_val_score` to get more stable estimates.
- Feature importance: Use permutation importance (`sklearn.inspection.permutation_importance`).
- Write a short README: problem statement, data steps, model choice, metrics, and how to run.

**Stretch goal:** Deploy with **Streamlit** (inputs → prediction) or a simple **Flask** API.
