# Kaggle Titanic Baseline

This notebook mirrors the baseline pipeline in `src/train.py` and produces a `submission.csv` file for the Kaggle Titanic competition.

**Before you start:** place `train.csv` and `test.csv` inside the `data/` directory. The final submission file will be written to `output/submission.csv`.

In [None]:
from pathlib import Path

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder


## Load and inspect the data

We verify the input files exist, load them into pandas, and take a quick look at their shape and columns for easy debugging.

In [None]:
data_dir = Path("data")
train_path = data_dir / "train.csv"
test_path = data_dir / "test.csv"

if not train_path.exists() or not test_path.exists():
    missing = [str(p) for p in [train_path, test_path] if not p.exists()]
    raise FileNotFoundError(
        "Missing input files: "
        + ", ".join(missing)
        + ". Download them from Kaggle and place them in the data/ directory."
    )

train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)

print(f"Train shape: {train_df.shape}")
print(f"Test shape: {test_df.shape}")
train_df.head()


## Build the preprocessing + model pipeline

We align the feature list with the training script, split numeric vs. categorical columns, and build a preprocessing pipeline with a logistic regression classifier.

In [None]:
feature_cols = [
    "Pclass",
    "Sex",
    "Age",
    "SibSp",
    "Parch",
    "Fare",
    "Embarked",
]

missing_cols = [col for col in feature_cols if col not in train_df.columns]
if missing_cols:
    raise ValueError(f"Training data is missing columns: {missing_cols}")

missing_test_cols = [col for col in feature_cols if col not in test_df.columns]
if missing_test_cols:
    raise ValueError(f"Test data is missing columns: {missing_test_cols}")

numeric_features = ["Pclass", "Age", "SibSp", "Parch", "Fare"]
categorical_features = ["Sex", "Embarked"]

numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median"))]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

model = LogisticRegression(max_iter=1000)

pipeline = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("model", model),
    ]
)


## Train the model

We fit the pipeline on the full training dataset, mirroring the approach in `src/train.py`.

In [None]:
x = train_df[feature_cols]
y = train_df["Survived"]
x_test = test_df[feature_cols]

pipeline.fit(x, y)


## Generate a submission file

We run the fitted model on the test set and save the Kaggle-ready CSV to `output/submission.csv`.

In [None]:
predictions = pipeline.predict(x_test)

submission = pd.DataFrame(
    {"PassengerId": test_df["PassengerId"], "Survived": predictions}
)

output_dir = Path("output")
output_dir.mkdir(parents=True, exist_ok=True)
output_path = output_dir / "submission.csv"
submission.to_csv(output_path, index=False)

print(f"Saved submission file to {output_path.resolve()}")
submission.head()
