```markdown
# Predicting Titanic Survival

This code demonstrates a basic machine learning workflow to predict whether a passenger survived the Titanic shipwreck using the `train.csv` dataset.

## 1. Setup and Data Loading

First, we'll import the necessary libraries and load the dataset.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the dataset
try:
    df = pd.read_csv('data/titanic/train.csv')
    print("Dataset loaded successfully.")
except FileNotFoundError:
    print("Error: 'data/titanic/train.csv' not found.")
    print("Please make sure the train.csv file is in the 'data/titanic/' directory relative to where you run this script.")
    exit() # Exit the script if the file isn't found

# Display the first few rows
print("\n--- Data Head ---")
print(df.head())

# Display data info and check for missing values
print("\n--- Data Info ---")
df.info()

print("\n--- Missing Values ---")
print(df.isnull().sum())

## 2. Data Preprocessing

The data needs cleaning and transformation before being used by a machine learning model.

*   **Handle Missing Values:**
    *   `Age`: Impute with the median age.
    *   `Fare`: Impute with the median fare (though training data has no missing Fare, good practice).
    *   `Embarked`: Impute with the most frequent embarkation port.
    *   `Cabin`: Too many missing values, we will drop this column.
*   **Handle Categorical Features:**
    *   `Sex` and `Embarked`, `Pclass`: Convert these categorical features into numerical representations using one-hot encoding.
*   **Feature Selection:**
    *   Drop irrelevant columns like `PassengerId`, `Name`, `Ticket`, and `Cabin`.

In [None]:
# Separate target variable
X = df.drop('Survived', axis=1)
y = df['Survived']

# Select features to use (excluding those to be dropped)
# 'PassengerId', 'Name', 'Ticket', 'Cabin' will be dropped implicitly by ColumnTransformer's remainder='drop'
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
X = X[features] # Keep only the selected feature columns in X

# Identify column types for preprocessing
# Note: Pclass is technically ordinal but often treated as categorical or numerical.
# We'll treat it as categorical here and one-hot encode it.
numerical_features = ['Age', 'SibSp', 'Parch', 'Fare']
categorical_features = ['Sex', 'Embarked', 'Pclass']

# Create preprocessing pipelines for numerical and categorical features
# Numerical pipeline: Impute missing values with median
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))
])

# Categorical pipeline: Impute missing values with most frequent, then one-hot encode
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')), # For Embarked
    ('onehot', OneHotEncoder(handle_unknown='ignore')) # handle_unknown='ignore' is useful for prediction on unseen data
])

# Combine preprocessing steps using ColumnTransformer
# 'remainder='drop'' will drop all columns not specified in transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    remainder='drop'
)

print("\n--- Preprocessing defined ---")
print(preprocessor)

## 3. Split Data

Split the data into training and testing sets. The model will be trained on the training data and evaluated on the unseen testing data.

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"\nTraining set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")

## 4. Model Training and Evaluation

We will train a few different classification models and evaluate their performance using accuracy, classification report (precision, recall, F1-score), and confusion matrix.

We use a `Pipeline` to chain the preprocessing steps and the model training together.

In [None]:
# Define models to train
models = {
    'Logistic Regression': LogisticRegression(solver='liblinear', random_state=42), # liblinear is good for small datasets
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

results = {}

# Train and evaluate each model
for name, model in models.items():
    print(f"\n--- Training {name} ---")

    # Create a pipeline that first preprocesses then trains the model
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('classifier', model)])

    # Train the model
    pipeline.fit(X_train, y_train)

    # Make predictions on the test set
    y_pred = pipeline.predict(X_test)

    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    cm = confusion_matrix(y_test, y_pred)

    # Store results
    results[name] = {
        'accuracy': accuracy,
        'report': report,
        'confusion_matrix': cm
    }

    # Print evaluation results
    print(f"{name} Accuracy: {accuracy:.4f}")
    print(f"{name} Classification Report:\n{report}")
    print(f"{name} Confusion Matrix:\n{cm}")

## 5. Conclusion

The code performs the following steps:

1.  Loads the Titanic training data.
2.  Inspects the data, including checking for missing values.
3.  Defines preprocessing steps using `ColumnTransformer` and `Pipeline` to handle missing values (imputation) and categorical features (one-hot encoding), while dropping irrelevant columns.
4.  Splits the data into training and testing sets.
5.  Trains and evaluates three different classification models (Logistic Regression, Decision Tree, Random Forest) using a pipeline that includes the preprocessing.
6.  Prints the accuracy, classification report, and confusion matrix for each model on the test set.

The output shows the performance of each model. You can compare the metrics (accuracy, precision, recall, F1-score) to see which model performed best on this particular split of the data. Random Forest often performs well on this dataset due to its ensemble nature.

This is a basic example. Further improvements could include:
*   More sophisticated feature engineering (e.g., creating 'FamilySize' from 'SibSp' and 'Parch', extracting titles from 'Name').
*   Hyperparameter tuning for the models.
*   Cross-validation for more robust evaluation.
*   Handling outliers.
```