# Module 5, Assignment 1, Question 1: Neural Networks on Tabular Data  
## Predicting Titanic Survival  

The Titanic dataset is one of the most famous datasets in machine learning. It contains information about passengers aboard the Titanic, which tragically sank on April 15, 1912. Each row corresponds to a passenger and records their demographic information, ticket details, and survival outcome.

The goal of our task is to predict whether a passenger survived (1) or not (0) based on the available features. This is a binary classification problem.

Features in the dataset:
PassengerId: A unique identifier for each passenger (not useful for prediction).

Survived: The target variable (0 = Did not survive, 1 = Survived).

Pclass: Passenger’s class of travel (1 = First class, 2 = Second class, 3 = Third class). This is a proxy for socio-economic status.

Name: Passenger’s full name (contains titles like Mr., Mrs., Miss, which could be extracted for prediction).

Sex: Gender of the passenger (male, female).

Age: Age of the passenger in years.

SibSp: Number of siblings or spouses traveling with the passenger.

Parch: Number of parents or children traveling with the passenger.

Ticket: Ticket number (not very predictive on its own, often dropped).

Fare: Price paid for the ticket.

Cabin: Cabin number (often missing, but the first letter can be used as the deck).

Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).

By exploring and preprocessing these features, we aim to build a neural network model that learns patterns in the data and helps us predict survival outcomes for Titanic passengers.

### Loading the data

We first load and print the first 5 rows of the data.

In [3]:
# Load Titanic training dataset
import pandas as pd

train_df = pd.read_csv('titanic.csv')
train_df.head()



Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Missing values, feature selection, preprocessing, and split

We’ll

(1) check how many missing values each column has,

(2) pick a small, sensible set of features that could matter for survival

(3) build a simple preprocessing recipe that fills in missing numbers with the median, fills in missing categories with the most frequent label, one-hot encodes the categorical features, and standardizes the numeric ones so they’re on a comparable scale.

Finally, we split the data into a training set (to fit the model) and a test set (to fairly evaluate it).

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

# Print number of NaN
train_df.isna().sum()

Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,177
SibSp,0
Parch,0
Ticket,0
Fare,0


### Train a single Perceptron baseline

We’ll train a Perceptron—a simple linear classifier—as our baseline. Then we’ll report train/test accuracy.

In [5]:
# Features and target
core_cols = ['Pclass','Sex','Age','Fare','Embarked','SibSp','Parch']
X = train_df[core_cols].copy()
y = train_df['Survived'].astype(int)

# Define column groups
num_cols = [c for c in X.columns if c not in ['Sex','Embarked']]
cat_cols = ['Sex','Embarked']

# Simple preprocessing
preprocess = ColumnTransformer(
    transformers=[
        ('num', Pipeline([
            ('imp', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ]), num_cols),
        ('cat', Pipeline([
            ('imp', SimpleImputer(strategy='most_frequent')),
            ('ohe', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
        ]), cat_cols),
    ]
)

# Train/test split
X_train_raw, X_test_raw, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0, stratify=y
)

X_train = preprocess.fit_transform(X_train_raw)
X_test = preprocess.transform(X_test_raw)

print("X_train shape:", X_train.shape, "  X_test shape:", X_test.shape)


X_train shape: (712, 10)   X_test shape: (179, 10)


In [6]:
from sklearn.linear_model import Perceptron
from sklearn.metrics import accuracy_score

# Train perceptron
perc = Perceptron(max_iter=1000, tol=1e-3, random_state=42)
perc.fit(X_train, y_train)

# Predictions
y_pred_train = perc.predict(X_train)
y_pred_test = perc.predict(X_test)

print("Train Accuracy:", accuracy_score(y_train, y_pred_train))
print("Test Accuracy:", accuracy_score(y_test, y_pred_test))


Train Accuracy: 0.7654494382022472
Test Accuracy: 0.7430167597765364


### Decision scores and thresholding with confusion matrices

Linear models output a raw decision score for each passenger. By default, the Perceptron predicts class “1” when the score > 0. Here, we’ll sweep different thresholds (e.g., −0.5, 0.0, 0.5, 1.0), convert scores to predictions, and print the confusion matrix for each threshold. This shows the trade-off between false positives and false negatives, so you can choose a threshold that fits a scenario.

In [7]:
import numpy as np
from sklearn.metrics import confusion_matrix

# Raw decision values
scores = perc.decision_function(X_test)

# Try different thresholds
thresholds = [-0.5, 0.0, 0.5, 1.0]

for t in thresholds:
    preds = (scores > t).astype(int)
    cm = confusion_matrix(y_test, preds)
    print(f"\nThreshold {t}")
    print(cm)



Threshold -0.5
[[75 35]
 [16 53]]

Threshold 0.0
[[80 30]
 [16 53]]

Threshold 0.5
[[81 29]
 [17 52]]

Threshold 1.0
[[83 27]
 [20 49]]


### A stronger non-linear model (MLP with two hidden layers)

Now we try a Multi-Layer Perceptron (MLP) with two hidden layers to capture non-linear patterns the linear Perceptron can’t. We’ll train it and compare train/test accuracy. Expect the MLP to fit training data better; the key question is whether it also improves test accuracy without overfitting too much.

In [8]:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(hidden_layer_sizes=(10,10), activation='relu',
                    solver='adam', max_iter=1000, random_state=42)
mlp.fit(X_train, y_train)

print("Train Accuracy:", mlp.score(X_train, y_train))
print("Test Accuracy:", mlp.score(X_test, y_test))


Train Accuracy: 0.8665730337078652
Test Accuracy: 0.7988826815642458


### Exploring model capacity: hidden-layer sweep

Finally, we’ll systematically vary the MLP’s architecture (number of hidden layers and units) and record train/test accuracy and how many epochs it needed to converge. This helps you see the capacity vs. generalization trade-off in practice.

In [9]:
from sklearn.neural_network import MLPClassifier

def eval_mlp(hidden, alpha=0.0, early_stopping=False):
    """
    Train one MLP and return train/test accuracy.
    hidden: tuple, e.g. (8,), (16,8), (32,16,8)
    """
    clf = MLPClassifier(
        hidden_layer_sizes=hidden,
        activation="relu",
        solver="adam",
        max_iter=1000,
        alpha=alpha,
        early_stopping=early_stopping,
        random_state=42
    )
    clf.fit(X_train, y_train)
    train_acc = clf.score(X_train, y_train)
    test_acc  = clf.score(X_test, y_test)
    return train_acc, test_acc, clf

# Sweep a few sizes
sizes = [(2,), (8,), (16, 8), (32, 16, 8)]

print("Hidden-layer sweep (alpha=0.0, early_stopping=False):")
for h in sizes:
    tr, te, model = eval_mlp(h, alpha=0.0, early_stopping=False)
    print(f"{h}: train={tr:.3f} | test={te:.3f} | epochs={model.n_iter_}")


Hidden-layer sweep (alpha=0.0, early_stopping=False):
(2,): train=0.817 | test=0.793 | epochs=383
(8,): train=0.844 | test=0.782 | epochs=384
(16, 8): train=0.875 | test=0.788 | epochs=771
(32, 16, 8): train=0.909 | test=0.760 | epochs=556
