In [4]:
import numpy as np
import pandas as pd
import random
import os 
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

SEED = 42

def set_reproducibility(seed):
    # 1. Standard Python randomness
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)

    # 2. Modern NumPy (Generator-based)
    # Use this 'rng' for any direct numpy random calls in your code
    rng = np.random.default_rng(seed)
    # Optional: still set legacy global seed for libraries that depend on it
    np.random.seed(seed) 

    # 3. PyTorch (CPU and all GPUs)
    torch.manual_seed(seed) # for CPU/GPU weight initialization
    torch.cuda.manual_seed_all(seed) # ensures all GPUs are seeded if you use multiple. 

    # 4. Forcing Deterministic GPU Algorithms (Crucial for 2026)
    # This ensures operations like convolutions are identical every run
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    
    return rng

# Initialize your setup
rng = set_reproducibility(SEED)

For a neural network project, it is best practice to seed all possible sources of randomness at once to ensure your weights and training paths are identical every time you run the script. Setting os.environ['PYTHONHASHSEED'] remains a standard part of "all-in-one" reproducibility scripts. There is a major technical limitation you must know: modifying this inside your script usually does nothing.
Python reads this environment variable only once, at the very moment the interpreter starts. By the time your code reaches os.environ['PYTHONHASHSEED'] = ..., the interpreter has already initialized its hashing mechanism with a random value. 
How to use it correctly
If you need 100% strict reproducibility, you must set the variable before running your script on Terminal/ Command Line:            

'export PYTHONHASHSEED=42 && python my_script.py'


In [6]:
# Load data 
train_df = pd.read_csv("/kaggle/input/titanic/train.csv")
test_df  = pd.read_csv("/kaggle/input/titanic/test.csv")

# Define target & features
#Identify column types
num_cols = train_df.select_dtypes(include=["int64", "float64"]).columns
cat_cols = train_df.select_dtypes(include=["object"]).columns

# Convert to Tensors/Lists
# Numerical data must be a Float Tensor for nanmean to work
numerical_tensor = torch.tensor(train_df[num_cols].values, dtype=torch.float32)
# Categorical data must be a list of lists for the _build_mappings logic
categorical_data = train_df[cat_cols].astype(str).values.tolist()

# Define target
y_tensor = torch.tensor(train_df["Survived"].values, dtype=torch.float32)

A Class is a way to group data (attributes) and actions (methods) together into one package. PyTorch uses classes for almost everything because machine learning models are stateful. Unlike a function that forgets everything once it finishes, a Class remembers things.

self is a way to attach information to the class. cat_mapping=None is A setting that allows the class to either invent a new category-to-number (mapping) system or reuse an old one.

The underscore _ at the start of the name is a Python convention. It signals to other programmers: "This is an internal helper tool. You don't need to call this yourself; the class will use it automatically during setup."


Docstrings act as a "manual" for anyone reading your code (including your future self). By writing [N, num_features], the programmer is saying: "This class expects a 2D table. If you try to pass a 1D list or a 3D cube of data, the code will break."

The Docstring (""" ... """)
A Docstring (Documentation String) is meant for the user of the code and the Python system itself. 
Format: Wrapped in triple quotes (""") right under a class or function definition.
Behavior: Python actually stores this text as a special attribute called __doc__. It is not ignored.
Purpose: It explains how to use the class/function, what the inputs are, and what it returns.
Visibility: You can see it without opening the file. You can access it by typing help(YourClassName) or hovering your mouse over the code in a modern editor.

understanding dim (short for dimension) is the single most important concept for manipulating data in PyTorch.
When you see dim=0 in a function like torch.nanmean(data, dim=0), you are telling PyTorch which direction to look when performing a calculation on a table (tensor).

Think of a standard data table (a 2D Tensor):
dim=0 refers to the Rows (Vertical axis).
dim=1 refers to the Columns (Horizontal axis).

In PyTorch logic, we prefer Label Encoding because:
Memory: It keeps your data small (just one column of integers).
Embeddings: PyTorch has a powerful layer called nn.Embedding that takes these integers and turns them into "concept vectors," allowing the model to learn that "London" and "Paris" are both European cities.

In [7]:
# user-defined class that inherits from a built-in PyTorch template called Dataset.
class PreprocessingDataset(Dataset):
    
    # Method definition (A function that belongs to a specific class)
    def __init__(self, numerical_data, categorical_data, cat_mappings=None):

        #Docstrings
        """
        numerical_data: Tensor with NaNs [N, num_features]
        categorical_data: List of lists or Array of strings [N, num_cats]
        """
        # 1. IMPUTATION (Logical Mean Imputation)
        # Calculate mean while ignoring NaNs
        self.num_means = torch.nanmean(numerical_data, dim=0)
        # Fill NaNs with the calculated mean
        self.num_data = torch.where(torch.isnan(numerical_data), self.num_means, numerical_data)
        
        # 2. STANDARDIZATION (Z-Score)
        self.num_std = torch.std(self.num_data, dim=0)
        # Avoid division by zero for constant features
        self.num_std[self.num_std == 0] = 1.0 # replace 0 with 1. Dividing by 1.0 doesn't change the value of the numerator
        self.num_standardized = (self.num_data - self.num_means) / self.num_std
        
        # 3. CATEGORICAL ENCODING (Label Encoding)
        self.cat_mappings = cat_mappings or self._build_mappings(categorical_data) # This line decides whether to create or reuse the mappings.
        self.cat_encoded = self._encode(categorical_data)

    # To implement label coding
    def _build_mappings(self, data):
        mappings = []
        # This loop goes through your data column by column.
        for col_idx in range(len(data[0])): # data[0] looks at the first row to count how many columns exist.
            # set(...): It grabs every value in the column and throws away duplicates. If "London" appears 500 times, the set keeps it only once.
            # list(...): It converts those unique items into a list.
            # sorted(...): It puts them in alphabetical order. This is crucial for consistency—it ensures that "London" always gets the same number every time you run the code
            unique_vals = sorted(list(set(row[col_idx] for row in data)))
            # Creates a dictionary for the columns
            # enumerate: Assigns a number to each item in your sorted list (0, 1, 2...). It creates a map like {"London": 0, "New York": 1, "Paris": 2}.
            mappings.append({val: i for i, val in enumerate(unique_vals)})
        return mappings

    def _encode(self, data):
        encoded = []
        for row in data:
            # the actual "translation
            encoded_row = [self.cat_mappings[i].get(val, 0) for i, val in enumerate(row)] # The default 0
            encoded.append(encoded_row)
        # After all rows are processed, the code turns the big list of numbers into a PyTorch Tensor.
        return torch.tensor(encoded, dtype=torch.long) # In PyTorch, categorical labels must be integers (64-bit integers, specifically)
        
    # return a single number: the total number of rows in the data.
    def __len__(self):
        return len(self.num_standardized)

    # to get a single row of data when given an index (like "row #42").
    def __getitem__(self, idx):
        return self.num_standardized[idx], self.cat_encoded[idx]

In [10]:
# --- EXECUTION ---

# Initialize the cleaning class
preprocessor = PreprocessingDataset(numerical_tensor, categorical_data)

# --- STEP 1: Finalize X_np and y_np ---
# num_standardized and cat_encoded are tensors from our previous cleaning steps
# Combine standardized numbers and encoded categories into one matrix
X_tensor = torch.cat([preprocessor.num_standardized, preprocessor.cat_encoded], dim=1)

X_np = X_tensor.numpy()    # Instruction: End with numpy array
y_np = y_tensor.numpy()    # Instruction: End with numpy array

In [11]:
# PyTorch Dataset
class TitanicDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.float32).view(-1, 1) # Align shape for loss

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

train_ds = TitanicDataset(X_np, y_np)
# creates a pipeline to feed the neural network
# batch_size group data into sets of n samples at a time
train_loader = DataLoader(train_ds, batch_size=32, shuffle=True)

Processing the entire dataset at once is often too large for your computer's memory (RAM/VRAM). Hence we group it into multiple samples.

In [12]:
# Model definition
class BaselineNet(nn.Module): # The physical structure of the brain (the neurons).
    def __init__(self, input_size):
        super(BaselineNet, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(input_size, 16),
            nn.ReLU(),
            nn.Linear(16, 8),
            nn.ReLU(),
            nn.Linear(8, 1),
            nn.Sigmoid()
        )
    
    def forward(self, x): # The thought process (data goes in, a decision comes out).
        return self.net(x)

# setup
model = BaselineNet(input_size=X_np.shape[1])
criterion = nn.BCELoss() # Binary Cross Entropy for 0/1 prediction (how wrong was the decision?)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01) # The learning process (adjusting the brain so there is less regret next time).


This is a Multilayer Perceptron (MLP). It uses a series of "Linear" layers to transform input data into a prediction.
nn.Module: This is the base class for all neural networks in PyTorch. Inheriting from it gives your model the ability to track "gradients" (how to change its weights to get better).
super().__init__(): This initializes the parent nn.Module. Without this, PyTorch won't be able to track your layers.
nn.Sequential: This is a container that runs layers in order, like an assembly line.
nn.Linear(input_size, 16): The "Input Layer." It takes your features (Age, Sex, etc.) and expands them to 16 hidden neurons.
nn.ReLU(): An activation function. It turns negative numbers into 0. This "non-linearity" is what allows the model to learn complex patterns instead of just simple lines.
nn.Linear(16, 8): A "Hidden Layer." It condenses the 16 features down to 8, looking for deeper patterns.
nn.Linear(8, 1): The "Output Layer." It condenses everything down to a single number.
nn.Sigmoid(): This squashes the output number to be between 0 and 1. This is perfect for the Titanic dataset because we want a probability (e.g., "There is a 0.85 chance this person survived").

In PyTorch, you don't call .predict(). Instead, you define a forward method. When you do model(X), PyTorch automatically runs this method. It simply passes the data x through the self.net assembly line we defined in the setup.

input_size=X_np.shape[1]: This tells the first layer how many columns are in your data. shape[1] refers to the number of columns (features).
nn.BCELoss(): Stands for Binary Cross Entropy. This is the standard "scoring" system for 0/1 problems.
If the model predicts 0.9 and the person survived (1), the loss is low.
If the model predicts 0.1 and the person survived (1), the loss is high.
torch.optim.SGD: The Stochastic Gradient Descent optimizer. This is the "brain" that updates the model's weights.
model.parameters(): You tell the optimizer: "These are the weights you are allowed to change."
lr=0.01: The Learning Rate. This controls how big of a "step" the model takes when correcting an error. Too big, and it overshoots; too small, and it takes forever to learn.
Summary Analogy
BaselineNet: The physical structure of the brain (the neurons).
forward: The thought process (data goes in, a decision comes out).
BCELoss: The feeling of regret (how wrong was the decision?).
SGD Optimizer: The learning process (adjusting the brain so there is less regret next time).

In [16]:
# TRAINING LOOP
# Set the number of epochs (times the model sees the whole dataset)
epochs = 20

for epoch in range(epochs):
    model.train()
    running_loss = 0.0  # Reset the "accumulator" at the start of each epoch
    correct_predictions = 0
    total_samples = 0
    
    for X_batch, y_batch in train_loader:
        # 1. Reset the gradients (clear the chalkboard)
        optimizer.zero_grad()
        
        # 2. Forward pass (the model makes guesses)
        logits = model(X_batch)
        
        # 3. Calculate how wrong the guesses were
        loss = criterion(logits, y_batch)
        
        # 4. Backward pass (find out who to blame for the error)
        loss.backward()
        
        # 5. Optimization step (adjust the weights)
        optimizer.step()
        
        # Add the batch loss to our running total
        # We use .item() to get the number and avoid memory buildup
        running_loss += loss.item()

        # --- ACCURACY LOGIC ---
        # 1. Convert probabilities to 0 or 1 (Threshold = 0.5)
        # If probability >= 0.5, predicted = 1. Else 0.
        predicted = (logits >= 0.5).float()
        
        # 2. Count how many matches the actual labels (y_batch)
        correct_predictions += (predicted == y_batch).sum().item()
        total_samples += y_batch.size(0)
    
    # Calculate the average loss for this entire epoch
    avg_loss = running_loss / len(train_loader)
    accuracy = (correct_predictions / total_samples) * 100
    
    # Print progress every epoch (or use an if-statement for every 5th epoch)
    print(f"Epoch [{epoch+1:02d}] | Loss: {avg_loss:.4f} | Accuracy: {accuracy:.2f}%")

print("\nTraining Complete!")

Epoch [01] | Loss: 0.6332 | Accuracy: 65.66%
Epoch [02] | Loss: 0.6369 | Accuracy: 66.11%
Epoch [03] | Loss: 0.6258 | Accuracy: 67.12%
Epoch [04] | Loss: 0.6189 | Accuracy: 66.33%
Epoch [05] | Loss: 0.6265 | Accuracy: 66.22%
Epoch [06] | Loss: 0.6215 | Accuracy: 66.22%
Epoch [07] | Loss: 0.6208 | Accuracy: 66.55%
Epoch [08] | Loss: 0.6252 | Accuracy: 65.10%
Epoch [09] | Loss: 0.6315 | Accuracy: 66.33%
Epoch [10] | Loss: 0.6265 | Accuracy: 66.78%
Epoch [11] | Loss: 0.6242 | Accuracy: 67.79%
Epoch [12] | Loss: 0.6178 | Accuracy: 66.67%
Epoch [13] | Loss: 0.6305 | Accuracy: 65.21%
Epoch [14] | Loss: 0.6239 | Accuracy: 67.12%
Epoch [15] | Loss: 0.6341 | Accuracy: 66.22%
Epoch [16] | Loss: 0.6237 | Accuracy: 65.66%
Epoch [17] | Loss: 0.6226 | Accuracy: 66.67%
Epoch [18] | Loss: 0.6171 | Accuracy: 67.23%
Epoch [19] | Loss: 0.6247 | Accuracy: 66.89%
Epoch [20] | Loss: 0.6182 | Accuracy: 65.21%

Training Complete!


In PyTorch, models have two primary modes: Train and Eval (Evaluation).
What it does: It tells the model, "I am about to start training, so enable all behaviors necessary for learning."
Why it's needed: Some neural network components (like Dropout or Batch Normalization) behave differently during training than they do during testing.
In Train mode, these layers are active to help the model learn and generalize.
In Eval mode (model.eval()), these layers are "frozen" to ensure the model gives consistent, stable predictions.

In [14]:
# --- FINAL VERIFICATIONS ---
# PRINT RESULTS
# 1. Print Final Loss
print(f"\nFinal Average Loss: {avg_loss:.4f}")

# 2. Run a Forward Pass (on one batch)
model.eval()  # Switch to evaluation mode
with torch.no_grad():  # Disable gradient tracking to save memory
    first_batch_features, _ = next(iter(train_loader))
    forward_output = model(first_batch_features)

# 3. Confirm no NaNs
nan_check = torch.isnan(forward_output).any().item()
print(f"Forward Pass successful. Contains NaNs: {nan_check}")

# 4. Confirm Shapes
print(f"Input Shape: {first_batch_features.shape}")   # Expected: [32, Total_Features]
print(f"Output Shape: {forward_output.shape}")       # Expected: [32, 1]


Final Average Loss: 0.6226
Forward Pass successful. Contains NaNs: False
Input Shape: torch.Size([32, 12])
Output Shape: torch.Size([32, 1])


In [15]:
print(X_np.shape[1])

12


## Reflections
Even without tuning:

Loss went down

The model learned something

But:

It didn’t obviously beat logistic regression

It felt fragile

Small changes caused big behavior shifts

That’s not a bug.