<a href="https://colab.research.google.com/github/Dharvi-k/Titanic-survival-prediction-pytorch/blob/main/Titanic_PyTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



---


# Titanic Survival Prediction using PyTorch


---



**Objective:**

To build a machine learning model using PyTorch to predict whether a passenger survived the Titanic shipwreck, based on historical data such as age, sex, ticket class, fare, and more.

**How This Adds Value:**

This project demonstrates my ability to clean real-world data, build and train deep learning models using PyTorch, and generate accurate predictions for a real Kaggle competition.

In [None]:
# For Google Colab: Upload train.csv manually using file picker
from google.colab import files

uploaded = files.upload()

Saving train.csv to train (1).csv


In [None]:
# Then load the file (make sure it’s uploaded)
import pandas as pd
import os

# Load the training dataset
df = pd.read_csv("train.csv")

# Show first five rows
print(df.head())

# Show dataset shape (rows, columns)
print(f"Shape: {df.shape}")

# Check column info and datatype
print("Info:")
print(df.info())

# Check for missing values
print("Missing values:")
print(df.isnull().sum())

# total number of rows
print(f" Total number of rows (passengers): {len(df)}")


   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
Sh

In [None]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Step 2: Handle Missing Values

 Check Current Missing Values

In [None]:
df.isnull().sum()

Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,177
SibSp,0
Parch,0
Ticket,0
Fare,0


**Fill Missing Age with Median**

In [None]:
df["Age"].fillna(df["Age"].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Age"].fillna(df["Age"].median(), inplace=True)


**Drop the Cabin Column**

In [None]:
df.drop(columns=["Cabin"], inplace=True)

**Fill the Missing Embarked Values**

In [None]:
df["Embarked"].fillna(df["Embarked"].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Embarked"].fillna(df["Embarked"].mode()[0], inplace=True)


**Encode Sex before using it in a machine learning model.**

ML models (including PyTorch) can’t understand 'male' or 'female'.

In [None]:
df["Sex"]=df["Sex"].map({'male':0, 'female':1})

"The Sex column was encoded numerically, mapping 'male' to 0 and 'female' to 1. This transformation ensures the model receives consistent, numeric input while preserving the original meaning."

**Drop Name, Ticket, and PassengerId**

In [None]:
df.drop(columns=['Name','Ticket','PassengerId'], inplace=True)

* Columns like Name, Ticket, and PassengerId were dropped because they contain mostly unique or non-informative values for a survival prediction model. These columns can be explored further in advanced feature engineering, but for this project, they were excluded to reduce noise and simplify the model."

**Now we’ll convert the Embarked column into numerical form **

In [None]:
# one-hot encode 'Embarked' column
df=pd.get_dummies(df, columns=['Embarked'], drop_first=True)

Since machine learning models require numeric input, the Embarked column was one-hot encoded using pd.get_dummies(). This converted the text categories (S, C, Q) into binary columns (Embarked_Q, Embarked_S). The first category was dropped to avoid redundancy

In [None]:
# checking datatypes
print(df.dtypes)

Survived        int64
Pclass          int64
Sex             int64
Age           float64
SibSp           int64
Parch           int64
Fare          float64
Embarked_Q       bool
Embarked_S       bool
dtype: object


In [None]:
df['Age'].dtype

dtype('float64')

In [None]:
df['Age']=pd.to_numeric(df['Age'], errors='coerce')

In [None]:
df['Age'].dtype

dtype('float64')

"The Titanic dataset was cleaned by handling missing values (e.g., filling Age with the median and Embarked with the mode), and dropping irrelevant features such as Name, Ticket, and PassengerId. Categorical features like Sex and Embarked were converted into numerical form using mapping and one-hot encoding. All columns were checked to ensure they were numeric and free from null values, making the dataset suitable for machine learning model training."

\

---


# Let’s build your PyTorch neural network model


---



# **Convert Data to PyTorch Tensors**


**1. Split into Features and Target**





In [None]:
# separate features (x) and target (y)
y=df['Survived']
x=df.drop(columns=['Survived'])

 **2. Split into Training and Validation Sets**

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test=train_test_split(x,
                                                  y,
                                                  test_size=0.2,
                                                  random_state=42
                                                  )

In [None]:
from sklearn.preprocessing import StandardScaler

# step 1: create the scaler
scaler=StandardScaler()

# step 2: fit the scaler on the training data and transform it
x_train_scaled=scaler.fit_transform(x_train)

# step 3: now transform the validation data using the same scaler
x_test_scaled=scaler.transform(x_test)

In [None]:
import numpy as np

x_train_scaled=np.nan_to_num(x_train_scaled)
x_test_scaled=np.nan_to_num(x_test_scaled)

# Convert to Tensors

In [None]:
import torch

x_train_tensor=torch.tensor(x_train_scaled, dtype=torch.float32)
y_train_tensor=torch.tensor(y_train.values, dtype=torch.long)
x_test_tensor=torch.tensor(x_test_scaled, dtype=torch.float32)
y_test_tensor=torch.tensor(y_test.values, dtype=torch.long)

After scaling the features using StandardScaler, the data was converted into PyTorch tensors. Input features were converted to float32 tensors, and target labels to long tensors, which are required for training a PyTorch classification model using CrossEntropyLoss

# Build and Train the Model

In [None]:
from torch import nn

class Titanic(nn.Module):
  def __init__(self,input_size:int, output_size:int, hidden_units:int):
    super().__init__()
    self.layer_stack=nn.Sequential(
        nn.Linear(in_features=input_size,out_features=hidden_units),
        nn.ReLU(),
        nn.Dropout(p=0.3),
        nn.Linear(in_features=hidden_units,out_features=hidden_units),
        nn.ReLU(),
        nn.Dropout(p=0.3),
        nn.Linear(in_features=hidden_units,out_features=output_size)
    )

  def forward(self,x):
    return self.layer_stack(x)

In [None]:
# create an instance of our model
model=Titanic(input_size=x_train_tensor.shape[1],
              output_size=2,
              hidden_units=64)
model

Titanic(
  (layer_stack): Sequential(
    (0): Linear(in_features=8, out_features=64, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.3, inplace=False)
    (3): Linear(in_features=64, out_features=64, bias=True)
    (4): ReLU()
    (5): Dropout(p=0.3, inplace=False)
    (6): Linear(in_features=64, out_features=2, bias=True)
  )
)

# Set Up the Loss Function and Optimizer

In [None]:
# setting up loss function and optimizer

loss_fn=nn.CrossEntropyLoss()  # because we are working with binary classification problem

optimizer=torch.optim.Adam(params=model.parameters(),
                                                  lr=0.01)

In [None]:
# Create an Accuracy function
def Accuracy(y_true,y_pred):
  correct=(y_true==y_pred).sum().item()
  total=len(y_true)
  return (correct/total)*100  # converting it into percenatge


# Creating the Training Loop

In [None]:
# create a training and testing loop

# Number of epochs
epochs=100

# Applying early stopping
best_test_loss=float('inf') # start with infinite loss
patience=10                 # how many bad epochs to tolerate
triger_times=0              #how many times test loss did NOT improve
## Training loop
for epoch in range(epochs):
  ### put the model in training loop
  model.train()

  # 1. Forward pass
  y_logits=model(x_train_tensor)
  y_pred=torch.argmax(y_logits, dim=1)

  # 2. calculate the loss/acc
  loss=loss_fn(y_logits,y_train_tensor)
  acc=Accuracy(y_pred=y_pred,y_true=y_train_tensor)

  # 3. optimizer zero grad
  optimizer.zero_grad()

  # 4. loss backward
  loss.backward()

  # 5. optimizer step
  optimizer.step()

  ### Testing
  model.eval()
  with torch.inference_mode():
    # 1. forward pass
    test_logits=model(x_test_tensor)
    test_pred=torch.argmax(test_logits, dim=1)
    # 2. calculate the loss/acc
    test_loss=loss_fn(test_logits,y_test_tensor)
    test_acc=Accuracy(y_pred=test_pred,y_true=y_test_tensor)

    # Early stopping
    if test_loss<best_test_loss:
      best_test_loss=test_loss
      triger_times=0
      torch.save(model.state_dict(), 'best_model.pth')
      print(f"model saved at epoch {epoch}")
    else:
      triger_times+=1
      print(f"trigerred: {triger_times}/{patience}")
      if triger_times==patience:
        print("Early stopping")
        break

  # print what's happening
  if epoch%10==0:
    print(f"Epoch: {epoch} | Loss: {loss:.5f} | Acc: {acc:.2f}% | Test Loss: {test_loss:.5f} | Test Acc: {test_acc:.2f}%")



model saved at epoch 0
Epoch: 0 | Loss: 0.72482 | Acc: 36.94% | Test Loss: 0.63208 | Test Acc: 74.86%
model saved at epoch 1
model saved at epoch 2
model saved at epoch 3
model saved at epoch 4
model saved at epoch 5
model saved at epoch 6
model saved at epoch 7
model saved at epoch 8
model saved at epoch 9
model saved at epoch 10
Epoch: 10 | Loss: 0.44771 | Acc: 80.90% | Test Loss: 0.43148 | Test Acc: 79.89%
trigerred: 1/10
trigerred: 2/10
trigerred: 3/10
trigerred: 4/10
trigerred: 5/10
trigerred: 6/10
trigerred: 7/10
trigerred: 8/10
trigerred: 9/10
trigerred: 10/10
Early stopping


Implemented Early Stopping to prevent overfitting.
 Automatically monitored test loss and saved the best model version to disk

In [None]:
# checking is there any Nan value
print(torch.isnan(x_train_tensor).any())
print(torch.isinf(x_train_tensor).any())

tensor(False)
tensor(False)


# Load Best Model

In [None]:
model.load_state_dict(torch.load('best_model.pth'))
model.eval()

Titanic(
  (layer_stack): Sequential(
    (0): Linear(in_features=8, out_features=64, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.3, inplace=False)
    (3): Linear(in_features=64, out_features=64, bias=True)
    (4): ReLU()
    (5): Dropout(p=0.3, inplace=False)
    (6): Linear(in_features=64, out_features=2, bias=True)
  )
)

# Evaluating our model on test data it has never seen before

**1. Load test.csv**

In [None]:
# For Google Colab: Upload test.csv manually using file picker
from google.colab import files
uploaded=files.upload()

Saving test.csv to test (1).csv


In [None]:
# Then load the file (make sure it’s uploaded)
kaggle_test_df=pd.read_csv("test.csv")

# print first five rows of data
kaggle_test_df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


**Step 2: Clean the test.csv data**

In [None]:
kaggle_test_df.isnull().sum()

Unnamed: 0,0
PassengerId,0
Pclass,0
Name,0
Sex,0
Age,86
SibSp,0
Parch,0
Ticket,0
Fare,1
Cabin,327


In [None]:
# Filling the missing values in age column

kaggle_test_df['Age'].fillna(kaggle_test_df['Age'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  kaggle_test_df['Age'].fillna(kaggle_test_df['Age'].median(), inplace=True)


In [None]:
kaggle_test_df.shape

(418, 11)

Since cabin column consists of 327 missing values which is a lot when comparing with total rows that is 418.
So dropping this column is better for our data.

In [None]:
# Drop cabin column
kaggle_test_df.drop(columns=['Cabin','Name','Ticket','PassengerId'], inplace=True)

Now we have handeled missing values in our data the next step is to :

 convert our data into numerical form.

In [None]:
# coverting string values into int
kaggle_test_df['Sex']=kaggle_test_df['Sex'].map({'male': 0, 'female': 1})

In [None]:
kaggle_test_df=pd.get_dummies(kaggle_test_df, columns=['Embarked'], drop_first=True,dtype=int)

In [None]:
# Scale the features
kaggle_test_df=scaler.transform(kaggle_test_df)

In [None]:
# checking datatype
kaggle_test_df.dtype

dtype('float64')

**Now we have this clean data, lets convert it into tensors.**

# Step 3: Convert to Tensors

In [None]:
kaggle_test_tensor=torch.tensor(kaggle_test_df, dtype=torch.float32)

# Step 4: Load Your Best Model & Predict

In [None]:
model.load_state_dict(torch.load('best_model.pth'))
model.eval()

with torch.inference_mode():
  test_logits=model(kaggle_test_tensor)
  test_preds=torch.argmax(test_logits, dim=1)

print(test_preds[:20])


tensor([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1])


In [None]:
# Step 1: Reload Kaggle test data to recover PassengerId
original_test_df=pd.read_csv("test.csv")

# Extract passengerId
passenger_ids=original_test_df['PassengerId']

# Step 3: Create submission DataFrame
submission_df=pd.DataFrame({
    "PassengerId":passenger_ids,
    "Survived":test_preds.numpy() # Model’s predictions (0s and 1s)
})

# Step 4: Save submission file
submission_df.to_csv("submission.csv", index=False)

# Optional: Preview first few rows
print(submission_df.head(10))

   PassengerId  Survived
0          892         0
1          893         1
2          894         0
3          895         0
4          896         1
5          897         0
6          898         1
7          899         0
8          900         1
9          901         0


In [None]:
print(torch.unique(test_preds))

tensor([0, 1])


**Conclusion & Reflection**

In this project, I successfully built a machine learning pipeline using PyTorch to predict Titanic passenger survival. I handled real-world challenges like missing data, categorical encoding, feature scaling, and model evaluation.

This project helped strengthen my understanding of model development, PyTorch fundamentals, and real-world data preprocessing — skills that are essential for a career in data science and machine learning.