# Session 17-18 Support Vector Machine

# Exercise: Titanic Survival Prediction Using SVM

You are a data scientist tasked with analyzing the Titanic dataset

https://www.kaggle.com/competitions/titanic

The goal of this exercise is to build and evaluate a Support Vector Machine (SVM) classifier to predict whether a passenger survived the Titanic disaster.

# Step 1. Data Preparation

## Goal: Load data, select features, handle missing values, encode categorical variables.

In [1]:
# Step 1: Data Preparation

import pandas as pd
import numpy as np

In [3]:
# Load Titanic dataset
data = pd.read_csv("train.csv")

# Inspect dataset
print(data.head())
print(data.info())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
<c

In [4]:
# Target variable
y = data["Survived"]

# Select relevant features
features = ["Pclass", "Sex", "Age", "Fare", "SibSp", "Parch"]
X = data[features].copy()  # copy avoids SettingWithCopyWarning

# Handle missing values
X["Age"] = X["Age"].fillna(X["Age"].median())

# Encode categorical variable
# male -> 0, female -> 1
X["Sex"] = X["Sex"].map({"male": 0, "female": 1})

# Step 2. Train–Test Split

## Goal: Split data into 70% training and 30% testing.

In [5]:
# Step 2: Train-Test Split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.30,
    random_state=42,
    stratify=y  # keeps class balance
)

print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)


Training set size: (623, 6)
Testing set size: (268, 6)


# Step 3. Model Development (SVM)

## Goal: Train an SVM classifier and specify kernel + hyperparameters.

We use RBF kernel, which handles non-linear decision boundaries.

In [6]:
# Step 3: Model Development

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Build a pipeline:
# - Scale features (VERY important for SVM)
# - Train SVM with RBF kernel
svm_model = Pipeline([
    ("scaler", StandardScaler()),
    ("svm", SVC(
        kernel="rbf",     # radial basis function kernel
        C=1.0,            # regularization parameter
        gamma="scale",    # kernel coefficient
        probability=True # needed for ROC-AUC
    ))
])

# Train the SVM model
svm_model.fit(X_train, y_train)

0,1,2
,steps,"[('scaler', ...), ('svm', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,C,1.0
,kernel,'rbf'
,degree,3
,gamma,'scale'
,coef0,0.0
,shrinking,True
,probability,True
,tol,0.001
,cache_size,200
,class_weight,


# Step 4. Model Evaluation

## Goal: Evaluate using Accuracy, Precision, Recall, ROC-AUC.

In [8]:
# Step 4: Model Evaluation

from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score

# Predictions
y_pred = svm_model.predict(X_test)
y_pred_proba = svm_model.predict_proba(X_test)[:, 1]

# Metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

print("SVM Model Performance:")
print("Accuracy :", accuracy)
print("Precision:", precision)
print("Recall   :", recall)
print("ROC-AUC  :", roc_auc)

SVM Model Performance:
Accuracy : 0.832089552238806
Precision: 0.8152173913043478
Recall   : 0.7281553398058253
ROC-AUC  : 0.8223300970873786


## Drawing ROC curve

In [9]:
# (1): Import required functions
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# (2): Compute False Positive Rate (FPR) and True Positive Rate (TPR)
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

# (3): Compute Area Under the Curve (AUC)
roc_auc = auc(fpr, tpr)

In [None]:
# (4): Plot ROC curve

plt.figure()
plt.plot(fpr, tpr, label=f"ROC curve (AUC = {roc_auc:.3f})")

# Diagonal line = random classifier
plt.plot([0, 1], [0, 1], linestyle="--", label="Random classifier")

# Labels and title
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve - SVM Titanic Survival Prediction")
plt.legend()
plt.show()

# Step 5. Confusion Matrix Analysis

## Goal: Compute and interpret TP, FP, TN, FN.

In [12]:
# Step 5: Confusion Matrix Analysis

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
TN, FP, FN, TP = cm.ravel()

print("Confusion Matrix:")
print(cm)

print("\nInterpretation:")
print("TP:", TP, "- Correctly predicted survivors")
print("FP:", FP, "- Predicted survived but did not")
print("TN:", TN, "- Correctly predicted non-survivors")
print("FN:", FN, "- Predicted non-survivor but survived")

Confusion Matrix:
[[148  17]
 [ 28  75]]

Interpretation:
TP: 75 - Correctly predicted survivors
FP: 17 - Predicted survived but did not
TN: 148 - Correctly predicted non-survivors
FN: 28 - Predicted non-survivor but survived


# Step 6. Discussion

## Goal: Discuss strengths, weaknesses, and error importance.

Strengths of SVM:
- Effective in high-dimensional spaces
- Works well with non-linear decision boundaries
- Robust to overfitting with proper regularization

Weaknesses of SVM:
- Computationally expensive on large datasets
- Sensitive to feature scaling
- Hyperparameter tuning (C, gamma) is crucial

Error Importance in Titanic Context:
- False Negatives (FN): Predicting death when passenger survived
  -> More critical, because survivors are missed
- False Positives (FP): Predicting survival when passenger died
  -> Less critical in this historical analysis

In safety-related problems, minimizing FN is often prioritized.