# Titanic Logistic Regression Notebook Documentation

This document details the implementation and evaluation of two logistic regression models applied to the Titanic dataset. Both models use the Newton-Raphson method to optimize the parameters; however, they differ in the preprocessing and feature engineering steps applied:

- **Model 1 (Basic):** Uses basic preprocessing (simple imputation and encoding) and trains the model on the fundamental features.
- **Model 2 (Enhanced / Feature Engineering):** Applies advanced data cleaning and feature extraction (e.g., extracting deck from Cabin, title from Name, and processing Ticket) to potentially improve the model’s predictive performance.
---


## Table of Contents

1. [Introduction](#introduction)
2. [Dataset Description](#dataset-description)
3. [Model 1: Basic Preprocessing and Logistic Regression](#model-1-basic-preprocessing-and-logistic-regression)
    - [Basic Preprocessing](#basic-preprocessing)
    - [Model 1 Implementation](#model-1-implementation)
    - [Model 1 Evaluation](#model-1-evaluation)
4. [Model 2: Advanced Preprocessing and Feature Engineering](#model-2-advanced-preprocessing-and-feature-engineering)
    - [Advanced Preprocessing](#advanced-preprocessing)
    - [Model 2 Implementation](#model-2-implementation)
    - [Model 2 Evaluation](#model-2-evaluation)
5. [Conclusions](#conclusions)

---


## 1. Introduction

The purpose of this notebook is to implement a logistic regression model using the Newton-Raphson method from scratch, thereby deepening our understanding of the underlying mathematics of optimization. Two approaches are explored:

- **Model 1:** Uses basic preprocessing and feature selection.
- **Model 2:** Incorporates advanced preprocessing (feature engineering) by extracting additional information from variables such as *Cabin*, *Name*, and *Ticket*, which is expected to positively impact the model’s accuracy.

---


## 2. Dataset Description

The Titanic dataset is divided into two sets:

- **Training set (`train.csv`):** Contains the target variable `Survived` (0 = No, 1 = Yes) along with features like `Pclass`, `Sex`, `Age`, `SibSp`, `Parch`, `Fare`, etc.
- **Test set (`test.csv`):** Does not include the `Survived` variable and is used for generating predictions for Kaggle submission.

### Key Variables

- **Pclass:** Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd).  
  *(A proxy for socioeconomic status)*
- **Sex:** Passenger gender, mapped to numerical values (0 for 'male' and 1 for 'female').
- **Age:** Passenger age (can be fractional).
- **SibSp / Parch:** Number of siblings/spouses and parents/children aboard.
- **Fare:** Ticket fare.
- **Cabin:** Cabin number (can extract deck information).
- **Embarked:** Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).

---



## 3. Model 1: Basic Preprocessing and Logistic Regression

### Basic Preprocessing

In this approach, the following transformations are performed:

- **Missing Value Imputation:**  
  - `Age` and `Fare` are filled with their median values.
  - `Embarked` is filled with its mode.
- **Variable Transformation:**  
  - Map `Sex` to numerical values.
  - Apply one-hot encoding to the `Embarked` variable.
- **Feature Selection:**  
  - Select relevant features such as `Pclass`, `Age`, `SibSp`, `Parch`, `Fare`, `Sex`, and the one-hot encoded columns for `Embarked`.
- **Intercept Term:**  
  - Add a column of ones to represent the intercept in the model.


In [7]:
import numpy as np
import pandas as pd

# Load the data
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

def preprocess_data(df, is_train=True):
    df = df.copy()
    
    # Impute missing values
    df['Age'] = df['Age'].fillna(df['Age'].median())
    df['Fare'] = df['Fare'].fillna(df['Fare'].median())
    
    if 'Embarked' in df.columns:
        df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
    
    # Map 'Sex' to numerical values
    df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
    
    # One-hot encoding for 'Embarked'
    if 'Embarked' in df.columns:
        df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)
    
    features = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex']
    embarked_cols = [col for col in df.columns if col.startswith('Embarked_')]
    features += embarked_cols
    
    if is_train:
        X = df[features].values.astype(np.float64)
        y = df['Survived'].values.astype(np.float64)
        return X, y
    else:
        X = df[features].values.astype(np.float64)
        return X

# Preprocess training and test sets
X_train, y_train = preprocess_data(train_df, is_train=True)
X_test = preprocess_data(test_df, is_train=False)

# Add the intercept term
X_train = np.hstack([np.ones((X_train.shape[0], 1), dtype=np.float64), X_train])
X_test = np.hstack([np.ones((X_test.shape[0], 1), dtype=np.float64), X_test])



### Model 1 Implementation

The logistic regression model is optimized using the Newton-Raphson method.

#### Mathematical Background

The **sigmoid function** is defined as:

$
\sigma(z) = \frac{1}{1 + e^{-z}}
$

The **log-likelihood** function is given by:

$
\ell(\beta) = \sum \left[ y \cdot \log(p) + (1-y) \cdot \log(1-p) \right]
$

where $ p = \sigma(X\beta)$.

The Newton-Raphson algorithm updates the parameters as:

$
\beta_{\text{new}} = \beta - H^{-1} \nabla \ell(\beta)
$

with:
- $\nabla \ell(\beta)$ as the gradient, and
- $H$ as the Hessian matrix.


In [9]:
# Sigmoid function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Log-likelihood function
def log_likelihood(beta, X, y):
    z = np.dot(X, beta)
    p = sigmoid(z)
    return np.sum(y * np.log(p + 1e-9) + (1 - y) * np.log(1 - p + 1e-9))

# Gradient of the log-likelihood
def gradient(beta, X, y):
    z = np.dot(X, beta)
    p = sigmoid(z)
    return np.dot(X.T, (y - p))

# Hessian matrix
def hessian(beta, X, y):
    z = np.dot(X, beta)
    p = sigmoid(z)
    W = np.diag(p * (1 - p))
    return -np.dot(X.T, np.dot(W, X))

# Initialize parameters
beta = np.zeros(X_train.shape[1], dtype=np.float64)
tol = 1e-24
max_iter = 100

# Newton-Raphson algorithm
for i in range(max_iter):
    grad = gradient(beta, X_train, y_train)
    H = hessian(beta, X_train, y_train)
    delta = np.linalg.solve(H, grad)
    beta_new = beta - delta
    if np.linalg.norm(beta_new - beta, 2) < tol:
        beta = beta_new
        print(f"Convergence reached in {i+1} iterations.")
        break
    beta = beta_new

print("Estimated coefficients:", beta)

# Prediction on the test set
z_test = np.dot(X_test, beta)
p_test = sigmoid(z_test)
y_pred = (p_test >= 0.5).astype(int)

# Prepare the submission file
submission = pd.DataFrame({
    'PassengerId': test_df['PassengerId'],
    'Survived': y_pred
})
submission.to_csv('submission.csv', index=False)
print("Submission file 'submission.csv' generated successfully.")


Estimated coefficients: [ 2.51984934e+00 -1.10076974e+00 -3.90888921e-02 -3.25369301e-01
 -9.06363873e-02  1.98291345e-03  2.72840804e+00 -6.12958075e-02
 -4.06609580e-01]
Submission file 'submission.csv' generated successfully.


### Model 1 Evaluation

The performance on the training set is evaluated using several metrics:

- **Accuracy:** The proportion of correct predictions.
- **Log-Likelihood:** The value of the log-likelihood function.
- **Confusion Matrix:** Provides insights into true positives, false positives, etc.
- **Classification Report:** Includes precision, recall, and F1-score.
- **ROC AUC Score:** Measures the model's ability to distinguish between classes.

---

In [11]:
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score

z_train = np.dot(X_train, beta)
p_train = sigmoid(z_train)
y_train_pred = (p_train >= 0.5).astype(int)

# Accuracy
accuracy = np.mean(y_train_pred == y_train)
print("Training Accuracy:", accuracy)

# Log-likelihood
ll_value = log_likelihood(beta, X_train, y_train)
print("Training Log-Likelihood:", ll_value)

# Confusion Matrix
cm = confusion_matrix(y_train, y_train_pred)
print("Confusion Matrix:")
print(cm)

# Classification Report
print("Classification Report:")
print(classification_report(y_train, y_train_pred))

# ROC AUC Score
roc_auc = roc_auc_score(y_train, p_train)
print("Training ROC AUC:", roc_auc)


Training Accuracy: 0.797979797979798
Training Log-Likelihood: -392.7616718226784
Confusion Matrix:
[[471  78]
 [102 240]]
Classification Report:
              precision    recall  f1-score   support

         0.0       0.82      0.86      0.84       549
         1.0       0.75      0.70      0.73       342

    accuracy                           0.80       891
   macro avg       0.79      0.78      0.78       891
weighted avg       0.80      0.80      0.80       891

Training ROC AUC: 0.8571219335527648



## 4. Model 2: Advanced Preprocessing and Feature Engineering

This approach enriches the dataset by extracting additional features, which may enhance the predictive performance.

### Advanced Preprocessing

Additional transformations include:

- **Cabin:**  
  Extract the deck letter from the Cabin (if missing, assign `'U'`).
- **Name:**  
  Extract the title (e.g., Mr, Mrs, Miss) from the passenger's name.
- **Ticket:**  
  Separate the ticket number and prefix, converting the number to numeric and creating new variables based on the prefix.



In [13]:
def preprocess_df(df, is_train=True):
    df = df.copy()
    
    # Impute numerical values
    df['Age'] = df['Age'].fillna(df['Age'].median())
    df['Fare'] = df['Fare'].fillna(df['Fare'].median())
    
    # Impute 'Embarked'
    df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
    
    # Process Cabin: extract the deck letter
    df['Cabin_deck'] = df['Cabin'].fillna('U').apply(lambda x: x[0] if x != 'U' else 'U')
    
    # Process Name: extract title
    def extract_title(name):
        if ',' in name:
            parts = name.split(',')
            if len(parts) > 1:
                title_part = parts[1].strip()
                title = title_part.split(' ')[0].replace('.', '')
                return title
        return 'None'
    df['Title'] = df['Name'].apply(extract_title)
    
    # Process Ticket: extract number and prefix
    def ticket_number(x):
        tokens = x.split()
        return tokens[-1] if tokens else '0'
    def ticket_item(x):
        tokens = x.split()
        return " ".join(tokens[:-1]) if len(tokens) > 1 else 'NONE'
    df['Ticket_number'] = pd.to_numeric(df['Ticket'].apply(ticket_number), errors='coerce').fillna(0)
    df['Ticket_item'] = df['Ticket'].apply(ticket_item)
    
    # Drop original columns that are no longer needed
    df = df.drop(columns=['Name', 'Ticket', 'Cabin'])
    
    return df

# Load datasets
train_df = pd.read_csv("train.csv")
test_df  = pd.read_csv("test.csv")

# Apply preprocessing
train_df_proc = preprocess_df(train_df, is_train=True)
test_df_proc  = preprocess_df(test_df, is_train=False)

# Define variables to use
categorical_cols = ['Sex', 'Embarked', 'Cabin_deck', 'Title', 'Ticket_item']
numerical_cols   = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Ticket_number']

# Ensure 'Sex' is treated as a categorical variable
train_df_proc['Sex'] = train_df_proc['Sex'].astype(str)
test_df_proc['Sex']  = test_df_proc['Sex'].astype(str)

# Concatenate for consistent encoding between train and test sets
all_df = pd.concat([train_df_proc, test_df_proc], sort=False, ignore_index=True)
all_df = pd.get_dummies(all_df, columns=categorical_cols, drop_first=True)

# Separate the train and test sets
train_processed = all_df.iloc[:len(train_df_proc)].copy()
test_processed  = all_df.iloc[len(train_df_proc):].copy()

# Define target and features for training
y_train = train_processed['Survived'].values.astype(np.float64)
X_train = train_processed.drop(columns=['PassengerId', 'Survived']).values.astype(np.float64)
passenger_ids = test_processed['PassengerId'].values
X_test = test_processed.drop(columns=['PassengerId', 'Survived'], errors='ignore').values.astype(np.float64)

# Add the intercept term
X_train = np.hstack([np.ones((X_train.shape[0], 1)), X_train])
X_test  = np.hstack([np.ones((X_test.shape[0], 1)), X_test])

print("Shapes: X_train =", X_train.shape, "; y_train =", y_train.shape, "; X_test =", X_test.shape)


Shapes: X_train = (891, 85) ; y_train = (891,) ; X_test = (418, 85)



### Model 2 Implementation

The implementation is similar to Model 1, but here the pseudo-inverse is used for the Hessian matrix to handle possible singularities.



In [15]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def log_likelihood(beta, X, y):
    z = np.dot(X, beta)
    p = sigmoid(z)
    return np.sum(y * np.log(p + 1e-9) + (1 - y) * np.log(1 - p + 1e-9))

def gradient(beta, X, y):
    z = np.dot(X, beta)
    p = sigmoid(z)
    return np.dot(X.T, (y - p))

def hessian(beta, X, y):
    z = np.dot(X, beta)
    p = sigmoid(z)
    W = np.diag(p * (1 - p))
    return -np.dot(X.T, np.dot(W, X))

# Initialize parameters
beta = np.zeros(X_train.shape[1], dtype=np.float64)
tol = 1e-12
max_iter = 100

for i in range(max_iter):
    grad = gradient(beta, X_train, y_train)
    H = hessian(beta, X_train, y_train)
    # Use pseudo-inverse to handle singular Hessian matrices
    delta = np.dot(np.linalg.pinv(H), grad)
    beta_new = beta - delta
    if np.linalg.norm(beta_new - beta, 2) < tol:
        beta = beta_new
        print(f"Convergence reached in {i+1} iterations.")
        break
    beta = beta_new

print("Estimated coefficients:", beta)
print("Log-Likelihood:", log_likelihood(beta, X_train, y_train))


Estimated coefficients: [ 5.27501290e+00 -9.23075709e-01 -3.20759269e-02 -5.29792233e-01
 -3.80086444e-01  6.04613824e-03 -1.10569050e-07 -3.44790578e+00
 -1.74320325e-01 -6.20565404e-01  3.86784739e-01 -1.15999087e-01
  1.13306396e+00  1.98805170e+00  7.09697417e-01 -1.95845636e+00
 -1.90725974e+00 -1.88088662e-01  5.38011976e-01 -2.81828662e+00
 -1.28972557e-01 -2.24788946e-01 -3.03283985e+00  1.10077761e+00
  3.35994555e-01  2.85768705e+00 -1.02541824e+00  6.15477129e-01
  8.07704139e-01 -6.54154845e-01 -5.82036855e-02  1.84885376e+00
 -3.53975783e+00  4.22547128e+00  7.41029118e-01 -1.21919033e+00
 -7.22342383e-01 -1.16832422e+00 -1.59608834e+00  1.11798083e+00
 -3.50629617e-01 -5.86417193e-01 -5.59382891e-01 -1.10439104e-01
  8.89834569e-02  2.33123297e+00  1.14874273e+00 -1.41181948e+00
 -2.87836803e+00  1.53156563e+00 -2.39397051e+00  2.36432967e+00
 -3.89921894e-01  1.41471055e-02  5.80927785e-01  8.85694791e-02
 -3.96803071e-01  2.95333682e+00 -5.91644056e-01  1.46749331e-01
 

### Model 2 Evaluation

Evaluation metrics are computed in the same manner as in Model 1, which includes accuracy, log-likelihood, confusion matrix, classification report, and ROC AUC.



In [17]:
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score

z_train = np.dot(X_train, beta)
p_train = sigmoid(z_train)
y_train_pred = (p_train >= 0.5).astype(int)

accuracy = np.mean(y_train_pred == y_train)
print("Training Accuracy:", accuracy)

cm = confusion_matrix(y_train, y_train_pred)
print("Confusion Matrix:")
print(cm)

print("Classification Report:")
print(classification_report(y_train, y_train_pred))

roc_auc = roc_auc_score(y_train, p_train)
print("Training ROC AUC:", roc_auc)

# Prediction on the test set and submission file generation
z_test = np.dot(X_test, beta)
p_test = sigmoid(z_test)
y_test_pred = (p_test >= 0.5).astype(int)

submission = pd.DataFrame({
    'PassengerId': passenger_ids,
    'Survived': y_test_pred
})
submission.to_csv('submission.csv', index=False)
print("Submission file 'submission.csv' generated successfully.")


Training Accuracy: 0.8451178451178452
Confusion Matrix:
[[484  65]
 [ 73 269]]
Classification Report:
              precision    recall  f1-score   support

         0.0       0.87      0.88      0.88       549
         1.0       0.81      0.79      0.80       342

    accuracy                           0.85       891
   macro avg       0.84      0.83      0.84       891
weighted avg       0.84      0.85      0.84       891

Training ROC AUC: 0.9024222669606621
Submission file 'submission.csv' generated successfully.


---



## 5. Conclusions

- **Model 1 (Basic):**  
  - **Preprocessing:** Applied simple imputation (median) and basic encoding.  
  - **Implementation:** Logistic regression via Newton-Raphson without advanced modifications.  
  - **Results:** Reasonable performance metrics (accuracy, log-likelihood, ROC AUC), but with room for improvement.

- **Model 2 (Enhanced/Feature Engineering):**  
  - **Preprocessing:** Enriched the dataset by extracting information from variables like *Cabin*, *Name*, and *Ticket* and encoded multiple categorical variables consistently.  
  - **Implementation:** Utilized the pseudo-inverse in the Newton-Raphson method to ensure numerical stability.  
  - **Results:** Improved evaluation metrics, demonstrating that advanced preprocessing and feature engineering can positively impact the model's predictive ability.



