# Basic Introduction and Summary of Assignment

In this assignment, we will be working with the Titanic dataset. The goal is to load, preprocess, and analyze the data to gain insights into the factors that influenced the survival of passengers. We will perform various data preprocessing steps such as handling missing values, encoding categorical variables, and feature scaling.

## Loading and Preprocessing of the Titanic Dataset

We will start by loading the Titanic dataset and performing necessary preprocessing steps to prepare the data for analysis. This includes:

1. Handling missing values.
2. Encoding categorical variables.
3. Feature scaling.


# Titanic Dataset Description

1. **survival**: Survival (0 = No; 1 = Yes).
2. **class**: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd).
3. **name**: Name.
4. **sex**: Sex.
5. **sibsp**: Number of Siblings/Spouses Aboard.
6. **parch**: Number of Parents/Children Aboard.
7. **ticket**: Ticket Number.
8. **fare**: Passenger Fare.
9. **cabin**: Cabin.
10. **embarked**: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton).
11. **boat**: Lifeboat (if survived).
12. **body**: Body number (if did not survive and the body was recovered).


In [None]:
%pip install boruta
%pip install imblearn
%pip install seaborn
%pip install scikit-learn
%pip install matplotlib
%pip install pandas
%pip install numpy
%pip install scipy
%pip install statsmodels

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import scipy.stats as stats

from scipy.stats import pearsonr
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from boruta import BorutaPy


# Task 1: Data Loading and Initial Exploration

Basic data loading and exploration

In [None]:
# Load the dataset
file_path = 'titanic3.xls'
titanic_df = pd.read_excel(file_path)

# Display the first few rows of the dataset
print(titanic_df.head())

# Display the dataset information
print(titanic_df.info())

# Display the summary statistics of the dataset
print(titanic_df.describe())

# Task 2: Managing Missing Values

Calculating and Showing the Percentage of Missing Values in Each Column

In [None]:
# Count missing values for each column
missing_values_count = titanic_df.isnull().sum()
print("\nMissing Values Count for Each Column:")
print(missing_values_count)

# Filter out rows with missing values
filtered_df = titanic_df.dropna()

# Display the first few rows of the filtered dataset
print("\nFiltered Dataset (No Missing Values):")
print(filtered_df.head())

# Display the dataset information
print("\nFiltered Dataset Info:")
print(filtered_df.info())

# Display the summary statistics of the filtered dataset
print("\nFiltered Dataset Summary Statistics:")
print(filtered_df.describe())

Calculating the Number of People Survived/!Survived depending on Boat/!Boat

In [None]:
# Filter the dataset for people who have a boat value and also survived
boat_and_survived = titanic_df[(titanic_df['boat'].notnull()) & (titanic_df['survived'] == 1)]

# Calculate the number of people who have a boat value and also survived
num_boat_and_survived = boat_and_survived.shape[0]

print(f"Number of people who have a boat value and also survived: {num_boat_and_survived}")

# Filter the dataset for people who have a boat value and didn't survive
boat_and_not_survived = titanic_df[(titanic_df['boat'].notnull()) & (titanic_df['survived'] == 0)]

# Calculate the number of people who have a boat value and didn't survive
num_boat_and_not_survived = boat_and_not_survived.shape[0]

print(f"Number of people who have a boat value and didn't survive: {num_boat_and_not_survived}")

# Calculate the number of people who didn't have a boat value and survived
no_boat_and_survived = titanic_df[(titanic_df['boat'].isnull()) & (titanic_df['survived'] == 1)]
num_no_boat_and_survived = no_boat_and_survived.shape[0]

print(f"Number of people who didn't have a boat value and survived: {num_no_boat_and_survived}")

# Calculate the number of people who didn't have a boat value and didn't survive
no_boat_and_not_survived = titanic_df[(titanic_df['boat'].isnull()) & (titanic_df['survived'] == 0)]
num_no_boat_and_not_survived = no_boat_and_not_survived.shape[0]

print(f"Number of people who didn't have a boat value and didn't survive: {num_no_boat_and_not_survived}")

Plotting 2x2 Matrix of Survived/!Survived depending on Boat/!Boat

In [None]:
# Create a 2x2 matrix with the data
matrix = [
    [num_boat_and_survived, num_boat_and_not_survived],
    [num_no_boat_and_survived, num_no_boat_and_not_survived]
]

# Create a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(matrix, annot=True, fmt="d", cmap="YlGnBu", xticklabels=['Boat', 'No Boat'], yticklabels=['Survived', 'Not Survived'])
plt.title('Survival vs Boat Presence')
plt.xlabel('Boat Presence')
plt.ylabel('Survival Status')
plt.show()

Although the Boat Column has high correlation with Survival status, I will be dropping it as I consider it future information/data leakage. As the Boat Status is dependent on the other factors, aswell as it being a future information, I will be dropping it.

In [None]:
# Remove the 'boat' column
titanic_df.drop(columns=['boat'], inplace=True)

The Body Column will also be dropped, as it is future information.

In [None]:
# Remove the 'body' column, as this is dependent on survival status
titanic_df.drop(columns=['body'], inplace=True)

Although a Name such as Dr. or Ms. could have an impact on the survival rate, I will be dropping the Name Column as it is not a feature that can be used in the model. It would add complexity.

In [None]:
# Drop the 'name' column
titanic_df.drop(columns=['name'], inplace=True)

The Ticket Column will also be dropped as it is not a feature that can be used in the model. It would add complexity.

In [None]:
# Drop the 'ticket' column as it is not statistically significant
titanic_df.drop(columns=['ticket'], inplace=True)

The Embarked Column will be dropped, as it adds complexity and is theoretically not a feature that can be used in the model.

In [None]:
# Drop the 'embarked' column
titanic_df.drop(columns=['embarked'], inplace=True)

Home destination would not be a feature that can be used in the model, so I will be dropping the Home.dest Column.

In [None]:
# Drop the 'home.dest' column
titanic_df.drop(columns=['home.dest'], inplace=True)

Calculating the Median Age for each sex will make sure that the Age Column is not biased. I will be filling the missing values with the median age. As the standard deviation of the Age is high, which is why the mean would not be a good choice, as it is skewed.

In [None]:
# Calculate the median age for each sex
median_age_per_sex = titanic_df.groupby('sex')['age'].median()

# Function to fill missing age values based on sex
def fill_age(row):
    if pd.isnull(row['age']):
        return median_age_per_sex[row['sex']]
    else:
        return row['age']

# Apply the function to fill missing age values
titanic_df['age'] = titanic_df.apply(fill_age, axis=1)

I will be calculating the average fare price per class and filling the missing values with the average fare price per class. This approach ensures that the missing fare values are imputed based on the average fare paid by passengers in their respective classes, which is a reasonable assumption given that fare prices are often correlated with passenger class.

In [None]:
# Calculate the average fare for each class, excluding fares of 0 or N/A
average_fare_per_class = titanic_df[titanic_df['fare'] > 0].groupby('pclass')['fare'].mean()

# Fill missing fare values with the average fare of their respective class
titanic_df['fare'] = titanic_df.apply(
    lambda row: average_fare_per_class[row['pclass']] if pd.isnull(row['fare']) or row['fare'] == 0 else row['fare'],
    axis=1
)

Using the median Fare for Cabins, I will reverse engineer the Cabin Column and fill the missing values with the median Fare price per Cabin. This approach ensures that the missing cabin values are imputed based on the fare paid by the passengers, which is a reasonable assumption given that fare prices are often correlated with cabin assignments.

In [None]:
# Calculate the median fare for each cabin
titanic_df['cabin'] = titanic_df['cabin'].str[0]  # Extract the first letter of the cabin
average_fare_per_cabin = titanic_df.groupby('cabin')['fare'].median().to_dict()

# Plot the distribution of cabin decks based on the first letter
plt.figure(figsize=(10, 6))
sns.countplot(data=titanic_df, x='cabin', order=sorted(titanic_df['cabin'].dropna().unique()))
plt.title('Distribution of Cabin Decks')
plt.xlabel('Cabin Deck')
plt.ylabel('Count')
plt.show()

# Function to assign cabin based on fare price
def assign_cabin(fare):
    for cabin, median_fare in average_fare_per_cabin.items():
        if fare <= median_fare:
            return cabin
    return 'T'  # Assign 'T' if fare is higher than all median fares

# Fill missing cabin values based on fare price
titanic_df['cabin'] = titanic_df.apply(
    lambda row: assign_cabin(row['fare']) if pd.isnull(row['cabin']) else row['cabin'],
    axis=1
)

# Plot the distribution of cabin decks after filling missing values
plt.figure(figsize=(10, 6))
sns.countplot(data=titanic_df, x='cabin', order=sorted(titanic_df['cabin'].unique()))
plt.title('Distribution of Cabin Decks After Filling Missing Values')
plt.xlabel('Cabin Deck')
plt.ylabel('Count')
plt.show()

Showing that there are no missing values left in the dataset.

In [None]:
# Display the updated dataset information
print(titanic_df.info(verbose=True))

# Show columns with missing values
columns_with_missing_values = titanic_df.columns[titanic_df.isnull().any()]
print("\nColumns with Missing Values:")
print(columns_with_missing_values)

# Task 3: Encoding Categorical Variables

I will perform one-hot encoding on the Passenger Class, because it is a categorical variable.

In [None]:
# Perform one-hot encoding for the 'pclass' column
titanic_df = pd.get_dummies(titanic_df, columns=['pclass'], drop_first=True)

I will perform one-hot encoding on the Passenger Gender, as it is a categorical variable. This process involves converting the gender categories into a format that can be provided to machine learning algorithms to improve the model's performance. By doing this, we ensure that the gender information is represented numerically, allowing the model to interpret and utilize this feature effectively.

In [None]:
# Perform one-hot encoding for the 'sex' column
titanic_df = pd.get_dummies(titanic_df, columns=['sex'], drop_first=True)

In [None]:
print(titanic_df.columns)
print(titanic_df.head())

# Task 4: Feature Scaling

Data Visualization will be performed to see if the data is normally distributed. If it is not, I will perform feature scaling.

In [None]:
# Plot histogram of the 'fare' column
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.histplot(titanic_df['fare'], kde=True)
plt.title('Histogram of Fare')

# Plot Q-Q plot of the 'fare' column
plt.subplot(1, 2, 2)
stats.probplot(titanic_df['fare'], dist="norm", plot=plt)
plt.title('Q-Q Plot of Fare')

plt.tight_layout()
plt.show()

I Standard Scale Fare Column, as it is a continuous variable. This Scaled version will be used in the model, as it will be easier to interpret for the Model.

In [None]:
# Initialize the StandardScaler
scaler = StandardScaler()

# Perform standardized scaling for the 'fare' column
titanic_df['fare_scaled'] = scaler.fit_transform(titanic_df[['fare']])

# Display the first few rows of the updated dataset
print(titanic_df[['fare', 'fare_scaled']].head())

I plot a Q-Q plot to check if the Fare Column is normally distributed. I will be using the Scaled version of the Fare Column in the model.

In [None]:
# Plot histogram and Q-Q plot of the 'fare_scaled' column
plt.figure(figsize=(12, 6))

# Histogram
plt.subplot(1, 2, 1)
sns.histplot(titanic_df['fare_scaled'], kde=True)
plt.title('Histogram of Scaled Fare')
plt.xlabel('Scaled Fare')
plt.ylabel('Frequency')

# Q-Q plot
plt.subplot(1, 2, 2)
stats.probplot(titanic_df['fare_scaled'], dist="norm", plot=plt)
plt.title('Q-Q Plot of Scaled Fare')

plt.tight_layout()
plt.show()

I Scale the Age using MinMaxScaler, as it is a continuous variable. This Scaled version will be used in the model, as it will be easier to interpret for the Model.

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Initialize the MinMaxScaler
minmax_scaler = MinMaxScaler()

# Perform MinMax scaling for the 'age' column
titanic_df['age_scaled'] = minmax_scaler.fit_transform(titanic_df[['age']])

# Display the first few rows of the updated dataset
print(titanic_df[['age', 'age_scaled']].head())

# Drop the 'age' column as it is now scaled
titanic_df.drop(columns=['age'], inplace=True)

In [None]:
print(titanic_df.columns)
print(titanic_df.head())

# Task 5: Data Splitting

I will be splitting the dataset into three parts: training, testing, and validation sets. The training set will consist of 50% of the data, the testing set will consist of 30% of the data, and the validation set will consist of 20% of the data. This approach ensures that the model is trained on a substantial portion of the data, while also having separate sets for testing and validation to evaluate the model's performance and generalization.

In [None]:
# Preprocess the data to convert categorical features to numeric values
X = pd.get_dummies(titanic_df.drop(columns=['survived']))
y = titanic_df['survived']

# Split the dataset into training (50%) and temp (50%) sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.5, random_state=42)

# Split the temp set into validation (60% of temp, which is 30% of original) and testing (40% of temp, which is 20% of original) sets
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.4, random_state=42)

# Display the sizes of the splits
print(f"Training set size: {X_train.shape[0]}")
print(f"Validation set size: {X_val.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")

# Function to plot pie chart for survived/!survived
def plot_survival_pie_chart(data, title, ax):
    survival_counts = data.value_counts()
    ax.pie(survival_counts, labels=['Not Survived', 'Survived'], autopct='%1.1f%%', startangle=140, colors=['#ff9999','#66b3ff'])
    ax.set_title(title)
    ax.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

# Create subplots
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# Plot pie charts for each dataset
plot_survival_pie_chart(y_train, 'Training Data Split: Survived vs Not Survived', axes[0])
plot_survival_pie_chart(y_val, 'Validation Data Split: Survived vs Not Survived', axes[1])
plot_survival_pie_chart(y_test, 'Testing Data Split: Survived vs Not Survived', axes[2])

plt.tight_layout()
plt.show()

# Task 6: Adressing Class Imbalance

As this dataset is imbalanced, I will be using SMOTE to balance the dataset. The training data will be balanced, while the testing and validation sets will remain the same. This approach ensures that the model is trained on a balanced dataset, which helps in improving the model's performance and generalization. However, the testing and validation sets will remain imbalanced to reflect the real-world scenario and to evaluate the model's performance accurately.

In [None]:
# Count the number of survivors and non-survivors in the training set
survival_counts = y_train.value_counts()

# Create subplots
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# Plot pie charts for each dataset
plot_survival_pie_chart(y_train, 'Training Data Split: Survived vs Not Survived', axes[0])
plot_survival_pie_chart(y_val, 'Validation Data Split: Survived vs Not Survived', axes[1])
plot_survival_pie_chart(y_test, 'Testing Data Split: Survived vs Not Survived', axes[2])

plt.tight_layout()
plt.show()

In [None]:
# Initialize the SMOTE object
smote = SMOTE(random_state=42)

# Apply SMOTE to the training set
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Display the sizes of the original and resampled training sets
print(f"Original training set size: {X_train.shape[0]}")
print(f"Resampled training set size: {X_train_smote.shape[0]}")

# Display the distribution of the target variable in the resampled training set
print("\nDistribution of the target variable in the resampled training set:")
print(y_train_smote.value_counts())

Checking the Class Imbalance before and after SMOTE.

In [None]:
# Count the number of survivors and non-survivors in the new dataset
survival_counts_smote = y_train_smote.value_counts()

# Create subplots
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# Plot pie charts for each dataset
plot_survival_pie_chart(y_train_smote, 'Training Data Split (SMOTE): Survived vs Not Survived', axes[0])
plot_survival_pie_chart(y_val, 'Validation Data Split: Survived vs Not Survived', axes[1])
plot_survival_pie_chart(y_test, 'Testing Data Split: Survived vs Not Survived', axes[2])

plt.tight_layout()
plt.show()

# Task 7: Feature Selection

In [None]:
print(X_train_smote.columns)
print(X_train_smote.head())

I will use the Boruta algorithm to select the most relevant features for the model. This method helps in identifying the important features by iteratively removing the least important ones, ensuring that only the most significant features are retained for the model training.

In [None]:
rf = RandomForestClassifier(n_estimators=100, random_state=42)

boruta_selector = BorutaPy(rf, n_estimators='auto', random_state=42)
boruta_selector.fit(X_train_smote.values, y_train_smote.values)

selected_features = X_train_smote.columns[boruta_selector.support_].to_list()
print("Selected Features:", selected_features)

I used GridSearchCV to find the best hyperparameters for my Logistic Regression model. This helps me optimize the model's performance.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt

# Define the hyperparameter grid
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'saga']
}



# Task 8:Traning a Logistic Regression Model


Using simple Logistic Regression Model to predict the survival of passengers based on the selected features. I will evaluate the model's performance using accuracy, precision, recall, and F1-score metrics on the validation set.

In [None]:
%%time
%pip install joblib
import joblib

# Initialize the logistic regression model
lr = LogisticRegression(max_iter=3000, random_state=42)

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=lr, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=2)

# Fit GridSearchCV on the training set with selected features
grid_search.fit(X_train_smote[selected_features], y_train_smote)

# Get the best hyperparameters
best_params = grid_search.best_params_
print(f"Best hyperparameters: {best_params}")

# Train the logistic regression model with the best hyperparameters
best_lr = LogisticRegression(**best_params, max_iter=3000, random_state=42)
best_lr.fit(X_train_smote[selected_features], y_train_smote)


I evaluated my model on the validation set to check its performance. I looked at accuracy, the confusion matrix, and the classification report.

In [None]:
# Evaluate the model on the validation set
y_val_pred = best_lr.predict(X_val[selected_features])
val_accuracy = accuracy_score(y_val, y_val_pred)
print(f"Validation Accuracy: {val_accuracy}")

# Display the confusion matrix
val_cm = confusion_matrix(y_val, y_val_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(val_cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Survived', 'Survived'], yticklabels=['Not Survived', 'Survived'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Validation Confusion Matrix')
plt.show()

# Display the classification report
val_cr = classification_report(y_val, y_val_pred)
print("Validation Classification Report:")
print(val_cr)

I combined the training and validation sets to retrain my model with the best hyperparameters. This helps improve the model by using more data.

In [None]:
# Combine the training and validation sets
X_train_combined = pd.concat([X_train_smote, X_val])
y_train_combined = pd.concat([y_train_smote, y_val])

# Retrain the logistic regression model on the combined dataset with the best hyperparameters
final_lr = LogisticRegression(**best_params, max_iter=3000, random_state=42)
final_lr.fit(X_train_combined[selected_features], y_train_combined)

I evaluated my final model on the testing set to see how well it performs on completely unseen data. I checked the accuracy, confusion matrix, and classification report.

In [None]:
# Make predictions on the testing set
y_test_pred = final_lr.predict(X_test[selected_features])

# Calculate accuracy
test_accuracy = accuracy_score(y_test, y_test_pred)
print(f"Testing Accuracy: {test_accuracy}")

# Display the confusion matrix
test_cm = confusion_matrix(y_test, y_test_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(test_cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Survived', 'Survived'], yticklabels=['Not Survived', 'Survived'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Testing Confusion Matrix')
plt.show()

# Display the classification report
test_cr = classification_report(y_test, y_test_pred)
print("Testing Classification Report:")
print(test_cr)

In Conclusion, the model has an accuracy of 0.75 , I consider this a good result, as the model is able to predict the survival of passengers with a high degree of accuracy. The precision, recall, and F1-score metrics are also good, indicating that the model is performing well in terms of predicting the survival of passengers.