## Exploring Bayesian Machine Learning: A Case Study on Titanic Survival Prediction


As part of my journey to master Bayesian machine learning concepts, I am undertaking a comprehensive analysis of the Titanic dataset using Bayesian techniques. This project aims to deepen my understanding of Bayesian methods and their applications in predictive modeling.


#### Importing necessary Libraries and Loading [Kaggle data set](https://www.kaggle.com/datasets/waqi786/titanic-dataset/data)


In [82]:
import pandas as pd
import numpy as np
df = pd.read_csv("Titanic data set.csv")
df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Survived
0,1,3,Allison Hill,male,17,4,2,43d75413-a939-4bd1-a516-b0d47d3572cc,144.08,Q,1
1,2,1,Noah Rhodes,male,60,2,2,6334fa2a-8b4b-47e7-a451-5ae01754bf08,249.04,S,0
2,3,3,Angie Henderson,male,64,0,0,61a66444-e2af-4629-9efb-336e2f546033,50.31,Q,1
3,4,3,Daniel Wagner,male,35,4,0,0b6c03c8-721e-4419-afc3-e6495e911b91,235.2,C,1
4,5,1,Cristian Santos,female,70,0,3,436e3c49-770e-49db-b092-d40143675d58,160.17,C,1


### Dropping Unnecessary Columns

This operation removes irrelevant features from the DataFrame, which helps in simplifying the dataset for analysis and modeling.

In [83]:
df.drop(['PassengerId','Name','SibSp','Parch','Ticket','Embarked'],axis='columns',inplace=True)
df.head()

Unnamed: 0,Pclass,Sex,Age,Fare,Survived
0,3,male,17,144.08,1
1,1,male,60,249.04,0
2,3,male,64,50.31,1
3,3,male,35,235.2,1
4,1,female,70,160.17,1


In [84]:
df.head()

Unnamed: 0,Pclass,Sex,Age,Fare,Survived
0,3,male,17,144.08,1
1,1,male,60,249.04,0
2,3,male,64,50.31,1
3,3,male,35,235.2,1
4,1,female,70,160.17,1


#### Convert categorical variables to numeric

In [85]:
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

In [86]:
df.head()

Unnamed: 0,Pclass,Sex,Age,Fare,Survived
0,3,0,17,144.08,1
1,1,0,60,249.04,0
2,3,0,64,50.31,1
3,3,0,35,235.2,1
4,1,1,70,160.17,1


### Handling Missing Values

This operation fills in missing values in the dataset to ensure all features are complete for modeling.

**Details:**
- **Age and Fare**: Missing values are filled with the **mean** of their respective columns.
- **Pclass**: Missing values are filled with the **mode** (most common value) of the `Pclass` column.
- **Sex**: Missing values are filled with the **mode** of the `Sex` column.

This step helps to maintain the integrity of the dataset and prevent errors during model training.


In [87]:
df[['Age','Fare']] = df[['Age', 'Fare']].fillna(df[['Age', 'Fare']].mean())


In [92]:
df['Pclass'] = df['Pclass'].fillna(df['Pclass'].mode()[0])
df['Sex'] = df['Sex'].fillna(df['Sex'].mode()[0])

### Splitting the Data

This operation divides the dataset into training and testing sets for model evaluation.

**Details:**
- **Features (X)**: Selected columns are `Pclass`, `Sex`, `Age`, and `Fare`.
- **Target (y)**: The target variable is `Survived`.
- **Train-Test Split**: The data is split into training (80%) and testing (20%) sets using a random state for reproducibility.

This step is crucial for assessing the model's performance on unseen data.


In [109]:
# Split the data
from sklearn.model_selection import train_test_split
# Split the data
X = df[['Pclass', 'Sex', 'Age', 'Fare']]
y = df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [110]:
y_test[0:10]

521    0
737    0
740    1
660    0
411    1
678    1
626    1
513    0
859    1
136    1
Name: Survived, dtype: int64

In [111]:
X_test[0:10]

Unnamed: 0,Pclass,Sex,Age,Fare
521,1,1,78,145.2
737,1,1,65,454.93
740,2,0,45,80.3
660,3,1,14,30.37
411,2,1,48,29.73
678,1,1,32,20.26
626,3,0,22,344.27
513,1,0,37,424.97
859,1,0,12,249.93
136,2,0,46,389.87


### Initializing and Training the Model

This operation sets up the Gaussian Naive Bayes model and trains it using the training dataset.

**Details:**
- **Model**: `GaussianNB()` is used, which is suitable for classification tasks involving normally distributed data.
- **Training**: The model is trained with the training features (`X_train`) and the corresponding target values (`y_train`).

This step prepares the model to make predictions based on the patterns learned from the training data.


In [112]:
# Initialize and train the model
model = GaussianNB()
model.fit(X_train, y_train)

In [113]:
# Make predictions
model.predict(X_test[0:10])

array([0, 0, 0, 0, 0, 0, 1, 0, 1, 1], dtype=int64)

### Understanding Predictions from the Titanic Dataset

After training the Gaussian Naive Bayes model, predictions were made on the test dataset.

**Predictions:**
- Output: `array([0, 0, 0, 0, 0, 0, 1, 0, 1, 1], dtype=int64)`

**Interpretation:**
- Each value in the array corresponds to a passenger in the test set, where:
  - **0** indicates that the passenger did **not survive**.
  - **1** indicates that the passenger **survived**.
  
In this case, the model predicts that the first six passengers did not survive, while the last four passengers did survive. This binary classification helps to assess the survival likelihood based on the features provided, such as passenger class, age, sex, and fare.


In [114]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.3f}')

Accuracy: 0.455


In [115]:
model.predict_proba(X_test[:10])

array([[0.51365293, 0.48634707],
       [0.52805038, 0.47194962],
       [0.51906068, 0.48093932],
       [0.53527401, 0.46472599],
       [0.5416603 , 0.4583397 ],
       [0.54359921, 0.45640079],
       [0.48301363, 0.51698637],
       [0.50708723, 0.49291277],
       [0.47523716, 0.52476284],
       [0.49719877, 0.50280123]])

### Analysis of Predicted Probabilities

The Gaussian Naive Bayes model outputs 

Key Insights:

Each row presents the probabilities of not surviving (first value) and surviving (second value).
For instance, the first passenger has a 51.37% chance of not surviving and a 48.63% chance of survival.
Probabilities close to 0.5 indicate uncertainty in the model’s predictions, while values significantly higher or lower suggest stronger confidence in the outcome.


This probability output allows for a better understanding of the model's confidence in its predictions, facilitating more informed decision-making regarding passenger survival assessment.

In [116]:
from sklearn.model_selection import cross_val_score
cross_val_score(GaussianNB(),X_train, y_train, cv=5)

array([0.50625, 0.5125 , 0.5    , 0.53125, 0.475  ])

### Cross-Validation Analysis

To evaluate the performance of the Gaussian Naive Bayes model, we performed cross-validation using 5 folds on the training data.

Key Insights:

The cross-validation scores represent the accuracy of the model on different subsets of the training data.
The scores range from approximately 0.475 to 0.531, indicating a variable model performance across the folds.


The average accuracy across the 5 folds is around 50.0%, suggesting that the model has limited predictive power in this context. This could indicate the need for feature engineering, model tuning, or consideration of different algorithms to improve performance.

### Conclusion

In this project, we utilized the Titanic dataset to predict passenger survival using the Gaussian Naive Bayes algorithm. The following key points summarize our findings:

- **Data Preparation**: We successfully cleaned the dataset by dropping irrelevant features and handling missing values, ensuring a robust foundation for analysis.
  
- **Model Training**: After splitting the data into training and testing sets, we trained the model and made predictions, achieving a survival prediction output.

- **Probability Assessment**: The model also provided predicted probabilities, which offered insights into the confidence of each prediction.

- **Cross-Validation Performance**: The cross-validation scores indicated a modest accuracy of around 50%, highlighting the need for further refinement in feature selection or the exploration of more complex algorithms.

**Final Thoughts**: The project demonstrated the application of a Bayesian machine learning method in a real-world dataset, providing valuable insights into the factors affecting survival during the Titanic disaster. Future work could involve enhancing model performance through additional feature engineering and experimenting with different classification algorithms.
