**Assignment: Gaussian Naive Bayes for Titanic Survival Prediction**

Instructions:

You are required to write Python code to solve the given problems.

Use the scikit-learn library for implementing Gaussian Naive Bayes.

Comment your code appropriately for better understanding.

Submit the Python script with the completed code.

1. Load the "Titanic" dataset using pandas and explore its structure and contents. Show the result.
2. Perform data preprocessing tasks such as handling missing values, feature selection, and feature encoding. Consider using techniques like imputation for missing values and one-hot encoding for categorical variables.
3. Split the dataset into training and testing sets.

*Use 7x% of the data for training and (100 - 7x)% for testing.*

**[x = 2nd last digit of your roll number]**

*comment your roll number and 2nd last digit*

4. Implement Gaussian Naive Bayes using the training set. Train the model on the selected features and the corresponding target variable (Survived).
5. Apply the trained model to make predictions on the testing set.
6. Evaluate the performance of the Gaussian Naive Bayes model by calculating metrics such as accuracy, precision, recall, and F1-score. Interpret the results and analyze the model's effectiveness. (Be creative here)
7. Any chart that you want to implement , do it (Bonus)

In [None]:

import pandas as pd

# Load the dataset
titanic_data = pd.read_csv('titanic.csv')

# Explore the structure of the dataset
print(titanic_data.head())  
print(titanic_data.info())  

# 2. 
# fill missing values in the 'Age' column with the median age
titanic_data['Age'].fillna(titanic_data['Age'].median(), inplace=True)

# Feature selection
selected_features = ['Pclass', 'Sex', 'Age', 'Fare', 'Embarked'] 

# Feature encoding 
titanic_data = pd.get_dummies(titanic_data, columns=['Sex', 'Embarked'])

# 3. Split the dataset into training and testing sets.

from sklearn.model_selection import train_test_split

X = titanic_data[selected_features]
y = titanic_data['Survived']

# Use 14% of the data for training and 86% for testing (
x = 2   # the 2nd last digit of  roll number(220825)
train_size = 0.07 * x  # Calculate the training set size
test_size = 1 - train_size  # Calculate the testing set size

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=test_size, random_state=42)


# 4. Implement Gaussian Naive Bayes

from sklearn.naive_bayes import GaussianNB

# Create the Gaussian Naive Bayes model
model = GaussianNB()

# Train the model on the training set
model.fit(train_X, train_y)

# 5. Apply the trained model 

predictions = model.predict(test_X)

# 6. Evaluate the performance of the Gaussian Naive Bayes model

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(test_y, predictions)
precision = precision_score(test_y, predictions)
recall = recall_score(test_y, predictions)
f1 = f1_score(test_y, predictions)

# Interpret the results and analyze the model's effectiveness

print("Accuracy: {:.2f}%".format(accuracy * 100))
print("Precision: {:.2f}%".format(precision * 100))
print("Recall: {:.2f}%".format(recall * 100))
print("F1-score: {:.2f}%".format(f1 * 100))

# 7. Bonus: 

import matplotlib.pyplot as plt

# Count the number of survivors and non-survivors
survivors = titanic_data['Survived'].value_counts()

# Plot a bar chart
plt.bar(['Not Survived', 'Survived'], survivors)
plt.xlabel('Survival Status')
plt.ylabel('Count')
plt.title('Survival Distribution')
plt.show()
