<a href="https://colab.research.google.com/github/Bosy-Ayman/Machine_Learning/blob/main/Assignment(6)_Naive_Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Objective:


In this assignment, you will implement a Naive Bayes
classifier to predict the survival of passengers on the Titanic with built-in libraries. You will use
the Titanic dataset, which contains various features about the passengers, to
train and evaluate your model.

# Assignment Tasks:

# 1- Data Exploration and Preprocessing:
Load the train and test datasets using pandas.

Explore the data to understand the features and their types.

Handle missing values appropriately (e.g., imputing or
removing).

Convert categorical features into numerical ones using
techniques like one-hot encoding.

Normalize or scale the features if necessary.

# 2-Feature Selection:
Select the features you will use to train your model.
Justify your choices.

# 3-Model Implementation and Evaluation:

Implement a Naive Bayes classifier using scikit-learn. You
may choose between GaussianNB, MultinomialNB, or BernoulliNB based on your
feature types.

Train your model on the training dataset.

Evaluate your model using appropriate metrics, such as
accuracy, precision, recall, and F1-score.

Perform cross-validation to ensure the robustness of your
model.

# 4- Prediction:
Use your trained model to make predictions on the test
dataset.
Report: (2 marks)

# Write a detailed report (maximum 5 pages),
including:

An introduction to the problem and the Naive Bayes
classifier.

Data exploration and preprocessing steps.
Feature selection rationale.

Model training process and parameter tuning.

Evaluation results are discussed.

Conclusion and any challenges faced.

In [80]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [81]:
data = pd.read_csv('titanic_dataset.csv')
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Drop unneeded columns

In [82]:
df = data.drop(['PassengerId','Name','SibSp','Parch','Ticket','Cabin'], axis='columns')
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked
0,0,3,male,22.0,7.25,S
1,1,1,female,38.0,71.2833,C
2,1,3,female,26.0,7.925,S
3,1,1,female,35.0,53.1,S
4,0,3,male,35.0,8.05,S


In [83]:
df.isnull().sum()

Survived      0
Pclass        0
Sex           0
Age         177
Fare          0
Embarked      2
dtype: int64

*Null values Exists*

# Preprocessing

# Clean Data

In [84]:
df['Age'].fillna(df['Age'].median(), inplace=True)


In [85]:
df.isnull().sum()

Survived    0
Pclass      0
Sex         0
Age         0
Fare        0
Embarked    2
dtype: int64

In [86]:
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

In [87]:
df =df.dropna(axis=0)

In [88]:
df.isnull().sum()

Survived    0
Pclass      0
Sex         0
Age         0
Fare        0
Embarked    0
dtype: int64

# Dealing with categorical data

# One Hot encoding

In [89]:
df = pd.get_dummies(df, columns=['Sex','Embarked'], drop_first=True)
df.head()

Unnamed: 0,Survived,Pclass,Age,Fare,Sex_male,Embarked_Q,Embarked_S
0,0,3,22.0,7.25,True,False,True
1,1,1,38.0,71.2833,False,False,False
2,1,3,26.0,7.925,False,False,True
3,1,1,35.0,53.1,False,False,True
4,0,3,35.0,8.05,True,False,True


# Split the data

In [90]:
df.head()

Unnamed: 0,Survived,Pclass,Age,Fare,Sex_male,Embarked_Q,Embarked_S
0,0,3,22.0,7.25,True,False,True
1,1,1,38.0,71.2833,False,False,False
2,1,3,26.0,7.925,False,False,True
3,1,1,35.0,53.1,False,False,True
4,0,3,35.0,8.05,True,False,True


# Feature selection

In [91]:
features = df.drop('Survived', axis=1)
label = df['Survived']

# Spliting data

In [92]:
X_train, X_test, y_train, y_test = train_test_split(features, label, test_size=0.2, random_state=1)

In [93]:
clf = GaussianNB()
clf.fit(X_train, y_train)

In [94]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Prediction

In [95]:
y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)


print("Accuracy: ", accuracy)
print("Precision: ", precision)
print("Recall: ", recall)
print("F1 Score: ", f1)


Accuracy:  0.770949720670391
Precision:  0.7285714285714285
Recall:  0.6986301369863014
F1 Score:  0.7132867132867133


as we see the accuracy is good

In [96]:
from sklearn.model_selection import cross_val_score

In [97]:
cv_scores = cross_val_score(clf, features,label , cv=5)

In [98]:
cv_scores

array([0.69273743, 0.79213483, 0.81460674, 0.76404494, 0.79213483])

In [99]:
test_predictions = clf.predict(X_test)

df.loc[X_test.index, 'Predicted_Survived'] = test_predictions

In [100]:
df

Unnamed: 0,Survived,Pclass,Age,Fare,Sex_male,Embarked_Q,Embarked_S,Predicted_Survived
0,0,3,22.0,7.2500,True,False,True,
1,1,1,38.0,71.2833,False,False,False,
2,1,3,26.0,7.9250,False,False,True,1.0
3,1,1,35.0,53.1000,False,False,True,1.0
4,0,3,35.0,8.0500,True,False,True,
...,...,...,...,...,...,...,...,...
886,0,2,27.0,13.0000,True,False,True,
887,1,1,19.0,30.0000,False,False,True,1.0
888,0,3,28.0,23.4500,False,False,True,
889,1,1,26.0,30.0000,True,False,False,
