#### Dataset Used: Titanic Dataset
#### Name: Rajath C Aralikatti
#### Roll No: 181CO241 Section 2

## Naive Bayes Classification
- Naive Bayes Classification is a supervised classification algorithm.
- We apply Bayes’ theorem with assumption of conditional independence between the input features. The term Naive comes in the name becaue we make the above assumption.

## Import the required libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split

## Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

import os
os.chdir('/content/drive/MyDrive/Colab Notebooks')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Reading the Data from titanic.csv

In [None]:
df = pd.read_csv('./Data/titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Dropping Redundant Columns and removing Rows with Missing Data (Data Pre-Processing)

In [None]:
print(df.columns)

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')


In [None]:
df = df.drop(['PassengerId', 'Name', 'Ticket', 'Fare', 'Cabin'], axis=1)

In [None]:
print(df.columns)

Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Embarked'], dtype='object')


In [None]:
print(df.isna().sum())

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Embarked      2
dtype: int64


In [None]:
print(df.shape)

(891, 7)


In [None]:
df = df.dropna()
print(df.shape)

(712, 7)


In [None]:
print(df.isna().sum())

Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Embarked    0
dtype: int64


In [None]:
print(df.groupby('Survived').size())

Survived
0    424
1    288
dtype: int64


## Replacing Strings with Numeric Data

- Column: Sex<br>
>Female : 0<br>
>Male : 1<br><br>

- Column: Embarked<br>
>C : 0<br>
>Q : 1<br>
>S : 2

In [None]:
df = df.replace(to_replace='female', value=0)
df = df.replace(to_replace='male', value=1)

df = df.replace(to_replace='C', value=0)
df = df.replace(to_replace='Q', value=1)
df = df.replace(to_replace='S', value=2)

## Splitting data between Train and Test sets

In [None]:
train, test = train_test_split(df, test_size=0.3, stratify=df['Survived'], random_state=0)
print('Train Shape', train.shape, '\n', train.groupby('Survived').size())
print('\nTest Shape', train.shape, '\n', test.groupby('Survived').size())

Train Shape (498, 7) 
 Survived
0    297
1    201
dtype: int64

Test Shape (498, 7) 
 Survived
0    127
1     87
dtype: int64


In [None]:
y_train = (train.pop('Survived')).to_numpy()
y_test = (test.pop('Survived')).to_numpy()
x_train = train.to_numpy()
x_test = test.to_numpy()

In [None]:
print('Dimensions and datatype of')

print('x_train:', x_train.shape, '\tdtype:', x_train.dtype, '\tRange:', x_train.min(), 'to', x_train.max())
print('y_train:', y_train.shape, '\tdtype:', y_train.dtype, '\tRange:', y_train.min(), 'to', y_train.max())

print('x_test:', x_test.shape, '\tdtype:', x_test.dtype, '\tRange:', x_test.min(), 'to', x_train.max())
print('y_test:', y_test.shape, '\t\tdtype:', y_test.dtype, '\tRange:', y_test.min(), 'to', y_test.max())

Dimensions and datatype of
x_train: (498, 6) 	dtype: float64 	Range: 0.0 to 80.0
y_train: (498,) 	dtype: int64 	Range: 0 to 1
x_test: (214, 6) 	dtype: float64 	Range: 0.0 to 80.0
y_test: (214,) 		dtype: int64 	Range: 0 to 1


## Normalize the data

In [None]:
mean = x_train.mean(axis=0)
std = x_train.std(axis=0)

In [None]:
x_train = (x_train - mean) / std
x_test = (x_test - mean) / std

In [None]:
print('Dimensions and datatype of')

print('x_train:', x_train.shape, '\tdtype:', x_train.dtype, '\tRange:', x_train.min(), 'to', x_train.max())
print('y_train:', y_train.shape, '\tdtype:', y_train.dtype, '\tRange:', y_train.min(), 'to', y_train.max())

print('x_test:', x_test.shape, '\tdtype:', x_test.dtype, '\tRange:', x_test.min(), 'to', x_train.max())
print('y_test:', y_test.shape, '\t\tdtype:', y_test.dtype, '\tRange:', y_test.min(), 'to', y_test.max())

Dimensions and datatype of
x_train: (498, 6) 	dtype: float64 	Range: -2.0776848062419653 to 6.470294457908575
y_train: (498,) 	dtype: int64 	Range: 0 to 1
x_test: (214, 6) 	dtype: float64 	Range: -2.0776848062419653 to 6.470294457908575
y_test: (214,) 		dtype: int64 	Range: 0 to 1


##  Loading and Fitting the Model

In [None]:
from sklearn.naive_bayes import GaussianNB
naive_bayes = GaussianNB()
naive_bayes.fit(x_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

## Prediction on the Test Set

In [None]:
y_pred = naive_bayes.predict(x_test)
print(y_test)
print(y_pred)

[1 0 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 0 0 1 1
 1 0 0 0 0 1 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0
 1 0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 1
 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 1 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0
 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 0 1 0 0 1 1 1 0 0 0 1
 0 0 1 1 1 1 0 0 1 0 0 1 0 1 0 1 0 0 1 0 1 1 0 0 1 0 0 0 1]
[1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 1 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1 1
 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 1 1 0 0 1 0 0 1 0 0 0 1 1
 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
 1 1 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 1 1 0 0 1 1
 0 0 1 1 1 1 0 0 1 0 0 1 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0 1]


## Test Accuracy and Confusion Matrix

In [None]:
def results(y_test, y_predict):
  print('Accuracy -', (accuracy_score(y_test, y_predict) * 100))
  print('Report\n', classification_report(y_test, y_predict))

In [None]:
results(y_test, y_pred)

Accuracy - 81.77570093457945
Report
               precision    recall  f1-score   support

           0       0.82      0.88      0.85       127
           1       0.81      0.72      0.76        87

    accuracy                           0.82       214
   macro avg       0.82      0.80      0.81       214
weighted avg       0.82      0.82      0.82       214

