This project is a comprehensive application of data analysis and machine learning using Python. It begins with the importation of necessary libraries and the loading of a dataset from a CSV file, specifically the 'titanic.csv'. The dataset is then preprocessed by dropping irrelevant columns and handling categorical variables through the creation of dummy variables. Missing values in the 'Age' column are filled with the median age. The preprocessed data is then split into features ('inputs') and target ('Survived') sets, which are further split into training and testing sets. A Gaussian Naive Bayes model is trained on the training data, and its performance is evaluated using the testing data. The model's predictions are compared with the actual values to assess its accuracy. 

01. This cell imports the necessary libraries, creates a pandas DataFrame with sample data, and prints the DataFrame. 

In [1]:
import pandas as pd # for dataset handling
import matplotlib.pyplot as plt # display purposes
import numpy as np # calculations
from google.colab import files # work with external files
uploaded=files.upload()

Saving titanic.csv to titanic.csv


2. This cell imports additional libraries, including pandas, matplotlib, numpy, and files from google.colab. It also uploads a file using the files.upload() function.

In [2]:
df=pd.read_csv("titanic.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
df.drop(['PassengerId','Name','SibSp','Parch','Ticket','Fare','Cabin','Embarked'],axis='columns',inplace=True)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age
0,0,3,male,22.0
1,1,1,female,38.0
2,1,3,female,26.0
3,1,1,female,35.0
4,0,3,male,35.0


In [5]:
inputs=df.drop('Survived',axis='columns') # with the removal of survivde, inputs contains only independent variables
target=df.Survived # suvived becomes the depedant variable. Hence seperate it from independant data elements
inputs

Unnamed: 0,Pclass,Sex,Age
0,3,male,22.0
1,1,female,38.0
2,3,female,26.0
3,1,female,35.0
4,3,male,35.0
...,...,...,...
886,2,male,27.0
887,1,female,19.0
888,3,female,
889,1,male,26.0


In [6]:
target

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64

In [8]:
dummies=pd.get_dummies(inputs.Sex) # get_dummies() method will create a flag based encoding for the provided column values
dummies.head(30)

Unnamed: 0,female,male
0,0,1
1,1,0
2,1,0
3,1,0
4,0,1
5,0,1
6,0,1
7,0,1
8,1,0
9,1,0


In [9]:
inputs=pd.concat([inputs,dummies],axis='columns') # merge the dummies dataset with the main columns in the data frame
inputs.head(3)

Unnamed: 0,Pclass,Sex,Age,female,male
0,3,male,22.0,0,1
1,1,female,38.0,1,0
2,3,female,26.0,1,0


In [10]:
inputs.drop(['Sex','male'],axis='columns',inplace=True) # remove male, as one flag is enough to show the gender
inputs.head(3)

Unnamed: 0,Pclass,Age,female
0,3,22.0,0
1,1,38.0,1
2,3,26.0,1


In [27]:
inputs.columns[inputs.isna().any()] # check for any missing values.If the o/p is Index([], means no missing values

Index([], dtype='object')

In [16]:
inputs['Age']

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

In [25]:
inputs['Age'] = df['Age'].fillna((df['Age'].median()))

In [29]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(inputs,target,test_size=0.3)

In [30]:
X_train

Unnamed: 0,Pclass,Age,female
665,2,32.0,0
365,3,30.0,0
680,3,29.7,1
592,3,47.0,0
377,1,27.0,0
...,...,...,...
555,1,62.0,0
114,3,17.0,1
698,1,49.0,0
516,2,34.0,1


In [31]:
y_train

665    0
365    0
680    0
592    0
377    0
      ..
555    0
114    0
698    0
516    1
76     0
Name: Survived, Length: 623, dtype: int64

In [32]:
from sklearn.naive_bayes import GaussianNB
model=GaussianNB()

In [33]:
model.fit(X_train,y_train)

In [34]:
model.score(X_test,y_test)

0.7873134328358209

In [None]:
X_test[0:10]

Unnamed: 0,Pclass,Age,female
36,3,18.0,0
308,1,24.0,1
198,2,24.0,1
288,1,26.0,1
522,3,40.5,0
435,1,64.0,0
339,1,24.0,1
564,3,29.0,1
674,3,18.0,1
643,3,19.0,0


In [39]:
y_test[0:10]

42     0
306    1
175    0
743    0
533    1
494    0
132    0
7      0
230    1
167    0
Name: Survived, dtype: int64

In [40]:
model.predict(X_test[0:10]) # usually predicts for the range. Write a simple logic to extract the specific one value you need


array([0, 1, 0, 0, 1, 0, 1, 0, 1, 1])

In [None]:
model.predict_proba(X_test[0:10])

array([[0.91798555, 0.08201445],
       [0.03768991, 0.96231009],
       [0.11411203, 0.88588797],
       [0.03837647, 0.96162353],
       [0.92995132, 0.07004868],
       [0.63424982, 0.36575018],
       [0.03768991, 0.96231009],
       [0.22949044, 0.77050956],
       [0.2105884 , 0.7894116 ],
       [0.91889439, 0.08110561]])