Make a naive assumption that features are independent to each other.

In [78]:
import pandas as pd
df = pd.read_csv("titanic.csv")
df.head()

Unnamed: 0,sex,age,sibsp,parch,fare,embarked,class,who,alone,survived
0,male,22.0,1,0,7.25,S,Third,man,False,0
1,female,38.0,1,0,71.2833,C,First,woman,False,1
2,female,26.0,0,0,7.925,S,Third,woman,True,1
3,female,35.0,1,0,53.1,S,First,woman,False,1
4,male,35.0,0,0,8.05,S,Third,man,True,0


Now, drop unnecessary column

In [79]:
df.drop(['sibsp','parch','embarked','class','who','alone'], axis='columns',inplace=True)
df.head(10)

Unnamed: 0,sex,age,fare,survived
0,male,22.0,7.25,0
1,female,38.0,71.2833,1
2,female,26.0,7.925,1
3,female,35.0,53.1,1
4,male,35.0,8.05,0
5,male,,8.4583,0
6,male,54.0,51.8625,0
7,male,2.0,21.075,0
8,female,27.0,11.1333,1
9,female,14.0,30.0708,1


For target variable I have stored there survived columns value. and in inputs rest of column.

In [80]:
target = df.survived
inputs = df.drop('survived',axis = 'columns')

Use get_dummies to create two differant column for male and female.

In [81]:
dummies = pd.get_dummies(inputs.sex)
dummies.head()

Unnamed: 0,female,male
0,0,1
1,1,0
2,1,0
3,1,0
4,0,1


Now, concat those column to the dataframe.

In [82]:
inputs = pd.concat([inputs,dummies],axis='columns')
inputs.head()

Unnamed: 0,sex,age,fare,female,male
0,male,22.0,7.25,0,1
1,female,38.0,71.2833,1,0
2,female,26.0,7.925,1,0
3,female,35.0,53.1,1,0
4,male,35.0,8.05,0,1


Now, drop sex column.

In [83]:
inputs.drop('sex',axis='columns',inplace=True)
inputs.head(3)

Unnamed: 0,age,fare,female,male
0,22.0,7.25,0,1
1,38.0,71.2833,1,0
2,26.0,7.925,1,0


By using isna() find out that wheather any column has NAN value.

In [84]:
inputs.columns[inputs.isna().any()]

Index(['age'], dtype='object')

In [85]:
inputs.age[:10]

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5     NaN
6    54.0
7     2.0
8    27.0
9    14.0
Name: age, dtype: float64

Find mean and put it in NaN's place.

In [86]:
inputs.age = inputs.age.fillna(inputs.age.mean())
inputs.head(6)

Unnamed: 0,age,fare,female,male
0,22.0,7.25,0,1
1,38.0,71.2833,1,0
2,26.0,7.925,1,0
3,35.0,53.1,1,0
4,35.0,8.05,0,1
5,29.699118,8.4583,0,1


I have used train_test_split method that my data would get biased.

In [87]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(inputs,target,test_size=.2)

In [88]:
len(x_train)

712

In [89]:
len(x_test)

179

In [90]:
len(inputs)

891

In [91]:
x_train

Unnamed: 0,age,fare,female,male
602,29.699118,42.400,0,1
369,24.000000,69.300,1,0
782,29.000000,30.000,0,1
747,30.000000,13.000,1,0
475,29.699118,52.000,0,1
...,...,...,...,...
217,42.000000,27.000,0,1
95,29.699118,8.050,0,1
407,3.000000,18.750,0,1
682,20.000000,9.225,0,1


Create naive based model

In [92]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()

In [93]:
model.fit(x_train, y_train)

GaussianNB()

In [94]:
model.score(x_test, y_test)

0.7318435754189944

In [95]:
x_test[:10]

Unnamed: 0,age,fare,female,male
337,41.0,134.5,1,0
852,9.0,15.2458,1,0
566,19.0,7.8958,0,1
542,11.0,31.275,1,0
206,32.0,15.85,0,1
131,20.0,7.05,0,1
700,18.0,227.525,1,0
207,26.0,18.7875,0,1
655,24.0,73.5,0,1
350,23.0,9.225,0,1


In [96]:
y_test[:10]

337    1
852    0
566    0
542    0
206    0
131    0
700    1
207    1
655    0
350    0
Name: survived, dtype: int64

In [97]:
model.predict(x_test[:10])

array([1, 1, 0, 1, 0, 0, 1, 0, 0, 0], dtype=int64)

In [98]:
model.predict_proba(x_test[:10])

array([[8.70400350e-06, 9.99991296e-01],
       [1.79350136e-02, 9.82064986e-01],
       [9.81962392e-01, 1.80376082e-02],
       [1.69855241e-02, 9.83014476e-01],
       [9.86014656e-01, 1.39853444e-02],
       [9.82301053e-01, 1.76989474e-02],
       [1.13349466e-13, 1.00000000e+00],
       [9.84965049e-01, 1.50349515e-02],
       [9.00487235e-01, 9.95127651e-02],
       [9.83814260e-01, 1.61857403e-02]])