# **Titanic with Naive Bayes Classifier from scratch (without using Scikit-learn)**
A fundamental part of learning ML is understanding the theory behind every used statistical model. Otherwise, Scikit-learn is nothing but a magical black box to the user. The Naive Bayes algorithm is quite simple and fun to start with. It will be explained and used in the Titanic Survival prediction problem, using Python. 

**Note:** My goal is to show the basics of the algorithm and how I implemented it, so I did not put much effort in feature engineering to obtain the highest score possible.

## Theory: Bayes Theorem
It describes the probability of an event, based on prior knowledge of conditions that might be related to the event. Using Bayes theorem, we can find the probability of \\( y \\) happening, given that \\( x \\) has occurred. Here, \\( x \\) is the evidence, \\( y \\) is the prior knowledge and \\( P(x|y) \\) is the likelihood. The assumption made here is that the predictors/features are independent.

It's equation is as follows:

\\( P(y|x)= \dfrac{P(x|y)P(y)}{P(x)} \\)
where 
* \\( y,x \\) = Events
* \\( P(y|x) \\) = Probability of \\( y \\) given \\( x \\)
* \\( P(x|y) \\) = Probability of \\( x \\) given \\( y \\)
* \\( P(y), P(x) \\) = Independent probabilities of \\( y \\) and \\( x \\)

Naive Bayes Methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable.

In our case, the variable \\( y \\) is the class variable (Survival 0 or 1), which represents if a passenger will survive or not given the conditions. Variable \\( X \\) represent the parameters/features. \\( X \\) is given as \\( X=(x_{1}, x_{2}, ..., x_{n}) \\) and they can be mapped to Age, Class, Sex, etc. By substituting for \\( X \\) into the Bayes Rule and expanding using the chain rule we get
        
\\( P(y|x_{1}, x_{2}, ..., x_{n})= \dfrac{P(x_{1}|y)P(x_{2}|y)...P(x_{n}|y)P(y)}{P(x_{1})P(x_{2})...P(x_{n})} \\)

Now, you can calculate the values for each probability by looking at the dataset and substitute them into the equation. In our case, the class variable \\( y \\) has two outcomes: 0 or 1. So the survival probabilities for each case '*Survival*' or '*No survival*' (1 or 0, respectively) needs to be calculated for each passenger. The one having the highest probability is then the final outcome. That is, if \\( P(1) > P(0) \\), the person will survive.




In [None]:
import pandas as pd
import seaborn as sns


df_train=pd.read_csv('../input/titanic/train.csv')
df_test=pd.read_csv('../input/titanic/test.csv')

Some feature engineering below (for simplicity I used 3 features only: Class, Sex and Age):

In [None]:
sexos={"male":0, "female":1}
df_train.Sex=[sexos[item] for item in df_train.Sex]
df_test.Sex=[sexos[item] for item in df_test.Sex]

df_train.Age.fillna(df_train.Age.mean(), inplace=True)
df_test.Age.fillna(df_test.Age.mean(), inplace=True)

df_train.Age=df_train.Age.astype(int)
df_test.Age=df_test.Age.astype(int)

#A wild plot has appeared, just for the heck of it
sns.violinplot(x='Pclass', y='Age', hue='Survived', data=df_train, split=True)

#Ages grouped
data = [df_train, df_test]
for dataset in data:
    dataset.loc[ dataset['Age'] <= 11, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 11) & (dataset['Age'] <= 18), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 18) & (dataset['Age'] <= 22), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 22) & (dataset['Age'] <= 27), 'Age'] = 3
    dataset.loc[(dataset['Age'] > 27) & (dataset['Age'] <= 33), 'Age'] = 4
    dataset.loc[(dataset['Age'] > 33) & (dataset['Age'] <= 40), 'Age'] = 5
    dataset.loc[(dataset['Age'] > 40) & (dataset['Age'] <= 66), 'Age'] = 6
    dataset.loc[ dataset['Age'] > 66, 'Age'] = 7

## The Machine Learning model construction starts here:

We have 3 features, so the Bayes rule takes the following form,

\\( P(y|x_{1}, x_{2},x_{3})= \dfrac{P(x_{1}|y)P(x_{2}|y)P(x_{3}|y)P(y)}{P(x_{1})P(x_{2})P(x_{3})} \\)

where 

\\( P(y) \\) = Probability of survival (for 0 and for 1), so it is a 2-dimensional array.
* \\( P(x_{1}) \\) = Probability of Pclass, it is a 3-dimensional array (denoted as p_Class in the code)
* \\( P(x_{2}) \\) = Probability of gender, 2-dimensional array (denoted as p_Sex in the code)
* \\( P(x_{3}) \\) = Probability of Age, 8-dimensional array (denoted as p_Age in the code)

and the conditional probabilities

*  \\( P(x_{1}|y) \\) = Probability of Pclass given survival (0 or 1)
*  \\( P(x_{2}|y) \\) = Probability of gender given survival (0 or 1)
*  \\( P(x_{3}|y) \\) =  Probability of Age given survival (0 or 1)

The probabilities are calculated below:

In [None]:
#probabilities of the features
    
Class_counts=df_train['Pclass'].value_counts()  
p_Class=Class_counts/len(df_train)

Sex_counts=df_train['Sex'].value_counts()
p_Sex=Sex_counts/len(df_train)

Age_counts=df_train['Age'].value_counts()
p_Age=Age_counts/len(df_train)

# Survival and Death probabilities
y_counts=df_train['Survived'].value_counts()
p_y=y_counts/len(df_train)

df_survived=df_train.loc[df_train['Survived'] == 1]
df_died=df_train.loc[df_train['Survived'] == 0]

# Conditional probabilities
#class
class_survived_counts=df_survived['Pclass'].value_counts()  
p_class_survived=class_survived_counts/len(df_survived)

class_died_counts=df_died['Pclass'].value_counts()  
p_class_died=class_died_counts/len(df_died)

#sex
sex_survived_counts=df_survived['Sex'].value_counts()  
p_sex_survived=sex_survived_counts/len(df_survived)

sex_died_counts=df_died['Sex'].value_counts()  
p_sex_died=sex_died_counts/len(df_died)

#Age
age_survived_counts=df_survived['Age'].value_counts()  
p_age_survived=age_survived_counts/len(df_survived)

age_died_counts=df_died['Age'].value_counts()  
p_age_died=age_died_counts/len(df_died)

Bayes function defined below:

In [None]:
def Bayes(py, px1y, px2y, px3y, px1, px2, px3):
    numerator=px1y*px2y*px3y*py
    denominator=px1*px2*px3
    p=numerator/denominator
    return p

The probabilities of survival for each passenger calculated below:

In [None]:
result_array=[]

for i in range(0,418):
    feature_class=df_test.iloc[i]['Pclass']
    feature_sex=df_test.iloc[i]['Sex']
    feature_age=df_test.iloc[i]['Age']
    
    P_Y1=Bayes(p_y[1], p_class_survived[feature_class], p_sex_survived[feature_sex], p_age_survived[feature_age], p_Class[feature_class], p_Sex[feature_sex], p_Age[feature_age])
    P_Y0=Bayes(p_y[0], p_class_died[feature_class], p_sex_died[feature_sex], p_age_died[feature_age], p_Class[feature_class], p_Sex[feature_sex], p_Age[feature_age])
    
    if P_Y0 > P_Y1:
        result=0
    else:
        result=1
        
    result_array.append(result)


output = pd.DataFrame({'PassengerId': df_test.PassengerId,'Survived': result_array})
output.to_csv('submission.csv', index=False)

## Submission:

This predictor scored **0.77033**, which is not that bad for a model **without using Scikit-learn library!!!!!!**