## Useful tricks vol. 3 - Naive Bayes
#### Properties
- generally faster than linear models, but slightly less accurate
- `GaussianNB`: continuous data
- `BernoulliNB`: bianry data
- `MultinomialNB`: count data

### Properties
- assumes conditional independence among features (sometimes not possible in practice, but even when not fullfiled, still works as a good approximation)

### Advantages
- fast
- works well with smaller datasets
- works well with a lot of features
- 
### Disadvantages
- simple and often not performing well with a lot of zero features

#### When to use
Only for classification. Even faster than linear models, good for very large datasets and high-dimensional data. Often less accurate than linear models.

## 1. Load and fit the boston dataset

In [92]:
import numpy as np
import mglearn
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB

Let's create a simple dataset

In [93]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
weather=['Sunny','Sunny','Overcast','Rainy','Rainy','Rainy','Overcast','Sunny','Sunny',
'Rainy','Sunny','Overcast','Overcast','Rainy']
temp=['Hot','Hot','Hot','Mild','Cool','Cool','Cool','Mild','Cool','Mild','Mild','Mild','Hot','Mild']

play=['No','No','Yes','Yes','Yes','No','Yes','No','Yes','Yes','Yes','Yes','Yes','No']

weather_enc=le.fit_transform(weather)
temp_enc=le.fit_transform(temp)
label=le.fit_transform(play)

In [94]:
features = np.array((temp_enc, weather_enc))
print(features.shape)
label.shape

(2, 14)


(14,)

In [95]:
m = GaussianNB()
m.fit(features.T,label)
pred = m.predict([[0,2]])
pred

array([0], dtype=int64)

#### Let's try to use it with the Titanic dataset

In [96]:
import pandas as pd
df = pd.read_csv('data/titanic/train.csv')

In [97]:
df["Sex_le"] = le.fit_transform(df["Sex"])
# i can't encode Embarked - has nan values

In [98]:
df["Embarked"].value_counts(dropna=False)

S      644
C      168
Q       77
NaN      2
Name: Embarked, dtype: int64

Let's handle the `nan` values by assiging the most frequent value instead

In [99]:
df[df["Embarked"].isna()] = df["Embarked"].value_counts().idxmax()

In [100]:
df["Embarked_le"] = le.fit_transform(df["Embarked"])

In [101]:
features_to_consider = [
    "Survived",
    "Pclass",
    "Sex_le",
    "Age",
    "SibSp",
    "Parch",
    "Fare",
    "Embarked_le"
]
df = df[features_to_consider].dropna(axis=0, how='any')
target = "Survived"
features = list(set(features_to_consider) - set(target))

In [102]:
df[df["Survived"] == "S"] = df["Survived"].value_counts().idxmax()

In [103]:
from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(df, test_size=0.5)

gnb = GaussianNB()
gnb.fit(
    X_train[features],
    X_train[target]
)
y_pred = gnb.predict(X_test[features])
print(f"Accuracy: {(1-(X_test.Survived.values != y_pred).sum())}")

Accuracy: 1
