# Naive Bayes Classifier 

In this work, we're going to cover lots of things about Naive Bayes Classifier. I implement our algorithm with Scikit-Learn. 

# 1. Introduction the Naive Bayes Classifier 

Naive Bayes Classifier uses the Bayes’ theorem to predict probabilities for each class such as the probability that given record or data point belongs to a particular class. 

It can be used for; 

* Text classification
* Sentiment analysis
* Spam filtering
* Recommender systems

### What is Naive? Why is it naive? 

It is naive because it ignores all of the dependencies. It assumes event are independent. Features does not affect each othet. 
Let me explaing it an example. For a spam classifier, our equations would be like abowe. 

$$P(Spam \, | \, Word) = \frac{P(Word \, | \, Spam) \, P(Spam)} {P(Word)}$$ 

So, for a sample sentence, "we are good.". It would be like abowe. 

$$ \frac{P(We \, | \, Spam) \, P(Spam)} {P(We)} x  \frac{P(are \, | \, Spam) \, P(Spam)} {P(are)} x \frac{P(Good \, | \, Spam) \, P(Spam)} {P(Good)}$$ 

# 2. Notebook Imports

In [None]:
import pandas as pd
import numpy as np 
import seaborn as sns 
import matplotlib.pyplot as plt

import category_encoders as ce

from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

import warnings

warnings.filterwarnings('ignore')

%matplotlib inline

# 3. Import Dataset

In [None]:
data = pd.read_csv('../input/adult-dataset/adult.csv')

In [None]:
data.head() 

In [None]:
data.columns = ['age', 'workclass', 'fnlwgt', 'education', 'never_married', 'marital_status', 'occupation', 'relationship',
             'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']

data.tail() 

In [None]:
data.income = pd.get_dummies(data.income)[' >50K']

In [None]:
data.tail()

In [None]:
data.info() 

In this dataset we have numeric and categorical features. For numeric features, it's ok. There is no problem with them. But we have to analyze our categorical variables and we do some encoding for them. 

To do this, let's take them.

## Explore Categorical Variables

In [None]:
categorical_names = []
for feature in data.columns: 
    if data[feature].dtype == object: 
        categorical_names.append(feature)
categorical_names

In [None]:
data[categorical_names].head() 

In [None]:
data[categorical_names].isnull().any()

In [None]:
data[categorical_names].isna().any()

In [None]:
for feature in data[categorical_names].columns:
    print('FEATURE NAME:', feature)
    print(data[feature].value_counts())
    

You can see abowe there is some missing values in our seperated dataframes. But pandas' methods like isna() or isnull() couldn't detect them because of their value. It's coded as a ?. 

In this case, we are going to replace them with nan values and we visualize them. 

In [None]:
for feature in data.columns:
    data[feature].replace(' ?', np.nan, inplace=True)

In [None]:
# check this, 

data[data.occupation == ' ?']

In [None]:
data.native_country.value_counts()

In [None]:
data[categorical_names].isnull().any() 

Let's see it with a plot. 

In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(data[categorical_names].isnull(), yticklabels=False, cbar=False, cmap='viridis')
plt.show() 

## Explore Numerical Variables

In [None]:
numerical_features = [var for var in data.columns if data[var].dtype!='O']

data[numerical_features].head() 

In [None]:
data[numerical_features].isnull().any()

In [None]:
data[categorical_names].isnull().mean()

### Impute missing categorical variables with most frequent value

In [None]:
# The mode of a set of values is the value that appears most often. It can be multiple values.
data.workclass.mode()

In [None]:
data.workclass.value_counts()

In [None]:
na_colls = data.isnull().any().loc[data.isnull().any().values == True].index
na_colls

In [None]:
for i in na_colls:
    data[i].fillna(data[i].mode()[0], inplace=True)

data.isnull().any() 

In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(data[categorical_names].isnull(), yticklabels=False, cbar=False, cmap='viridis')
plt.show() 

# Encoding Categorial Variables 

In [None]:
data[categorical_names].head() 

We are going to use the method called One Hot Encoding. One hot encoding is the most widespread approach, and it works very well unless your categorical variable takes on a large number of values. 
One hot encoding creates new (binary) columns, indicating the presence of each possible value from the original data. 

* In our data set, there is just one column that may be a problem for this technique. NATIVE_COUNTRY has got 41 different categories. It's not convenient for this method.

In [None]:
for i in categorical_names: 
    print(str.upper(i), data[i].value_counts().shape[0]) 

In [None]:
categorical_names_withoutone = categorical_names
categorical_names_withoutone.remove('native_country')
encoder = ce.OneHotEncoder(cols=categorical_names_withoutone)

data_encoded = encoder.fit_transform(data)

data_encoded.head()

I am going to use the Mean Encoding Method. 
Mean encoding represents a probability of your target variable, conditional on each value of the feature.
Let's see this. 

In [None]:
print('native_country has got', data.native_country.value_counts().shape[0], 'features.')

In [None]:
mean_encoded_nativeCont = data_encoded.groupby(['native_country'])['income'].mean().to_dict() 
data_encoded.native_country = data_encoded.native_country.map(mean_encoded_nativeCont)

In [None]:
data_encoded.native_country

# Creation Model

In [None]:
target = data_encoded.income 
features = data_encoded.drop('income', axis=1) 

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.33)

gnb = GaussianNB() 
gnb.fit(X_train, y_train)

In [None]:
prediction = gnb.predict(X_test)

prediction

# Metrics and Evaluation

## Accuracy

In [None]:
correct = (y_test == prediction).sum() 
print('classified correctly', correct) 
wrong = X_test.shape[0] - correct 
print('classified incorretly', wrong)

In [None]:
print('The Accuracy is', correct / X_test.shape[0])

In [None]:
prediction_train = gnb.predict(X_train)
correct_train = (y_train == prediction_train).sum()
print('classified correctly in train set', correct_train) 
wrong_train = X_train.shape[0] - correct_train
print('classified incorrectly in train set', wrong_train)

### Check for overfitting and underfitting

The training-set accuracy score is 0.7957827 while the test-set accuracy to be  0.79311 So, there is no sign of overfitting.

In [None]:
print('The accuracy for train set is', correct_train / X_train.shape[0])

## Visualising the Results

In [None]:
# chart styling info 

yaxis_label = '>50K'
xaxis_label = '<=50K'

In [None]:
log_probabilities = gnb.predict_proba(X_test)
prob0 = log_probabilities[:,0]
prob1 = log_probabilities[:,1]

summary_df = pd.DataFrame({yaxis_label: prob0, xaxis_label: prob1, 'labels':y_test})
summary_df

In [None]:
sns.lmplot(x=xaxis_label, y=yaxis_label, data=summary_df, height=6.5, fit_reg=False, legend=False,
          scatter_kws={'alpha': 0.5, 's': 25}, hue='labels', markers=['o', 'x'], palette='hls')



plt.show()