# Naive Bayes Classifiers
In machine learning, Naïve Bayes classification is a straightforward and powerful algorithm for the classification task. Naïve Bayes classification is based on applying Bayes’ theorem with strong independence assumption between the features. Naïve Bayes classification produces good results when we use it for textual data analysis such as Natural Language Processing.

Today's dataset contains information about adults with the target value of income. 

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

In [2]:
df = pd.read_csv('adult.csv', header=None)

In [3]:
col_names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship',
             'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']

df.columns = col_names

In [4]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


## Exercise 1
Display the information about the dataset. 

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education_num   32561 non-null  int64 
 5   marital_status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital_gain    32561 non-null  int64 
 11  capital_loss    32561 non-null  int64 
 12  hours_per_week  32561 non-null  int64 
 13  native_country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


## Exercise 2 
How many categorical values do you have in the dataset? 


In [6]:
intCol = df.count(numeric_only=True).count()
# Since there are only Int64 and Object Dtypes in the Dataset, I minused the total columns with count of int columns
len(df.columns) - intCol

9

## Exercise 3
How many sample of each label do you have in each column? (use for loop)

In [7]:
# categ_col_names = ['workclass', 'education', 'marital_status', 
#                    'occupation', 'relationship', 'race', 'sex', 'native_country', 'income']
for i in df.columns:
    print(df[i].value_counts())
    print('\n')

36    898
31    888
34    886
23    877
35    876
     ... 
83      6
85      3
88      3
87      1
86      1
Name: age, Length: 73, dtype: int64


 Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 ?                    1836
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name: workclass, dtype: int64


164190    13
203488    13
123011    13
113364    12
121124    12
          ..
284211     1
312881     1
177711     1
179758     1
229376     1
Name: fnlwgt, Length: 21648, dtype: int64


 HS-grad         10501
 Some-college     7291
 Bachelors        5355
 Masters          1723
 Assoc-voc        1382
 11th             1175
 Assoc-acdm       1067
 10th              933
 7th-8th           646
 Prof-school       576
 9th               514
 12th              433
 Doctorate         413
 5th-6th           333
 1st-4th           168
 Preschool          51
Name: education, dtype

## Exercise 4
Define your x, y and do a 75-25 training-testing split

In [8]:
x = df.drop(columns=['income']) # All columns besides the first one
y = df.drop(df.loc[:, 'age':'native_country'].columns, axis = 1) # Only the first column

# Encode before splitting
import category_encoders as ce
ce_one_hot = ce.OneHotEncoder()
x = ce.OneHotEncoder().fit_transform(x, y)
y = ce.OrdinalEncoder().fit_transform(y, y)

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 1)
print('x test:  ', X_test.count()[1], ' | y test:  ', y_test.count()[0])
print('x train: ', X_train.count()[1], '| y train: ', y_train.count()[0])

  elif pd.api.types.is_categorical(cols):


x test:   8141  | y test:   8141
x train:  24420 | y train:  24420


## Exercise 5 
Use category encoder to encode each categorical column. 


In [9]:
import category_encoders as ce

In [10]:
# I do this during the spliting phase, up in Exercise 4

## Exercise 6 
Create and train your model.

In [11]:
# import it if you haven't
from sklearn.naive_bayes import GaussianNB

# instantiate the model
gnb = GaussianNB()

In [12]:
gnb.fit(X_train, np.ravel(y_train))

GaussianNB()

## Exercise 7 
Display the accuracy (testing) and verify there is no overfitting. 
Additionally compare your results with a null classifer. 
A null classifier is when you always predict the most frequent target. 

In [13]:
y_pred = gnb.predict(X_test)
print("Test Accuracy:", gnb.score(y_test, y_pred))

Test Accuracy: 0.8927650165827294


In [14]:
y_pred = gnb.predict(X_train)
print("Training Accuracy:", gnb.score(y_train, y_pred))

Training Accuracy: 0.8828828828828829


In [15]:
# The accuracy isnt that high so there isnt any overfitting.

In [16]:
# Null classifier, again the percenage isnt that high when compared to the other accuracy so there is no overfitting
y_test.value_counts().head(1) / len(y_test)

income
1         0.774229
dtype: float64

## Homework: Improve the accuracy
use anything you know to improve the performance 

In [17]:
df2 = df.copy()

In [18]:
# Preprocessing
df2['education'] = df2['education'].replace([' 11th', ' 10th', ' 9th', ' 7th-8th', ' 12th', 
                                             ' 5th-6th', ' 1st-4th', ' Preschool', ' Prof-school'], ' Didn\'t pass School')
df2['education'] = df2['education'].replace([' Assoc-voc', ' Assoc-acdm'], ' Assoc')
df2['marital_status'] = df2['marital_status'].replace([' Married-civ-spouse', ' Married-spouse-absent', ' Married-AF-spouse'], ' Married')
#df2['education'].value_counts()

In [19]:
df2['income'].value_counts()

 <=50K    24720
 >50K      7841
Name: income, dtype: int64

In [20]:
# Fixing undersampling
df2_1 = df2.loc[df2['income'] == ' <=50K'][:7800]
df2_2 = df2.loc[df2['income'] == ' >50K'][:7800]

df2 = df2_1.append(df2_2).reset_index(drop=True)
df2['income'].value_counts()

 <=50K    7800
 >50K     7800
Name: income, dtype: int64

In [21]:
x = df2.drop(columns=['income']) # All columns besides the first one
y = df2.drop(df2.loc[:, 'age':'native_country'].columns, axis = 1) # Only the first column

# Encode before splitting
import category_encoders as ce
ce_one_hot = ce.OneHotEncoder()
x = ce.OneHotEncoder().fit_transform(x, y)
y = ce.OrdinalEncoder().fit_transform(y, y)

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 1)
print('x test:  ', X_test.count()[1], ' | y test:  ', y_test.count()[0])
print('x train: ', X_train.count()[1], '| y train: ', y_train.count()[0])

  elif pd.api.types.is_categorical(cols):


x test:   3900  | y test:   3900
x train:  11700 | y train:  11700


In [22]:
gnb = GaussianNB()
gnb.fit(X_train, np.ravel(y_train))

GaussianNB()

In [23]:
y_pred = gnb.predict(X_test)
print("Test Accuracy:", gnb.score(y_test, y_pred))

Test Accuracy: 0.8192307692307692


In [24]:
y_pred = gnb.predict(X_train)
print("Training Accuracy:", gnb.score(y_train, y_pred))

Training Accuracy: 0.8202564102564103


In [25]:
y_test.value_counts().head(1) / len(y_test)

income
1         0.500513
dtype: float64

#### another method
#### **Decision Tree Doesn't help improve the accuracy**

In [26]:
from sklearn.tree import DecisionTreeClassifier

In [27]:
dtree = DecisionTreeClassifier()
dtree.fit(X_train, y_train)

DecisionTreeClassifier()

In [28]:
y_pred = dtree.predict(X_test)
print("Test Accuracy:", accuracy_score(y_test, y_pred))

Test Accuracy: 0.7612820512820513


In [29]:
y_pred = dtree.predict(X_train)
print("Training Accuracy:", accuracy_score(y_train, y_pred))

Training Accuracy: 0.9999145299145299
