# Naive Bayes Classifier - Salary data 

Data Description:

|Feature| Description|
|-------|------------|
|age|age of a person|
|workclass|A work class is a grouping of work|
|education|Education of an individuals|	
|maritalstatus|Marital status of an individuals|	
|occupation|occupation of an individuals|
|relationship| 	
|race|Race of an Individual|
|sex|Gender of an Individual|
|capitalgain|profit received from the sale of an investment|	
|capitalloss|A decrease in the value of a capital asset|
|hoursperweek|number of hours work per week|	
|native|Native of an individual|
|Salary|salary of an individual|


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('SalaryData_Train.csv')
df.head()

Unnamed: 0,age,workclass,education,educationno,maritalstatus,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native,Salary
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


We have performed exploratory data analysis of Salary dataset while implementing SVM algorithm. 
Let's import cleaned dataset, apply Naive bayes classifier and evaluate it's performance.

Additionally we shall perform chi square test to understand the extent of dependence of predictor feature with label thereby identifying important feature.

In [4]:
df = pd.read_csv('Salary_train_modified.csv')
df_test = pd.read_csv('Salary_test_modified.csv')
df.head()

Unnamed: 0,age,educationno,sex,hoursperweek,Salary,workclass_2,workclass_3,education_2,education_3,maritalstatus_1,occupation_2,occupation_3,relationship_1,race_1,native_1
0,39,13,1,40,0,1,0,0,1,0,0,0,0,1,1
1,50,13,1,13,0,0,0,0,1,1,0,1,1,1,1
2,38,9,1,40,0,1,0,1,0,0,0,0,0,1,1
3,53,7,1,40,0,1,0,0,0,1,0,0,1,0,1
4,28,13,0,40,0,1,0,0,1,1,0,1,1,0,1


Label: Salary

Some family algorithms such as Linear Discriminant Analysis(LDA) , Naive Bayes give weights to features based on their range and hence scaling does not affect these algorithms.

The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one dimensional distribution. This in turn helps to alleviate problems stemming from the curse of dimensionality.
Hence we will directly apply naive bayes without much feature engineering.

Even though NB works on naive assumption that features are independent of each other which isnt true otherwise, it still works good. 

Although naive Bayes is known as a decent classifier, it is known to be a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.

The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of P(X|y). We shall use Gaussian NB which assumes Normal distribution of probabilities.

In [5]:
X = df.drop(columns=['educationno','Salary'])
y = df['Salary']

X_test = df_test.drop(columns=['educationno','Salary'])
y_test = df_test['Salary'] 

In [6]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X,y)

GaussianNB()

In [8]:
from sklearn.metrics import classification_report
print(classification_report(y_test,model.predict(X_test)))

              precision    recall  f1-score   support

           0       0.96      0.59      0.73     11360
           1       0.42      0.92      0.58      3700

    accuracy                           0.67     15060
   macro avg       0.69      0.76      0.66     15060
weighted avg       0.83      0.67      0.70     15060



In [20]:
df = pd.read_csv('SalaryData_Train.csv')
df.duplicated().sum()
df.drop_duplicates(inplace=True)
df_cat = df.select_dtypes(include='object').drop(columns='Salary')
df_num = df.select_dtypes(exclude='object')
df_cat = pd.get_dummies(df_cat,drop_first=True)
X = pd.concat([df_num,df_cat],axis=1)
y = df.Salary
X

Unnamed: 0,age,educationno,capitalgain,capitalloss,hoursperweek,workclass_ Local-gov,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,...,native_ Portugal,native_ Puerto-Rico,native_ Scotland,native_ South,native_ Taiwan,native_ Thailand,native_ Trinadad&Tobago,native_ United-States,native_ Vietnam,native_ Yugoslavia
0,39,13,2174,0,40,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
1,50,13,0,0,13,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
2,38,9,0,0,40,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,53,7,0,0,40,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,28,13,0,0,40,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30154,53,14,0,0,40,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
30155,22,10,0,0,40,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
30156,27,12,0,0,38,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
30158,58,9,0,0,40,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [21]:
df = pd.read_csv('SalaryData_Test.csv')
df.duplicated().sum()
df.drop_duplicates(inplace=True)
df_cat = df.select_dtypes(include='object').drop(columns='Salary')
df_num = df.select_dtypes(exclude='object')
df_cat = pd.get_dummies(df_cat,drop_first=True)
X_test = pd.concat([df_num,df_cat],axis=1)
y_test = df.Salary

In [22]:
model.fit(X,y)
print(classification_report(y_test,model.predict(X_test)))

              precision    recall  f1-score   support

       <=50K       0.90      0.85      0.88     10620
        >50K       0.62      0.73      0.67      3510

    accuracy                           0.82     14130
   macro avg       0.76      0.79      0.77     14130
weighted avg       0.83      0.82      0.82     14130



In [50]:
from sklearn.feature_selection import chi2

df = pd.read_csv('SalaryData_Train.csv')
df.duplicated().sum()
df.drop_duplicates(inplace=True)
X = df.select_dtypes(include='object').drop(columns='Salary')
X = pd.get_dummies(X,drop_first=True)

y = df.Salary
scores = chi2(X,y)

In [51]:
feat = pd.DataFrame(data=scores[1],index=X.columns,columns=['p-values']).sort_values(by='p-values')
feat = feat[feat['p-values']<0.05]
feat.index

Index(['maritalstatus_ Never-married', 'maritalstatus_ Married-civ-spouse',
       'relationship_ Own-child', 'occupation_ Exec-managerial',
       'relationship_ Not-in-family', 'occupation_ Prof-specialty',
       'education_ Masters', 'occupation_ Other-service',
       'education_ Prof-school', 'education_ Bachelors',
       'relationship_ Unmarried', 'workclass_ Self-emp-inc',
       'education_ Doctorate', 'sex_ Male', 'relationship_ Wife',
       'education_ HS-grad', 'race_ Black', 'occupation_ Handlers-cleaners',
       'relationship_ Other-relative', 'education_ 11th',
       'maritalstatus_ Separated', 'occupation_ Machine-op-inspct',
       'native_ Mexico', 'maritalstatus_ Widowed', 'education_ 7th-8th',
       'workclass_ Private', 'education_ 9th', 'occupation_ Farming-fishing',
       'education_ Some-college', 'education_ 5th-6th',
       'maritalstatus_ Married-spouse-absent', 'education_ 12th',
       'occupation_ Priv-house-serv', 'education_ 1st-4th', 'race_ White'

We will include all continuous numerical features as they are not much in number (or else we can use ANOVA test to find important continuous features)

In [60]:
df = pd.read_csv('SalaryData_Train.csv')
df.duplicated().sum()
df.drop_duplicates(inplace=True)
df_num = df.select_dtypes(exclude='object')
df_cat = df.select_dtypes(include='object').drop(columns='Salary')
df_cat = pd.get_dummies(df_cat,drop_first=True)
df_cat = df_cat[feat.index.to_list()]
X = pd.concat([df_num,df_cat],axis=1)
y = df.Salary

In [61]:
df = pd.read_csv('SalaryData_Test.csv')
df.duplicated().sum()
df.drop_duplicates(inplace=True)
df_num = df.select_dtypes(exclude='object')
df_cat = df.select_dtypes(include='object').drop(columns='Salary')
df_cat = pd.get_dummies(df_cat,drop_first=True)
df_cat = df_cat[feat.index.to_list()]
X_test = pd.concat([df_num,df_cat],axis=1)
y_test = df.Salary

In [62]:
model.fit(X,y)
print(classification_report(y_test,model.predict(X_test)))

# As expected Naive bayes is exempt from the curse of dimensionality

              precision    recall  f1-score   support

       <=50K       0.90      0.85      0.88     10620
        >50K       0.62      0.73      0.67      3510

    accuracy                           0.82     14130
   macro avg       0.76      0.79      0.77     14130
weighted avg       0.83      0.82      0.82     14130



# Thank you!