**Naive Bayes model creations for Salary Data**

In [1]:
import pandas as pd
import numpy as np

# import Naive Bayes libraries
from sklearn.naive_bayes import MultinomialNB as MB
from sklearn.naive_bayes import GaussianNB as GB

## **Salary Train Data Processing and cleanup**

In [2]:
df_sal_train = pd.read_csv('SalaryData_Train.csv')
df_sal_train.head(5)

Unnamed: 0,age,workclass,education,educationno,maritalstatus,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native,Salary
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


### **EDA (Exploratory Data Analysis - Salary Train Data)**

Observation: martialstatus, relationship, race, sex may not impacting salary. Hence, dropping these variables 

In [3]:
df_sal_train = df_sal_train.drop(['maritalstatus', 'relationship', 'race', 'sex'], axis='columns')

In [4]:
df_sal_train.head()

Unnamed: 0,age,workclass,education,educationno,occupation,capitalgain,capitalloss,hoursperweek,native,Salary
0,39,State-gov,Bachelors,13,Adm-clerical,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Exec-managerial,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Handlers-cleaners,0,0,40,United-States,<=50K
3,53,Private,11th,7,Handlers-cleaners,0,0,40,United-States,<=50K
4,28,Private,Bachelors,13,Prof-specialty,0,0,40,Cuba,<=50K


In [5]:
df_sal_train.shape

(30161, 10)

In [6]:
df_sal_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30161 entries, 0 to 30160
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   age           30161 non-null  int64 
 1   workclass     30161 non-null  object
 2   education     30161 non-null  object
 3   educationno   30161 non-null  int64 
 4   occupation    30161 non-null  object
 5   capitalgain   30161 non-null  int64 
 6   capitalloss   30161 non-null  int64 
 7   hoursperweek  30161 non-null  int64 
 8   native        30161 non-null  object
 9   Salary        30161 non-null  object
dtypes: int64(5), object(5)
memory usage: 2.3+ MB


In [7]:
columns = ['workclass', 'education', 'occupation', 'native', 'Salary']

for x in columns:
  print('Value counts of column:' , x)
  print(df_sal_train[x].value_counts())


Value counts of column: workclass
 Private             22285
 Self-emp-not-inc     2499
 Local-gov            2067
 State-gov            1279
 Self-emp-inc         1074
 Federal-gov           943
 Without-pay            14
Name: workclass, dtype: int64
Value counts of column: education
 HS-grad         9840
 Some-college    6677
 Bachelors       5044
 Masters         1627
 Assoc-voc       1307
 11th            1048
 Assoc-acdm      1008
 10th             820
 7th-8th          557
 Prof-school      542
 9th              455
 12th             377
 Doctorate        375
 5th-6th          288
 1st-4th          151
 Preschool         45
Name: education, dtype: int64
Value counts of column: occupation
 Prof-specialty       4038
 Craft-repair         4030
 Exec-managerial      3992
 Adm-clerical         3721
 Sales                3584
 Other-service        3212
 Machine-op-inspct    1965
 Transport-moving     1572
 Handlers-cleaners    1350
 Farming-fishing       989
 Tech-support          912

In [8]:
# All above five columns need to be changed to categorical
columns = ['workclass', 'education', 'occupation', 'native', 'Salary']

for x in columns:
  df_sal_train[x] = df_sal_train[x].astype('category')

df_sal_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30161 entries, 0 to 30160
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   age           30161 non-null  int64   
 1   workclass     30161 non-null  category
 2   education     30161 non-null  category
 3   educationno   30161 non-null  int64   
 4   occupation    30161 non-null  category
 5   capitalgain   30161 non-null  int64   
 6   capitalloss   30161 non-null  int64   
 7   hoursperweek  30161 non-null  int64   
 8   native        30161 non-null  category
 9   Salary        30161 non-null  category
dtypes: category(5), int64(5)
memory usage: 1.3 MB


In [9]:
# Separate dataset into input and output columns
df_sal_train_x = df_sal_train.drop('Salary', axis=1)
print(df_sal_train_x.head())

print('y data now')
df_sal_train_y = df_sal_train.Salary
print(df_sal_train_y.head())

   age          workclass   education  educationno          occupation  \
0   39          State-gov   Bachelors           13        Adm-clerical   
1   50   Self-emp-not-inc   Bachelors           13     Exec-managerial   
2   38            Private     HS-grad            9   Handlers-cleaners   
3   53            Private        11th            7   Handlers-cleaners   
4   28            Private   Bachelors           13      Prof-specialty   

   capitalgain  capitalloss  hoursperweek          native  
0         2174            0            40   United-States  
1            0            0            13   United-States  
2            0            0            40   United-States  
3            0            0            40   United-States  
4            0            0            40            Cuba  
y data now
0     <=50K
1     <=50K
2     <=50K
3     <=50K
4     <=50K
Name: Salary, dtype: category
Categories (2, object): [' <=50K', ' >50K']


In [10]:
# Change categorical strings to numerical values using get dummies
df_sal_train_x = pd.get_dummies(df_sal_train_x)
df_sal_train_x.head()

Unnamed: 0,age,educationno,capitalgain,capitalloss,hoursperweek,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,...,native_ Portugal,native_ Puerto-Rico,native_ Scotland,native_ South,native_ Taiwan,native_ Thailand,native_ Trinadad&Tobago,native_ United-States,native_ Vietnam,native_ Yugoslavia
0,39,13,2174,0,40,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,50,13,0,0,13,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
2,38,9,0,0,40,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
3,53,7,0,0,40,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
4,28,13,0,0,40,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


## **Salary Test Data processing and cleanup**

In [11]:
df_sal_test = pd.read_csv('SalaryData_Test.csv')
df_sal_test.head()

Unnamed: 0,age,workclass,education,educationno,maritalstatus,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native,Salary
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,34,Private,10th,6,Never-married,Other-service,Not-in-family,White,Male,0,0,30,United-States,<=50K


### **EDA (Exploratory Data Analysis - Salary Test)**

In [12]:
df_sal_test = df_sal_test.drop(['maritalstatus', 'relationship', 'race', 'sex'], axis='columns')

In [13]:
df_sal_test.head()

Unnamed: 0,age,workclass,education,educationno,occupation,capitalgain,capitalloss,hoursperweek,native,Salary
0,25,Private,11th,7,Machine-op-inspct,0,0,40,United-States,<=50K
1,38,Private,HS-grad,9,Farming-fishing,0,0,50,United-States,<=50K
2,28,Local-gov,Assoc-acdm,12,Protective-serv,0,0,40,United-States,>50K
3,44,Private,Some-college,10,Machine-op-inspct,7688,0,40,United-States,>50K
4,34,Private,10th,6,Other-service,0,0,30,United-States,<=50K


In [14]:
df_sal_test.shape

(15060, 10)

In [15]:
df_sal_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15060 entries, 0 to 15059
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   age           15060 non-null  int64 
 1   workclass     15060 non-null  object
 2   education     15060 non-null  object
 3   educationno   15060 non-null  int64 
 4   occupation    15060 non-null  object
 5   capitalgain   15060 non-null  int64 
 6   capitalloss   15060 non-null  int64 
 7   hoursperweek  15060 non-null  int64 
 8   native        15060 non-null  object
 9   Salary        15060 non-null  object
dtypes: int64(5), object(5)
memory usage: 1.1+ MB


In [16]:
# All above five columns need to be changed to categorical
columns = ['workclass', 'education', 'occupation', 'native', 'Salary']

for x in columns:
  df_sal_test[x] = df_sal_train[x].astype('category')

df_sal_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15060 entries, 0 to 15059
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   age           15060 non-null  int64   
 1   workclass     15060 non-null  category
 2   education     15060 non-null  category
 3   educationno   15060 non-null  int64   
 4   occupation    15060 non-null  category
 5   capitalgain   15060 non-null  int64   
 6   capitalloss   15060 non-null  int64   
 7   hoursperweek  15060 non-null  int64   
 8   native        15060 non-null  category
 9   Salary        15060 non-null  category
dtypes: category(5), int64(5)
memory usage: 665.1 KB


In [17]:
# Separate dataset into input and output columns
df_sal_test_x = df_sal_test.drop('Salary', axis=1)
print(df_sal_test_x.head())

print('y data now')
df_sal_test_y = df_sal_test.Salary
print(df_sal_test_y.head())

   age          workclass   education  educationno          occupation  \
0   25          State-gov   Bachelors            7        Adm-clerical   
1   38   Self-emp-not-inc   Bachelors            9     Exec-managerial   
2   28            Private     HS-grad           12   Handlers-cleaners   
3   44            Private        11th           10   Handlers-cleaners   
4   34            Private   Bachelors            6      Prof-specialty   

   capitalgain  capitalloss  hoursperweek          native  
0            0            0            40   United-States  
1            0            0            50   United-States  
2            0            0            40   United-States  
3         7688            0            40   United-States  
4            0            0            30            Cuba  
y data now
0     <=50K
1     <=50K
2     <=50K
3     <=50K
4     <=50K
Name: Salary, dtype: category
Categories (2, object): [' <=50K', ' >50K']


In [18]:
# Change categorical strings to numerical values using get dummies
df_sal_test_x = pd.get_dummies(df_sal_test_x)
df_sal_test_x.head()

Unnamed: 0,age,educationno,capitalgain,capitalloss,hoursperweek,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,...,native_ Portugal,native_ Puerto-Rico,native_ Scotland,native_ South,native_ Taiwan,native_ Thailand,native_ Trinadad&Tobago,native_ United-States,native_ Vietnam,native_ Yugoslavia
0,25,7,0,0,40,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,38,9,0,0,50,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
2,28,12,0,0,40,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
3,44,10,7688,0,40,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
4,34,6,0,0,30,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
df_sal_test_x.shape

(15060, 82)

In [20]:
df_sal_test_y.shape

(15060,)

# **Naive Bayes Models Creation**

## **Multinomial Naive Bayes creation**

In [21]:
# Preparing a naive bayes model on training data set 

# Multinomial Naive Bayes
classifier_mb = MB()
classifier_mb.fit(df_sal_train_x, df_sal_train_y)
train_pred_mb = classifier_mb.predict(df_sal_train_x)
accuracy_train_mb = np.mean(train_pred_mb==df_sal_train_y)

In [22]:
train_pred_mb

array([' >50K', ' <=50K', ' <=50K', ..., ' <=50K', ' <=50K', ' >50K'],
      dtype='<U6')

In [23]:
accuracy_train_mb

0.7729186698053778

In [24]:
classifier_mb.score(df_sal_train_x, df_sal_train_y)

0.7729186698053778

**Prediction for Test Dataset (Multinomial)**

In [25]:
test_pred_mb = classifier_mb.predict(df_sal_test_x)
accuracy_test_mb = np.mean(test_pred_mb==df_sal_test_y)
print(accuracy_test_mb)

0.7090969455511288


In [26]:
classifier_mb.score(df_sal_test_x, df_sal_test_y)# Gaussian Naive Bayes 

0.7090969455511288

## **Gaussian Naive Bayes Creation**

In [27]:
classifier_gb = GB()
classifier_gb.fit(df_sal_train_x, df_sal_train_y)
train_pred_gb = classifier_gb.predict(df_sal_train_x)
accuracy_train_gb = np.mean(train_pred_gb==df_sal_train_y)
print(accuracy_train_gb)

0.8013991578528563


**Prediction for Test Dataset (Gaussian)**

In [28]:
test_pred_gb = classifier_gb.predict(df_sal_test_x)
accuracy_test_gb = np.mean(test_pred_gb==df_sal_test_y)
print(accuracy_test_gb)

0.7134130146082337


In [29]:
classifier_gb.score(df_sal_test_x, df_sal_test_y)# Gaussian Naive Bayes 

0.7134130146082337

# Conclusion: Both models are giving similar accuracy. So, any one can be considered for the classification of data.