# Naive Bayes Project
Predict whether annual income of an individual exceeds $50K/yr based on census data. Also known as "Census Income" dataset.

### 1. Import libraries

In [53]:
import pandas as pd
import numpy as np
import matplotlib.pyplot  as plt
import seaborn as sns
%matplotlib inline

%matplotlib inline is a magic command in Jupyter Notebook that allows Matplotlib plots to be displayed directly in the notebook rather than in a separate window.

Explanation:<br>
%matplotlib is an IPython magic command for configuring Matplotlib.<br>
The inline argument tells Jupyter to render the plots inside the output cell.<br>
It is mainly used in Jupyter Notebook and JupyterLab but is not needed in standard Python scripts.

In [54]:
import warnings 

warnings.filterwarnings('ignore')

### 2. Import dataset

In [55]:
df = pd.read_csv('adult.csv')


### Exploratory data analysis

In [56]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [57]:
# view thte dimensions 
df.shape

(48842, 15)

### Look at the categorical features

In [58]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        48842 non-null  object
 2   fnlwgt           48842 non-null  int64 
 3   education        48842 non-null  object
 4   educational-num  48842 non-null  int64 
 5   marital-status   48842 non-null  object
 6   occupation       48842 non-null  object
 7   relationship     48842 non-null  object
 8   race             48842 non-null  object
 9   gender           48842 non-null  object
 10  capital-gain     48842 non-null  int64 
 11  capital-loss     48842 non-null  int64 
 12  hours-per-week   48842 non-null  int64 
 13  native-country   48842 non-null  object
 14  income           48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [59]:
### Get the catgorial columns

cat_cols = [col for col in df.columns if df[col].dtype=="O"]

# print the number of cagoical columns(features)
print("The number of categorical columns is {}: ".format(len(cat_cols)))
print("The categorical columns are:\n {}".format(cat_cols))


The number of categorical columns is 9: 
The categorical columns are:
 ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'gender', 'native-country', 'income']


In [60]:
# look at the categoical columns

df[cat_cols].head()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,gender,native-country,income
0,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,United-States,<=50K
1,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,United-States,<=50K
2,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,United-States,>50K
3,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,United-States,>50K
4,?,Some-college,Never-married,?,Own-child,White,Female,United-States,<=50K


##### summary

- There are 9 categorical columns (features)
- target variable is `income`

### Check missing values

In [61]:
df[cat_cols].isnull().sum()

workclass         0
education         0
marital-status    0
occupation        0
relationship      0
race              0
gender            0
native-country    0
income            0
dtype: int64

since there no missing values no need to handle them

### Fequency counts of categorical variables

In [92]:

for col in cat_col:
    print(df[col].value_counts())

workclass
Private             33906
Self-emp-not-inc     3862
Local-gov            3136
?                    2799
State-gov            1981
Self-emp-inc         1695
Federal-gov          1432
Without-pay            21
Never-worked           10
Name: count, dtype: int64
education
HS-grad         15784
Some-college    10878
Bachelors        8025
Masters          2657
Assoc-voc        2061
11th             1812
Assoc-acdm       1601
10th             1389
7th-8th           955
Prof-school       834
9th               756
12th              657
Doctorate         594
5th-6th           509
1st-4th           247
Preschool          83
Name: count, dtype: int64
marital-status
Married-civ-spouse       22379
Never-married            16117
Divorced                  6633
Separated                 1530
Widowed                   1518
Married-spouse-absent      628
Married-AF-spouse           37
Name: count, dtype: int64
occupation
Prof-specialty       6172
Craft-repair         6112
Exec-managerial      

In [96]:

for col in cat_col:
    print(df[col].value_counts()/float(len(df)))

workclass
Private             0.694198
Self-emp-not-inc    0.079071
Local-gov           0.064207
?                   0.057307
State-gov           0.040559
Self-emp-inc        0.034704
Federal-gov         0.029319
Without-pay         0.000430
Never-worked        0.000205
Name: count, dtype: float64
education
HS-grad         0.323164
Some-college    0.222718
Bachelors       0.164305
Masters         0.054400
Assoc-voc       0.042197
11th            0.037099
Assoc-acdm      0.032779
10th            0.028439
7th-8th         0.019553
Prof-school     0.017075
9th             0.015478
12th            0.013452
Doctorate       0.012162
5th-6th         0.010421
1st-4th         0.005057
Preschool       0.001699
Name: count, dtype: float64
marital-status
Married-civ-spouse       0.458192
Never-married            0.329982
Divorced                 0.135805
Separated                0.031325
Widowed                  0.031080
Married-spouse-absent    0.012858
Married-AF-spouse        0.000758
Name: coun

### Number of labels: cardinality

The number of labels within a categorical variable is known as cardinality. A high number of labels within a variable is known as high cardinality. High cardinality may pose some serious problems in the machine learning model. So, I will check for high cardinality.

Cardinality refers to the number of unique values (categories/labels) in a categorical variable

- Why is Cardinality Important?
Low Cardinality → Few unique values (e.g., "Yes"/"No", "Male"/"Female") <br>
✅ Easy to encode (One-Hot Encoding works well)

- High Cardinality → Many unique values (e.g., "City" with 1000+ unique values) <br>
⚠️ One-Hot Encoding creates too many columns → Better to use Ordinal Encoding or Target Encoding



In [62]:
# Cardinality for categorical variables
for col in cat_col:
    print(col, ' contains ', len(df[col].unique()), ' labels')

workclass  contains  9  labels
education  contains  16  labels
marital-status  contains  7  labels
occupation  contains  15  labels
relationship  contains  6  labels
race  contains  5  labels
gender  contains  2  labels
native-country  contains  42  labels
income  contains  2  labels


### Visualizating Our Data