# Predicting Adult Income in the US

**Goal** : To precisely predict individuals’ income using data collected from the 1994 U.S. Census. Your goal is to build a model that accurately predicts whether an individual makes more than $50,000.

**Dataset** : [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Census+Income)

**Dataset Description** : This dataset consists of approximately 32,000 data points, with each datapoint having 13 features. This dataset is a modified version of the dataset published in the paper “Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid”, by Ron Kohavi.

In [1]:
# !pip install -q matplotlib-venn

In [2]:
# !apt-get -qq install -y libfluidsynth1

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Data Analyzing

Reading through `adult.names` and making sense of the data in `adult.data` and `adult.test` files. Certain steps like trimming spaces and unwanted first rows were taken and the datasets were converted to [`adult_income_train.csv`](./adult_income_train.csv) and [`adult_income_test.csv`](./adult_income_test.csv), respectively, to facilitate easy readability by `pandas`.

In [4]:
# column_names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income_class']
# df = pd.read_csv('adult_income_train.csv',names=column_names)
# test = pd.read_csv('adult_income_test.csv',names=column_names)

In [5]:
# df = train.append(test,ignore_index=True)

Using later provided [`adult.csv`](./adult.csv) already-cleaned dataset.

In [6]:
column_names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income_class']
df = pd.read_csv('adult.csv', names=column_names, header=0)

In [7]:
df.replace('?',np.nan,inplace=True)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
workclass         30725 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education_num     32561 non-null int64
marital_status    32561 non-null object
occupation        30718 non-null object
relationship      32561 non-null object
race              32561 non-null object
sex               32561 non-null object
capital_gain      32561 non-null int64
capital_loss      32561 non-null int64
hours_per_week    32561 non-null int64
native_country    31978 non-null object
income_class      32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [9]:
df.drop(columns=['education','native_country'], inplace=True)

In [10]:
df.shape

(32561, 13)

Exploring null-containing columns

In [11]:
df.workclass.value_counts(dropna=False)

Private             22696
Self-emp-not-inc     2541
Local-gov            2093
NaN                  1836
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: workclass, dtype: int64

In [12]:
# df.native_country.value_counts(dropna=False)

In [13]:
df.occupation.value_counts(dropna=False)

Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         3770
Sales                3650
Other-service        3295
Machine-op-inspct    2002
NaN                  1843
Transport-moving     1597
Handlers-cleaners    1370
Farming-fishing       994
Tech-support          928
Protective-serv       649
Priv-house-serv       149
Armed-Forces            9
Name: occupation, dtype: int64

In [14]:
df[df.occupation.isnull()].income_class.value_counts()
# cannot assume NaN in df.occupation equals Unemployed by default

<=50K    1652
>50K      191
Name: income_class, dtype: int64

Creating two distinct classes for classification

In [15]:
df.replace({'income_class':{'<=50K':0,'<=50K.':0,'>50K':1,'>50K.':1}},inplace=True)

Filling NaN values with respective column mode values

In [16]:
for col in ['workclass', 'occupation']: #, 'native_country'
    df[col].fillna(df[col].mode()[0], inplace=True)

In [17]:
df

Unnamed: 0,age,workclass,fnlwgt,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,income_class
0,90,Private,77053,9,Widowed,Prof-specialty,Not-in-family,White,Female,0,4356,40,0
1,82,Private,132870,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,0
2,66,Private,186061,10,Widowed,Prof-specialty,Unmarried,Black,Female,0,4356,40,0
3,54,Private,140359,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,0
4,41,Private,264663,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,22,Private,310152,10,Never-married,Protective-serv,Not-in-family,White,Male,0,0,40,0
32557,27,Private,257302,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,0
32558,40,Private,154374,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,1
32559,58,Private,151910,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,0


## Data Preprocessing

In [18]:
from sklearn.preprocessing import LabelEncoder

In [19]:
other_cols = df.iloc[:,[1,4,5,6,7,8]].apply(LabelEncoder().fit_transform)

In [20]:
df.drop(columns=df.columns[[1,4,5,6,7,8]],inplace=True)

In [21]:
df = pd.concat([df,other_cols], axis=1)

## Splitting dataset

In [22]:
X = df.drop('income_class',axis=1)
y = df['income_class']

In [23]:
X.shape

(32561, 12)

In [24]:
y.shape

(32561,)

## K-Means Clustering

In [25]:
from sklearn.cluster import KMeans
Kmean = KMeans(n_clusters=2,tol=0.35,random_state=1)
Kmean.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=1, tol=0.35, verbose=0)

In [26]:
y_predicted = Kmean.predict(X)

In [27]:
y_predicted = np.array(y_predicted,dtype='int')
np.unique(y_predicted,return_counts=True)

(array([0, 1]), array([24021,  8540]))

In [28]:
y = np.array(y,dtype='int')
np.unique(y,return_counts=True)

(array([0, 1]), array([24720,  7841]))

In [29]:
def new_acc(y_predicted, y_real):
    one = np.sum(y_predicted == y_real)/len(y_real)
    two = np.sum(y_predicted != y_real)/len(y_real)
    return max(one,two)

In [30]:
new_acc(y_predicted, y)

0.6155830594883449

In [31]:
# Best: 0.6345935321396763