# Predicting Adult Income in the US

**Goal** : To precisely predict individuals’ income using data collected from the 1994 U.S. Census. Your goal is to build a model that accurately predicts whether an individual makes more than $50,000.

**Dataset** : [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Census+Income)

**Dataset Description** : This dataset consists of approximately 32,000 data points, with each datapoint having 13 features. This dataset is a modified version of the dataset published in the paper “Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid”, by Ron Kohavi.

In [1]:
# !pip install -q matplotlib-venn

In [2]:
# !apt-get -qq install -y libfluidsynth1

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Data Analyzing

Reading through `adult.names` and making sense of the data in `adult.data` and `adult.test` files. Certain steps like trimming spaces and unwanted first rows were taken and the datasets were converted to [`adult_income_train.csv`](./adult_income_train.csv) and [`adult_income_test.csv`](./adult_income_test.csv), respectively, to facilitate easy readability by `pandas`.

In [4]:
# column_names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income_class']
# df = pd.read_csv('adult_income_train.csv',names=column_names)
# test = pd.read_csv('adult_income_test.csv',names=column_names)

In [5]:
# df = train.append(test,ignore_index=True)

Using later provided [`adult.csv`](./adult.csv) already-cleaned dataset.

In [6]:
column_names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income_class']
df = pd.read_csv('adult.csv', names=column_names, header=0)

In [7]:
df.replace('?',np.nan,inplace=True)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
workclass         30725 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education_num     32561 non-null int64
marital_status    32561 non-null object
occupation        30718 non-null object
relationship      32561 non-null object
race              32561 non-null object
sex               32561 non-null object
capital_gain      32561 non-null int64
capital_loss      32561 non-null int64
hours_per_week    32561 non-null int64
native_country    31978 non-null object
income_class      32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


Exploring null-containing columns

In [9]:
df.workclass.value_counts(dropna=False)

Private             22696
Self-emp-not-inc     2541
Local-gov            2093
NaN                  1836
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: workclass, dtype: int64

In [10]:
df.native_country.value_counts(dropna=False)

United-States                 29170
Mexico                          643
NaN                             583
Philippines                     198
Germany                         137
Canada                          121
Puerto-Rico                     114
El-Salvador                     106
India                           100
Cuba                             95
England                          90
Jamaica                          81
South                            80
China                            75
Italy                            73
Dominican-Republic               70
Vietnam                          67
Guatemala                        64
Japan                            62
Poland                           60
Columbia                         59
Taiwan                           51
Haiti                            44
Iran                             43
Portugal                         37
Nicaragua                        34
Peru                             31
Greece                      

In [11]:
df.occupation.value_counts(dropna=False)

Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         3770
Sales                3650
Other-service        3295
Machine-op-inspct    2002
NaN                  1843
Transport-moving     1597
Handlers-cleaners    1370
Farming-fishing       994
Tech-support          928
Protective-serv       649
Priv-house-serv       149
Armed-Forces            9
Name: occupation, dtype: int64

In [12]:
df[df.occupation.isnull()].income_class.value_counts()
# cannot assume NaN in df.occupation equals Unemployed by default

<=50K    1652
>50K      191
Name: income_class, dtype: int64

Creating two distinct classes for classification

In [13]:
df.replace({'income_class':{'<=50K':0,'<=50K.':0,'>50K':1,'>50K.':1}},inplace=True)
df.income_class = df.income_class.astype('int')

Filling NaN values with respective column mode values after exploration

In [14]:
for col in ['workclass', 'occupation', 'native_country']:
    df[col].fillna(df[col].mode()[0], inplace=True)

## Data Preprocessing

In [15]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer, make_column_transformer

In [16]:
int_columns = list(df.select_dtypes(include=['int']).columns)
obj_columns = list(df.select_dtypes(include=['object']).columns)

Making tranformations:

 - OneHot Encoding columns with `dtype = 'object'`
 - Standard Scaling columns with `dtype = 'int'`

In [17]:
ct = make_column_transformer(
        (StandardScaler(), int_columns[:-1]), 
        (OneHotEncoder(), obj_columns), sparse_threshold=0
)

*DOES NOT* scale last integer column, which is `income_class` and houses `y` values

`income_class` remains categorical with `0`s and `1`s

## Splitting the processed data

In [18]:
X = pd.DataFrame(ct.fit_transform(df))
y = df.iloc[:,-1]

In [19]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

## Support Vector Machine

In [20]:
from sklearn import svm
svmach = svm.SVC(gamma='scale')

In [21]:
svmach.fit(X_train, y_train)
svmach.score(X_test, y_test)

0.8523902139420616

In [22]:
# Best: 0.8523902139420616