Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset.

- age: continuous.
- workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- fnlwgt: continuous.
- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- education-num: continuous.
- marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- sex: Female, Male.
- capital-gain: continuous.
- capital-loss: continuous.
- hours-per-week: continuous.
- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

 Description of fnlwgt (final weight)


 The weights on the CPS files are controlled to independent estimates of the
 civilian noninstitutional population of the US.  These are prepared monthly
 for us by Population Division here at the Census Bureau.  We use 3 sets of
 controls.

  These are:
          1.  A single cell estimate of the population 16+ for each state.
          2.  Controls for Hispanic Origin by age and sex.
          3.  Controls by Race, age and sex.

 We use all three sets of controls in our weighting program and "rake" through
 them 6 times so that by the end we come back to all the controls we used.

 The term estimate refers to population totals derived from CPS by creating
 "weighted tallies" of any specified socio-economic characteristics of the
 population.

 People with similar demographic characteristics should have
 similar weights.  There is one important caveat to remember
 about this statement.  That is that since the CPS sample is
 actually a collection of 51 state samples, each with its own
 probability of selection, the statement only applies within
 state.

http://archive.ics.uci.edu/ml/machine-learning-databases/adult/

In [1]:
from IPython.display import Markdown, display

import pandas as pd
import numpy as np

%matplotlib inline

In [2]:
df1 = pd.read_csv('adult.csv', skipinitialspace = True, na_values = ['?'], names = ['age','workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship', 'race', 
                                                      'sex', 'capital_gain', 'capital_loss','hours_per_week','native_country', 'income'])

df2 = pd.read_csv('adult_test.csv', skipinitialspace = True, na_values = ['?'], names = ['age','workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship', 'race', 
                                                      'sex', 'capital_gain', 'capital_loss','hours_per_week','native_country', 'income'])
dframes = [df1,df2]
df = pd.concat(dframes)
df.head(3)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K


In [3]:
df.shape

(48842, 15)

In [4]:
df.iloc[0]

age                          39
workclass             State-gov
fnlwgt                    77516
education             Bachelors
education_num                13
marital_status    Never-married
occupation         Adm-clerical
relationship      Not-in-family
race                      White
sex                        Male
capital_gain               2174
capital_loss                  0
hours_per_week               40
native_country    United-States
income                    <=50K
Name: 0, dtype: object

In [5]:
df.isnull().values.any()

True

In [6]:
df= df.dropna()
df.head(3)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K


In [7]:
d = {'Private': 1, 'Self-emp-not-inc': 2, 'Self-emp-inc': 3, 'Federal-gov': 4, 'Local-gov': 5, 'State-gov': 6, 'Without-pay': 7, 'Never-worked': 8}
df['workclass'] = df['workclass'].map(d)

#education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
d = {'Bachelors': 1, 'Some-college': 2, '11th': 3, 'HS-grad': 4, 'Prof-school': 5, 'Assoc-acdm': 6, 'Assoc-voc': 7, '9th': 8, '7th-8th': 9, '12th':10, 'Masters':11, '1st-4th':12, '10th':13 , 'Doctorate': 14, '5th-6th': 15, 'Preschool': 16}
df['education'] = df['education'].map(d)

# marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
e = {'Married-civ-spouse': 1, 'Divorced': 2, 'Never-married': 3, 'Separated': 4, 'Widowed': 5, 'Married-spouse-absent': 6,
     'Married-AF-spouse': 7}
df['marital_status'] = df['marital_status'].map(e)

# occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, 
# Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
d = {'Tech-support':1, 'Craft-repair':2, 'Other-service':3, 'Sales':4, 'Exec-managerial':5, 'Prof-specialty':6, 'Handlers-cleaners':7, 'Machine-op-inspct':8, 'Adm-clerical':9, 'Farming-fishing':10, 'Transport-moving':11, 'Priv-house-serv':12, 'Protective-serv':13, 'Armed-Forces':14}
df['occupation'] = df['occupation'].map(d)

# relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
d = {'Wife': 1, 'Own-child': 2, 'Husband': 3, 'Not-in-family': 4, 'Other-relative': 5, 'Unmarried': 6}
df['relationship'] = df['relationship'].map(d)

# race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
d = {'White': 1, 'Asian-Pac-Islander': 2, 'Amer-Indian-Eskimo': 3, 'Other': 4, 'Black': 5}
df['race'] = df['race'].map(d)

# sex: Female, Male.
d = {'Female': 1, 'Male': 2}
df['sex'] = df['sex'].map(d)

# native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India,
#     Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, 
# Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, 
# Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
d = {'United-States':1, 'Cambodia':2, 'England':3, 'Puerto-Rico':4, 'Canada':5, 'Germany':6, 'Outlying-US(Guam-USVI-etc)':7, 'India':8,
'Japan':9, 'Greece':10, 'South':11, 'China':12, 'Cuba':13, 'Iran':14, 'Honduras':15, 'Philippines':16, 'Italy':17, 'Poland':18, 
'Jamaica':19, 'Vietnam':20, 'Mexico':21, 'Portugal':22, 
'Ireland':23, 'France':24, 'Dominican-Republic':25, 'Laos':26, 'Ecuador':27, 'Taiwan':28, 'Haiti':29, 'Columbia':30,
     'Hungary':31, 'Guatemala':32, 'Nicaragua':33, 'Scotland':34, 
'Thailand':35, 'Yugoslavia':36, 'El-Salvador':37, 'Trinadad&Tobago':38, 'Peru':39, 'Hong':40, 'Holand-Netherlands':41}
df['native_country'] = df['native_country'].map(d)

d = {'<=50K': 0, '>50K': 1}
df['income'] = df['income'].map(d)

df.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,6,77516,1,13,3,9,4,1,2,2174,0,40,1,0.0
1,50,2,83311,1,13,1,5,3,1,2,0,0,13,1,0.0
2,38,1,215646,4,9,2,7,4,1,2,0,0,40,1,0.0
3,53,1,234721,3,7,1,7,3,5,2,0,0,40,1,0.0
4,28,1,338409,1,13,1,6,1,5,1,0,0,40,13,0.0
5,37,1,284582,11,14,1,5,1,1,1,0,0,40,1,0.0
6,49,1,160187,8,5,6,3,4,5,1,0,0,16,19,0.0
7,52,2,209642,4,9,1,5,3,1,2,0,0,45,1,1.0
8,31,1,45781,11,14,3,6,4,1,1,14084,0,50,1,1.0
9,42,1,159449,1,13,1,5,3,1,2,5178,0,40,1,1.0


In [8]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,6,77516,1,13,3,9,4,1,2,2174,0,40,1,0.0
1,50,2,83311,1,13,1,5,3,1,2,0,0,13,1,0.0
2,38,1,215646,4,9,2,7,4,1,2,0,0,40,1,0.0
3,53,1,234721,3,7,1,7,3,5,2,0,0,40,1,0.0
4,28,1,338409,1,13,1,6,1,5,1,0,0,40,13,0.0


In [9]:
df= df.dropna()
df.head(3)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,6,77516,1,13,3,9,4,1,2,2174,0,40,1,0.0
1,50,2,83311,1,13,1,5,3,1,2,0,0,13,1,0.0
2,38,1,215646,4,9,2,7,4,1,2,0,0,40,1,0.0


In [10]:
df.isnull().values.any()

False

In [11]:
df.describe()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
count,30162.0,30162.0,30162.0,30162.0,30162.0,30162.0,30162.0,30162.0,30162.0,30162.0,30162.0,30162.0,30162.0,30162.0,30162.0
mean,38.437902,1.736788,189793.8,4.37282,10.121312,2.053213,5.742159,3.393276,1.445196,1.675685,1092.007858,88.372489,40.931238,2.515583,0.248922
std,13.134665,1.461908,105653.0,3.429379,2.549995,1.170881,2.978754,1.229789,1.196958,0.468126,7406.346497,404.29837,11.979984,5.641075,0.432396
min,17.0,1.0,13769.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0
25%,28.0,1.0,117627.2,2.0,9.0,1.0,3.0,3.0,1.0,1.0,0.0,0.0,40.0,1.0,0.0
50%,37.0,1.0,178425.0,4.0,10.0,2.0,5.0,3.0,1.0,2.0,0.0,0.0,40.0,1.0,0.0
75%,47.0,2.0,237628.5,5.0,13.0,3.0,8.0,4.0,1.0,2.0,0.0,0.0,45.0,1.0,0.0
max,90.0,7.0,1484705.0,16.0,16.0,7.0,14.0,6.0,5.0,2.0,99999.0,4356.0,99.0,41.0,1.0


In [12]:

#here resetting all the features to the original list of  features
all_features = df[['age','workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship', 'race', 
                                                      'sex', 'capital_gain', 'capital_loss','hours_per_week','native_country']].values


all_classes = df['income'].values

feature_names = ['age','workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship', 'race', 
                                                      'sex', 'capital_gain', 'capital_loss','hours_per_week','native_country']

all_features

array([[    39,      6,  77516, ...,      0,     40,      1],
       [    50,      2,  83311, ...,      0,     13,      1],
       [    38,      1, 215646, ...,      0,     40,      1],
       ...,
       [    58,      1, 151910, ...,      0,     40,      1],
       [    22,      1, 201490, ...,      0,     20,      1],
       [    52,      3, 287927, ...,      0,     40,      1]], dtype=int64)

Neural network

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder



X_train, X_test, y_train, y_test = train_test_split(all_features, all_classes,  
                                                    test_size = 0.2, random_state = 3)

# Normalize feature data
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# One hot encode target values
OneHotEncoder()
one_hot = OneHotEncoder()

y_train_hot = one_hot.fit_transform(y_train.reshape(-1, 1)).todense()
y_test_hot = one_hot.transform(y_test.reshape(-1, 1)).todense()




In [14]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder



X_train, X_test, y_train, y_test = train_test_split(all_features, all_classes,  
                                                    test_size = 0.2, random_state = 3)

# Normalize feature data
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# One hot encode target values
OneHotEncoder()
one_hot = OneHotEncoder(handle_unknown = 'ignore')


y_train_hot = one_hot.fit_transform(y_train.reshape(-1, 1)).todense()
y_test_hot = one_hot.transform(y_test.reshape(-1, 1)).todense()



In [15]:
# !pip install mlrose

In [16]:
import mlrose
import numpy as np

## RHC

In [41]:
# Initialize neural network object and fit object
from datetime import datetime

startTime = datetime.now()
np.random.seed(3)
nn_model1 = mlrose.NeuralNetwork(hidden_nodes = [2], activation = 'relu', algorithm = 'random_hill_climb',
                                 max_iters = 1000, bias = True, is_classifier = True, learning_rate = 0.0001, 
                                 early_stopping = True, clip_max =5, max_attempts =100)


nn_model1.fit(X_train_scaled, y_train_hot)
weights= nn_model1.fitted_weights

print()
print(datetime.now() - startTime)


0:00:16.263181


In [20]:
from sklearn.metrics import accuracy_score

startTime = datetime.now()
# Predict labels for train set and assess accuracy
y_train_pred = nn_model1.predict(X_train_scaled)
y_train_accuracy = accuracy_score(y_train_hot, y_train_pred)
print(y_train_accuracy)
print()
print(datetime.now() - startTime)

0.7532844295246384

0:00:00.014957


In [19]:
# weights= nn_model1.fitted_weights

In [20]:
print(weights)

[-5.         -5.         -0.41819052 -5.         -2.          5.
 -0.74882938  5.         -0.89706559 -0.11838031  5.          5.
  5.         -2.          5.          5.         -2.          5.
 -0.48149511 -5.         -5.         -5.         -5.         -5.
  0.08929804  5.         -0.38727294  5.         -0.22405748  0.8727673
  0.95199084  0.34476735  0.80566822  0.69150174]


## SA

In [46]:
startTime = datetime.now()

np.random.seed(3)
nn_model1 = mlrose.NeuralNetwork(hidden_nodes = [2], activation = 'relu', algorithm = 'simulated_annealing',
                                 max_iters = 5000, bias = True, is_classifier = True, learning_rate = 7, 
                                 early_stopping = True, clip_max = 5, max_attempts =100)


nn_model1.fit(X_train_scaled, y_train_hot)
weights= nn_model1.fitted_weights

print()
print(datetime.now() - startTime)


0:01:24.815030


In [22]:
from sklearn.metrics import accuracy_score
# Predict labels for train set and assess accuracy
y_train_pred = nn_model1.predict(X_train_scaled)
y_train_accuracy = accuracy_score(y_train_hot, y_train_pred)
print(y_train_accuracy)


0.7532844295246384


In [23]:
print(weights)

[-2. -5.  5. -5.  5.  5.  2.  2. -2. -5.  2.  5. -5. -5. -5.  5.  5.  2.
 -5. -5.  5. -5.  2. -5.  2. -5.  2.  5.  5.  5.  5.  5.  5.  2.]


## GA

In [50]:
startTime = datetime.now()

np.random.seed(3)
nn_model1 = mlrose.NeuralNetwork(hidden_nodes = [2], activation = 'relu', algorithm = 'genetic_alg',
                                 max_iters = 4000, bias = True, is_classifier = True, learning_rate = 7, 
                                 early_stopping = True, clip_max = 5, max_attempts =100)


nn_model1.fit(X_train_scaled, y_train_hot)
weights= nn_model1.fitted_weights

print()
print(datetime.now() - startTime)


0:05:19.035714


In [25]:
from sklearn.metrics import accuracy_score
# Predict labels for train set and assess accuracy
y_train_pred = nn_model1.predict(X_train_scaled)
y_train_accuracy = accuracy_score(y_train_hot, y_train_pred)
print(y_train_accuracy)

0.7404782626714742


In [26]:
print(weights)

[-1.76053541 -1.25957587 -1.54809096  1.03391236  3.34929261 -3.08721269
  1.90532025 -0.49618676  1.42629456  1.23823247  4.36253606  3.617811
  0.64935842  3.7643912  -2.06526822  2.27685176 -2.70330867  0.13043817
 -1.54829239 -2.91577968  4.45015189  4.80781478  3.92216963 -2.00712654
  1.98212143 -3.29980498  1.46228181  1.3713631  -1.19102024  3.00731644
  3.30093334  3.96057883  0.66358853  0.27223606]
