# Census Experiments 

We are going to downloads: 
https://www.kaggle.com/datasets/uciml/adult-census-income?resource=download


This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). The prediction task is to determine whether a person makes over $50K a year.

## Preprocessing 



In [17]:
import numpy as np 
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, KBinsDiscretizer
from sklearn import svm,model_selection,preprocessing, linear_model
from sklearn.decomposition import PCA
from sklearn.kernel_approximation import RBFSampler

import timeit

df=pd.read_csv('../datasets/adult.csv')

In [2]:
#Drop the duplicate rows
df=df.drop_duplicates(keep='first')
df.shape
df.info()
print(pd.value_counts(df.dtypes))
df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32537 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32537 non-null  int64 
 1   workclass       32537 non-null  object
 2   fnlwgt          32537 non-null  int64 
 3   education       32537 non-null  object
 4   education.num   32537 non-null  int64 
 5   marital.status  32537 non-null  object
 6   occupation      32537 non-null  object
 7   relationship    32537 non-null  object
 8   race            32537 non-null  object
 9   sex             32537 non-null  object
 10  capital.gain    32537 non-null  int64 
 11  capital.loss    32537 non-null  int64 
 12  hours.per.week  32537 non-null  int64 
 13  native.country  32537 non-null  object
 14  income          32537 non-null  object
dtypes: int64(6), object(9)
memory usage: 4.0+ MB
object    9
int64     6
dtype: int64


Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


Values that cannot be interpreted as "?" will be considered as missing values. For simplicity, we will remove those rows without considering other methods, assuming that the data is unbiased.



In [3]:
df[df=='?']=np.nan
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32537 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32537 non-null  int64 
 1   workclass       30701 non-null  object
 2   fnlwgt          32537 non-null  int64 
 3   education       32537 non-null  object
 4   education.num   32537 non-null  int64 
 5   marital.status  32537 non-null  object
 6   occupation      30694 non-null  object
 7   relationship    32537 non-null  object
 8   race            32537 non-null  object
 9   sex             32537 non-null  object
 10  capital.gain    32537 non-null  int64 
 11  capital.loss    32537 non-null  int64 
 12  hours.per.week  32537 non-null  int64 
 13  native.country  31955 non-null  object
 14  income          32537 non-null  object
dtypes: int64(6), object(9)
memory usage: 4.0+ MB


In [4]:
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30139 entries, 1 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             30139 non-null  int64 
 1   workclass       30139 non-null  object
 2   fnlwgt          30139 non-null  int64 
 3   education       30139 non-null  object
 4   education.num   30139 non-null  int64 
 5   marital.status  30139 non-null  object
 6   occupation      30139 non-null  object
 7   relationship    30139 non-null  object
 8   race            30139 non-null  object
 9   sex             30139 non-null  object
 10  capital.gain    30139 non-null  int64 
 11  capital.loss    30139 non-null  int64 
 12  hours.per.week  30139 non-null  int64 
 13  native.country  30139 non-null  object
 14  income          30139 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


### Categorical data and scaler 

In [5]:
x=df.drop('income',axis=1)
y=df['income']

encoder = OneHotEncoder()
encoder.fit(x)

one_hot_encoded_data  = encoder.transform(x).toarray()

x_train, x_test, y_train, y_test = model_selection.train_test_split(one_hot_encoded_data, y, test_size = 0.4, random_state = 10)

scaler=preprocessing.StandardScaler()

x_train =scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

print('Final x_train shape ', x_train.shape)
print('Final x_test shape ', x_test.shape)

Final x_train shape  (18083, 20751)
Final x_test shape  (12056, 20751)


In order to have the same number of dimension as the experiment (119 dims) we are going to reduce dimension using PCA. 

In [6]:
n_components = 119
pca = PCA(n_components = n_components)
x_train_dims = pca.fit_transform(x_train)
x_test_dims = pca.transform(x_test)

print('Final x_train shape ', x_train_dims.shape)
print('Final x_test shape ', x_test_dims.shape)

Final x_train shape  (18083, 119)
Final x_test shape  (12056, 119)


## SVM adjust 

In [7]:
# Training 
clf = svm.SVC(gamma='scale') # regression problem
init_time = timeit.default_timer()
clf.fit(x_train_dims, y_train)
end_time = timeit.default_timer()
spent_time = end_time - init_time
print(f'Time spent in SVC: {spent_time:.4f}s')
# Test 
score = clf.score(x_test_dims, y_test)
print(f'Accuracy in test {score :.4f}%')

Time spent in SVC: 10.2496s
Accuracy in test 0.8546%


## Random Fourier features 


In [15]:
D = 500
best_alpha = 0.5
gamma = 0.5

rbf_feature = RBFSampler(gamma=gamma, n_components=D, random_state=1)
clf = linear_model.RidgeClassifier(alpha=best_alpha)
init_time = timeit.default_timer()
X_features = rbf_feature.fit_transform(x_train_dims)
clf.fit(X_features, y_train)
end_time = timeit.default_timer()
spent_time = end_time - init_time

print(f'Time spent in RFF D={D} gamma = {gamma} : {spent_time:.4f} s')
X_features_test = rbf_feature.fit_transform(x_test_dims)

score = clf.score(X_features_test, y_test)
print(f'Accuracy RFF D={D} gamma = {gamma}: {score :.4f}%')

Time spent in RFF D=500 gamma = 0.5 : 0.4650 s
Accuracy RFF D=500 gamma = 0.5: 0.7527%


In [20]:
P = 30
est = KBinsDiscretizer(n_bins=P, encode='ordinal',
                                strategy='uniform')

clf = linear_model.RidgeClassifier(alpha=best_alpha)
init_time = timeit.default_timer()
X_features = est.fit_transform(x_train_dims)
clf.fit(X_features, y_train)
end_time = timeit.default_timer()
spent_time = end_time - init_time

print(f'Time spent in RBF + LS P={P} gamma = {gamma}  alpha = {best_alpha:.3f}: {spent_time:.4f} s')
X_features_test = est.transform(x_test_dims)

score = clf.score(X_features_test, y_test)
print(f'Score RBF + LS P={P} gamma = {gamma} R: {score :.4f} %\n')

Time spent in RBF + LS P=30 gamma = 0.5  alpha = 0.500: 0.2892 s
Score RBF + LS P=30 gamma = 0.5 R: 0.8397 %

