https://archive.ics.uci.edu/ml/datasets/Adult


Hay diferentes métodos para transformar las variables categóricas a numéricas para usarlas en un modelo. 

* Label encoding: a cada categoría diferente en la columna se asocia una etiqueta numérica. En sklearn, "LabelEncoder" asigna un valor númerico creciente a las categorías por orden alfabético. Si no hay jerarquía en las variables categóricas, el "LabelEncoder" asigna un orden jerárquico según el orden alfabético. Esto podría ser un problema si el modelo "detecta" el order jerárquico que en realidad no existe.

* One Hot Encoding: se crean dummy variables para cada categoría de la(s) variable(s) categórica(s). Se puede hacer con la librería sklearn con "One Hot Encoder", o con la librería pandas con "get_dummies".

In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split


In [2]:
df = pd.read_csv('adult_na.csv',na_values='?')

In [3]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25.0,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38.0,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28.0,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44.0,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18.0,,103497,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States,<=50K


In [4]:
df = df.dropna()

In [5]:
target = 'income'

categ_cols = [cname for cname in df.columns if df[cname].dtype == "object"]
numeric_cols = [cname for cname in df.columns if df[cname].dtype in ['int64', 'float64', 'uint8','int8']]

In [8]:
X = df.drop(target,axis=1)
Y = df[target]

In [12]:
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=.30,random_state=0)

In [13]:
X_train.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country
22528,18.0,Private,401051,10th,6,Never-married,Sales,Own-child,White,Female,0,0,35,United-States
34681,27.0,Self-emp-inc,376936,HS-grad,9,Never-married,Sales,Not-in-family,White,Male,0,0,50,United-States
13851,38.0,Self-emp-inc,66687,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,50,Portugal
27552,70.0,Private,235781,Some-college,10,Divorced,Farming-fishing,Not-in-family,Asian-Pac-Islander,Male,0,0,40,Vietnam
28853,56.0,Private,244605,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States


In [14]:
X_train['workclass'].value_counts()

Private             23171
Self-emp-not-inc     2633
Local-gov            2163
State-gov            1370
Self-emp-inc         1168
Federal-gov           964
Without-pay            17
Name: workclass, dtype: int64

In [15]:
X_train['education'].value_counts()

HS-grad         10305
Some-college     6863
Bachelors        5333
Masters          1711
Assoc-voc        1367
11th             1126
Assoc-acdm       1058
10th              860
7th-8th           573
Prof-school       535
9th               467
12th              394
Doctorate         389
5th-6th           303
1st-4th           148
Preschool          54
Name: education, dtype: int64

In [16]:
X_train['occupation'].value_counts()

Craft-repair         4183
Prof-specialty       4171
Exec-managerial      4148
Adm-clerical         3829
Sales                3803
Other-service        3339
Machine-op-inspct    2050
Transport-moving     1627
Handlers-cleaners    1441
Farming-fishing      1029
Tech-support          984
Protective-serv       709
Priv-house-serv       163
Armed-Forces           10
Name: occupation, dtype: int64

In [17]:
X_train['marital-status'].value_counts()

Married-civ-spouse       14806
Never-married            10088
Divorced                  4330
Separated                  993
Widowed                    864
Married-spouse-absent      380
Married-AF-spouse           25
Name: marital-status, dtype: int64

In [19]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
for colname in categ_cols:
  if colname=='income':
    continue
  X_train[colname] = label_encoder.fit_transform(X_train[colname])


In [20]:
X_train.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country
22528,18.0,2,401051,0,6,4,11,3,4,0,0,0,35,38
34681,27.0,3,376936,11,9,4,11,1,4,1,0,0,50,38
13851,38.0,3,66687,11,9,2,3,0,4,1,5178,0,50,31
27552,70.0,2,235781,15,10,0,4,1,1,1,0,0,40,39
28853,56.0,2,244605,11,9,2,2,0,4,1,0,0,40,38


## One Hot Encoding

Se crean "dummy variables" (no conviene si son muchas categorías).

In [21]:
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=.30,random_state=0)

In [22]:
X_train.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country
22528,18.0,Private,401051,10th,6,Never-married,Sales,Own-child,White,Female,0,0,35,United-States
34681,27.0,Self-emp-inc,376936,HS-grad,9,Never-married,Sales,Not-in-family,White,Male,0,0,50,United-States
13851,38.0,Self-emp-inc,66687,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,50,Portugal
27552,70.0,Private,235781,Some-college,10,Divorced,Farming-fishing,Not-in-family,Asian-Pac-Islander,Male,0,0,40,Vietnam
28853,56.0,Private,244605,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States


In [23]:
label_encoder = LabelEncoder()
for colname in categ_cols:
  if (colname=='income' or colname=='gender'):
    continue
  X_train[colname] = label_encoder.fit_transform(X_train[colname])

In [24]:
X_train.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country
22528,18.0,2,401051,0,6,4,11,3,4,Female,0,0,35,38
34681,27.0,3,376936,11,9,4,11,1,4,Male,0,0,50,38
13851,38.0,3,66687,11,9,2,3,0,4,Male,5178,0,50,31
27552,70.0,2,235781,15,10,0,4,1,1,Male,0,0,40,39
28853,56.0,2,244605,11,9,2,2,0,4,Male,0,0,40,38


In [25]:
X_train = pd.get_dummies(X_train)
#X_train = pd.get_dummies(X_train,drop_first=True)


In [26]:
X_train.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,capital-gain,capital-loss,hours-per-week,native-country,gender_Female,gender_Male
22528,18.0,2,401051,0,6,4,11,3,4,0,0,35,38,1,0
34681,27.0,3,376936,11,9,4,11,1,4,0,0,50,38,0,1
13851,38.0,3,66687,11,9,2,3,0,4,5178,0,50,31,0,1
27552,70.0,2,235781,15,10,0,4,1,1,0,0,40,39,0,1
28853,56.0,2,244605,11,9,2,2,0,4,0,0,40,38,0,1
