### Nominal Data

Often datasets involve "nominal" or "categorical" features.

These features do not have a natural numerical order. 

For example, in an economic dataset a person might be "Employed", "Unemployed", or "Not in workforce".

Often data is categorical even when it is numerical. 

For example in an astronomy dataset a star might have a feature called "type" that has values in the set $\{1,2,3,4\}$.  But these types do not really have an ordering. Maybe it just refers to the "metalicity" or the chemical composition of the star. 

Type 1 and type 2 stars are not necessarily more similar than type 1 and type 4 stars. 

The labeling is arbitrary.

---

You have probably noticed that all of the learning algorithms we have discussed assume that all features are numerical. 

Not only that, but many assume that the numerical columns have a meaningful ordering.  

This assumption appears clearly in decision trees, for example, where we use inequalities on feature values to refine the data into different tree branches.

The effectiveness of the linear methods we discussed as well as KNN uses an assumption of *continuity*. 

This means that two features which are "close" have an increased likelihood of belonging to the same class. 

This learning bias (assumption of continuity) will not hold for data in which nominal data is transformed into arbitrary numerical codes.

---

In the cells below we will look at the `adult` dataset from the UCI ML Repo.

This dataset is not as exciting as it sounds.  

It has many nominal features.  

First we will experiment with a simplified version of the data to see the basic methods of nominal encoding, and then we will do industrial strength conversion of the dataset to fully numeric features. 









In [30]:
!cat adult.names

| This data was extracted from the census bureau database found at
| http://www.census.gov/ftp/pub/DES/www/welcome.html
| Donor: Ronny Kohavi and Barry Becker,
|        Data Mining and Visualization
|        Silicon Graphics.
|        e-mail: ronnyk@sgi.com for questions.
| Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random).
| 48842 instances, mix of continuous and discrete    (train=32561, test=16281)
| 45222 if instances with unknown values are removed (train=30162, test=15060)
| Duplicate or conflicting instances : 6
| Class probabilities for adult.all file
| Probability for the label '>50K'  : 23.93% / 24.78% (without unknowns)
| Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns)
|
| Extraction was done by Barry Becker from the 1994 Census database.  A set of
|   reasonably clean records was extracted using the following conditions:
|   ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
|
| Prediction task is to determine whether a person makes over

In [31]:
import pandas as pd

df = pd.read_csv("adult.data",header=None)

columns=["age","wk_class","fnlwgt","education","ed_num","marital","occupation","relationship","race","sex","capital-gain","capital-loss","hours-per-week","country","income"]
df.columns = columns

df.head()

Unnamed: 0,age,wk_class,fnlwgt,education,ed_num,marital,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [32]:
df.shape

(32561, 15)

In [33]:
objcolumns = df.columns[df.dtypes==object]

for oc in objcolumns:
    print("\t\t\t{} is an object column.".format(oc))
    print("\t\tThe possible values in this column are:\n{}".format(df[oc].unique()))
    print(df[oc].value_counts())
        

			wk_class is an object column.
		The possible values in this column are:
[' State-gov' ' Self-emp-not-inc' ' Private' ' Federal-gov' ' Local-gov'
 ' ?' ' Self-emp-inc' ' Without-pay' ' Never-worked']
 Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 ?                    1836
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name: wk_class, dtype: int64
			education is an object column.
		The possible values in this column are:
[' Bachelors' ' HS-grad' ' 11th' ' Masters' ' 9th' ' Some-college'
 ' Assoc-acdm' ' Assoc-voc' ' 7th-8th' ' Doctorate' ' Prof-school'
 ' 5th-6th' ' 10th' ' 1st-4th' ' Preschool' ' 12th']
 HS-grad         10501
 Some-college     7291
 Bachelors        5355
 Masters          1723
 Assoc-voc        1382
 11th             1175
 Assoc-acdm       1067
 10th              933
 7th-8th           646
 Prof-school       576
 9th               514
 12th     

### The "education" and "ed_num" features

A little inspection shows that "education" is actually redundant.

The same information is represented in the "ed_num" column.

And the values in this column are numerical *and* continuous. 

So we will just drop the "education" feature.

In [34]:
       
df[["education","ed_num"]].sort_values("ed_num").drop_duplicates()

Unnamed: 0,education,ed_num
22940,Preschool,1
8211,1st-4th,2
5340,5th-6th,3
28914,7th-8th,4
23833,9th,5
3376,10th,6
9137,11th,7
4199,12th,8
12488,HS-grad,9
2293,Some-college,10


In [35]:
df=df.drop("education",axis=1)

### Simplify simplify simplify

To better show the techniques for manipulating nominal data, let's simplify the dataset (temporarily).

We will consider just ten rows, and restrict our attention to three nominal columns.



In [37]:
simple_df = df[["marital","occupation","relationship"]].sample(10)
simple_df

Unnamed: 0,marital,occupation,relationship
26346,Divorced,Prof-specialty,Unmarried
22865,Married-civ-spouse,Other-service,Husband
30434,Never-married,Exec-managerial,Own-child
23406,Married-civ-spouse,Craft-repair,Husband
8184,Married-civ-spouse,Protective-serv,Husband
19969,Never-married,Adm-clerical,Own-child
2910,Never-married,Handlers-cleaners,Own-child
30628,Married-civ-spouse,?,Husband
8438,Never-married,Adm-clerical,Not-in-family
18515,Never-married,Other-service,Own-child


### Many methods

As we often say, there are many ways to skin a cat.

Three common methods are:

* arbitrary numerical codes for categories
* one-hot encoding
* hash encoding

#### Arbitrary numerical encoding

One way to convert nominal to numerical data is to convert the string categories to arbitrary numerical codes.

This may work fine.

However the new numerical columns will violate the continuity assumption of many learning algorithms.

We illustrate one way to do arbitary numerical encoding using pandas.  

There are many others.


In [43]:
simple_df.dtypes

marital         object
occupation      object
relationship    object
dtype: object

In [53]:
simple_df_cat = simple_df.astype("category")
print(simple_df_cat["marital"])
print(simple_df_cat["marital"].cat.codes)

26346               Divorced
22865     Married-civ-spouse
30434          Never-married
23406     Married-civ-spouse
8184      Married-civ-spouse
19969          Never-married
2910           Never-married
30628     Married-civ-spouse
8438           Never-married
18515          Never-married
Name: marital, dtype: category
Categories (3, object): [' Divorced', ' Married-civ-spouse', ' Never-married']
26346    0
22865    1
30434    2
23406    1
8184     1
19969    2
2910     2
30628    1
8438     2
18515    2
dtype: int8


In [58]:
for col in simple_df.columns:
    simple_df[col] = simple_df[col].astype("category").cat.codes

simple_df

Unnamed: 0,marital,occupation,relationship
26346,0,6,3
22865,1,5,0
30434,2,3,2
23406,1,2,0
8184,1,7,0
19969,2,1,2
2910,2,4,2
30628,1,0,0
8438,2,1,1
18515,2,5,2


### One-hot encoding

The idea of one-hot encoding is to replace a nominal feature with "radio buttons" that encode the different possible feature values.

This satisfies the continuity assumption but at the expense of adding new features.

We illustrate how this works below.



In [69]:
simple_df = df[["marital","occupation","relationship"]].sample(10)
simple_df

Unnamed: 0,marital,occupation,relationship
12868,Divorced,Machine-op-inspct,Unmarried
25981,Widowed,Other-service,Unmarried
1110,Married-civ-spouse,Sales,Husband
24687,Divorced,Adm-clerical,Unmarried
26266,Divorced,Exec-managerial,Not-in-family
2191,Never-married,Adm-clerical,Not-in-family
23270,Married-civ-spouse,Sales,Husband
29710,Divorced,?,Not-in-family
16961,Married-civ-spouse,Exec-managerial,Husband
4956,Widowed,Exec-managerial,Unmarried


In [70]:
pd.get_dummies(simple_df)


Unnamed: 0,marital_ Divorced,marital_ Married-civ-spouse,marital_ Never-married,marital_ Widowed,occupation_ ?,occupation_ Adm-clerical,occupation_ Exec-managerial,occupation_ Machine-op-inspct,occupation_ Other-service,occupation_ Sales,relationship_ Husband,relationship_ Not-in-family,relationship_ Unmarried
12868,1,0,0,0,0,0,0,1,0,0,0,0,1
25981,0,0,0,1,0,0,0,0,1,0,0,0,1
1110,0,1,0,0,0,0,0,0,0,1,1,0,0
24687,1,0,0,0,0,1,0,0,0,0,0,0,1
26266,1,0,0,0,0,0,1,0,0,0,0,1,0
2191,0,0,1,0,0,1,0,0,0,0,0,1,0
23270,0,1,0,0,0,0,0,0,0,1,1,0,0
29710,1,0,0,0,1,0,0,0,0,0,0,1,0
16961,0,1,0,0,0,0,1,0,0,0,1,0,0
4956,0,0,0,1,0,0,1,0,0,0,0,0,1


### One-hot redundancy

There is some inherent redundancy in one-hot encoding. 

Consider the "marital" columns above.  If you can see any three of them, you can always infer the value of the third.  

Thus one column is unnecessary, because the value is already determined.

You can tell the one-hot encoder to drop the first column to save on dimensions in the transformed data. 

In [71]:
pd.get_dummies(simple_df,drop_first=True)


Unnamed: 0,marital_ Married-civ-spouse,marital_ Never-married,marital_ Widowed,occupation_ Adm-clerical,occupation_ Exec-managerial,occupation_ Machine-op-inspct,occupation_ Other-service,occupation_ Sales,relationship_ Not-in-family,relationship_ Unmarried
12868,0,0,0,0,0,1,0,0,0,1
25981,0,0,1,0,0,0,1,0,0,1
1110,1,0,0,0,0,0,0,1,0,0
24687,0,0,0,1,0,0,0,0,0,1
26266,0,0,0,0,1,0,0,0,1,0
2191,0,1,0,1,0,0,0,0,1,0
23270,1,0,0,0,0,0,0,1,0,0
29710,0,0,0,0,0,0,0,0,1,0
16961,1,0,0,0,1,0,0,0,0,0
4956,0,0,1,0,1,0,0,0,0,1


### Hash encoding

One-hot encoding has a drawback.  If a categorical feature has $n$ categories, then doing the encoding adds $n$ features to the dataset.

This is okay for our census data.  But consider a hospital billing dataset with thousands of categorical procedure codes (one for each possible medical procedure).

In a case like that we may not want to pay the price of adding thousands of features to a dataset that is probably already fairly big (one row for each medical procedure ever performed at the hospital). 

The hash encoding method is a compromise between the benefits of one-hot encoding and the drawbacks of adding many features.

It works by adding a pre-determined number of binary features (by default 8).

Then, for each category, it hashes the category to exactly one of the binary columns.

In terms of radio buttons, it hashes a category to a radio button. 

There may be information loss due to collisions.

To use this method you will probably need to install a new library.

     pip install category_encoders


In [74]:
simple_df

Unnamed: 0,marital,occupation,relationship
12868,Divorced,Machine-op-inspct,Unmarried
25981,Widowed,Other-service,Unmarried
1110,Married-civ-spouse,Sales,Husband
24687,Divorced,Adm-clerical,Unmarried
26266,Divorced,Exec-managerial,Not-in-family
2191,Never-married,Adm-clerical,Not-in-family
23270,Married-civ-spouse,Sales,Husband
29710,Divorced,?,Not-in-family
16961,Married-civ-spouse,Exec-managerial,Husband
4956,Widowed,Exec-managerial,Unmarried


In [82]:
import category_encoders as ce
encoder=ce.HashingEncoder(cols='marital',hash_method="md5")
hash_res = encoder.fit_transform(simple_df)
hash_res

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,occupation,relationship
12868,0,0,0,0,1,0,0,0,Machine-op-inspct,Unmarried
25981,0,0,0,0,0,0,1,0,Other-service,Unmarried
1110,0,0,0,0,0,0,1,0,Sales,Husband
24687,0,0,0,0,1,0,0,0,Adm-clerical,Unmarried
26266,0,0,0,0,1,0,0,0,Exec-managerial,Not-in-family
2191,0,0,1,0,0,0,0,0,Adm-clerical,Not-in-family
23270,0,0,0,0,0,0,1,0,Sales,Husband
29710,0,0,0,0,1,0,0,0,?,Not-in-family
16961,0,0,0,0,0,0,1,0,Exec-managerial,Husband
4956,0,0,0,0,0,0,1,0,Exec-managerial,Unmarried


### Others

There are many other methods for encoding nominal features.

This article provides a nice summary.

https://practicaldatascience.co.uk/machine-learning/how-to-use-category-encoders-to-transform-categorical-variables


### Homework

Your homework is to apply nominal encoding methods to the `adult` dataset.

You should try at least arbitary numerical encoding and one-hot encoding. 

Also try encoding the `country` column using the hashing method with 8 binary columns.

Which method or combination of methods gives the best performance?

Note that you do not have to have the same policy for every column.  You can use hashing in the `country` column, but one-hot encoding in the `marital` column, for instance.

In `adult.names` you will see that `income` is the target variable.

In your experiments try using a RandomForestClassifier and the KNN classifier with different hyperparameters.

Answer the following questions in prose after you do your analysis:

* What is the highest achievable accuracy score on the test set (that you were able to find)? 

* Does encoding make a big difference in performance for these data? 

* Do you see a way that transforming the features relates to bias and variance?

* Which features are the most important, according to the RandomForest model?

* Was training time an issue? 

## Solution

In [67]:
import pandas as pd

df = pd.read_csv("adult.fulldata",header=None)

columns=["age","wk_class","fnlwgt","education","ed_num","marital","occupation","relationship","race","sex","capital-gain","capital-loss","hours-per-week","country","income"]
df.columns = columns

df.head()

Unnamed: 0,age,wk_class,fnlwgt,education,ed_num,marital,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [68]:
df=df.drop("education",axis=1)
df["occupation"] = df["occupation"].replace(' ?',"unknown")
df["wk_class"] = df["wk_class"].replace(' ?',"unknown")

In [69]:
df["income"].value_counts()

 <=50K     24720
 <=50K.    12435
 >50K       7841
 >50K.      3846
Name: income, dtype: int64

In [70]:
income_labels = df["income"].unique()
positive_class = [l for l in income_labels if "<=" in l]
negative_class = [l for l in income_labels if ">" in l]
for pc in positive_class:
    df["income"] = df["income"].replace(pc,1)
for nc in negative_class:
    df["income"] = df["income"].replace(nc,-1)

df["income"].value_counts()
    

 1    37155
-1    11687
Name: income, dtype: int64

In [71]:
objcolumns = df.columns[df.dtypes==object]
objcolumns

Index(['wk_class', 'marital', 'occupation', 'relationship', 'race', 'sex',
       'country'],
      dtype='object')

In [72]:
df_arb_num = df.copy()
for column in objcolumns:
    df_arb_num[column] = df_arb_num[column].astype("category")
    df_arb_num[column] = df_arb_num[column].cat.codes

df_arb_num.sample(6)

Unnamed: 0,age,wk_class,fnlwgt,ed_num,marital,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,country,income
15702,18,3,144711,9,4,11,1,4,0,0,1721,40,39,1
30401,37,1,44694,14,2,9,5,4,0,0,0,45,39,-1
29337,21,0,181096,10,4,12,3,2,1,0,0,20,39,1
34237,37,5,200352,13,2,2,0,4,1,0,0,50,39,-1
41112,21,5,83704,5,4,11,3,4,1,0,0,30,39,1
24576,52,5,129311,10,2,3,0,4,1,0,0,95,39,-1


In [73]:
df_all_onehot = df.copy()
df_all_onehot = pd.get_dummies(df_all_onehot)
df_all_onehot

Unnamed: 0,age,fnlwgt,ed_num,capital-gain,capital-loss,hours-per-week,income,wk_class_ Federal-gov,wk_class_ Local-gov,wk_class_ Never-worked,...,country_ Portugal,country_ Puerto-Rico,country_ Scotland,country_ South,country_ Taiwan,country_ Thailand,country_ Trinadad&Tobago,country_ United-States,country_ Vietnam,country_ Yugoslavia
0,39,77516,13,2174,0,40,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,50,83311,13,0,0,13,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,38,215646,9,0,0,40,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,53,234721,7,0,0,40,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,28,338409,13,0,0,40,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,215419,13,0,0,36,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
48838,64,321403,9,0,0,40,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
48839,38,374983,13,0,0,50,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
48840,44,83891,13,5455,0,40,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [86]:
df_some_onehot = df.copy()

for column in objcolumns:
    if column=="country":
        continue
    one_hot = pd.get_dummies(df_some_onehot[column])
    one_hot.columns = [f"{column}_{c}" for c in one_hot.columns]
    df_some_onehot = df_some_onehot.drop(column,axis=1)
    try:
        df_some_onehot = df_some_onehot.join(one_hot)
    except:
        print(column)

df_some_onehot

Unnamed: 0,age,fnlwgt,ed_num,capital-gain,capital-loss,hours-per-week,country,income,wk_class_ Federal-gov,wk_class_ Local-gov,...,relationship_ Own-child,relationship_ Unmarried,relationship_ Wife,race_ Amer-Indian-Eskimo,race_ Asian-Pac-Islander,race_ Black,race_ Other,race_ White,sex_ Female,sex_ Male
0,39,77516,13,2174,0,40,United-States,1,0,0,...,0,0,0,0,0,0,0,1,0,1
1,50,83311,13,0,0,13,United-States,1,0,0,...,0,0,0,0,0,0,0,1,0,1
2,38,215646,9,0,0,40,United-States,1,0,0,...,0,0,0,0,0,0,0,1,0,1
3,53,234721,7,0,0,40,United-States,1,0,0,...,0,0,0,0,0,1,0,0,0,1
4,28,338409,13,0,0,40,Cuba,1,0,0,...,0,0,1,0,0,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,215419,13,0,0,36,United-States,1,0,0,...,0,0,0,0,0,0,0,1,1,0
48838,64,321403,9,0,0,40,United-States,1,0,0,...,0,0,0,0,0,1,0,0,0,1
48839,38,374983,13,0,0,50,United-States,1,0,0,...,0,0,0,0,0,0,0,1,0,1
48840,44,83891,13,5455,0,40,United-States,1,0,0,...,1,0,0,0,1,0,0,0,0,1


In [87]:
import category_encoders as ce
encoder=ce.HashingEncoder(cols='country',hash_method="md5")
df_some_onehot = encoder.fit_transform(df_some_onehot)
for i,c in enumerate("col_0 col_1	col_2	col_3	col_4	col_5	col_6	col_7".split()):
    df_some_onehot.rename(columns={c:"country_"+str(i)},inplace=True)

In [101]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
import time

clf_names = "SVC RandomForest LogReg".split()
classifiers = [SVC(),RandomForestClassifier(),LogisticRegression()]
types = "all_onehot arb_numerical some_onehot".split()
for clf_nom,clf in zip(clf_names,classifiers):
    for t,_df in zip(types,[df_all_onehot,df_arb_num,df_some_onehot]):
        y = _df["income"]
        others = [c for c in _df.columns if c != "income"]
        X = _df[others]
        X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42)
        scl = StandardScaler()
        print(t)
        scl.fit(X_train)
        X_train = scl.transform(X_train)
        X_test = scl.transform(X_test)
        start = time.time()
        clf.fit(X_train,y_train)
        stop = time.time()
        print(f"Training time: {stop-start:5.2f}s")
        score = clf.score(X_test,y_test)
        print(t,clf_nom,score)



all_onehot
Training time: 40.85s
all_onehot SVC 0.8488248300712472
arb_numerical
Training time:  9.90s
arb_numerical SVC 0.846204242076816
some_onehot
Training time: 26.15s
some_onehot SVC 0.8506264843174187
all_onehot
Training time:  2.35s
all_onehot RandomForest 0.8554581934321513
arb_numerical
Training time:  2.20s
arb_numerical RandomForest 0.8562771271804112
some_onehot
Training time:  2.16s
some_onehot RandomForest 0.8536565391859798
all_onehot
Training time:  0.34s
all_onehot LogReg 0.8509540578167226
arb_numerical
Training time:  0.04s
arb_numerical LogReg 0.8230284170010647
some_onehot
Training time:  0.39s
some_onehot LogReg 0.8525100319384162


In [102]:
"end"

'end'