# Encoding Categorical Variables

* Ordinal encoding
* Label encoding
* One-hot encoding
* Binary encoding
* Hashing
* Target encoding
* Date-time encoding

In [62]:
# import lib
import pandas as pd

# read csv file
data = pd.read_csv("aug_test.csv")

# show variable types
print(data.dtypes)

enrollee_id                 int64
city                       object
city_development_index    float64
gender                     object
relevent_experience        object
enrolled_university        object
education_level            object
major_discipline           object
experience                 object
company_size               object
company_type               object
last_new_job               object
training_hours              int64
dtype: object


output features that are dtype = object and that tells us those features could be text or a mix of text and numerical values.

The reason to spend time into this level of encoding is that there are many machine learning models that cannot handle text and will only work with numbers. Dataset must be encoded into numbers before begin to train, test, or evaluate a model.

-> Ordinal Encoding

In [63]:
print(data["education_level"].value_counts())

Graduate          1269
Masters            496
High School        222
Phd                 54
Primary School      36
Name: education_level, dtype: int64


This is definitely an example of ordinal data: the education_level can easily be put in order of those in the “Phd” level to the education level in the “Primary School” level.

The output printed the labels with the highest counts, assume the following hierarchy:
* Phd
* High School
* Graduate
* Masters
* Primary School

We need to convert these labels into numbers, and we can do this with two different approaches.

First, we can do this by creating a dictionary where every label is the key and the new numeric number is the value. ‘Phd’ will get the highest score and ‘Primary School’ will be our lowest score. Then we will map each label from the education_level column to the numeric value and create a new column called education_rating.

In [64]:
# create dictionary of label: values in order
rating_dict = {"Phd":4, "High School":3, "Graduate":2, "Masters":1, "Primary School":0}

# create a new columns
data["education_rating"] = data["education_level"].map(rating_dict)

# show what diff
print(data["education_rating"].value_counts())

2.0    1269
1.0     496
3.0     222
4.0      54
0.0      36
Name: education_rating, dtype: int64


The second approach we will show is how to utilize the <span style="color:green">sklearn.preprocessing</span> library OrdinalEncoder. We follow a similar approach: we set our categories as a list, and then we will <span style="color:green">.fit_transform</span> the values in our feature education_level. We need to make sure we adhere to the shape requirements of a 2-D array, so I’ll notice the method <span style="color:green">.reshape(-1, 1)</span>.

We’ll also note, this method will not work if feature has NaN values. Those need to be addressed prior to running <span style="color:green">.fit_transform</span>

In [65]:
# show nan values in education_level
print(data["education_level"].isna().sum())

# drop it
new_df = data.dropna()
print(new_df["education_level"].isna().sum())

52
0


In [66]:
# import sklearn lib
from sklearn.preprocessing import OrdinalEncoder

# create encoder and set category order
encoder = OrdinalEncoder(categories = [['Phd', 'High School', 'Graduate', 'Masters', 'Primary School']])

# reshape feature
education_reshaped = new_df["education_level"].values.reshape(-1, 1)

# create new variable with assigned (values)numbers
new_df["education_rating_sklearn"] = encoder.fit_transform(education_reshaped)

# show what results
print(new_df["education_rating_sklearn"].value_counts())

2.0    697
3.0    291
0.0     31
Name: education_rating_sklearn, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df["education_rating_sklearn"] = encoder.fit_transform(education_reshaped)


-> Label Encoding

Now, to talk about nominal data, and we have to approach this type of data differently than what we did with ordinal data. Our city feature has a lot of different labels, but here are the top five cities that appear in data frame.

In [67]:
print(data["city"].nunique())
print(data['city'].value_counts()[:5])

108
city_103    473
city_21     318
city_16     168
city_114    155
city_160    113
Name: city, dtype: int64


to prepare this feature, we still need to convert our text to numbers, so let’s do just that. We will demonstrate two different approaches, with the first one showing how to convert the feature from an object type to a categories type.

In [68]:
# convert feature to category type
data["city_codes"] = data["city"].astype('category')

# save new version of category codes
data["city_codes"] = data["city_codes"].cat.codes

# show to see transformation
print(data["city_codes"].value_counts()[:5])

5     473
55    318
41    168
11    155
42    113
Name: city_codes, dtype: int64


one more way we can transform this feature is by using <span style="color:green">.sklearn.preprocessing</span> and the LabelEncoder library. This method will not work if your feature has NaN values. Those need to be addressed prior to running <span style="color:green">.fit_transform</span>.

In [69]:
# import sklearn lib
from sklearn.preprocessing import LabelEncoder

# instantiate encoder
encoder = LabelEncoder()

# create new varable with assigned (values)numbers
data["city_codes_sklearn"] = encoder.fit_transform(data["city"])

# show transformation
print(data["city_codes_sklearn"].value_counts()[:5])

5     473
55    318
41    168
11    155
42    113
Name: city_codes_sklearn, dtype: int64


-> One-hot Encoding

One-hot encoding is when create a dummy variable for each value of categorical feature, and a dummy variable is defined as a numeric variable with two values: 0 and 1.

this approach is great for categorical feature and will allow the model to see each category as its own feature and not try to create order between a value and a other values.

In [70]:
# import lib
import pandas as pd

# use pandas .get_dummies to create new column for each city
ohe = pd.get_dummies(data["city"])

# join the new columns back onto cars dataframe
data = data.join(ohe)

# show results
print(data.columns)

Index(['enrollee_id', 'city', 'city_development_index', 'gender',
       'relevent_experience', 'enrolled_university', 'education_level',
       'major_discipline', 'experience', 'company_size',
       ...
       'city_84', 'city_89', 'city_9', 'city_90', 'city_91', 'city_93',
       'city_94', 'city_97', 'city_98', 'city_99'],
      dtype='object', length=124)


a downside to this approach is that it can create a lot of features which can then create a very sparse matrix.

-> Binary Encoding

Binary encoder will find the number of unique categories and then convert each category to its binary representation.

ex: say we have unique 19 values and we want 5 digits long, so our binary encoder will need 5 columns to be able to represent all digits.
some unique values has transformed to be represented in the binary form 10011. If we were to utilize this process instead of the traditional one-hot encoder we would have 5 numerical features instead of 19, reducing our features by about 75%!, that's make sense.

to make this happen, we’ll use a library called <span style="color:green">category_encoders</span> and import <span style="color:green">BinaryEncoder</span>. We will determine which column to transform and set <span style="color:green">drop_invariant</span> to True, so it will keep the five binary columns. If it is set to the default 0, then we would have an additional column full of zeros.

In [71]:
# import lib
from category_encoders import BinaryEncoder

# this will create a new data frame with the company_type column removed and replaced with 5 new binary features columns
encoder = BinaryEncoder(cols = ['company_type'], drop_invariant = True)

binary_results = encoder.fit_transform(data['company_type'])

# show result
print(binary_results.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2129 entries, 0 to 2128
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   company_type_0  2129 non-null   int64
 1   company_type_1  2129 non-null   int64
 2   company_type_2  2129 non-null   int64
dtypes: int64(3)
memory usage: 50.0 KB
None


-> Hashing

Hashing is(another options) an encoding technique, this process is similar to one-hot encoding where it will create new binary columns, but within the parameters, you can decide how many features to output.
A huge advantage is reduced dimensionality, but a large disadvantage is that some categories will be mapped to the same values. That is called collision.
the Meaning of same hash values is, we’ve lost some information and our model won’t be able to see the difference between those.

Here is how can make this work with Python. final result of <span style="color:green">hash_results</span> will produce a data frame of just 5 columns, so we will want to concatenate this new data onto our original data frame.

In [72]:
# import lib
from category_encoders import HashingEncoder

# instantiate encoder
encoder = HashingEncoder(cols = 'company_type', n_components = 5)

# fit transform on expect column and set to a new varable
hash_results = encoder.fit_transform(data['company_type'])

# show result
print(hash_results.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2129 entries, 0 to 2128
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   col_0   2129 non-null   int64
 1   col_1   2129 non-null   int64
 2   col_2   2129 non-null   int64
 3   col_3   2129 non-null   int64
 4   col_4   2129 non-null   int64
dtypes: int64(5)
memory usage: 83.3 KB
None


when would I use this if I’m going to lose information and my model will see some other 'size company' combo with the same hash value as the same thing. Well, this could be a solution to your project and dataset if you are not as interested in assessing the impact of any particular categorical value.

For this example, maybe you aren’t interested in knowing which 'size company' had an impact on your final prediction, but you want to be able to get the best performance from your model. This encoding solution may be a good approach.(end of hash encoding)

-> Target Encoding

Target encoding is a Bayesian encoder used to transform categorical features into hashed numerical values and is sometimes called the mean encoder. This encoder can be utilized for data sets that are being prepared for regression-based supervised learning, as it needs to take into consideration the mean of the target variable and its correlation between each individual category of our feature.

It replaces each color with a blend of the mean price of that 'experience' and the mean 'training_hours' of all the data. Had it been predicting something categorical, it would’ve used a Bayesian target statistic.

Some drawbacks to this approach are overfitting and unevenly distributed values that could lead to extremes.

Say we are preparing our dataset for a regression-based supervised learning algorithm that is trying to predict the training hours(just for example 'experience' has some kind of relationship. 'training_hours').

In [73]:
from category_encoders import TargetEncoder

# instantiate encoder
encoder = TargetEncoder(cols = 'experience')

# fit transform
target_results = encoder.fit_transform(data['experience'], data['training_hours'])

# show result
print(target_results.head(5))

   experience
0   64.522166
1   65.110429
2   67.327250
3   67.798496
4   67.945170


-> Date-time Encoding

Exactly as its name suggests.

In [74]:
# see type of column that we want to convert to date-time object
data = pd.read_csv("profile.csv")
print(data['became_member_on'].dtypes)

int64


In [75]:
print(data['became_member_on'].head(5))

0    20170212
1    20170715
2    20180712
3    20170509
4    20170804
Name: became_member_on, dtype: int64


In [76]:
# now we can convert it to date-time object
data['became_member_on'] = pd.to_datetime(data['became_member_on'])
print(data['became_member_on'].info())
# complete to convert date-time objects


# create new variable for month
data['month'] = data['became_member_on'].dt.month

# create new variable for day of the week
data['dayofweek'] = data['became_member_on'].dt.day

# show all result
print(data['became_member_on'].head(3), data['month'].head(3), data['dayofweek'].head(3))

<class 'pandas.core.series.Series'>
RangeIndex: 17000 entries, 0 to 16999
Series name: became_member_on
Non-Null Count  Dtype         
--------------  -----         
17000 non-null  datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 132.9 KB
None
0   1970-01-01 00:00:00.020170212
1   1970-01-01 00:00:00.020170715
2   1970-01-01 00:00:00.020180712
Name: became_member_on, dtype: datetime64[ns] 0    1
1    1
2    1
Name: month, dtype: int64 0    1
1    1
2    1
Name: dayofweek, dtype: int64
