# Encoding Categorical Variables

In [71]:
# import lib
import pandas as pd

# read csv file
data = pd.read_csv("aug_test.csv")

# show variable types
print(data.dtypes)

enrollee_id                 int64
city                       object
city_development_index    float64
gender                     object
relevent_experience        object
enrolled_university        object
education_level            object
major_discipline           object
experience                 object
company_size               object
company_type               object
last_new_job               object
training_hours              int64
dtype: object


output features that are dtype = object and that tells us those features could be text or a mix of text and numerical values.

The reason to spend time into this level of encoding is that there are many machine learning models that cannot handle text and will only work with numbers. Dataset must be encoded into numbers before begin to train, test, or evaluate a model.

-> Ordinal Encoding

In [72]:
print(data["education_level"].value_counts())

Graduate          1269
Masters            496
High School        222
Phd                 54
Primary School      36
Name: education_level, dtype: int64


This is definitely an example of ordinal data: the education_level can easily be put in order of those in the “Phd” level to the education level in the “Primary School” level.

The output printed the labels with the highest counts, assume the following hierarchy:
* Phd
* High School
* Graduate
* Masters
* Primary School

We need to convert these labels into numbers, and we can do this with two different approaches.

First, we can do this by creating a dictionary where every label is the key and the new numeric number is the value. ‘Phd’ will get the highest score and ‘Primary School’ will be our lowest score. Then we will map each label from the education_level column to the numeric value and create a new column called education_rating.

In [73]:
# create dictionary of label: values in order
rating_dict = {"Phd":4, "High School":3, "Graduate":2, "Masters":1, "Primary School":0}

# create a new columns
data["education_rating"] = data["education_level"].map(rating_dict)

# show what diff
print(data["education_rating"].value_counts())

2.0    1269
1.0     496
3.0     222
4.0      54
0.0      36
Name: education_rating, dtype: int64


The second approach we will show is how to utilize the <span style="color:green">sklearn.preprocessing</span> library OrdinalEncoder. We follow a similar approach: we set our categories as a list, and then we will <span style="color:green">.fit_transform</span> the values in our feature education_level. We need to make sure we adhere to the shape requirements of a 2-D array, so I’ll notice the method <span style="color:green">.reshape(-1, 1)</span>.

We’ll also note, this method will not work if feature has NaN values. Those need to be addressed prior to running <span style="color:green">.fit_transform</span>

In [74]:
# show nan values in education_level
print(data["education_level"].isna().sum())

# drop it
new_df = data.dropna()
print(new_df["education_level"].isna().sum())

52
0


In [75]:
# import sklearn lib
from sklearn.preprocessing import OrdinalEncoder

# create encoder and set category order
encoder = OrdinalEncoder(categories = [['Phd', 'High School', 'Graduate', 'Masters', 'Primary School']])

# reshape feature
education_reshaped = new_df["education_level"].values.reshape(-1, 1)

# create new variable with assigned (values)numbers
new_df["education_rating_sklearn"] = encoder.fit_transform(education_reshaped)

# show what results
print(new_df["education_rating_sklearn"].value_counts())

2.0    697
3.0    291
0.0     31
Name: education_rating_sklearn, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df["education_rating_sklearn"] = encoder.fit_transform(education_reshaped)


-> Label Encoding

Now, to talk about nominal data, and we have to approach this type of data differently than what we did with ordinal data. Our city feature has a lot of different labels, but here are the top five cities that appear in data frame.

In [76]:
print(data["city"].nunique())
print(data['city'].value_counts()[:5])

108
city_103    473
city_21     318
city_16     168
city_114    155
city_160    113
Name: city, dtype: int64


to prepare this feature, we still need to convert our text to numbers, so let’s do just that. We will demonstrate two different approaches, with the first one showing how to convert the feature from an object type to a categories type.

In [77]:
# convert feature to category type
data["city_codes"] = data["city"].astype('category')

# save new version of category codes
data["city_codes"] = data["city_codes"].cat.codes

# show to see transformation
print(data["city_codes"].value_counts()[:5])

5     473
55    318
41    168
11    155
42    113
Name: city_codes, dtype: int64


one more way we can transform this feature is by using <span style="color:green">.sklearn.preprocessing</span> and the LabelEncoder library. This method will not work if your feature has NaN values. Those need to be addressed prior to running <span style="color:green">.fit_transform</span>.

In [78]:
# import sklearn lib
from sklearn.preprocessing import LabelEncoder

# instantiate encoder
encoder = LabelEncoder()

# create new varable with assigned (values)numbers
data["city_codes_sklearn"] = encoder.fit_transform(data["city"])

# show transformation
print(data["city_codes_sklearn"].value_counts()[:5])

5     473
55    318
41    168
11    155
42    113
Name: city_codes_sklearn, dtype: int64
