<a href="https://colab.research.google.com/github/MinakoNG63/DSFB/blob/main/13_Feature_Transformation_Categorical_Data_63070240.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Engineering (Feature Transformation) on Categorical Data

Term 1 2022 - Instructor: Teerapong Leelanupab

Teaching Assistant:
1. Piyawat Chuangkrud (Sam)
2. Suvapat Manu (Mint)

***

Credit: Dipanjan (DJ) Sarkar, [Categorical Data](https://towardsdatascience.com/understanding-feature-engineering-part-2-categorical-data-f54324193e63)
***

In [None]:
#---------------------------------
#download files จาก google drive
#---------------------------------
#download vgsales.csv
!gdown --id 1u_ADLdhLtfH0780fgjRCWOV27c3JZ_OR

#download Pokemon.csv
!gdown --id 18Q4wKu6jUC6ZXtENtwJ0lWpNyn13wjcR

Downloading...
From: https://drive.google.com/uc?id=1u_ADLdhLtfH0780fgjRCWOV27c3JZ_OR
To: /content/vgsales.csv
100% 1.36M/1.36M [00:00<00:00, 133MB/s]
Downloading...
From: https://drive.google.com/uc?id=18Q4wKu6jUC6ZXtENtwJ0lWpNyn13wjcR
To: /content/Pokemon.csv
100% 47.2k/47.2k [00:00<00:00, 55.9MB/s]


# When to use a Label Encoding vs. One Hot Encoding
This question generally depends on your dataset and the model which you wish to apply. But still, a few points to note before choosing the right encoding technique for your model:

We apply One-Hot Encoding when:

1. The categorical feature is not ordinal (like the countries above)
2. The number of categorical features is less so one-hot encoding can be effectively applied

We apply Label Encoding when:
1. The categorical feature is ordinal (like Jr. kg, Sr. kg, Primary school, high school)
2. The number of categories is quite large as one-hot encoding can lead to high memory consumption

Credit: ALAKH SETHI, [one-hot-encoding-vs-label-encoding](https://www.analyticsvidhya.com/blog/2020/03/one-hot-encoding-vs-label-encoding-using-scikit-learn/)


# Strategies for working with discrete, categorical data


In the previous notebook ([Notebook 1](https://drive.google.com/file/d/1wX62EW9LPZjJ7AqKWkMgdRrBRvoy0cZV/view?usp=sharing)), we cover various feature engineering strategies for dealing with structured continuous numeric data. In this notebook, we will look at another type of structured data, which is discrete in nature and is popularly termed as categorical data. Dealing with numeric data is often easier than categorical data given that we do not have to deal with additional complexities of the semantics pertaining to each category value in any data attribute which is of a categorical type. We will use a hands-on approach to discuss several encoding schemes for dealing with categorical data and also a couple of popular techniques for dealing with large scale feature explosion, often known as the “[curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality)”.

In reality, machine learning algorithms cannot work directly with categorical data and you do need to do some amount of engineering and transformations on this data before you can start modeling on your data.

## Understanding Categorical Data
Let’s get an idea about categorical data representations before diving into feature engineering strategies. Typically, any data attribute which is categorical in nature represents discrete values which belong to a specific finite set of categories or classes. These are also often known as classes or labels in the context of attributes or variables which are to be predicted by a model (popularly known as response variables). These discrete values can be text or numeric in nature (or even unstructured data like images!). There are two major classes of categorical data, nominal and ordinal.

In any **nominal categorical data attribute**, there is no concept of ordering amongst the values of that attribute. Consider a simple example of weather categories, as depicted in the following figure. We can see that we have six major classes or categories in this particular scenario without any concept or notion of order (*windy* doesn’t always occur before *sunny* nor is it smaller or bigger than *sunny*) -> หรือ ข้อมูลของฟีเจอร์นั้นไม่มีลำดับของข้อมูล (non-ordinal).

<figure>
<center>
<img src='https://www.it.kmitl.ac.th/~teerapong/Exxon/Week1/images/Lab32_weather.png' alt='Weather as a categorical attribute'/>
<figcaption><em>Fig. 1: Weather as a categorical attribute</em></figcaption></center>
</figure>


Similarly movie, music and video game genres, country names, food and cuisine types are other examples of nominal categorical attributes.

**Ordinal categorical attributes** have some sense or notion of order amongst its values. For instance look at the following figure for shirt sizes. It is quite evident that order or in this case ‘size’ matters when thinking about shirts ($S$ is smaller than $M$ which is smaller than $L$ and so on) -> ข้อมูลของฟีเจอร์นั้นนำมาเรียงลำดับได้ (ordinal).

<figure>
<center>
<img src='https://www.it.kmitl.ac.th/~teerapong/Exxon/Week1/images/Lab32_shirt_size.png' alt='Shirt size as an ordinal categorical attribute'/>
<figcaption><em>Fig. 2: Shirt size as an ordinal categorical attribute</em></figcaption></center>
</figure>

Shoe sizes, education level and employment roles are some other examples of ordinal categorical attributes. Having a decent idea about categorical data, let’s now look at some feature engineering strategies.

# Feature Engineering on Categorical Data

While a lot of advancements have been made in various machine learning frameworks to accept complex categorical data types like text labels. Typically any standard workflow in feature engineering involves some form of **transformation** of these categorical values into numeric labels and then applying some **encoding scheme** on these values. We load up the necessary essentials before getting started.

Step
1. Identify categorical features
2. Transform them into numeric labels
3. Apply encoding scheme (optional)

## 2. Transform them into numeric labels, either nominal or ordinal attribute by number

In [None]:
import pandas as pd
import numpy as np

### 2.1 Transforming <font color='red'>Nominal</font> Attributes (**<font color='red'>LabelEncoder</font>**)
Nominal attributes consist of discrete categorical values with no notion or sense of order amongst them. The idea here is to transform these attributes into a more representative numerical format which can be easily understood by downstream code and pipelines. Let’s look at a new dataset pertaining to video game sales. This dataset is also available on [Kaggle](https://www.kaggle.com/gregorut/videogamesales).

In [None]:
vg_df = pd.read_csv('vgsales.csv', encoding='utf-8')
vg_df[['Name', 'Platform', 'Year', 'Genre', 'Publisher']].iloc[1:7]

Unnamed: 0,Name,Platform,Year,Genre,Publisher
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo
5,Tetris,GB,1989.0,Puzzle,Nintendo
6,New Super Mario Bros.,DS,2006.0,Platform,Nintendo


Let’s focus on the video game `Genre` attribute as depicted in the above data frame. It is quite evident that this is a nominal categorical attribute just like `Publisher` and `Platform`. We can easily get the list of unique video game genres as follows.

In [None]:
genres = np.unique(vg_df['Genre'])
print("Number of unique labels: " + str(len(genres)))
genres

Number of unique labels: 12


array(['Action', 'Adventure', 'Fighting', 'Misc', 'Platform', 'Puzzle',
       'Racing', 'Role-Playing', 'Shooter', 'Simulation', 'Sports',
       'Strategy'], dtype=object)

This tells us that we have 12 distinct video game genres. We can now generate a label encoding scheme for mapping each category to a numeric value by leveraging `scikit-learn`.

Binary Classification [0,1]
Multi-class classification []

In [None]:
from sklearn.preprocessing import LabelEncoder

gle = LabelEncoder()
genre_labels = gle.fit_transform(vg_df['Genre'])
genre_mappings = {index: label for index, label in enumerate(gle.classes_)}
genre_mappings

{0: 'Action',
 1: 'Adventure',
 2: 'Fighting',
 3: 'Misc',
 4: 'Platform',
 5: 'Puzzle',
 6: 'Racing',
 7: 'Role-Playing',
 8: 'Shooter',
 9: 'Simulation',
 10: 'Sports',
 11: 'Strategy'}

Thus a mapping scheme has been generated where each genre value is mapped to a number with the help of the `LabelEncoder` object `gle`. The transformed labels are stored in the `genre_labels` value which we can write back to our data frame.

In [None]:
vg_df['GenreLabel'] = genre_labels
vg_df[['Name', 'Platform', 'Year', 'Genre', 'GenreLabel']].iloc[1:7]

Unnamed: 0,Name,Platform,Year,Genre,GenreLabel
1,Super Mario Bros.,NES,1985.0,Platform,4
2,Mario Kart Wii,Wii,2008.0,Racing,6
3,Wii Sports Resort,Wii,2009.0,Sports,10
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,7
5,Tetris,GB,1989.0,Puzzle,5
6,New Super Mario Bros.,DS,2006.0,Platform,4


These labels can be used directly often especially with frameworks like `scikit-learn` if you plan to use them as response variables for prediction, however as discussed earlier, we will need an additional step of encoding on these before we can use them as features.

### 2.2 Transforming <font color='red'> Ordinal</font> Attributes **(<font color='red'>map</font> function)**
Ordinal attributes are categorical attributes with a sense of order amongst the values. Let’s consider our [Pokémon]$^1$ (https://www.kaggle.com/abcsds/pokemon/data) dataset which we used in the [Lab 31](../Week11_Feature_Engineering_needprepare/31_Feature_Engineering_Continuous%20_Numeric_Data.ipynb) of this series. Let’s focus more specifically on the `Generation` attribute.

<br />

$^1$ Note that the dataset downloaded from the Kaggle differs from the one given with this notebook for the `Generation` attribute, of which data are already labeled and transformed by just six ordinal categorical numbers, i.e. {'1', '2', '3', ..., '6'}.  

In [None]:
poke_df = pd.read_csv('Pokemon.csv', encoding='utf-8')
poke_df = poke_df.sample(random_state=1, frac=1).reset_index(drop=True)

poke_df

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,6,CharizardMega Charizard Y,Fire,Flying,634,78,104,78,159,115,100,Gen 1,False
1,460,Abomasnow,Grass,Ice,494,90,92,75,92,85,60,Gen 4,False
2,161,Sentret,Normal,,215,35,46,34,35,45,20,Gen 2,False
3,667,Litleo,Fire,Normal,369,62,50,58,73,54,72,Gen 6,False
4,224,Octillery,Water,,480,75,105,75,105,75,45,Gen 2,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,648,MeloettaAria Forme,Normal,Psychic,600,100,77,77,128,128,90,Gen 5,False
796,697,Tyrantrum,Rock,Dragon,521,82,121,119,69,59,71,Gen 6,False
797,66,Machop,Fighting,,305,70,80,50,35,35,35,Gen 1,False
798,217,Ursaring,Normal,,500,90,130,75,75,75,55,Gen 2,False


In [None]:
np.unique(poke_df['Generation'])

array(['Gen 1', 'Gen 2', 'Gen 3', 'Gen 4', 'Gen 5', 'Gen 6'], dtype=object)

Based on the above output, we can see there are a total of **6** generations and each Pokémon typically belongs to a specific generation based on the video games (when they were released) and also the television series follows a similar timeline. This attribute is typically ordinal (domain knowledge is necessary here) because most Pokémon belonging to *Generation 1* were introduced earlier in the video games and the television shows than *Generation 2* as so on. Fans can check out the following figure to remember some of the popular Pokémon of each generation (views may differ among fans!).

<figure>
<center>
<img src='https://www.it.kmitl.ac.th/~teerapong/Exxon/Week1/images/Lab32_pokemon_gen_chart.png' alt='Shirt size as an ordinal categorical attribute'/>
<figcaption><em>Fig. 3: Popular Pokémon based on generation and type (source: <a href="https://www.reddit.com/r/pokemon/comments/2s2upx/heres_my_favorite_pokemon_by_type_and_gen_chart">https://www.reddit.com/r/pokemon/comments/2s2upx/heres_my_favorite_pokemon_by_type_and_gen_chart</a>)</em></figcaption></center>
</figure>





Hence they have a sense of order amongst them. In general, there is no generic module or function to map and transform these features into numeric representations based on order automatically. Hence we can use a custom encoding\mapping scheme.

In [None]:
gen_ord_map = {'Gen 1': 1, 'Gen 2': 2, 'Gen 3': 3, 'Gen 4': 4, 'Gen 5': 5, 'Gen 6': 6}
poke_df['GenerationLabel'] = poke_df['Generation'].map(gen_ord_map)
poke_df[['Name', 'Generation', 'GenerationLabel']].iloc[4:10]

Unnamed: 0,Name,Generation,GenerationLabel
4,Octillery,Gen 2,2
5,Helioptile,Gen 6,6
6,Dialga,Gen 4,4
7,DeoxysDefense Forme,Gen 3,3
8,Rapidash,Gen 1,1
9,Swanna,Gen 5,5


It is quite evident from the above code that the **map(…)** function from pandas is quite helpful in transforming this ordinal feature.

## 3. Encoding Categorical Attributes
If you remember what we mentioned earlier, typically feature engineering on categorical data involves a transformation process which we depicted in the previous section and a compulsory encoding process where we apply specific encoding schemes to create dummy variables or features for each category\value in a specific categorical attribute.

You might be wondering, we just converted categories to numerical labels in the previous section, why on earth do we need this now? The reason is quite simple. Considering video game genres, if we directly fed the `GenreLabel` attribute as a feature in a machine learning model, it would consider it to be a continuous numeric feature thinking value **10** (*Sports*) is greater than **6** (*Racing*) but that is meaningless because the *Sports* genre is certainly not bigger or smaller than *Racing*, these are essentially different values or categories which cannot be compared directly. Hence we need an additional layer of encoding schemes where dummy features are created for each unique value or category out of all the distinct categories per attribute.

### 3.1) <font color='red'>One-hot Encoding</font> Scheme ($m$ labels -> $m$ binary features) **(<font color='red'>OneHotEncoder</font> or <font color='red'>get_dummies</font> function)**

Considering we have the numeric representation of any categorical attribute with $m$ labels (after transformation), the one-hot encoding scheme, encodes or transforms the attribute into $m$ binary features which can only contain a value of 1 or 0. Each observation in the categorical feature is thus converted into a vector of size $m$ with only one of the values as **1** (indicating it as active). Let’s take a subset of our Pokémon dataset depicting two attributes of interest.


เช่น หากมี weather = {"summer", "autumn", "winter", "spring"}
จะ encode โดยลบ weather attribute และ เพิ่มอีก 4 attributes คือ "summer", "autumn", "winter", "spring" attributes ซึ่งมี feature เป็น binary โดย
หาก weather = "summer" จะมีการ encode เป็น summer = 1, autumn = 0, winter = 0 และ spring = 0

In [None]:
poke_df[['Name', 'Generation', 'Legendary']].iloc[4:10]

Unnamed: 0,Name,Generation,Legendary
4,Octillery,Gen 2,False
5,Helioptile,Gen 6,False
6,Dialga,Gen 4,True
7,DeoxysDefense Forme,Gen 3,True
8,Rapidash,Gen 1,False
9,Swanna,Gen 5,False


The attributes of interest are Pokémon `Generation` and their `Legendary` status. The first step is to *transform* these attributes into numeric representations based on what we learnt earlier.

In [None]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# transform and map pokemon generations
gen_le = LabelEncoder()
gen_labels = gen_le.fit_transform(poke_df['Generation'])
poke_df['Gen_Label'] = gen_labels

# transform and map pokemon legendary status
leg_le = LabelEncoder()
leg_labels = leg_le.fit_transform(poke_df['Legendary'])
poke_df['Lgnd_Label'] = leg_labels

poke_df_sub = poke_df[['Name', 'Generation', 'Gen_Label',  'Legendary', 'Lgnd_Label']]
poke_df_sub.iloc[4:10]

Unnamed: 0,Name,Generation,Gen_Label,Legendary,Lgnd_Label
4,Octillery,Gen 2,1,False,0
5,Helioptile,Gen 6,5,False,0
6,Dialga,Gen 4,3,True,1
7,DeoxysDefense Forme,Gen 3,2,True,1
8,Rapidash,Gen 1,0,False,0
9,Swanna,Gen 5,4,False,0


The features `Gen_Label` and `Lgnd_Label` now depict the *numeric representations* of our categorical features. Let’s now apply the **one-hot encoding** scheme on these features.

In [None]:
# encode generation labels using one-hot encoding scheme
# Option 1 - New usage: Don't need to convert the categories to integers any more before using OneHotEncoder.
gen_ohe = OneHotEncoder()
gen_feature_arr = gen_ohe.fit_transform(poke_df[['Generation']]).toarray()

# Option 2
# gen_ohe = OneHotEncoder(categories='auto')
# gen_feature_arr = gen_ohe.fit_transform(poke_df[['Gen_Label']]).toarray()

gen_feature_labels = list(gen_le.classes_)
gen_features = pd.DataFrame(gen_feature_arr, columns=gen_feature_labels)

# encode legendary status labels using one-hot encoding scheme
leg_ohe = OneHotEncoder(categories='auto')
leg_feature_arr = leg_ohe.fit_transform(poke_df[['Lgnd_Label']]).toarray()

leg_feature_labels = ['Legendary_' + str(cls_label) for cls_label in leg_le.classes_]
leg_features = pd.DataFrame(leg_feature_arr, columns=leg_feature_labels)

In general, you can always encode both the features together using the `fit_transform(…)` function by passing it a two dimensional array of the two features together (Check out the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)!). But we encode each feature separately, to make things easier to understand. Besides this, we can also create separate data frames and label them accordingly. Let’s now concatenate these feature frames and see the final result.

In [None]:
poke_df_ohe = pd.concat([poke_df_sub, gen_features, leg_features], axis=1)
columns = sum([['Name', 'Generation', 'Gen_Label'],
               gen_feature_labels, ['Legendary', 'Lgnd_Label'],
               leg_feature_labels], [])
poke_df_ohe[columns].iloc[4:10]

Unnamed: 0,Name,Generation,Gen_Label,Gen 1,Gen 2,Gen 3,Gen 4,Gen 5,Gen 6,Legendary,Lgnd_Label,Legendary_False,Legendary_True
4,Octillery,Gen 2,1,0.0,1.0,0.0,0.0,0.0,0.0,False,0,1.0,0.0
5,Helioptile,Gen 6,5,0.0,0.0,0.0,0.0,0.0,1.0,False,0,1.0,0.0
6,Dialga,Gen 4,3,0.0,0.0,0.0,1.0,0.0,0.0,True,1,0.0,1.0
7,DeoxysDefense Forme,Gen 3,2,0.0,0.0,1.0,0.0,0.0,0.0,True,1,0.0,1.0
8,Rapidash,Gen 1,0,1.0,0.0,0.0,0.0,0.0,0.0,False,0,1.0,0.0
9,Swanna,Gen 5,4,0.0,0.0,0.0,0.0,1.0,0.0,False,0,1.0,0.0


Thus you can see that **6** dummy variables or binary features have been created for Generation and **2** for Legendary since those are the total number of distinct categories in each of these attributes respectively. <strong><em>Active</em></strong> state of a category is indicated by the **1** value in one of these dummy variables which is quite evident from the above data frame.

Consider you built this encoding scheme on your training data and built some model and now you have some new data which has to be engineered for features before predictions as follows.

In [None]:
new_poke_df = pd.DataFrame([['PikaZoom', 'Gen 3', True],
                           ['CharMyToast', 'Gen 4', False]],
                       columns=['Name', 'Generation', 'Legendary'])
new_poke_df

Unnamed: 0,Name,Generation,Legendary
0,PikaZoom,Gen 3,True
1,CharMyToast,Gen 4,False


You can leverage `scikit-learn’s` excellent API here by calling the `transform(…)` function of the previously build `LabeLEncoder` and `OneHotEncoder` objects on the new data. Remember our workflow, first we do the <strong><em>transformation</em></strong> .

In [None]:
new_gen_labels = gen_le.transform(new_poke_df['Generation']) #Use the same encoder to encode the new data
new_poke_df['Gen_Label'] = new_gen_labels

new_leg_labels = leg_le.transform(new_poke_df['Legendary'])    #Use the same encoder to encode the new data
new_poke_df['Lgnd_Label'] = new_leg_labels

new_poke_df[['Name', 'Generation', 'Gen_Label', 'Legendary', 'Lgnd_Label']]

Unnamed: 0,Name,Generation,Gen_Label,Legendary,Lgnd_Label
0,PikaZoom,Gen 3,2,True,1
1,CharMyToast,Gen 4,3,False,0


Once we have numerical labels, let’s apply the encoding scheme now!

In [None]:
#Option1 - according to cell 10
new_gen_feature_arr = gen_ohe.transform(new_poke_df[['Generation']]).toarray()

#Option2
#new_gen_feature_arr = gen_ohe.transform(new_poke_df_sub[['Gen_Label']]).toarray()
new_gen_features = pd.DataFrame(new_gen_feature_arr, columns=gen_feature_labels)

new_leg_feature_arr = leg_ohe.transform(new_poke_df[['Lgnd_Label']]).toarray()
new_leg_features = pd.DataFrame(new_leg_feature_arr, columns=leg_feature_labels)

new_poke_ohe = pd.concat([new_poke_df, new_gen_features, new_leg_features], axis=1)
columns = sum([['Name', 'Generation', 'Gen_Label'], gen_feature_labels, ['Legendary', 'Lgnd_Label'], leg_feature_labels], [])

new_poke_ohe[columns]

Unnamed: 0,Name,Generation,Gen_Label,Gen 1,Gen 2,Gen 3,Gen 4,Gen 5,Gen 6,Legendary,Lgnd_Label,Legendary_False,Legendary_True
0,PikaZoom,Gen 3,2,0.0,0.0,1.0,0.0,0.0,0.0,True,1,0.0,1.0
1,CharMyToast,Gen 4,3,0.0,0.0,0.0,1.0,0.0,0.0,False,0,1.0,0.0


\\Thus you can see it’s quite easy to apply this scheme on new data easily by leveraging `scikit-learn’s` powerful API.

You can also apply the one-hot encoding scheme easily by leveraging the `get_dummies(…)` function from `pandas`.

In [None]:
gen_onehot_features = pd.get_dummies(poke_df['Generation'])
pd.concat([poke_df[['Name', 'Generation']], gen_onehot_features], axis=1).iloc[4:10]

Unnamed: 0,Name,Generation,Gen 1,Gen 2,Gen 3,Gen 4,Gen 5,Gen 6
4,Octillery,Gen 2,0,1,0,0,0,0
5,Helioptile,Gen 6,0,0,0,0,0,1
6,Dialga,Gen 4,0,0,0,1,0,0
7,DeoxysDefense Forme,Gen 3,0,0,1,0,0,0
8,Rapidash,Gen 1,1,0,0,0,0,0
9,Swanna,Gen 5,0,0,0,0,1,0


The above data frame depicts the one-hot encoding scheme applied on the `Generation` attribute and the results are same as compared to the earlier results as expected.

### 3.2) Dummy Coding Scheme ($m$ labels -> $m-1$ binary features) **(<font color='red'>get_dummies</font> function with <font color='red'>drop_first or  drop_last</font>)**

The dummy coding scheme is similar to the one-hot encoding scheme, except in the case of dummy coding scheme, when applied on a categorical feature with $m$ distinct labels, we get $m - 1$ binary features. Thus each value of the categorical variable gets converted into a vector of size $m - 1$. The extra feature is completely disregarded and thus if the category values range from ${0, 1, …, m-1}$ **either** the $0th$ (first label) **or** the $m - 1th$ (last label) feature column is dropped and corresponding category values are usually represented by a vector of all zeros ($0$). Let’s try applying dummy coding scheme on Pokémon Generation by dropping the first level binary encoded feature (`Gen 1`).

In [None]:
gen_dummy_features = pd.get_dummies(poke_df['Generation'], drop_first=True)
pd.concat([poke_df[['Name', 'Generation']], gen_dummy_features],  axis=1).iloc[4:10]

Unnamed: 0,Name,Generation,Gen 2,Gen 3,Gen 4,Gen 5,Gen 6
4,Octillery,Gen 2,1,0,0,0,0
5,Helioptile,Gen 6,0,0,0,0,1
6,Dialga,Gen 4,0,0,1,0,0
7,DeoxysDefense Forme,Gen 3,0,1,0,0,0
8,Rapidash,Gen 1,0,0,0,0,0
9,Swanna,Gen 5,0,0,0,1,0


If you want, you can also choose to drop the last level binary encoded feature (`Gen 6`) as follows.

In [None]:
gen_onehot_features = pd.get_dummies(poke_df['Generation'])
gen_dummy_features = gen_onehot_features.iloc[:,:-1]
pd.concat([poke_df[['Name', 'Generation']], gen_dummy_features], axis=1).iloc[4:10]

Unnamed: 0,Name,Generation,Gen 1,Gen 2,Gen 3,Gen 4,Gen 5
4,Octillery,Gen 2,0,1,0,0,0
5,Helioptile,Gen 6,0,0,0,0,0
6,Dialga,Gen 4,0,0,0,1,0
7,DeoxysDefense Forme,Gen 3,0,0,1,0,0
8,Rapidash,Gen 1,1,0,0,0,0
9,Swanna,Gen 5,0,0,0,0,1


Based on the above depictions, it is quite clear that categories belonging to the dropped feature are represented as a vector of zeros (**0**) like we discussed earlier.

### 3.3) Effect Coding Scheme

The effect coding scheme is actually very similar to the dummy coding scheme, except during the encoding process, the encoded features or feature vector, for the category values which represent all **0** in the dummy coding scheme, is replaced by **-1** in the effect coding scheme. This will become clearer with the following example.

In [None]:
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

gen_onehot_features = pd.get_dummies(poke_df['Generation'], dtype='int8') #need toset dtype to signed int (int8) as the default of get_dummies is uint8
gen_effect_features = gen_onehot_features.iloc[:,:-1]
#assign -1 to the label with all encoding features are 0. That is, replace 0 with -1.
gen_effect_features.loc[np.all(gen_effect_features == 0, axis=1)] = -1
pd.concat([poke_df[['Name', 'Generation']], gen_effect_features], axis=1).iloc[4:10]

Unnamed: 0,Name,Generation,Gen 1,Gen 2,Gen 3,Gen 4,Gen 5
4,Octillery,Gen 2,0,1,0,0,0
5,Helioptile,Gen 6,-1,-1,-1,-1,-1
6,Dialga,Gen 4,0,0,0,1,0
7,DeoxysDefense Forme,Gen 3,0,0,1,0,0
8,Rapidash,Gen 1,1,0,0,0,0
9,Swanna,Gen 5,0,0,0,0,1


The above output clearly shows that the Pokémon belonging to `Generation 6` are now represented by a vector of **-1** values as compared to zeros in dummy coding.

### 3.4 Bin-counting Scheme

The encoding schemes we discussed so far, work quite well on categorical data in general, but they start causing problems when the number of distinct categories in any feature becomes very large. Essential for any categorical feature of $m$ distinct labels, you get $m$ separate features. This can easily increase the size of the feature set causing problems like storage issues, model training problems with regard to time, space and memory. Besides this, we also have to deal with what is popularly known as the ‘[curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality)’ where basically with an enormous number of features and not enough representative samples, model performance starts getting affected often leading to overfitting.

https://drive.google.com/file/d//view?usp=sharing


<figure>
<center>
<img src='https://www.it.kmitl.ac.th/~teerapong/Exxon/Week1/images/Lab32_curse_of_dim.png' alt='Curse of Dimensionality'/>
<figcaption><em>Fig. 4: Curse of Dimensionality</em
></figcaption></center>
</figure>

Hence we need to look towards other categorical data feature engineering schemes for features having a large number of possible categories (like IP addresses). The bin-counting scheme is a useful scheme for dealing with categorical variables having many categories. In this scheme, `instead of using the actual label values for encoding, we use probability based statistical information about the value and the actual target or response value which we aim to predict in our modeling efforts`.

A simple example would be based on past historical data for IP addresses and the ones which were used in DDOS attacks; we can build probability values for a DDOS attack being caused by any of the IP addresses. Using this information, we can encode an input feature which depicts that if the same IP address comes in the future, what is the probability value of a DDOS attack being caused. This scheme needs historical data as a pre-requisite and is an elaborate one. Depicting this with a complete example would be currently difficult here but there are several resources online which you can refer to for the same.

### 3.5) Feature Hashing Scheme
The feature hashing scheme is another useful feature engineering scheme for dealing with large scale categorical features. In this scheme, a hash function is typically used with the number of encoded features pre-set (as a vector of pre-defined length) such that the hashed values of the features are used as indices in this pre-defined vector and values are updated accordingly. Since a hash function maps a large number of values into a small finite set of values, multiple different values might create the same hash which is termed as collisions. Typically, a signed hash function is used so that the sign of the value obtained from the hash is used as the sign of the value which is stored in the final feature vector at the appropriate index. This should ensure lesser collisions and lesser accumulation of error due to collisions.

Hashing schemes work on strings, numbers and other structures like vectors. You can think of hashed outputs as a finite set of $b$ bins such that when hash function is applied on the same values\categories, they get assigned to the same bin (or subset of bins) out of the $b$ bins based on the hash value. We can pre-define the value of $b$ which becomes the final size of the encoded feature vector for each categorical attribute that we encode using the feature hashing scheme.

Thus even if we have over **1000** distinct categories in a feature and we set $b=10$ as the final feature vector size, the output feature set will still have only **10** features as compared to **1000** binary features if we used a one-hot encoding scheme. Let’s consider the `Genre` attribute in our video game dataset.

In [None]:
unique_genres = np.unique(vg_df[['Genre']])
print("Total game genres:", len(unique_genres))
print(unique_genres)

Total game genres: 12
['Action' 'Adventure' 'Fighting' 'Misc' 'Platform' 'Puzzle' 'Racing'
 'Role-Playing' 'Shooter' 'Simulation' 'Sports' 'Strategy']


We can see that there are a total of 12 genres of video games. If we used a one-hot encoding scheme on the `Genre` feature, we would end up having 12 binary features. Instead, we will now use a feature hashing scheme by leveraging `scikit-learn’s` `FeatureHasher` class, which uses a signed 32-bit version of the *Murmurhash3* hash function. We will pre-define the final feature vector size to be **6** in this case.

In [None]:
from sklearn.feature_extraction import FeatureHasher

fh = FeatureHasher(n_features=6, input_type='string')
hashed_features = fh.fit_transform(vg_df['Genre'])
hashed_features = hashed_features.toarray()
pd.concat([vg_df[['Name', 'Genre']], pd.DataFrame(hashed_features)], axis=1).iloc[1:7]

Unnamed: 0,Name,Genre,0,1,2,3,4,5
1,Super Mario Bros.,Platform,0.0,2.0,2.0,-1.0,1.0,0.0
2,Mario Kart Wii,Racing,-1.0,0.0,0.0,0.0,0.0,-1.0
3,Wii Sports Resort,Sports,-2.0,2.0,0.0,-2.0,0.0,0.0
4,Pokemon Red/Pokemon Blue,Role-Playing,-1.0,1.0,2.0,0.0,1.0,-1.0
5,Tetris,Puzzle,0.0,1.0,1.0,-2.0,1.0,-1.0
6,New Super Mario Bros.,Platform,0.0,2.0,2.0,-1.0,1.0,0.0


Based on the above output, the Genre categorical attribute has been encoded using the hashing scheme into **6** features instead of **12**. We can also see that rows **1** and **6** denote the same genre of games, <em><strong>Platform</strong></em> which have been rightly encoded into the same feature vector.

## Conclusion

These examples should give you a good idea about popular strategies for feature engineering on discrete, categorical data. If you read [Notebook 1](https://drive.google.com/file/d/1wX62EW9LPZjJ7AqKWkMgdRrBRvoy0cZV/view?usp=sharing) of this class, you would have seen that it is slightly challenging to work with categorical data as compared to continuous, numeric data but definitely interesting!

We also talked about some ways to handle large feature spaces using feature engineering but you should also remember that there are other techniques including [feature selection](https://en.wikipedia.org/wiki/Feature_selection) and [dimensionality reduction](https://en.wikipedia.org/wiki/Dimensionality_reduction) methods to handle large feature spaces. We will cover some of these methods in a later article.