# Import necessary dependencies and settings

In [27]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.preprocessing import OrdinalEncoder

# Transforming Nominal Features

Nominal attributes consist of discrete categorical values with no notion or sense of order amongst them. The idea here is to transform these attributes into a more representative numerical format which can be easily understood by downstream code and pipelines. Let’s look at a new dataset pertaining to video game sales.

In [118]:
vg_df = pd.read_csv(r'F:\Programacion\1.BOOTCAMP\data\general_dfs\vgsales.csv', encoding='utf-8')
vg_df[['Name', 'Platform', 'Year', 'Genre', 'Publisher']].iloc[1:7]

Unnamed: 0,Name,Platform,Year,Genre,Publisher
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo
5,Tetris,GB,1989.0,Puzzle,Nintendo
6,New Super Mario Bros.,DS,2006.0,Platform,Nintendo


### Get the list of unique video game genres 

In [87]:
names = list(vg_df.Name.unique())
genres = list(vg_df.Genre.unique())
genres= pd.DataFrame(genres, columns= ['Genres'])

genres

Unnamed: 0,Genres
0,Sports
1,Platform
2,Racing
3,Role-Playing
4,Puzzle
5,Misc
6,Shooter
7,Simulation
8,Action
9,Fighting


This tells us that we have 12 distinct video game genres. 

### We can now generate a label encoding scheme for mapping each category to a numeric value by leveraging scikit-learn LabelEncoder

In [64]:
sth = OneHotEncoder().fit_transform(np.array(vg_df['Genre'].values.reshape(-1, 1)))
sth

<16598x12 sparse matrix of type '<class 'numpy.float64'>'
	with 16598 stored elements in Compressed Sparse Row format>

In [88]:
from collections import defaultdict
d = defaultdict(LabelEncoder)

# Encoding the variable
fit = genres.apply(lambda x: d[x.name].fit_transform(x))

# Inverse the encoded
fit.apply(lambda x: d[x.name].inverse_transform(x))

# Using the dictionary to label future data
genres['MkII'] = genres.apply(lambda x: d[x.name].transform(x))

In [89]:
genres

Unnamed: 0,Genres,MkII
0,Sports,10
1,Platform,4
2,Racing,6
3,Role-Playing,7
4,Puzzle,5
5,Misc,3
6,Shooter,8
7,Simulation,9
8,Action,0
9,Fighting,2


In [120]:
vg_df.insert(4, 'Genres_encoded', 0) 

In [124]:
for e in genres.Genres:
    mk2 =  int(genres.MkII.loc[genres.Genres == e])
    if e in list(vg_df.Genre):
        vg_df.Genres_encoded.loc[vg_df.Genre == e] = mk2
print ('Done')

Done


In [123]:
vg_df

Unnamed: 0,Rank,Name,Platform,Year,Genres_encoded,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,10.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,4.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,6.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,10.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.00
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,7.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.00,31.37
...,...,...,...,...,...,...,...,...,...,...,...,...
16593,16596,Woody Woodpecker in Crazy Castle 5,GBA,2002.0,4.0,Platform,Kemco,0.01,0.00,0.00,0.00,0.01
16594,16597,Men in Black II: Alien Escape,GC,2003.0,8.0,Shooter,Infogrames,0.01,0.00,0.00,0.00,0.01
16595,16598,SCORE International Baja 1000: The Official Game,PS2,2008.0,6.0,Racing,Activision,0.00,0.00,0.00,0.00,0.01
16596,16599,Know How 2,DS,2010.0,5.0,Puzzle,7G//AMES,0.00,0.01,0.00,0.00,0.01


In [125]:
vg_df.Genres_encoded.value_counts()

0.0     3316
10.0    2346
3.0     1739
7.0     1488
8.0     1310
1.0     1286
6.0     1249
4.0      886
9.0      867
2.0      848
11.0     681
5.0      582
Name: Genres_encoded, dtype: int64

### Show the transformed labels values and the dataframe


# Transforming Ordinal Features

Ordinal attributes are categorical attributes with a sense of order amongst the values. Let’s consider the Pokémon dataset. Let’s focus more specifically on the Generation attribute.


In [50]:
poke_df = pd.read_csv('datasets/Pokemon.csv', encoding='utf-8')
poke_df = poke_df.sample(random_state=1, frac=1).reset_index(drop=True)



FileNotFoundError: [Errno 2] File b'datasets/Pokemon.csv' does not exist: b'datasets/Pokemon.csv'

### Show the different generation present in the dataset

In general, there is no generic module or function to map and transform these features into numeric representations based on order automatically. Hence we can use a custom encoding\mapping scheme based on a dictionary.

In [51]:
gen_ord_map = {'Gen 1': 1, 'Gen 2': 2, 'Gen 3': 3, 
               'Gen 4': 4, 'Gen 5': 5, 'Gen 6': 6}

# map the values to the dataframe


# Encoding Categorical Features

## One-hot Encoding Scheme

In [52]:
poke_df[['Name', 'Generation', 'Legendary']].iloc[4:10]

NameError: name 'poke_df' is not defined

In [53]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# transform and map pokemon generations with LabelEncoder

# transform and map pokemon legendary status with Label Encoder


The features Gen_Label and Lgnd_Label now depict the numeric representations of our categorical features. Let’s now apply the one-hot encoding scheme on these features.

In [54]:
# encode generation labels using one-hot encoding scheme


# encode legendary status labels using one-hot encoding scheme


Unnamed: 0,Name,Generation,Gen_Label,Gen 1,Gen 2,Gen 3,Gen 4,Gen 5,Gen 6,Legendary,Lgnd_Label,Legendary_False,Legendary_True
4,Octillery,Gen 2,1,0.0,1.0,0.0,0.0,0.0,0.0,False,0,1.0,0.0
5,Helioptile,Gen 6,5,0.0,0.0,0.0,0.0,0.0,1.0,False,0,1.0,0.0
6,Dialga,Gen 4,3,0.0,0.0,0.0,1.0,0.0,0.0,True,1,0.0,1.0
7,DeoxysDefense Forme,Gen 3,2,0.0,0.0,1.0,0.0,0.0,0.0,True,1,0.0,1.0
8,Rapidash,Gen 1,0,1.0,0.0,0.0,0.0,0.0,0.0,False,0,1.0,0.0
9,Swanna,Gen 5,4,0.0,0.0,0.0,0.0,1.0,0.0,False,0,1.0,0.0


Consider you built this encoding scheme on your training data and built some model and now you have some new data which has to be engineered for features before predictions as follows.

In [55]:
new_poke_df = pd.DataFrame([['PikaZoom', 'Gen 3', True], 
                           ['CharMyToast', 'Gen 4', False]],
                           columns=['Name', 'Generation', 'Legendary'])
new_poke_df

Unnamed: 0,Name,Generation,Legendary
0,PikaZoom,Gen 3,True
1,CharMyToast,Gen 4,False


In [56]:
new_gen_labels = gen_le.transform(new_poke_df['Generation'])
new_poke_df['Gen_Label'] = new_gen_labels

new_leg_labels = leg_le.transform(new_poke_df['Legendary'])
new_poke_df['Lgnd_Label'] = new_leg_labels

new_poke_df[['Name', 'Generation', 'Gen_Label', 'Legendary', 'Lgnd_Label']]

NameError: name 'gen_le' is not defined

You can leverage scikit-learn’s excellent API here by calling the transform(…) function of the previously build LabeLEncoder and OneHotEncoder objects on the new data.

### You can also apply the one-hot encoding scheme easily by leveraging the to_dummies(…) function from pandas.
Use it on poke_df generation column

## Dummy Coding Scheme

Let’s try applying dummy coding scheme on Pokémon Generation by dropping the first level binary encoded feature (Gen 1).


In [57]:
gen_dummy_features = pd.get_dummies(poke_df['Generation'], drop_first=True)
pd.concat([poke_df[['Name', 'Generation']], gen_dummy_features], axis=1).iloc[4:10]

NameError: name 'poke_df' is not defined

If you want, you can also choose to drop the last level binary encoded feature (Gen 6) 

## Feature Hashing scheme

Find the number of different 'Genre' in the dataset.

Total game genres: 12
['Action' 'Adventure' 'Fighting' 'Misc' 'Platform' 'Puzzle' 'Racing'
 'Role-Playing' 'Shooter' 'Simulation' 'Sports' 'Strategy']


### We can see that there are a total of 12 genres of video games. If we used a one-hot encoding scheme on the Genre feature, we would end up having 12 binary features. Instead, we will now use a feature hashing scheme by leveraging scikit-learn’s FeatureHasher class, which uses a signed 32-bit version of the Murmurhash3 hash function. We will pre-define the final feature vector size to be 6 in this case.