# Import necessary dependencies and settings

In [3]:
# importa pandas y numpy
import pandas as pd 
import numpy as np 

# Transforming Nominal Features

Nominal attributes consist of discrete categorical values with no notion or sense of order amongst them. The idea here is to transform these attributes into a more representative numerical format which can be easily understood by downstream code and pipelines. Let’s look at a new dataset pertaining to video game sales.

In [16]:
# lee 'vgsales.csv'
# muestra las primeras 6 filas de las columnas 'Name', 'Platform', 'Year', 'Genre', 'Publisher'
df = pd.read_csv("..\\día_3\\ficheros_FE_categoricas\\vgsales.csv", encoding='utf-8')
df.loc[:,'Name':'Publisher'].head(6)

Unnamed: 0,Name,Platform,Year,Genre,Publisher
0,Wii Sports,Wii,2006.0,Sports,Nintendo
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo
5,Tetris,GB,1989.0,Puzzle,Nintendo


In [17]:
df_sort = df.loc[:,'Name':'Publisher'].copy()

### Get the list of unique video game genres 

In [21]:
generos = df['Genre'].unique()
generos

array([&#39;Sports&#39;, &#39;Platform&#39;, &#39;Racing&#39;, &#39;Role-Playing&#39;, &#39;Puzzle&#39;, &#39;Misc&#39;,
       &#39;Shooter&#39;, &#39;Simulation&#39;, &#39;Action&#39;, &#39;Fighting&#39;, &#39;Adventure&#39;,
       &#39;Strategy&#39;], dtype=object)

This tells us that we have 12 distinct video game genres. 

### We can now generate a label encoding scheme for mapping each category to a numeric value by leveraging scikit-learn LabelEncoder

In [70]:
# usando LabelEncoder muestra los géneros y las categorías asociadas a cada género
from sklearn import preprocessing

le = preprocessing.LabelEncoder().fit(generos)
print(le.classes_)
np_le = le.transform(generos) #Sólo estamos trabajando sobre#pero  los datos unicos y mapeando las clasificaciones unitarias. Pero se puede hacer sobre la Serie completa del DF
print("\n")
np_le

[&#39;Action&#39; &#39;Adventure&#39; &#39;Fighting&#39; &#39;Misc&#39; &#39;Platform&#39; &#39;Puzzle&#39; &#39;Racing&#39;
 &#39;Role-Playing&#39; &#39;Shooter&#39; &#39;Simulation&#39; &#39;Sports&#39; &#39;Strategy&#39;]




array([10,  4,  6,  7,  5,  3,  8,  9,  0,  2,  1, 11])

In [52]:
#Ejemplo Inverso de transform
list(le.inverse_transform(np_le))[-15:]

[&#39;Sports&#39;,
 &#39;Platform&#39;,
 &#39;Racing&#39;,
 &#39;Role-Playing&#39;,
 &#39;Puzzle&#39;,
 &#39;Misc&#39;,
 &#39;Shooter&#39;,
 &#39;Simulation&#39;,
 &#39;Action&#39;,
 &#39;Fighting&#39;,
 &#39;Adventure&#39;,
 &#39;Strategy&#39;]

In [56]:
#Se realiza sobre  sobre la Serie completa del DF
np_le_full = preprocessing.LabelEncoder().fit_transform(df['Genre'])

### Show the transformed labels values and the dataframe

In [57]:
# primero muestra solo los géneros del DataFrame
df['Genre'].head(10)

0          Sports
1        Platform
2          Racing
3          Sports
4    Role-Playing
5          Puzzle
6        Platform
7            Misc
8        Platform
9         Shooter
Name: Genre, dtype: object

In [69]:
# muestra en el DataFrame los géneros y sus categorías asociadas 
df['Game_Category'] = preprocessing.LabelEncoder().fit_transform(df['Genre'])
df.head()

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Game_Category
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74,10
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,4
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82,6
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0,10
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37,7


In [61]:
df[['Genre','Game_Category']][20:30]

Unnamed: 0,Genre,Game_Category
20,Role-Playing,7
21,Platform,4
22,Platform,4
23,Action,0
24,Action,0
25,Role-Playing,7
26,Role-Playing,7
27,Puzzle,5
28,Racing,6
29,Shooter,8


In [72]:
le.transform(df['Genre'])[20:30]

array([7, 4, 4, 0, 0, 7, 7, 5, 6, 8])


# Transforming Ordinal Features

Ordinal attributes are categorical attributes with a sense of order amongst the values. Let’s consider the Pokémon dataset. Let’s focus more specifically on the Type 1 attribute. We will think that each Type 1 has a different power that we can order.


In [4]:
# lee Pokemon.csv y muestra un head()
poke = pd.read_csv('..\\..\\semana_9\\día_2\\Pokemon.csv', encoding = 'latin_1', index_col=0)
poke.head()

Unnamed: 0_level_0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Stage,Legendary
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,2,False
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,3,False
4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
5,Charmeleon,Fire,,405,58,64,58,80,65,80,2,False


In [5]:
poke.shape

(151, 12)

In [6]:
# usa un sample() con semilla 1 y toma todo el DataFrame para desordenarlo aleatoriamente
# resetea los índices y haz un head()

poke = poke.sample(n=len(poke), random_state = 1)
#Con fraccion 1 = 100% y 50% =0.5
#poke.sample(frac=1, replace = True, random_state=1)

poke.head()

Unnamed: 0_level_0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Stage,Legendary
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
15,Beedrill,Bug,Poison,395,65,90,40,45,80,75,3,False
99,Kingler,Water,,475,55,130,115,50,50,75,2,False
76,Golem,Rock,Ground,495,80,120,130,55,65,45,3,False
17,Pidgeotto,Normal,Flying,349,63,60,55,50,50,71,2,False
132,Ditto,Normal,,288,48,48,48,48,48,48,1,False


In [7]:
poke_df = poke.reset_index(drop=True)
poke_df

Unnamed: 0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Stage,Legendary
0,Beedrill,Bug,Poison,395,65,90,40,45,80,75,3,False
1,Kingler,Water,,475,55,130,115,50,50,75,2,False
2,Golem,Rock,Ground,495,80,120,130,55,65,45,3,False
3,Pidgeotto,Normal,Flying,349,63,60,55,50,50,71,2,False
4,Ditto,Normal,,288,48,48,48,48,48,48,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...
146,Vaporeon,Water,,525,130,65,60,110,95,65,2,False
147,Omanyte,Rock,Water,355,35,40,100,90,55,35,1,False
148,Tentacruel,Water,Poison,515,80,70,65,80,120,100,2,False
149,Kabutops,Rock,Water,495,60,115,105,65,70,80,2,False


In [8]:
# muestra las columnas del DataFrame

In [9]:
poke.columns

Index([&#39;Name&#39;, &#39;Type 1&#39;, &#39;Type 2&#39;, &#39;Total&#39;, &#39;HP&#39;, &#39;Attack&#39;, &#39;Defense&#39;,
       &#39;Sp. Atk&#39;, &#39;Sp. Def&#39;, &#39;Speed&#39;, &#39;Stage&#39;, &#39;Legendary&#39;],
      dtype=&#39;object&#39;)

### Show the different type 1 present in the dataset

In general, there is no generic module or function to map and transform these features into numeric representations based on order automatically. Hence we can use a custom encoding\mapping scheme based on a dictionary.

In [10]:
# escribe un diccionario que mapee el Type 1 con un número asociado a cómo es de bueno el Type 1.
# Es decir, presupón que se pueden ordenar esas etiquetas.
# Usa DataFrame['Type 1'].unique() para seleccionar esos valores en ese orden y asignarles 1,2,3...
# Por ejemplo: 'Bug' se corresponde con 1, 'Water' se corresponde con 2...
print("Diferentes Type_1 de Pokemon")
poke_df['Type 1'].unique()

type_1_map = {'Bug': 1, 'Water': 2, 'Rock': 3, 'Normal': 4, 'Fighting': 5, 'Grass': 6, 'Poison': 7,
       'Fire': 8, 'Ghost': 9, 'Fairy': 10, 'Electric': 11, 'Dragon':12, 'Ground':13,
       'Psychic':14, 'Ice':15}
type_1_map


Diferentes Type_1 de Pokemon


{&#39;Bug&#39;: 1,
 &#39;Water&#39;: 2,
 &#39;Rock&#39;: 3,
 &#39;Normal&#39;: 4,
 &#39;Fighting&#39;: 5,
 &#39;Grass&#39;: 6,
 &#39;Poison&#39;: 7,
 &#39;Fire&#39;: 8,
 &#39;Ghost&#39;: 9,
 &#39;Fairy&#39;: 10,
 &#39;Electric&#39;: 11,
 &#39;Dragon&#39;: 12,
 &#39;Ground&#39;: 13,
 &#39;Psychic&#39;: 14,
 &#39;Ice&#39;: 15}

In [11]:
# mapea los valores en el DataFrame en una columna que se llame 'type_1_num'
# haz un head()
poke_df['type1_num'] = poke_df['Type 1'].map(type_1_map)
poke_df.head()

Unnamed: 0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Stage,Legendary,type1_num
0,Beedrill,Bug,Poison,395,65,90,40,45,80,75,3,False,1
1,Kingler,Water,,475,55,130,115,50,50,75,2,False,2
2,Golem,Rock,Ground,495,80,120,130,55,65,45,3,False,3
3,Pidgeotto,Normal,Flying,349,63,60,55,50,50,71,2,False,4
4,Ditto,Normal,,288,48,48,48,48,48,48,1,False,4


# Encoding Categorical Features

## One-hot Encoding Scheme

In [12]:
poke_df[['Name', 'Stage', 'Legendary']].iloc[4:10]

Unnamed: 0,Name,Stage,Legendary
4,Ditto,1,False
5,Primeape,2,False
6,Aerodactyl,1,False
7,Vileplume,3,False
8,Nidorina,2,False
9,Starmie,2,False


In [13]:
# usa LabelEncoder
from sklearn import preprocessing
# transform and map pokemon Type 1 with LabelEncoder
# el método zip te puede ayudar
type1_zip = dict(zip(poke['Type 1'].unique(),preprocessing.LabelEncoder().fit_transform(poke['Type 1'].unique())))
print(type1_zip)
poke_df['type1_zip'] = poke_df['Type 1'].map(type1_zip)
poke_df.head()

# transform and map pokemon legendary status with Label Encoder
poke_df['Legendary_zip'] = poke_df['Legendary'].map(dict(zip(poke['Legendary'],preprocessing.LabelEncoder().fit_transform(poke['Legendary']))))
poke_df.head()

{&#39;Bug&#39;: 0, &#39;Water&#39;: 14, &#39;Rock&#39;: 13, &#39;Normal&#39;: 10, &#39;Fighting&#39;: 4, &#39;Grass&#39;: 7, &#39;Poison&#39;: 11, &#39;Fire&#39;: 5, &#39;Ghost&#39;: 6, &#39;Fairy&#39;: 3, &#39;Electric&#39;: 2, &#39;Dragon&#39;: 1, &#39;Ground&#39;: 8, &#39;Psychic&#39;: 12, &#39;Ice&#39;: 9}


Unnamed: 0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Stage,Legendary,type1_num,type1_zip,Legendary_zip
0,Beedrill,Bug,Poison,395,65,90,40,45,80,75,3,False,1,0,0
1,Kingler,Water,,475,55,130,115,50,50,75,2,False,2,14,0
2,Golem,Rock,Ground,495,80,120,130,55,65,45,3,False,3,13,0
3,Pidgeotto,Normal,Flying,349,63,60,55,50,50,71,2,False,4,10,0
4,Ditto,Normal,,288,48,48,48,48,48,48,1,False,4,10,0


In [14]:
# Otra forma más sencilla utilizando transform
# ¡Para esto vale fit y transform!
# Muchas transformaciones se dividen en fit (ajusta los parámetros de la transformación)
# y en transform (aplica los cambios)

le_pokemon = preprocessing.LabelEncoder()
le_pokemon.fit(poke_df['Type 1'])
poke_df['type1_transformed'] = le_pokemon.transform(poke_df['Type 1'])

In [15]:
#haz un head()
poke_df.head()

Unnamed: 0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Stage,Legendary,type1_num,type1_zip,Legendary_zip,type1_transformed
0,Beedrill,Bug,Poison,395,65,90,40,45,80,75,3,False,1,0,0,0
1,Kingler,Water,,475,55,130,115,50,50,75,2,False,2,14,0,14
2,Golem,Rock,Ground,495,80,120,130,55,65,45,3,False,3,13,0,13
3,Pidgeotto,Normal,Flying,349,63,60,55,50,50,71,2,False,4,10,0,10
4,Ditto,Normal,,288,48,48,48,48,48,48,1,False,4,10,0,10


In [39]:
# comprobamos que la codificación del método es alfabética
poke_df['Type 1'].sort_values().unique().reshape(-1,1), poke_df['type1_zip'].sort_values().unique().reshape(-1,1)

(array([[&#39;Bug&#39;],
        [&#39;Dragon&#39;],
        [&#39;Electric&#39;],
        [&#39;Fairy&#39;],
        [&#39;Fighting&#39;],
        [&#39;Fire&#39;],
        [&#39;Ghost&#39;],
        [&#39;Grass&#39;],
        [&#39;Ground&#39;],
        [&#39;Ice&#39;],
        [&#39;Normal&#39;],
        [&#39;Poison&#39;],
        [&#39;Psychic&#39;],
        [&#39;Rock&#39;],
        [&#39;Water&#39;]], dtype=object), array([[ 0],
        [ 1],
        [ 2],
        [ 3],
        [ 4],
        [ 5],
        [ 6],
        [ 7],
        [ 8],
        [ 9],
        [10],
        [11],
        [12],
        [13],
        [14]], dtype=int64))

In [17]:
# haz un head()

The features Type 1 zip and Legendary_zip now depict the numeric representations of our categorical features. Let’s now apply the one-hot encoding scheme on these features. Apply the get_dummies() method.

In [41]:
# encode Type 1 labels using one-hot encoding scheme
one_hot_df_type_1 = pd.get_dummies(poke_df['Type 1'], prefix='Type_1')

# encode legendary status labels using one-hot encoding scheme
one_hot_df_legendary = pd.get_dummies(poke_df['Legendary'], prefix='Legendary')

# transform and map pokemon legendary status with Label Encoder


In [43]:
one_hot_df_type_1.head()

Unnamed: 0,Type_1_Bug,Type_1_Dragon,Type_1_Electric,Type_1_Fairy,Type_1_Fighting,Type_1_Fire,Type_1_Ghost,Type_1_Grass,Type_1_Ground,Type_1_Ice,Type_1_Normal,Type_1_Poison,Type_1_Psychic,Type_1_Rock,Type_1_Water
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
3,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


In [44]:
one_hot_df_legendary.head()

Unnamed: 0,Legendary_False,Legendary_True
0,1,0
1,1,0
2,1,0
3,1,0
4,1,0


In [51]:
sum(one_hot_df_legendary['Legendary_True'] == 1 )
sum(one_hot_df_legendary['Legendary_True'] == True )
sum(poke_df['Legendary'] == 1)
sum(poke_df['Legendary'] == True)
sum(poke_df['Legendary'])
# compruebo que solo hay 4 pokemon legendarios

sum(poke_df['Legendary']) == sum(one_hot_df_legendary['Legendary_True'] == 1 )


True

In [65]:
# concatena el DataFrame original con la codificación de Type 1 y de Legendary 

df_one_hot = pd.concat([poke_df,one_hot_df_type_1,one_hot_df_legendary], axis=1)
df_one_hot.iloc[:5, df_one_hot.shape[1]-8:]

Unnamed: 0,Type_1_Ice,Type_1_Normal,Type_1_Poison,Type_1_Psychic,Type_1_Rock,Type_1_Water,Legendary_False,Legendary_True
0,0,0,0,0,0,0,1,0
1,0,0,0,0,0,1,1,0
2,0,0,0,0,1,0,1,0
3,0,1,0,0,0,0,1,0
4,0,1,0,0,0,0,1,0


Consider you built this encoding scheme on your training data and built some model and now you have some new data which has to be engineered for features before predictions as follows.

In [66]:

new_poke_df = pd.DataFrame([['PikaZoom', 'Bug', True], 
                           ['CharMyToast', 'Water', False]],
                           columns=['Name', 'Type 1', 'Legendary'])
new_poke_df


Unnamed: 0,Name,Type 1,Legendary
0,PikaZoom,Bug,True
1,CharMyToast,Water,False


In [71]:
# usando fit() y transform(), añade Type1_Label y Lgnd_Label en el DataFrame

le_pokemon.fit(poke_df['Type 1'])
new_type1_labels = le_pokemon.transform(new_poke_df['Type 1'])
new_poke_df['Type_1_label'] = new_type1_labels

le_pokemon.fit(poke_df['Legendary'])
new_lgnd_label = le_pokemon.transform(new_poke_df['Legendary'])
new_poke_df['Legendary_label'] = new_lgnd_label

new_poke_df[['Name', 'Type 1', 'Type_1_label', 'Legendary', 'Legendary_label']]

Unnamed: 0,Name,Type 1,Type_1_label,Legendary,Legendary_label
0,PikaZoom,Bug,0,True,1
1,CharMyToast,Water,14,False,0


You can leverage scikit-learn’s excellent API here by calling the transform(…) function of the previously build LabeLEncoder objects on the new data.

## Dummy Coding Scheme

Let’s try applying dummy coding scheme on Pokémon Type 1 by dropping the first level binary encoded feature (Type 1 = Bug).


In [19]:
# haz un get_dummies para una codificación dummy
# muestra las filas desde la 4 hasta la 9 (incluida)




If you want, you can also choose to drop the last level binary encoded feature

In [20]:
# haz un fit() de Type 1 y mira las clases que aparecen



In [21]:
# haz un dummies sin eliminar ninguna columna que se obtenga solo de Type 1
# haz un head()



In [22]:
# comprueba en la codificación con la columna eliminada (dummy)
# pista: isin te puede ayudar



In [23]:
# comprueba qué hace el signo ~



In [24]:
# haz una lectura en el DataFrame con la última sentencia que emplea ~



In [25]:
# asígnalo a una variable y muestra un head




## Feature Hashing scheme

Find the number of different 'Genre' in the dataset.

In [26]:
# Usa vgsales.csv, léelo y haz un head()




In [27]:
# print('Total game genres: ' + str(len(df_videojuegos.Genre.unique())))
# print(df_videojuegos.Genre.sort_values().unique())

### We can see that there are a total of 12 genres of video games. If we used a one-hot encoding scheme on the Genre feature, we would end up having 12 binary features. Instead, we will now use a feature hashing scheme by leveraging scikit-learn’s FeatureHasher class, which uses a signed 32-bit version of the Murmurhash3 hash function. We will pre-define the final feature vector size to be 6 in this case.