##### Pair Programming - REGRESION LINEAL 8 - ENCODING

En el pair programming de hoy usaremos el set de datos que guardastéis en el pair programming de normalización y estandarización.

Vuestro set de datos debería tener al menos una variable categórica, el objetivo del pair programming de hoy:

- Hacer una códificación de la/las variables categóricas que tengáis en vuestro set de datos.

- Recordad que lo primero que deberéis hacer es decidir su vuestras variables tienen o no orden, para que en función de esto uséis una aproximación u otra.

- Guardad el dataframe, donde deberíais tener las variables estadandarizas, normalizadas y codificadas en un csv para usarlo en el próximo pairprogramming

In [1]:
# Tratamiento de datos
# -----------------------------------------------------------------------
import pandas as pd

# Para la codificación de las variables numéricas
# -----------------------------------------------------------------------
from sklearn.preprocessing import LabelEncoder # para realizar el Label Encoding 
from sklearn.preprocessing import OneHotEncoder  # para realizar el One-Hot Encoding

# Para evitar que salgan los warnings en jupyter
# -----------------------------------------------------------------------
import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('../datos/pokePd_estandarizado.csv', index_col = 0)
df.head()

Unnamed: 0,Type,Total,HP,Attack,Defense,Sp.Atk,Sp.Def,Speed_BOX
0,Grass,-0.819945,-0.75,-0.738095,-0.5,-0.088889,-0.131579,16.160251
1,Poison,-0.819945,-0.75,-0.738095,-0.5,-0.088889,-0.131579,16.160251
2,Grass,-0.33795,-0.28125,-0.428571,-0.166667,0.244444,0.263158,19.72477
3,Poison,-0.33795,-0.28125,-0.428571,-0.166667,0.244444,0.263158,19.72477
4,Grass,0.32687,0.34375,0.047619,0.309524,0.688889,0.789474,24.005888


En primer lugar, vemos que nuestra variable de tipo catégorico es nominal (columna 'Type'), es decir, no tiene un orden. Por ello vamos a utilizar los métodos de Encoding

Método One Hot Encoder

In [3]:
# iniciamos el método de OneHot Encoder
oh = OneHotEncoder()

In [4]:
# en este caso trabajaremos con la columna english, que en este caso no tiene orden

df.Type.unique()

array(['Grass', 'Poison', 'Fire', 'Flying', 'Dragon', 'Water', 'Bug',
       'Normal', 'Dark', 'Electric', 'Psychic', 'Ground', 'Ice', 'Steel',
       'Fairy', 'Fighting', 'Rock', 'Ghost'], dtype=object)

In [5]:
# hacemos la codificación de los datos para la variable dada

transformados = oh.fit_transform(df[['Type']])

In [6]:
# convertimos nuestro array con la codificación hecha en un dataframe

oh_df = pd.DataFrame(transformados.toarray())
oh_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
oh_df.columns = oh.get_feature_names_out()
oh_df.columns

Index(['Type_Bug', 'Type_Dark', 'Type_Dragon', 'Type_Electric', 'Type_Fairy',
       'Type_Fighting', 'Type_Fire', 'Type_Flying', 'Type_Ghost', 'Type_Grass',
       'Type_Ground', 'Type_Ice', 'Type_Normal', 'Type_Poison', 'Type_Psychic',
       'Type_Rock', 'Type_Steel', 'Type_Water'],
      dtype='object')

In [8]:
# concatenamos el dataframe original con el dataframe que acabamos de crear

final = pd.concat([df,oh_df],axis=1)

In [9]:
pd.options.display.max_columns = None

In [10]:
final.head()

Unnamed: 0,Type,Total,HP,Attack,Defense,Sp.Atk,Sp.Def,Speed_BOX,Type_Bug,Type_Dark,Type_Dragon,Type_Electric,Type_Fairy,Type_Fighting,Type_Fire,Type_Flying,Type_Ghost,Type_Grass,Type_Ground,Type_Ice,Type_Normal,Type_Poison,Type_Psychic,Type_Rock,Type_Steel,Type_Water
0,Grass,-0.819945,-0.75,-0.738095,-0.5,-0.088889,-0.131579,16.160251,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Poison,-0.819945,-0.75,-0.738095,-0.5,-0.088889,-0.131579,16.160251,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,Grass,-0.33795,-0.28125,-0.428571,-0.166667,0.244444,0.263158,19.72477,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Poison,-0.33795,-0.28125,-0.428571,-0.166667,0.244444,0.263158,19.72477,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,Grass,0.32687,0.34375,0.047619,0.309524,0.688889,0.789474,24.005888,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
# vamos a definir una función que nos aplique este método

def one_hot_encoder_one(df,columna,keep_first=True):
    
    # iniciamos el método de OneHot Encoder
    oh = OneHotEncoder()
    
    # hacemos la codificación de los datos para la variable dada 
    transformados = oh.fit_transform(df[[columna]])
    
    # convertimos nuestro array con la codificación hecha en un dataframe
    oh_df = pd.DataFrame(transformados.toarray())
    
    # el método get_feature_names nos va a dar el nombre de las columnas nuevas que se nos generarán
    oh_df.columns = oh.get_feature_names_out()
    
    # concatenamos el dataframe original con el dataframe que acabamos de crear
    final = pd.concat([df,oh_df],axis=1)
    
    # eliminamos la columna original 
    final.drop(columna, axis = 1,  inplace = True)
    return final

In [12]:
# aplicamos nuestra función al dataset para que nos haga el encoding sobre la columna "Type"
df_tipo = one_hot_encoder_one(df, "Type")

In [13]:
df_tipo.head()

Unnamed: 0,Total,HP,Attack,Defense,Sp.Atk,Sp.Def,Speed_BOX,Type_Bug,Type_Dark,Type_Dragon,Type_Electric,Type_Fairy,Type_Fighting,Type_Fire,Type_Flying,Type_Ghost,Type_Grass,Type_Ground,Type_Ice,Type_Normal,Type_Poison,Type_Psychic,Type_Rock,Type_Steel,Type_Water
0,-0.819945,-0.75,-0.738095,-0.5,-0.088889,-0.131579,16.160251,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,-0.819945,-0.75,-0.738095,-0.5,-0.088889,-0.131579,16.160251,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,-0.33795,-0.28125,-0.428571,-0.166667,0.244444,0.263158,19.72477,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,-0.33795,-0.28125,-0.428571,-0.166667,0.244444,0.263158,19.72477,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,0.32687,0.34375,0.047619,0.309524,0.688889,0.789474,24.005888,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Método Get Dummies

In [14]:
df.Type.unique()

array(['Grass', 'Poison', 'Fire', 'Flying', 'Dragon', 'Water', 'Bug',
       'Normal', 'Dark', 'Electric', 'Psychic', 'Ground', 'Ice', 'Steel',
       'Fairy', 'Fighting', 'Rock', 'Ghost'], dtype=object)

In [15]:
dummies = pd.get_dummies(df["Type"], prefix_sep = "_", prefix = "Type", dtype = int)
dummies.head(2)

Unnamed: 0,Type_Bug,Type_Dark,Type_Dragon,Type_Electric,Type_Fairy,Type_Fighting,Type_Fire,Type_Flying,Type_Ghost,Type_Grass,Type_Ground,Type_Ice,Type_Normal,Type_Poison,Type_Psychic,Type_Rock,Type_Steel,Type_Water
0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


In [16]:
df_dummies = pd.concat([df, dummies], axis = 1)
df_dummies.head(2)

Unnamed: 0,Type,Total,HP,Attack,Defense,Sp.Atk,Sp.Def,Speed_BOX,Type_Bug,Type_Dark,Type_Dragon,Type_Electric,Type_Fairy,Type_Fighting,Type_Fire,Type_Flying,Type_Ghost,Type_Grass,Type_Ground,Type_Ice,Type_Normal,Type_Poison,Type_Psychic,Type_Rock,Type_Steel,Type_Water
0,Grass,-0.819945,-0.75,-0.738095,-0.5,-0.088889,-0.131579,16.160251,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
1,Poison,-0.819945,-0.75,-0.738095,-0.5,-0.088889,-0.131579,16.160251,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


In [17]:
df_tipo.to_csv('../datos/pokePd_codificado_onehot.csv')

In [18]:
#guardamos el df a archivo csv
df_dummies.to_csv('../datos/pokePdcodificado.csv')