
# Encoding categorical variables
The effectiveness of a machine learning model relies not only on the model itself and its hyperparameters but also on how we handle and input various types of variables. Since most machine learning models only work with numerical data, preprocessing categorical variables becomes essential. Our task is to transform these categorical variables into numeric representations so that the model can interpret and extract meaningful insights."

"Data scientists typically allocate **70–80% of their time** to data cleaning and preparation. Among the necessary tasks, **converting categorical data** is crucial. Not only does it enhance model quality, but it also contributes to better feature engineering. Now, the question arises: **Which categorical data encoding method should we choose?**"

## Label Encoding


In machine learning projects, we often work with datasets containing various categorical columns. Some of these columns have ordinal variables, such as an 'income level' column with elements like 'low,' 'medium,' or 'high.' In such cases, we can replace these categorical elements with numeric labels (e.g., 1 for 'low,' 2 for 'medium,' and 3 for 'high'). This encoding preserves the meaning of the elements, assigning higher weights to those with higher priority.

**Label Encoding** is a technique used to convert categorical columns into numerical ones. It ensures that machine learning models, which require numerical data, can process the information effectively. Label Encoding is an essential preprocessing step in machine learning projects.

For example, consider a dataset with a 'Height' column containing categories like 'Tall,' 'Medium,' and 'Short.' By applying label encoding, we transform this categorical column into a numerical one, where 'Tall' corresponds to 0, 'Medium' to 1, and 'Short' to 2.

Label Encoding simplifies the representation of categorical data, making it compatible with machine learning algorithms.

In [2]:
pip install category_encoders

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [11]:
import pandas as pd 
df = pd.read_csv("C:/Users/maria/Desktop/Segundo_periodo/ING_CARACT/datos_20.csv")
lista_de_columnas= []
for i in df['Grado']: 
    if i not in lista_de_columnas:
        lista_de_columnas.append(i)
        print(i)
print(df)

Primaria
Secundaria
Primaria trunca
Preparatoria
Licenciatura o superior
Sin escolaridad
                  Grado  POBTOT_BC  PROM_HNV_BC
0              Primaria   3.173729     1.154342
1              Primaria   5.278282     1.302897
2              Primaria   3.173729     1.759130
3              Primaria   5.221993     1.388040
4              Primaria   4.219874     1.420656
...                 ...        ...          ...
105492  Primaria trunca   3.037064     1.540916
105493  Primaria trunca   2.234823     2.071155
105494         Primaria   3.716448     1.699794
105495         Primaria   4.123288     1.781686
105496         Primaria   3.774494     1.863812

[105497 rows x 3 columns]


In [15]:
df["Grado"].value_counts()

Grado
Primaria                   59567
Primaria trunca            36023
Secundaria                  8131
Preparatoria                1432
Sin escolaridad              274
Licenciatura o superior       70
Name: count, dtype: int64

In [16]:
import category_encoders as ce
import pandas as pd
train_df=pd.DataFrame({'Degree':lista_de_columnas})

# create object of Ordinalencoding
encoder= ce.OrdinalEncoder(cols=['Degree'],return_df=True,
                           mapping=[{'col':'Degree',
'mapping':{'Sin escolaridad':0,'Primaria trunca':1,'Primaria':2,'Secundaria':3,'Preparatoria':4,'Licenciatura o superior':5}}])

#Original data
print(train_df)

                    Degree
0                 Primaria
1               Secundaria
2          Primaria trunca
3             Preparatoria
4  Licenciatura o superior
5          Sin escolaridad


In [17]:
df_train_transformed = encoder.fit_transform(train_df)
df_train_transformed

Unnamed: 0,Degree
0,2
1,3
2,1
3,4
4,5
5,0


## One-Hot Encoding
 
 One-Hot Encodings a categorical data encoding technique used when the features are **nominal** (meaning they do not have any inherent order). Here's how it works:

1. For each level of a categorical feature, we create a new binary variable (a "dummy" variable).
2. Each category is mapped to a binary value: 0 represents the absence of that category, and 1 represents its presence.
3. These newly created binary features are known as **dummy variables**.
4. The number of dummy variables depends on the levels present in the categorical variable.

Let's illustrate this with an example: Suppose we have a dataset with a categorical feature called "Animal," which includes different animals like Dog, Cat, Sheep, Horse, and Lion. To one-hot encode this data, we create separate binary columns for each animal, indicating whether it is present (1) or absent (0). This technique allows machine learning models to effectively handle categorical data.

\begin{array}{|c|c|c|c|c|c|c|c|c|}
\hline \text { Index } & \text { Animal } & & \text { Index } & \text { Dog } & \text { Cat } & \text { Sheep } & \text { Lion } & \text { Horse } \\
\hline 0 & \text { Dog } & \text { One-Hot code } & 0 & 1 & 0 & 0 & 0 & 0 \\
\hline 1 & \text { Cat } & & 1 & 0 & 1 & 0 & 0 & 0 \\
\hline 2 & \text { Sheep } & & 2 & 0 & 0 & 1 & 0 & 0 \\
\hline 3 & \text { Horse } & & 3 & 0 & 0 & 0 & 0 & 1 \\
\hline 4 & \text { Lion } & & 4 & 0 & 0 & 0 & 1 & 0 \\
\hline
\end{array}


In [18]:
import category_encoders as ce
import pandas as pd
data=pd.DataFrame({'Degree':lista_de_columnas})

#Create object for one-hot encoding
encoder=ce.OneHotEncoder(cols='Degree',handle_unknown='return_nan',return_df=True,use_cat_names=True)

#Original Data
data

Unnamed: 0,Degree
0,Primaria
1,Secundaria
2,Primaria trunca
3,Preparatoria
4,Licenciatura o superior
5,Sin escolaridad


In [19]:
data_encoded = encoder.fit_transform(data)
data_encoded

Unnamed: 0,Degree_Primaria,Degree_Secundaria,Degree_Primaria trunca,Degree_Preparatoria,Degree_Licenciatura o superior,Degree_Sin escolaridad
0,1.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0
5,0.0,0.0,0.0,0.0,0.0,1.0


In [20]:
import category_encoders as ce
import pandas as pd

#Create the dataframe
data=pd.DataFrame({'Degree':lista_de_columnas})

pd.get_dummies(data,dtype=float)


Unnamed: 0,Degree_Licenciatura o superior,Degree_Preparatoria,Degree_Primaria,Degree_Primaria trunca,Degree_Secundaria,Degree_Sin escolaridad
0,0.0,0.0,1.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,1.0


In [21]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Create a sample DataFrame
df_ = pd.DataFrame({'Degree':lista_de_columnas})

# Initialize the One-Hot Encoder
encoder = OneHotEncoder()

# Fit and transform the data
encoded_data = encoder.fit_transform(df_[['Degree']]).toarray()

# Create a new DataFrame with the encoded columns
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(['Degree']))

# Concatenate the original DataFrame with the encoded DataFrame
df_f = pd.concat([df_, encoded_df], axis=1)

# Print the final DataFrame
print(df_f)


                    Degree  Degree_Licenciatura o superior  \
0                 Primaria                             0.0   
1               Secundaria                             0.0   
2          Primaria trunca                             0.0   
3             Preparatoria                             0.0   
4  Licenciatura o superior                             1.0   
5          Sin escolaridad                             0.0   

   Degree_Preparatoria  Degree_Primaria  Degree_Primaria trunca  \
0                  0.0              1.0                     0.0   
1                  0.0              0.0                     0.0   
2                  0.0              0.0                     1.0   
3                  1.0              0.0                     0.0   
4                  0.0              0.0                     0.0   
5                  0.0              0.0                     0.0   

   Degree_Secundaria  Degree_Sin escolaridad  
0                0.0                     0.0  
1

## HashingEncoder

In [22]:
import category_encoders as ce
import pandas as pd

#Create the dataframe
data=pd.DataFrame({'Degree':lista_de_columnas})

#Create object for hash encoder
encoder=ce.HashingEncoder(cols='Degree',n_components=3)
#Fit and Transform Data
encoder.fit_transform(data)

Unnamed: 0,col_0,col_1,col_2
0,0,1,0
1,0,1,0
2,0,0,1
3,0,1,0
4,0,1,0
5,1,0,0


In [23]:
import category_encoders as ce
import pandas as pd

#Create the dataframe
data=pd.DataFrame({'Degree':lista_de_columnas})
encoder=ce.OneHotEncoder(cols='Degree',handle_unknown='return_nan',return_df=True,use_cat_names=True)

#Original Data
data

Unnamed: 0,Degree
0,Primaria
1,Secundaria
2,Primaria trunca
3,Preparatoria
4,Licenciatura o superior
5,Sin escolaridad


In [24]:
data_encoded = encoder.fit_transform(data)
data_encoded

Unnamed: 0,Degree_Primaria,Degree_Secundaria,Degree_Primaria trunca,Degree_Preparatoria,Degree_Licenciatura o superior,Degree_Sin escolaridad
0,1.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0
5,0.0,0.0,0.0,0.0,0.0,1.0
