<center>
    <h1 id='dealing-with-categorical-variables' style='color:#7159c1'>🔨 Dealing with Categorical Variables 🔨</h1>
    <i>Encoding Categorical Variables</i>
</center>

```
- Dropping
- Ordinal Encoding
- One-hot Encoder
```

---

<h1 id='0-dropping' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>0 | Dropping</h1>

`Dropping` categorical variables is only suggested to be done when the variables does not have importance to the model. If that's so, droopping them wouldn't affect our model's results. This technique normally works worse than the others.

In [3]:
# ---- Reading Dataset ----
import pandas as pd # pip install pandas

houses_df = pd.read_csv('./datasets/melb_data.csv')
houses_without_categorical_variables_df = houses_df.select_dtypes(exclude=['object'])
houses_without_categorical_variables_df.head()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
0,2,1480000.0,2.5,3067.0,2.0,1.0,1.0,202.0,,,-37.7996,144.9984,4019.0
1,2,1035000.0,2.5,3067.0,2.0,1.0,0.0,156.0,79.0,1900.0,-37.8079,144.9934,4019.0
2,3,1465000.0,2.5,3067.0,3.0,2.0,0.0,134.0,150.0,1900.0,-37.8093,144.9944,4019.0
3,3,850000.0,2.5,3067.0,3.0,2.0,1.0,94.0,,,-37.7969,144.9969,4019.0
4,4,1600000.0,2.5,3067.0,3.0,1.0,2.0,120.0,142.0,2014.0,-37.8072,144.9941,4019.0


<h1 id='1-ordinal-encoder' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>1 | Ordinal Encoder</h1>

`Ordinal Encoder` transforms each possible categorical value to a number. For instance, the list: 

```python
fruits=['apple', 'grape', 'pineapple', 'apple]
```

would look like this:

```python
fruits=[2, 0, 1, 2]
```

Also, this technique applies hierarchy between the values: as higher the number that represents a value is, the more important is the categorical value.

In [7]:
# ---- Ordinal Encoder ----
from sklearn.preprocessing import OrdinalEncoder # pip install sklearn

encoder = OrdinalEncoder()

categorical_variables = [
    column for column in houses_df.columns
    if houses_df[column].dtype in ['object', 'o']
]

ordinal_encoder_df = houses_df.copy()
ordinal_encoder_df[categorical_variables] = pd.DataFrame(encoder.fit_transform(ordinal_encoder_df[categorical_variables]))
ordinal_encoder_df[categorical_variables].head()

Unnamed: 0,Suburb,Address,Type,Method,SellerG,Date,CouncilArea,Regionname
0,0.0,12794.0,0.0,1.0,23.0,45.0,31.0,2.0
1,0.0,5943.0,0.0,1.0,23.0,47.0,31.0,2.0
2,0.0,9814.0,0.0,3.0,23.0,48.0,31.0,2.0
3,0.0,9004.0,0.0,0.0,23.0,48.0,31.0,2.0
4,0.0,10589.0,0.0,4.0,155.0,49.0,31.0,2.0


---

Some times we can stumble upon with this problem:

The train dataset has some categorical values (classes)
that the validation one doesn't. For example:

```python
train_df = ['red', 'green', 'blue']
valid_df = ['red', 'yellow', 'blue', 'green']
```

The train one doesn't have the 'yellow' class, so, when
we try to encode this column, we will get an error;

One option is to drop all of the columns that have the problem.

In [27]:
# ---- Ordinal Encoding ----
from sklearn.model_selection import train_test_split # pip install sklearn

x_train_df, x_valid_df, y_train_df, y_valid_df = train_test_split(
    houses_df.loc[:, 'Price':]
    , houses_df.loc[:, 'Rooms']
    , train_size=0.70
    , test_size=0.30
)

categorical_variables = [
    column for column in x_train_df.columns
    if houses_df[column].dtype in ['object', 'o']
]

In [32]:
# ---- Ordinal Encoding ----

# Columns that can be safely ordinal encoded
good_label_cols = [
    column for column in categorical_variables
    if set(x_valid_df[column]).issubset(set(x_train_df[column]))
]

# Problematic columns that will be dropped from the dataset
bad_label_cols = list(set(categorical_variables) - set(good_label_cols))

# Dropping categorical columns that will not be encoded
labelled_x_train_df = x_train_df.drop(bad_label_cols, axis=1)
labelled_x_valid_df = x_valid_df.drop(bad_label_cols, axis=1)

In [35]:
# ---- Ordinal Encoding ----

# Encoding
labelled_categorical_variables = [
    column for column in labelled_x_train_df.columns
    if labelled_x_train_df[column].dtype in ['object', 'o']
]

labelled_x_train_df[labelled_categorical_variables] = pd.DataFrame(
    encoder.fit_transform(labelled_x_train_df[labelled_categorical_variables])
)

labelled_x_valid_df[labelled_categorical_variables] = pd.DataFrame(
    encoder.transform(labelled_x_valid_df[labelled_categorical_variables])
)

<h1 id='2-one-hot-encoding' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>2 | One-Hot Encoding</h1>

`One-hot Encoding` is suggested to be appliend when the categorical variables contain `less than 10 possible values` AND `there are not a hierarchy between them`, due to the technique adds new variables to each possible value, so, there is a high probability to the dataframe getting too large.


When creating the One-hot class, we have two parameters to set:

> **handle_unknown** - `ignore >> avoid errors when the validation dataset contains classes (categorical values) that are not represented in the training one`;

> **sparse_output** - `False >> returns a numpy array. True >> returns a sparse matrix`.

It normally works better than the others!!

In [40]:
# ---- One-Hot Encoding ----
from sklearn.preprocessing import OneHotEncoder # pip install sklearn

oh_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

x_train_df, x_valid_df, y_train_df, y_valid_df = train_test_split(
    houses_df.loc[:, 'Price':]
    , houses_df.loc[:, 'Rooms']
    , train_size=0.70
    , test_size=0.30
)

categorical_variables = [
    column for column in x_train_df.columns
    if x_train_df[column].dtype in ['object', 'o']
       and x_train_df[column].nunique() < 10
]

In [42]:
# ---- One-Hot Encoding ----

# Encoding the Variables
oh_x_train_df = pd.DataFrame(oh_encoder.fit_transform(x_train_df[categorical_variables]))
oh_x_valid_df = pd.DataFrame(oh_encoder.transform(x_valid_df[categorical_variables]))

# One-Hot Encoder removes the index, so we have to get them back
oh_x_train_df.index = x_train_df.index
oh_x_valid_df.index = x_valid_df.index

# Removing the Categorical Columns from the Dataset
numerical_x_train_df = x_train_df.drop(categorical_variables, axis=1)
numerical_x_valid_df = x_valid_df.drop(categorical_variables, axis=1)

# Adding the One-Hot Encoded Columns in the Dropped Categorical Columns
oh_x_train_df = pd.concat([numerical_x_train_df, oh_x_train_df], axis=1)
oh_x_valid_df = pd.concat([numerical_x_valid_df, oh_x_valid_df], axis=1)

# Showing the Result
oh_x_train_df.head()

Unnamed: 0,Price,SellerG,Date,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,...,3,4,5,6,7,8,9,10,11,12
1094,4700000.0,Hodges,12/11/2016,11.2,3186.0,5.0,3.0,2.0,1073.0,221.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4596,750000.0,Jellis,30/07/2016,2.6,3052.0,2.0,1.0,1.0,2193.0,,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
5604,801000.0,Biggin,6/08/2016,3.3,3141.0,2.0,2.0,1.0,0.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3528,870000.0,Greg,8/10/2016,4.2,3031.0,3.0,1.0,1.0,211.0,95.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
12913,625000.0,Barry,19/08/2017,16.1,3088.0,2.0,2.0,1.0,189.0,,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


---

Example of how to check out how many classes a Categorical Variable has. Remember that it's good to use One-Hot Encoding with Categorical Columns that have less than 10 classes.

In [44]:
# ---- Checking out how many Classes a Categorical Variable contains ----
categorical_variables_nunique_values = list(map(
    lambda column: x_train_df[column].nunique(), categorical_variables
))

categorical_variables_nunique_dictionary = dict(zip(
    categorical_variables, categorical_variables_nunique_values
))

sorted(
    categorical_variables_nunique_dictionary.items()
    , key = lambda item: item[x]
)

{'Method': 5, 'Regionname': 8}

---

<h1 id='reach-me' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📫 | Reach Me</h1>

> **Email** - [csfelix08@gmail.com](mailto:csfelix08@gmail.com?)

> **Linkedin** - [linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)

> **GitHub:** - [CSFelix](https://github.com/CSFelix)

> **Kaggle** - [DSFelix](https://www.kaggle.com/dsfelix)

> **Portfolio** - [CSFelix.io](https://csfelix.github.io/).