# More Information About Categorical Data Feature Engineering:

We've stated before many machine learning algorithms cannot work with categorical data directly. The categories must be converted into numbers. This is required for both input and output variables that are categorical.

We will discuss the different ways we can do this below.

The key thing to remember is that we want to help this machine learning model learn as many unique traits and characteristics of the dataset as possible without the loss of information when we convert (encode) the categorical values to numerical values.

## Different Types Of Categorical Data:
There are two different types of categorical data and it is important to understand the difference before we discuss how to convert these categorical features to numbers.

<br> 

### `Nominal` Categorical Data 
Nominal scales are used for labeling variables, without any quantitative value.  “Nominal” scales could simply be called “labels.”  None of these values have any numerical significance. 

**i.e.**

Whats your hair color?
- `1- Brown`
- `2- Black`
- `3- Blonde`
- `4- Grey`
- `5- Other`

Note: These are strings (text) not numbers

All of these values do not have any numerical significance. Unlike `Ordinal` Categorical Data.

<br>

### `Ordinal` Categorical Data
With ordinal scales, the order of the values is what’s important and significant, but the differences between each one is not really known.

**i.e.**

How do you feel today?
- `1. Very Happy`
- `2. Happy`
- `3. Ok`
- `4. Unhappy`
- `5. Very Unhappy`

Note: These are strings (text) not numbers

or 

What size shirt do you wear?
- `1. Extra Small`
- `2. Small`
- `3. Medium`
- `4. Large`
- `5. Extra Large `

Note: These are strings (text) not numbers

Through your experience you can agree that:
```Extra Small < Small < Medium < Large < Extra Large```

Unlike `nominal` categorical data, you might be able to see that there is a significance of order in `ordinal` categorical data.

In each case, we can confirm that there is an order in the different unique values. You can compare the unique values to each other to represent more semantic meaning. Although you can say that `Extra Small` is less than `Small` and `Large` is less than `Extra Large`, you cannot really measure the difference between the two comparisons. For example if we say that the difference in size between `Extra Small` and `Small` is `15%` we cannot assume that the difference in `Large` and `Extra Large` is the same. The difference in size of `Large` and `Extra Large` could possiblly be `10%`. Here is another example, we do not understand if the difference between “OK” and “Unhappy” is the same as the difference between “Very Happy” and “Happy”,  We just can’t say.

Ordinal scales are typically measures of non-numeric concepts like satisfaction, happiness, discomfort, etc.

**Advanced note:** The best way to determine central tendency on a set of ordinal data is to use the mode or median; a purist will tell you that the mean cannot be defined from an ordinal set.

[Reference for different types of categorical data](https://www.mymarketresearchmethods.com/types-of-data-nominal-ordinal-interval-ratio/)

<hr>

## Converting Categorical Data 

Now that we understand the difference between `nomianal` and `ordinal` categorical data we can discuss how to convert these text values to numeric values.

### Converting `nominal` data:

For `nominal` data we will perform `one hot encoding`. `One-hot encoding` is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction, it allows the representation of categorical data to be more expressive.

`One-hot encoding` converts a specific column of unique values to multiple columns. Each new column would pertain to  single unique value in the original column. The values in the new column are boolean values, where 1 signifies that it is that value and 0 signifiying that it is not that value.

For example:

In [0]:
import pandas as pd
data = {'hair color': ['Black', 'Black', 'Blonde', 'Brown', 'Grey', 'Grey', 'Other']}
df = pd.DataFrame(data)
display(df)

Unnamed: 0,hair color
0,Black
1,Black
2,Blonde
3,Brown
4,Grey
5,Grey
6,Other


After performing `one-hot encoding`:

In [0]:
pd.get_dummies(df)

Unnamed: 0,hair color_Black,hair color_Blonde,hair color_Brown,hair color_Grey,hair color_Other
0,1,0,0,0,0
1,1,0,0,0,0
2,0,1,0,0,0
3,0,0,1,0,0
4,0,0,0,1,0
5,0,0,0,1,0
6,0,0,0,0,1


These binary variables are often called “dummy variables” in other fields, such as statistics.

The one thing to note about `one-hot encoding` is the `curse of dimensionality`. If a column has an excessive amount of unique values it could hurt certain machine learning algorithms to perform `one-hot encoding`. In this case try your best to group unique values together through educated and domain knowledge.

Using this encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).

### Converting `ordinal` data:

For `ordinal` categorical data we will perform a different type of encoding. `Ordinal encoding` is done to ensure encoding of variable retains ordinal nature of the variable. In short we convert the categorical values to numerical values through a more controlled and defined way of mapping these values from categorical to numeric values, rather than letting the computer randomly convert the values. Once we do convert these values we will not `one-hot encode`, rather we will leave the numeric values in one column opposed to transforming the column into multiple columns.

"Treating ordered categorical data as a numerical variables preserves the information contained in the ordering that woud be lost if it were transformed using `one-hot encoding`" -O'reilly, Practical Statistics For Data Science.

We can convert these through the following way: 

In [0]:
data = {'shirt size': ['XS', 'XS', 'S',  'M', 'M', 'L', 'XL', 'XL']}
df = pd.DataFrame(data)
display(df)

Unnamed: 0,shirt size
0,XS
1,XS
2,S
3,M
4,M
5,L
6,XL
7,XL


After `ordinal encoding`:

In [0]:
size_dict = {
    'XS':1,
    'S':2,
    'M':3,
    'L':4,
    'XL':5
    }

df['shirt size ordinal'] = df['shirt size'].map(size_dict)
display(df)

Unnamed: 0,shirt size,shirt size ordinal
0,XS,1
1,XS,1
2,S,2
3,M,3
4,M,3
5,L,4
6,XL,5
7,XL,5


### Converting Categorical Data Recap: 
- If a column has an excessive amount of unique values it could hurt certain machine learning algorithms to perform one-hot encoding.
- "Treating ordered categorical data as a numerical variables preserves the information contained in the ordering that woud be lost if it were transformed using one-hot encoding" -O'reilly, Practical Statistics For Data Science.

Check out this link if you are curious about the [pros and cons of treating ordinal values as numerical values rather than one-hot encoding](https://www.theanalysisfactor.com/pros-and-cons-of-treating-ordinal-variables-as-nominal-or-continuous/)

<br>

## **Of course we will have to create a model with the different ways of encoding to see which performs better for the specific data set, remember it's an iterative process and there is no one answer fits all, all datasets and models have their own way of performing better.**

<br>

Actually there are many way ways to encode categorical data, check out the following links for more information:

[Categorical Encoding Reference](https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02)


<hr>

<br>

## Numerical Data Feature Engineering

### `Discretization`
Discretization (otherwise known as quantization or binning) provides a way to partition continuous features into discrete values. Certain datasets with continuous features may benefit from discretization, because discretization can transform the dataset of continuous attributes to one with only nominal attributes, or in other words convert a column to have less unique values.

More technically `discretization` is the process of transferring continuous functions, models, variables, and equations into discrete counterparts or partitions.

In a sense to help the model potentially perform better we will convert continuous data points to discrete data points.

"Discretized features can make a model more expressive, while maintaining interpretability. For instance, pre-processing with a discretizer can introduce nonlinearity to linear models." -[scikit-learn.org](https://scikit-learn.org/stable/modules/preprocessing.html#discretization)


#### Sklearn's preprocessing K-bins discretization
We will use sklearn's preprocessing `KBinsDiscretizer` function to  discretizes features into k bins:

#### Sklearn's KBinsDiscretizer() parameters:
- `n_bins`(int or array-like), The number of bins to produce for each column.  **Note:** Raises ValueError if n_bins < 2.
- `encode` ({‘onehot’, ‘onehot-dense’, ‘ordinal’}): Method used to encode the transformed result. if set to `ordinal` return the bin identifier encoded as an integer value.

We will call `fit()` on the KBinsDiscretizer method to fit the estimator. This will essentially find out the different ranges for each bin.

In [0]:
from sklearn import preprocessing
import numpy as np
X = np.array([[ -3., 5., 15 ],
              [  0., 6., 14 ],
              [  6., 3., 11 ]])

est = preprocessing.KBinsDiscretizer(n_bins=[3, 2, 2], encode='ordinal').fit(X)


By default the output is one-hot encoded into a sparse matrix ([See Encoding categorical features](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features)) and this can be configured with the encode parameter.

For each feature, the bin edges are computed during fit and together with the number of bins, they will define the intervals. Therefore, for the current example, these intervals are defined as:

**Key:** 
- `[` and `]` are inclusive
- `(` and `)` are exclusive

feature 1: ```[-infinity, -1), [-1,2), [2,infinity)```

feature 2: ```[-infinity, 5), [5,infinity)```

feature 3: ```[-infinity, 14), [14, infinity)```

<br>

We will call `.transform()` on the estimator to transform the data to the desired output, discretization of continuous numerical values to discrete numerical values.

Based on these bin intervals, X is transformed as follows:


                 


In [0]:
est.transform(X)

array([[0., 1., 1.],
       [1., 1., 1.],
       [2., 0., 0.]])

Refer here for the documentation:
- [discretization sci-kit learn documentation](https://scikit-learn.org/stable/modules/preprocessing.html#discretization)

## **Of course we will have to create a model with performing discretization and also without performing discretization to see which performs better for the specific data set, remember it's an iterative process and there is no one answer fits all, all datasets and models have their own way of performing better.**