<h1><center> Encoding </center></h1>

*https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/*

*https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/*

## *Written by Nathanael Hitch*

<hr>

Often, machine learning data sets will require, or at least recommend, that you prepare your data in specific ways before fitting a machine learning model. Our main example is using one-hot encoding on categorical data.

Categorical data are variables that contain label values, rather than numeric values, with the number of possible values often limited to a fixed set.

- A "pet" variable with the values: "dog" and "cat".
- A "color" variable with the values: "red", "green" and "blue".
- A "place" variable with the values: "first", "second" and "third".

Some algorithms can work with categorical data directly; a decision tree can learn directly from categorical data with no data transform required, depending on the specific implementation.<br>
However, many machine learning algorithms cannot operate on label data directly, requiring all input variables and output variables to be numeric.

This means that categorical data must be converted to a numerical form. Additionally, if the categorical variable is an output variable, you may also want to convert predictions by the model back into a categorical form in order to present them or use them in some application.

### Convert from Categorical to numerical

**1. Integer Encoding**

Firstly, each unique category value is assigned an integer value. For example, "red" is 1, "green" is 2, and "blue" is 3. This is called a **label encoding**/**integer encoding** and is easily reversible.

For some variables, this may be enough; the integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship. The ordinal variables like the above "place" example would be a good example where a label encoding would be sufficient.

In [1]:
from numpy import asarray
from sklearn.preprocessing import OrdinalEncoder

In [2]:
# define data
data = asarray([['red'], ['green'], ['blue']])
print(data, "\n")

# define ordinal encoding
encoder = OrdinalEncoder()

# transform data
result = encoder.fit_transform(data)
print(result)

[['red']
 ['green']
 ['blue']] 

[[2.]
 [1.]
 [0.]]


**2. One-Hot Encoding**

For categorical variables, where no such ordinal relationship exists, the integer encoding is not enough.<br>
Using this encoding and allowing the model to assume a natural ordering between categories could result in poor performance or unexpected results (predictions halfway between categories).

In this case, a one-hot encoding can be applied to the integer representation.
- The integer encoded variable is removed and a new binary variable is added for each unique integer value.

For the "color" variable example, there are 3 categories and therefore 3 binary variables are needed. A “1” value is placed in the binary variable for the color and "0" values for the other colors:

red   = 1, 0, 0
green = 0, 1, 0
blue  = 0, 0, 1

In [3]:
from sklearn.preprocessing import OneHotEncoder

In [4]:
# define data
data = asarray([['red'], ['green'], ['blue']])
print(data, "\n")

# define one hot encoding
encoder = OneHotEncoder(sparse=False)

# transform data
onehot = encoder.fit_transform(data)
print(onehot)

[['red']
 ['green']
 ['blue']] 

[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]


Using a dataset example (data directly from github):

In [5]:
from pandas import read_csv

# define the location of the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv"

# load the dataset
dataset = read_csv(url, header=None)

# retrieve the array of data
data = dataset.values

# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)

# summarize
print('Input', X.shape)
print('Output', y.shape)

Input (286, 9)
Output (286,)


For the labels (y), we can encode using the **Label Encoder**; this does the same as the other encoders:

In [8]:
from sklearn.preprocessing import LabelEncoder

# ordinal encode input variables
ordinal_encoder = OrdinalEncoder()
X = ordinal_encoder.fit_transform(X)

# ordinal encode target variable
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

# summarize the transformed data
print('Input', X.shape)
print(X[:5, :])
print('Output', y.shape)
print(y[:5])

Input (286, 9)
[[2. 2. 2. 0. 1. 2. 1. 2. 0.]
 [3. 0. 2. 0. 0. 0. 1. 0. 0.]
 [3. 0. 6. 0. 0. 1. 0. 1. 0.]
 [2. 2. 6. 0. 1. 2. 1. 1. 1.]
 [2. 2. 5. 4. 1. 1. 0. 4. 0.]]
Output (286,)
[1 0 1 0 1]


In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# ordinal encode input variables
ordinal_encoder = OrdinalEncoder()
ordinal_encoder.fit(X_train)

X_train = ordinal_encoder.transform(X_train)
X_test = ordinal_encoder.transform(X_test)

# ordinal encode target variable
label_encoder = LabelEncoder()
label_encoder.fit(y_train)

y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)

# define the model
model = LogisticRegression()

# fit on the training set
model.fit(X_train, y_train)

# predict on test set
yhat = model.predict(X_test)

# evaluate predictions
print("Logistic Regression Accuracy: %.2f" % (metrics.accuracy_score(y_test, yhat) * 100))
print("Logistic Regression Precision: %.2f" % (metrics.precision_score(y_test, yhat) * 100))
print("Logistic Regression Recall: %.2f" % (metrics.recall_score(y_test, yhat) * 100))

Logistic Regression Accuracy: 75.79
Logistic Regression Precision: 91.67
Logistic Regression Recall: 33.33
