Read generated "datecat.csv" data and encode the following columns:
 Features
   hour - for cyclical encoding
   colour - for dummy encoding
   size - for ordinal encoding
 Target
   move - for label encoding

# Investigating category encodings

This notebook shows how pandas and scikit-learn can be used to encode
categorical features, so that they can be used in scikit-learn based predictors.

Generally, the `pandas.get_dummies()` is the easiest way to get
encode nominal data in one-hot-encoded (dummy) form.
However, scikit-learn is more flexible and offers a much broader range of
encoders.

At the end of this notebook, there is an example _pipeline_ showing how
encoding and prediction can be integrated. Pipelines and similar more
advanced use of python and scikit-learn will be covered more fully in
Data Mining 2.

NB: usage of pipelines _is not required_ for CA3.

Import libraries

In [None]:
import numpy as np
import pandas as pd
import sklearn.preprocessing as skp
import sklearn.compose as skc
import sys

Read some example data with a variety of column types

In [None]:
df = pd.read_csv('data/datecat.csv', index_col=0)
df.head()
features = ['hour', 'colour', 'size']
target = ['move']
X = df[features].copy()
y = df[target].copy()

Define function, based on
https://www.kaggle.com/code/avanwyk/encoding-cyclical-features-for-deep-learning,
for cyclic encoding

In [None]:
def cyclicEncode(inDf, colName, period):
  encoded = {
    colName+'_cos' : np.cos(2 * np.pi * inDf[colName]/period),
    colName+'_sin' : np.sin(2 * np.pi * inDf[colName]/period)
    }
  outDf = pd.DataFrame(encoded, index=inDf.index)
  return outDf

Apply cyclicEncode() to the hour data

In [None]:
encodedHour = cyclicEncode(df, 'hour', 24)
encodedHour.head()

Derive _full_ set of (0,1)-valued dummy (effectively, one-hot-encoded) variables for each value of Colour

In [None]:
colName = 'colour'
dfColourDummies = pd.get_dummies(df[colName],\
        prefix=colName,\
        dtype=int)
dfColourDummies.head()

Derive _minimal_ set of (0,1)-valued indicator variables from Colour categorical variable

In [None]:
colName = 'colour'
dfColourInd = pd.get_dummies(df[colName],\
        prefix=colName,\
        drop_first=True,\
        dtype=int)\
        .rename(columns={\
        colName+"_Green": "IsGreen",\
        colName+"_Blue": "IsBlue"\
        })
dfColourInd.head()

As an alternative to pandas for encoding to Indicators, you can use sklearn's OneHotEncoder
It is perhaps a bit more complex, but you can reverse the transformation easily

In [None]:
# Fit (=derive) the one-hot-encoder for the colour column
colName = 'colour'
oheEnc = skp.OneHotEncoder(dtype=int) # dummy values are ints
oheEnc.fit(df[[colName]]) # "learn" the mapping to dummies
oheColours = oheEnc.transform(df[[colName]]) # Apply the mapping
dfOheColoursSk = pd.DataFrame(oheColours.toarray(),\
        columns=oheEnc.categories_,\
        index=df.index)
dfOheColoursSk.head()

In [None]:
# Fit (=derive) the one-hot-encoder for the colour column,
# dropping the first category to avoid the "dummy trap"
colName = 'colour'
oheEnc1 = skp.OneHotEncoder(dtype=int,\
        drop='first') # drop first category to avoid dummy trap
oheEnc1.fit(df[[colName]]) # "learn" the mapping to dummies
oheColours1 = oheEnc1.transform(df[[colName]]) # Apply the mapping
dfOheColours1Sk = pd.DataFrame(oheColours1.toarray(),\
       columns=oheEnc1.categories_[0][1:],\
       index=df.index)
dfOheColours1Sk.head()

In [None]:
# Derive integers in the right order for the ordinal category
colName = 'size'
categories=[['small','medium','large']]
ordinalEnc = skp.OrdinalEncoder(categories=categories, dtype=int)
sizeArr = df[colName].values.reshape(-1,1)
ordinalEnc.fit(sizeArr)
ordName = 'ord_' + colName
dfSizeOrdinal = pd.DataFrame(ordinalEnc.transform(sizeArr),\
        index=df.index, columns=[ordName])
dfSizeOrdinal.head()

In [None]:
# Complete the round trip from ordinal to integer and back
sizeOrdinalArr = dfSizeOrdinal.values
dfSizeOrdinalRoundTrip = pd.DataFrame(ordinalEnc.inverse_transform(sizeOrdinalArr),\
        index=df.index, columns=[colName])
dfSizeOrdinalRoundTrip.head()

In [None]:
# Apply label encoder to the target.
# NB: this encoder is intended for target columns. DO NOT USE on features!!!!
colName = 'move'
le = skp.LabelEncoder()
# See https://stackoverflow.com/a/36120015/1988855
le.fit(y.values.ravel())
classes = list(le.classes_)
#print(classes)
yE = le.transform(y.values.ravel())
yE

In [None]:
# We can invert the tranformation, to go back from the label-encoded
# target to its original form.
yRoundTrip = list(le.inverse_transform(yE))
yRoundTrip

Category encoding is a major topic in machine learning. Indeed, an
effective choice of encoding can often improve the quality of
a predictive model.
scikit-learn provides many options but even more are provided in
a scikit-learn contributed package: `category_encoders`.
This can be installed using
  `conda install -c conda-forge category_encoders`

In many cases, the API is much simpler than "vanilla" `sklearn.preprocessing`.
It also offers alternatives to one-hot-encoding when a variable has
a large number, say, more than 10 categories. In such cases, an alternative
encoding such as _binary encoding_ is an attractive approach.
One-hot-encoding generates $O(n)$ columns, where $n$ is the number of
categories - for binary encoding, the number of columns generated is $O(\log n)$.

See [Guide to Encoding Categorical Values in Python](https://pbpython.com/categorical-encoding.html) for general advice on
category encoding, and [How to Deal with Categorical Data for Machine Learning](https://www.kdnuggets.com/2021/05/deal-with-categorical-data-machine-learning.html)
for more examples of the use of `category_encoders`, especially binary encoding.

In [None]:
# Setup imports and prepare for column transformer in pipeline
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

cyclicFeatures = ['hour']
nominalFeatures = ['colour']
ordinalFeatures = ['size']

In [None]:
# Strictly speaking, hour should be transformed to 2 cyclic variables,
# need to generalise the encoder above to do this in a column transformation
column_trans = make_column_transformer(\
        (skp.OneHotEncoder(),nominalFeatures),\
        (skp.OrdinalEncoder(), ordinalFeatures),\
        remainder='passthrough')

In [None]:
# The pipeline applies the column transformations, and then passes them
# to the KNN classifier.
knn = KNeighborsClassifier()
pipe = make_pipeline(column_trans, knn)

In [None]:
# We now plug in the data (X, yE) - note that the label-encoded target is used.
# The transformer-classifier pipeline is run for each CV split and the accuracy
# results are returned as scores.
# Since the data was generated randomly, the scores are not great, but it is clear
# how easy it is to integrate feature encoding with prediction.
# Data Mining 2 will include more advanced topics like these.
training_accuracy = cross_val_score(pipe, X, yE, cv=5, scoring='accuracy')
training_accuracy