<a href="https://colab.research.google.com/github/JP109/ML-Basics/blob/main/Feature%20Engineering/Categorical_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Encoding categorical features**

Categorical features are either nominal, ordinal or boolean.

### **Nominal data:**

**1. Label Encoding**
LabelEncoder from sklearn is intended for target variable and hence only accepts a vector as input(not a matrix). 

We will encode a matrix(multivariate encoding) using a pandas trick instead:

In [3]:
import pandas as pd
df = pd.DataFrame({'A':list('bccd')})
df

Unnamed: 0,A
0,b
1,c
2,c
3,d


In [5]:
df = df.astype('category')
df.dtypes

A    category
dtype: object

In [7]:
df = df['A'].cat.codes
df

0    0
1    1
2    1
3    2
dtype: int8

## **Ordinal Data**

**1. Label encoding with specified order**

In [10]:
import pandas as pd
from pandas.api.types import CategoricalDtype

df = pd.DataFrame({'A':list('bccd')})
ordered = CategoricalDtype(categories=list('abcd'), ordered=True)
df = df.astype(ordered)
df

Unnamed: 0,A
0,b
1,c
2,c
3,d


In [11]:
df = df['A'].cat.codes
df

0    1
1    2
2    2
3    3
dtype: int8

You might be tempted to encode this data with a straightforward numerical mapping:{'Queen Anne': 1, 'Fremont': 2, 'Wallingford': 3};

This is not a useful approach in Scikit-Learn: the package's models make the fundamental assumption that numerical features reflect algebraic quantities. Thus such a mapping would imply, for example, that Queen Anne < Fremont < Wallingford, or even that Wallingford - Queen Anne = Fremont, which does not make much sense.

Better methods for vectorizing categorical features:

**2. One hot encoding**

In [None]:
data = [
    {'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},
    {'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},
    {'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},
    {'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}
]

In [None]:
# When data is in the form of a list of dictionaries:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False, dtype=int)
vec.fit_transform(data)

array([[     0,      1,      0, 850000,      4],
       [     1,      0,      0, 700000,      3],
       [     0,      0,      1, 650000,      3],
       [     1,      0,      0, 600000,      2]])

In [None]:
# Get meaning of each column:
vec.get_feature_names()

['neighborhood=Fremont',
 'neighborhood=Queen Anne',
 'neighborhood=Wallingford',
 'price',
 'rooms']

If your category has many possible values, this can greatly increase the size of your dataset. However, because the encoded data contains mostly zeros, a sparse output can be a very efficient solution:

In [None]:
vec = DictVectorizer(sparse=True, dtype=int)
vec.fit_transform(data)

<4x5 sparse matrix of type '<class 'numpy.int64'>'
	with 12 stored elements in Compressed Sparse Row format>

Many (though not yet all) of the Scikit-Learn estimators accept such sparse inputs when fitting and evaluating models. sklearn.preprocessing.OneHotEncoder and sklearn.feature_extraction.FeatureHasher are two additional tools that Scikit-Learn includes to support this type of encoding.

**3. Ordinal Encoder**