<center>
    <h1>
        <b>
            Chapter 5. Handling Categorical Data
        </b>
    </h1>
</center>

---

### **5.0 Introduction**

---

### **5.1 Encoding Nominal Categorical Features**

**Problem**
- You havea feature with nominal classes that has no intrisic ordering

**Solution**
- One-hot encode the feature using scikit-learn's `LabelBinarizer`

In [1]:
#Import libraries
import numpy as np
from sklearn.preprocessing import LabelBinarizer, MultiLabelBinarizer

#Create feature
feature = np.array([['Texas'],
                    ['California'],
                    ['Texas'],
                    ['Delaware'],
                    ['Texas']])

#Create one-hot encoder
one_hot = LabelBinarizer()

#One-hot encode feature
one_hot.fit_transform(feature)

array([[0, 0, 1],
       [1, 0, 0],
       [0, 0, 1],
       [0, 1, 0],
       [0, 0, 1]])

Use the `classes_` method to output the classes

In [2]:
#View feature classes
one_hot.classes_

array(['California', 'Delaware', 'Texas'], dtype='<U10')

Reverse the one-hot encoding, we can use `inverse_transform`

In [3]:
one_hot.inverse_transform(one_hot.fit_transform(feature))

array(['Texas', 'California', 'Texas', 'Delaware', 'Texas'], dtype='<U10')

Use pandas to one-hot encode the feature

In [6]:
#Import library
import pandas as pd

#Create dummy variables from feature
pd.get_dummies(feature[:, 0])

Unnamed: 0,California,Delaware,Texas
0,0,0,1
1,1,0,0
2,0,0,1
3,0,1,0
4,0,0,1


Multiple classes 

In [7]:
#Create multiclass feature
multiclass_features = [('Texas', 'Floria'),
                       ('California', 'Alabama'),
                       ('Texas', 'Floria'),
                       ('Delware', 'Floria'),
                       ('Texas', 'Alabama')]

#Create multiclass one-hot encoder
one_hot_multiclass = MultiLabelBinarizer()

#One-hot encode multiclass feature
one_hot_multiclass.fit_transform(multiclass_features)

array([[0, 0, 0, 1, 1],
       [1, 1, 0, 0, 0],
       [0, 0, 0, 1, 1],
       [0, 0, 1, 1, 0],
       [1, 0, 0, 0, 1]])

---

### **5.2 Encoding Ordinal Categorical Features**

**Problem**
- You have an ordinal categorical feature

**Solution**
- Use pandas DataFrame's `replace` method to transform string labels to numerical equivalents

In [9]:
#Load library
import pandas as pd

#Create features
dataframe = pd.DataFrame({
    'Score': ["Low", "Low", "Medium", "Medium", "High"]
})

#Create mapper
scale_mapper = {
    'Low': 1,
    'Medium': 2,
    'High': 3
}

#Replace feature values with scale
dataframe['Score'].replace(scale_mapper)

0    1
1    1
2    2
3    2
4    3
Name: Score, dtype: int64

---

### **5.3 Encoding Dictionaries of Features**

**Problem**
- You have a dictionary and want to convert it into a feature matrix

**Solution**
- Use `DictVectorizer`


In [10]:
#Import library
from sklearn.feature_extraction import DictVectorizer

#Create dictionary
data_dict = [{'Red': 2, 'Blue': 4},
             {'Red': 4, 'Blue': 3},
             {'Red': 1, 'Yellow': 2},
             {'Red': 2, 'Yellow': 2}]

#Create dictionary vectorizer
dictvectorizer = DictVectorizer(sparse=False)

#Convert dictionary to feature matrix
features = dictvectorizer.fit_transform(data_dict)

#View feature matrix
features

array([[4., 2., 0.],
       [3., 4., 0.],
       [0., 1., 2.],
       [0., 2., 2.]])

Use the `get_feature_names` method:

In [12]:
#Get feature names
feature_names = dictvectorizer.get_feature_names_out()

#View feature names
feature_names

array(['Blue', 'Red', 'Yellow'], dtype=object)

---

### **5.4 Imputing Missing Class Values**

In [1]:
#Load libraries
import numpy as np
from sklearn.neighbors import KNeighborsClassifier

#Create feature matrix with categorical feature
X = np.array([[0, 2.1, 1.45],
              [1, 1.18, 1.33],
              [0, 1.22, 1.27],
              [1, -0.21, -1.19]])

#Create feature matrix with missing values in the categorical feature
X_with_nan = np.array([[np.nan, 0.87, 1.31],
                       [np.nan, -0.67, -0.22]])

#Train KNN learner
clf = KNeighborsClassifier(3, weights='distance')
trained_model = clf.fit(X[:,1:], X[:,0])

#Predict missing value's class
imputed_values = trained_model.predict(X_with_nan[:, 1:])

#Join column of predicted class with their other features
X_with_imputed = np.hstack((imputed_values.reshape(-1,1), X_with_nan[:,1:]))

#Join two feature matrices
np.vstack((X_with_imputed, X))


array([[ 0.  ,  0.87,  1.31],
       [ 1.  , -0.67, -0.22],
       [ 0.  ,  2.1 ,  1.45],
       [ 1.  ,  1.18,  1.33],
       [ 0.  ,  1.22,  1.27],
       [ 1.  , -0.21, -1.19]])

Fill in missing values with the feature's most frequent value

In [2]:
from sklearn.preprocessing import Imputer

#Join the two feature matrices
X_complete = np.vstack((X_with_nan, X))

imputer = Imputer(strategy='most_frequent', axis=0)

imputer.fit_transform(X_complete)

ImportError: cannot import name 'Imputer' from 'sklearn.preprocessing' (c:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\preprocessing\__init__.py)

---

### **5.5 Handling Imbalanced Classes**

---