# Handling Categorical Data

### Encoding Nominal Categorical Features
You have a feature with nominal classes that has no intrinsic ordering (e.g., apple,
pear, banana).

In [12]:
# One hot encoding using scikit learn's LabelBinarizer
import numpy as np
from sklearn.preprocessing import LabelBinarizer, MultiLabelBinarizer

#Create Feature
feature=np.array([['Texas'],
                 ['California'],
                 ['Delaware'],
                 ['Texas']])

#Create one hot encoder
one_hot = LabelBinarizer()

#One Hot encode feature
one_hot.fit_transform(feature)

array([[0, 0, 1],
       [1, 0, 0],
       [0, 1, 0],
       [0, 0, 1]])

In [5]:
#View feature classes
one_hot.classes_

array(['California', 'Delaware', 'Texas'], dtype='<U10')

If we want to reverse the one-hot encoding, we can use inverse_transform:


In [7]:
#Reverse one hot encoding 
one_hot.inverse_transform(one_hot.transform(feature))

array(['Texas', 'California', 'Delaware', 'Texas'], dtype='<U10')

We can even use pandas to one hot encode the feature

In [9]:
import pandas as pd

pd.get_dummies(feature[:,0])

Unnamed: 0,California,Delaware,Texas
0,0,0,1
1,1,0,0
2,0,1,0
3,0,0,1


One helpful ability of scikit-learn is to handle a situation where each observation lists
multiple classes:

In [13]:
# Create multiclass feature
multiclass_feature = [("Texas", "Florida"),
                      ("California", "Alabama"),
                      ("Texas", "Florida"),
                      ("Delware", "Florida"),
                      ("Texas", "Alabama")]

#Create multiclass one hot encoder
one_hot_multiclass = MultiLabelBinarizer()

#One hot encode multiclass feature
one_hot_multiclass.fit_transform(multiclass_feature)

array([[0, 0, 0, 1, 1],
       [1, 1, 0, 0, 0],
       [0, 0, 0, 1, 1],
       [0, 0, 1, 1, 0],
       [1, 0, 0, 0, 1]])

### Encoding Ordinal Categorical Features 
(e.g., high, medium, low)

In [2]:
import pandas as pd

#Create feature
dataframe = pd.DataFrame({'Score': ['Low', 'Low', 'Medium', 'Medium', 'High']})

#Create mapper
scale_mapper={'Low':1,
             'Medium':2,
             'High':3}

#Replace feature values with scale
dataframe['Score'].replace(scale_mapper)

0    1
1    1
2    2
3    2
4    3
Name: Score, dtype: int64

### Encoding Dictionaries of Features
You have a dictionary and want to convert it into a feature matrix.

In [4]:
#import library
from sklearn.feature_extraction import DictVectorizer

#Create Dictionary
data_dict = [{"Red": 2, "Blue": 4},
             {"Red": 4, "Blue": 3},
             {"Red": 1, "Yellow": 2},
             {"Red": 2, "Yellow": 2}]

#Create Dictionary vectorizer
dictvectorizer= DictVectorizer(sparse=False)

#Convert dictionary to feature matrix
feature = dictvectorizer.fit_transform(data_dict)

#View feature matrix
feature

array([[4., 2., 0.],
       [3., 4., 0.],
       [0., 1., 2.],
       [0., 2., 2.]])

We can get the names of each generated feature using the get_feature_names
method:

In [6]:
#Get feature names
feature_names = dictvectorizer.get_feature_names()

#View feature names
feature_names

['Blue', 'Red', 'Yellow']

While not necessary, for the sake of illustration we can create a pandas DataFrame to
view the output better:


In [8]:
pd.DataFrame(feature, columns=feature_names)

Unnamed: 0,Blue,Red,Yellow
0,4.0,2.0,0.0
1,3.0,4.0,0.0
2,0.0,1.0,2.0
3,0.0,2.0,2.0


### Imputing Missing Class Values
You have a categorical feature containing missing values that you want to replace with
predicted values

In [11]:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier

#Create feature matrix with categorical feature
X = np.array([[0, 2.10, 1.45],
              [1, 1.18, 1.33],
              [0, 1.22, 1.27],
              [1, -0.21, -1.19]])

# Create feature matrix with missing values in the categorical feature
X_with_nan = np.array([[np.nan, 0.87, 1.31],
 [np.nan, -0.67, -0.22]])

#Train KNN learner
clf = KNeighborsClassifier(3, weights='distance')
trained_model=clf.fit(X[:, 1:], X[:,0])

#Predict missing values class
imputed_values = trained_model.predict(X_with_nan[:,1:])

#Join column of predicted class with other features
X_with_imputed = np.hstack((imputed_values.reshape(-1,1), X_with_nan[:,1:]))

#Join two feature matrices
np.vstack((X_with_imputed, X))

array([[ 0.  ,  0.87,  1.31],
       [ 1.  , -0.67, -0.22],
       [ 0.  ,  2.1 ,  1.45],
       [ 1.  ,  1.18,  1.33],
       [ 0.  ,  1.22,  1.27],
       [ 1.  , -0.21, -1.19]])

An alternative solution is to fill in missing values with the feature’s most frequent
value:

In [15]:
from sklearn.preprocessing import Imputer

#Join the two feature matrices
X_complete=np.vstack((X_with_nan, X))

imputer=Imputer(strategy='most_frequent', axis=0)
imputer.fit_transform(X_complete)

array([[ 0.  ,  0.87,  1.31],
       [ 0.  , -0.67, -0.22],
       [ 0.  ,  2.1 ,  1.45],
       [ 1.  ,  1.18,  1.33],
       [ 0.  ,  1.22,  1.27],
       [ 1.  , -0.21, -1.19]])