## Encoding Categorical Data

#### There are a whole bunch of encoding schemes that are used to deal with categorical data. We will look at two simple ones:

- LabelEncoder for encoding class (target) information

- OrdinalEncoder for encoding categorical features



In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder

In [2]:
from sklearn import metrics

### Somewhat sanitized data from UCI ML repository

In [84]:
# data from https://archive.ics.uci.edu/ml/datasets/breast+cancer, modified
data = pd.read_csv("breast-cancer_modified.csv", na_values=['?'])

In [85]:
data

Unnamed: 0,Class,age,menopause,tumor-size,Inv-nodes,Node-caps,Deg-malig,breast,breast-quad,irradiat
0,no-recurrence-events,30-39,premeno,30-34,0-2,no,3,left,left_low,no
1,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,right,right_up,no
2,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,left,left_low,no
3,no-recurrence-events,60-69,ge40,15-19,0-2,no,2,right,left_up,no
4,no-recurrence-events,40-49,premeno,0-4,0-2,no,2,right,right_low,no
...,...,...,...,...,...,...,...,...,...,...
279,recurrence-events,30-39,premeno,30-34,0-2,no,2,left,left_up,no
280,recurrence-events,30-39,premeno,20-24,0-2,no,3,left,left_up,yes
281,recurrence-events,60-69,ge40,20-24,0-2,no,1,right,left_up,no
282,recurrence-events,40-49,ge40,30-34,3-5,no,3,left,left_low,no


### Check for and drop missing values. Not the recommended procedures in a production environment. You need to impute the missing data using one of many techniques

In [86]:
np.sum(data.isna())

Class          0
age            0
menopause      0
tumor-size     0
Inv-nodes      0
Node-caps      8
Deg-malig      0
breast         0
breast-quad    1
irradiat       0
dtype: int64

### There are 9 missing values, 8 in "Node-caps" and 1 in "breast-quad". We drop these rows. We are losing a lot of data here.

In [87]:
datac=data.dropna()

In [88]:
datac

Unnamed: 0,Class,age,menopause,tumor-size,Inv-nodes,Node-caps,Deg-malig,breast,breast-quad,irradiat
0,no-recurrence-events,30-39,premeno,30-34,0-2,no,3,left,left_low,no
1,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,right,right_up,no
2,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,left,left_low,no
3,no-recurrence-events,60-69,ge40,15-19,0-2,no,2,right,left_up,no
4,no-recurrence-events,40-49,premeno,0-4,0-2,no,2,right,right_low,no
...,...,...,...,...,...,...,...,...,...,...
279,recurrence-events,30-39,premeno,30-34,0-2,no,2,left,left_up,no
280,recurrence-events,30-39,premeno,20-24,0-2,no,3,left,left_up,yes
281,recurrence-events,60-69,ge40,20-24,0-2,no,1,right,left_up,no
282,recurrence-events,40-49,ge40,30-34,3-5,no,3,left,left_low,no


In [89]:
np.sum(datac.isna())

Class          0
age            0
menopause      0
tumor-size     0
Inv-nodes      0
Node-caps      0
Deg-malig      0
breast         0
breast-quad    0
irradiat       0
dtype: int64

### isolate features and class/target columns. Here target is first column. 

In [90]:
X = datac.iloc[:, 1: ]

In [91]:
y = datac.iloc[:, 0]

### Split to test-train data

In [92]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [93]:
X_train.head()

Unnamed: 0,age,menopause,tumor-size,Inv-nodes,Node-caps,Deg-malig,breast,breast-quad,irradiat
106,30-39,premeno,40-44,0-2,no,2,right,right_up,no
158,40-49,premeno,35-39,0-2,no,1,left,left_low,no
81,60-69,ge40,15-19,0-2,no,2,right,left_low,no
135,30-39,premeno,40-44,3-5,no,3,right,right_up,yes
157,50-59,ge40,30-34,0-2,no,1,right,central,no


In [94]:
X_test.head()

Unnamed: 0,age,menopause,tumor-size,Inv-nodes,Node-caps,Deg-malig,breast,breast-quad,irradiat
118,60-69,ge40,15-19,0-2,no,1,left,right_low,no
187,40-49,premeno,10-14,0-2,no,2,right,left_up,no
220,30-39,premeno,30-34,0-2,no,1,right,left_up,no
154,60-69,ge40,40-44,3-5,no,2,right,left_up,yes
215,50-59,ge40,20-24,0-2,no,2,left,left_up,no


### Encoding is done in two steps - OrdinalEncoder for features and Label Encoder for class/target column

### Note that if you have categorical and numeric feature columns, you will need to deal with them separately. Here all features are categorical.

In [95]:
# ordinal encode input variables
ordinal_encoder = OrdinalEncoder()
ordinal_encoder.fit(X_train)


OrdinalEncoder()

In [97]:
X_train_encoded = ordinal_encoder.transform(X_train)
X_test_encoded = ordinal_encoder.transform(X_test)


In [98]:
X_train_encoded[1:5,:]

array([[1., 2., 6., 0., 0., 0., 0., 1., 0.],
       [3., 0., 2., 0., 0., 1., 1., 1., 0.],
       [0., 2., 7., 3., 0., 2., 1., 4., 1.],
       [2., 0., 5., 0., 0., 0., 1., 0., 0.]])

In [99]:
X_test_encoded[1:5, :]

array([[1., 2., 1., 0., 0., 1., 1., 2., 0.],
       [0., 2., 5., 0., 0., 0., 1., 2., 0.],
       [3., 0., 7., 3., 0., 1., 1., 2., 1.],
       [2., 0., 3., 0., 0., 1., 0., 2., 0.]])

### Notice how we did a fit, followed by a transform. Both these steps are necessary, becaue fit alone doesn't transform the data, just prepares the encoder. you can combine fit and transform into a single call to fit_transform()

### Now we do  encoding of class labels using LabelEncoder

In [100]:
# ordinal encode target variable
label_encoder = LabelEncoder()
label_encoder.fit(y_train)
y_train_encoded = label_encoder.transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

In [101]:
y_train_encoded[1:5]

array([0, 0, 0, 0])

### We use a logistic regression model

In [102]:
model = LogisticRegression()

In [103]:
model.fit(X_train_encoded, y_train_encoded)

LogisticRegression()

In [104]:
predicted=model.predict(X_test_encoded)

In [105]:
accuracy = metrics.accuracy_score(y_test_encoded, predicted)

In [106]:
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 65.45


### Oh well!!

### Try to re-work without throwing away whole rows containing the missing values