In [1]:
#Shubham Tribedi | 1811100002037

## Integer Encoding

Integer encoding consist in replacing the categories by digits from 1 to n (or 0 to n-1, depending the implementation), where n is the number of distinct categories of the variable.

The numbers are assigned arbitrarily. This encoding method allows for quick benchmarking of machine learning models. 


### Advantages

- Straightforward to implement
- Does not expand the feature space


### Limitations

- Does not capture any information about the categories labels
- Not suitable for linear models.

Integer encoding is better suited for non-linear methods which are able to navigate through the arbitrarily assigned digits to try and find patters that relate them to the target.


## In this assignment:

You have to perform one hot encoding with:
- pandas
- Scikit-learn
- Feature-Engine

And the advantages and limitations of each implementation using the House Prices dataset.

In [2]:
import numpy as np
import pandas as pd

# to split the datasets
from sklearn.model_selection import train_test_split

# for integer encoding using sklearn
from sklearn.preprocessing import LabelEncoder

# for integer encoding using feature-engine
from feature_engine.encoding import OrdinalEncoder

In [3]:
# load dataset
data = pd.read_csv(
    'houseprice.csv',
    usecols=['Neighborhood', 'Exterior1st', 'Exterior2nd', 'SalePrice'])

data.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd,SalePrice
0,CollgCr,VinylSd,VinylSd,208500
1,Veenker,MetalSd,MetalSd,181500
2,CollgCr,VinylSd,VinylSd,223500
3,Crawfor,Wd Sdng,Wd Shng,140000
4,NoRidge,VinylSd,VinylSd,250000


In [4]:
# let's have a look at how many labels each variable has
for i in data:
    print(i,':',len(data[i].unique()),'labels')


Neighborhood : 25 labels
Exterior1st : 15 labels
Exterior2nd : 16 labels
SalePrice : 663 labels


In [5]:
# let's explore the unique categories for Neighborhood , Exterior1st, and Exterior2nd
data1 = data['Neighborhood']
data1.unique()

array(['CollgCr', 'Veenker', 'Crawfor', 'NoRidge', 'Mitchel', 'Somerst',
       'NWAmes', 'OldTown', 'BrkSide', 'Sawyer', 'NridgHt', 'NAmes',
       'SawyerW', 'IDOTRR', 'MeadowV', 'Edwards', 'Timber', 'Gilbert',
       'StoneBr', 'ClearCr', 'NPkVill', 'Blmngtn', 'BrDale', 'SWISU',
       'Blueste'], dtype=object)

In [6]:
data2 = data['Exterior1st']
data2.unique()

array(['VinylSd', 'MetalSd', 'Wd Sdng', 'HdBoard', 'BrkFace', 'WdShing',
       'CemntBd', 'Plywood', 'AsbShng', 'Stucco', 'BrkComm', 'AsphShn',
       'Stone', 'ImStucc', 'CBlock'], dtype=object)

In [7]:
data3 = data['Exterior2nd']
data3.unique()

array(['VinylSd', 'MetalSd', 'Wd Shng', 'HdBoard', 'Plywood', 'Wd Sdng',
       'CmentBd', 'BrkFace', 'Stucco', 'AsbShng', 'Brk Cmn', 'ImStucc',
       'AsphShn', 'Stone', 'Other', 'CBlock'], dtype=object)

### Encoding important



In [8]:
# let's separate into training and testing set
x=data[['Neighborhood','Exterior1st','Exterior2nd']]
y=data['SalePrice']
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.3,random_state =0)
print(x_train.shape + x_test.shape)

(1022, 3, 438, 3)


## Integer encoding with pandas


### Advantages

- quick
- returns pandas dataframe

### Limitations of pandas:

- it does not preserve information from train data to propagate to test data

We need to capture and save the mappings one by one, manually, if we are planing to use those in production.

In [9]:
# first let's create a dictionary with the mappings of categories to numbers
data4 = data1.unique()
dic = {}
index=0
for i in data4:
  dic[i] =index
  index+=1
dic

{'CollgCr': 0,
 'Veenker': 1,
 'Crawfor': 2,
 'NoRidge': 3,
 'Mitchel': 4,
 'Somerst': 5,
 'NWAmes': 6,
 'OldTown': 7,
 'BrkSide': 8,
 'Sawyer': 9,
 'NridgHt': 10,
 'NAmes': 11,
 'SawyerW': 12,
 'IDOTRR': 13,
 'MeadowV': 14,
 'Edwards': 15,
 'Timber': 16,
 'Gilbert': 17,
 'StoneBr': 18,
 'ClearCr': 19,
 'NPkVill': 20,
 'Blmngtn': 21,
 'BrDale': 22,
 'SWISU': 23,
 'Blueste': 24}

The dictionary indicates which number will replace each category. Numbers were assigned arbitrarily from 0 to n - 1 where n is the number of distinct categories.

In [10]:
# replace the labels with the integers
x_train['Neighborhood']=x_train['Neighborhood'].replace(to_replace=list(dic.keys()),value=list(dic.values()))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x_train['Neighborhood']=x_train['Neighborhood'].replace(to_replace=list(dic.keys()),value=list(dic.values()))


In [11]:
# let's explore the result

x_train['Neighborhood'].head(10)

64       0
682     19
960      8
1384    15
1100    23
416      9
1034     2
853     11
472     15
1011    15
Name: Neighborhood, dtype: int64

In [12]:
#Exterior1st
data5 = data2.unique()
dic1 = {}
index=0
for i in data5:
  dic1[i] =index
  index+=1
#Exterior2nd
data6 = data3.unique()
dic2 = {}
index=0
for i in data6:
  dic2[i] =index
  index+=1

In [13]:
# let's see the final result after encoding
x_train['Exterior1st']=x_train['Exterior1st'].replace(to_replace=list(dic1.keys()),value=list(dic1.values()))
x_train['Exterior2nd']=x_train['Exterior2nd'].replace(to_replace=list(dic2.keys()),value=list(dic2.values()))
x_train.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x_train['Exterior1st']=x_train['Exterior1st'].replace(to_replace=list(dic1.keys()),value=list(dic1.values()))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x_train['Exterior2nd']=x_train['Exterior2nd'].replace(to_replace=list(dic2.keys()),value=list(dic2.values()))


Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
64,0,0,0
682,19,2,5
960,8,2,4
1384,15,5,2
1100,23,2,5


## Integer Encoding with Scikit-learn

In [14]:
# let's separate into training and testing set
x=data[['Neighborhood','Exterior1st','Exterior2nd']]
y=data['SalePrice']
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.3,random_state =0)
print(x_train.shape + x_test.shape)

(1022, 3, 438, 3)


In [15]:
le = LabelEncoder()
train_transformed = x_train
le.fit(x_train['Neighborhood'])
train_transformed['Neighborhood'] =le.transform(x_train['Neighborhood'])

le.fit(x_train['Exterior1st'])
train_transformed['Exterior1st']  = le.transform(x_train['Exterior1st'])

le.fit(x_train['Exterior2nd'])
train_transformed['Exterior2nd'] = le.transform(x_train['Exterior2nd'])

train_transformed.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_transformed['Neighborhood'] =le.transform(x_train['Neighborhood'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_transformed['Exterior1st']  = le.transform(x_train['Exterior1st'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_transformed['Exterior2nd'] = le.transform(x_train['Ex

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
64,5,12,13
682,4,13,14
960,3,13,10
1384,7,14,15
1100,18,13,14


In [16]:
test_transformed = x_test
le.fit(x_test['Neighborhood'])
test_transformed['Neighborhood'] =le.transform(x_test['Neighborhood'])

le.fit(x_test['Exterior1st'])
test_transformed['Exterior1st'] = le.transform(x_test['Exterior1st'])

le.fit(x_test['Exterior2nd'])
test_transformed['Exterior2nd'] = le.transform(x_test['Exterior2nd'])

test_transformed.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_transformed['Neighborhood'] =le.transform(x_test['Neighborhood'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_transformed['Exterior1st'] = le.transform(x_test['Exterior1st'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_transformed['Exterior2nd'] = le.transform(x_test['Exterior2

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
529,5,9,9
491,11,9,12
459,2,5,7
279,3,6,8
655,1,4,6


Finally, there is another Scikit-learn transformer, the OrdinalEncoder, to encode multiple variables at the same time. However, this transformer returns a NumPy array without column names, so it is not my favourite implementation. More details here: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html 

## Integer Encoding with Feature-Engine

In [17]:
# let's separate into training and testing set
x=data[['Neighborhood','Exterior1st','Exterior2nd']]
y=data['SalePrice']
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.3,random_state =0)
print(x_train.shape + x_test.shape)


(1022, 3, 438, 3)


In [18]:
Ordinal = OrdinalEncoder()
Ordinal.fit(x_train,y_train)
Ordinal.transform(x_train)

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
64,16,11,13
682,17,5,7
960,4,5,8
1384,3,4,4
1100,8,5,7
...,...,...,...
763,24,11,13
835,6,11,9
1216,6,11,13
559,15,11,13


In [19]:
Ordinal.fit(x_test,y_test)
Ordinal.transform(x_test)

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
529,15,5,9
491,7,5,3
459,4,3,4
279,18,7,7
655,1,4,13
...,...,...,...
271,18,7,7
445,5,5,3
654,22,3,4
1280,12,9,10


In [20]:
# Achieve the same using Feature Engine

**Note**

If the argument variables is left to None, then the encoder will automatically identify all categorical variables. Is that not sweet?

The encoder will not encode numerical variables. So if some of your numerical variables are in fact categories, you will need to re-cast them as object before using the encoder.

Note, if there is a variable in the test set, for which the encoder doesn't have a number to assigned (the category was not seen in the train set), the encoder will return an error.