In [1]:
#Shubham Tribedi | 1811100002037

## Count or frequency encoding

In count encoding we replace the categories by the count of the observations that show that category in the dataset. Similarly, we can replace the category by the frequency -or percentage- of observations in the dataset. That is, if 10 of our 100 observations show the colour blue, we would replace blue by 10 if doing count encoding, or by 0.1 if replacing by the frequency. These techniques capture the representation of each label in a dataset, but the encoding may not necessarily be predictive of the outcome. These are however, very popular encoding methods in Kaggle competitions.

The assumption of this technique is that the number observations shown by each variable is somewhat informative of the predictive power of the category.


### Advantages

- Simple
- Does not expand the feature space

### Disadvantages

- If 2 different categories appear the same amount of times in the dataset, that is, they appear in the same number of observations, they will be replaced by the same number: may lose valuable information.

For example, if there are 10 observations for the category blue and 10 observations for the category red, both will be replaced by 10, and therefore, after the encoding, will appear to be the same thing. 


Follow this [thread in Kaggle](https://www.kaggle.com/general/16927) for more information.



## In this assignment:

You have to perform count or frequency encoding with:
- pandas
- Feature-Engine

And the advantages and limitations of each implementation using the House Prices dataset.

In [3]:
import numpy as np
import pandas as pd

# to split the datasets
from sklearn.model_selection import train_test_split

# to encode with feature-engine
from feature_engine.encoding import CountFrequencyEncoder

In [4]:
# load dataset

data = pd.read_csv('houseprice.csv',usecols=['Neighborhood', 'Exterior1st', 'Exterior2nd', 'SalePrice'])

data.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd,SalePrice
0,CollgCr,VinylSd,VinylSd,208500
1,Veenker,MetalSd,MetalSd,181500
2,CollgCr,VinylSd,VinylSd,223500
3,Crawfor,Wd Sdng,Wd Shng,140000
4,NoRidge,VinylSd,VinylSd,250000


In [6]:
# let's have a look at how many labels each variable has
for i in data:
    print(i,':',len(data[i].unique()),'labels')

Neighborhood : 25 labels
Exterior1st : 15 labels
Exterior2nd : 16 labels
SalePrice : 663 labels


### Important

When doing count transformation of categorical variables, it is important to calculate the count (or frequency = count / total observations) **over the training set**, and then use those numbers to replace the labels in the test set.

In [7]:
# let's separate into training and testing set
# let's separate into training and testing set
X_train,X_test,y_train,y_test=train_test_split(data[['Neighborhood','Exterior1st','Exterior2nd']],data['SalePrice'],test_size=0.3)

X_train.shape, X_test.shape

((1022, 3), (438, 3))

## Count and Frequency encoding with pandas

In [8]:
# let's obtain the counts for each one of the labels
# in the variable Neigbourhood
X_train.groupby('Neighborhood').size().sort_values(ascending=False)
cat_dict = dict(X_train.groupby('Neighborhood').size().sort_values(ascending=False))
cat_dict

{'NAmes': 162,
 'CollgCr': 114,
 'OldTown': 77,
 'Edwards': 65,
 'Somerst': 61,
 'NridgHt': 59,
 'Gilbert': 58,
 'Sawyer': 55,
 'NWAmes': 51,
 'SawyerW': 43,
 'BrkSide': 37,
 'Crawfor': 33,
 'NoRidge': 31,
 'Mitchel': 28,
 'Timber': 25,
 'IDOTRR': 23,
 'StoneBr': 20,
 'ClearCr': 18,
 'SWISU': 15,
 'BrDale': 12,
 'MeadowV': 12,
 'Blmngtn': 9,
 'Veenker': 7,
 'NPkVill': 5,
 'Blueste': 2}

The dictionary contains the number of observations per category in Neighbourhood.

In [9]:
# replace the labels with the counts
X_train['Neighborhood']=X_train['Neighborhood'].replace(to_replace=list(cat_dict.keys()),value=list(cat_dict.values()))

In [10]:
# let's explore the result
X_train['Neighborhood'].head(10)

710      37
72       58
692      25
798      59
1023      9
528      65
341      43
1207    114
891      55
337     114
Name: Neighborhood, dtype: int64

In [11]:
X_train

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
710,37,VinylSd,VinylSd
72,58,VinylSd,VinylSd
692,25,MetalSd,MetalSd
798,59,VinylSd,VinylSd
1023,9,VinylSd,VinylSd
...,...,...,...
1245,51,VinylSd,VinylSd
1187,31,ImStucc,ImStucc
1182,31,Wd Sdng,ImStucc
587,55,HdBoard,HdBoard


In [12]:
# if instead of the count we would like the frequency
# we need only divide the count by the total number of observations:

frequency_map3 =  cat_dict

for i in frequency_map3.keys():
  frequency_map3[i] /= len(data['Neighborhood'])
frequency_map3

{'NAmes': 0.11095890410958904,
 'CollgCr': 0.07808219178082192,
 'OldTown': 0.05273972602739726,
 'Edwards': 0.04452054794520548,
 'Somerst': 0.04178082191780822,
 'NridgHt': 0.04041095890410959,
 'Gilbert': 0.03972602739726028,
 'Sawyer': 0.03767123287671233,
 'NWAmes': 0.03493150684931507,
 'SawyerW': 0.02945205479452055,
 'BrkSide': 0.025342465753424658,
 'Crawfor': 0.022602739726027398,
 'NoRidge': 0.021232876712328767,
 'Mitchel': 0.019178082191780823,
 'Timber': 0.017123287671232876,
 'IDOTRR': 0.015753424657534248,
 'StoneBr': 0.0136986301369863,
 'ClearCr': 0.012328767123287671,
 'SWISU': 0.010273972602739725,
 'BrDale': 0.00821917808219178,
 'MeadowV': 0.00821917808219178,
 'Blmngtn': 0.0061643835616438354,
 'Veenker': 0.004794520547945206,
 'NPkVill': 0.003424657534246575,
 'Blueste': 0.0013698630136986301}

In [13]:
X_train

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
710,37,VinylSd,VinylSd
72,58,VinylSd,VinylSd
692,25,MetalSd,MetalSd
798,59,VinylSd,VinylSd
1023,9,VinylSd,VinylSd
...,...,...,...
1245,51,VinylSd,VinylSd
1187,31,ImStucc,ImStucc
1182,31,Wd Sdng,ImStucc
587,55,HdBoard,HdBoard


In [14]:
# replace the labels with the frequencies
X_train['Neighborhood'] = X_train['Neighborhood'].replace(to_replace=list(frequency_map3.keys()),value=list(frequency_map3.values()))



## Count or Frequency Encoding with Feature-Engine

In [15]:
# let's separate into training and testing set
X_train,X_test,y_train,y_test=train_test_split(data[['Neighborhood','Exterior1st','Exterior2nd']],data['SalePrice'],test_size=0.3)

X_train.shape, X_test.shape

((1022, 3), (438, 3))

In [16]:
# let's explore the result
encoder = CountFrequencyEncoder(encoding_method='count')
X_train=encoder.fit_transform(X_train)
X_train.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
774,57,358,354
1449,13,45,44
493,165,39,139
1196,59,358,354
1140,165,154,145


**Note**

If the argument variables is left to None, then the encoder will automatically identify all categorical variables. Is that not sweet?

The encoder will not encode numerical variables. So if some of your numerical variables are in fact categories, you will need to re-cast them as object before using the encoder.

Note, if there is a variable in the test set, for which the encoder doesn't have a number to assigned (the category was not seen in the train set), the encoder will return an error.