## Count or frequency encoding

In count encoding we replace the categories by the count of the observations that show that category in the dataset. Similarly, we can replace the category by the frequency -or percentage- of observations in the dataset. That is, if 10 of our 100 observations show the colour blue, we would replace blue by 10 if doing count encoding, or by 0.1 if replacing by the frequency. These techniques capture the representation of each label in a dataset, but the encoding may not necessarily be predictive of the outcome. These are however, very popular encoding methods in Kaggle competitions.

The assumption of this technique is that the number observations shown by each variable is somewhat informative of the predictive power of the category.


### Advantages

- Simple
- Does not expand the feature space

### Disadvantages

- If 2 different categories appear the same amount of times in the dataset, that is, they appear in the same number of observations, they will be replaced by the same number: may lose valuable information.

For example, if there are 10 observations for the category blue and 10 observations for the category red, both will be replaced by 10, and therefore, after the encoding, will appear to be the same thing. 


Follow this [thread in Kaggle](https://www.kaggle.com/general/16927) for more information.



## In this demo:

We will see how to perform count or frequency encoding with:
- pandas
- Feature-Engine


In [1]:
import numpy as np
import pandas as pd

# to split the datasets
from sklearn.model_selection import train_test_split

# to encode with feature-engine
from feature_engine.categorical_encoders import CountFrequencyCategoricalEncoder

# Dataset
This data approach student achievement in secondary education of two Portuguese schools. 
The data attributes include student grades, demographic, social and school related features) 
and it was collected by using school reports and questionnaires.
the target, G3 is the final year grade (issued at the 3rd period)


In [2]:
columns = [ 'Mjob', 'Fjob', 'reason', 'guardian', 'schoolsup', 'famsup', 'paid', 'activities',
             'Dalc', 'Walc' , 'health', 'G3']
data_raw = pd.read_csv('C:\\Users\gusal\machine learning\Feature engineering\student-por.csv', delimiter= ';' ,usecols = columns)

In [3]:
data_raw.head(5)

Unnamed: 0,Mjob,Fjob,reason,guardian,schoolsup,famsup,paid,activities,Dalc,Walc,health,G3
0,at_home,teacher,course,mother,yes,no,no,no,1,1,3,11
1,at_home,other,course,father,no,yes,no,no,1,1,3,11
2,at_home,other,other,mother,yes,no,no,no,2,3,3,12
3,health,services,home,mother,no,yes,no,yes,1,1,5,14
4,other,other,home,father,no,yes,no,no,1,2,5,13


In [4]:
data_raw.dtypes

Mjob          object
Fjob          object
reason        object
guardian      object
schoolsup     object
famsup        object
paid          object
activities    object
Dalc           int64
Walc           int64
health         int64
G3             int64
dtype: object

'Dalc', 'Walc' , 'health' are categorical variables, but are taken as numerical, therefore they have to be re-cast

In [5]:
data_raw.Dalc = data_raw.Dalc.astype(object)
data_raw.Walc = data_raw.Walc.astype(object)
data_raw.health = data_raw.health.astype(object)


In [6]:
data_raw.dtypes

Mjob          object
Fjob          object
reason        object
guardian      object
schoolsup     object
famsup        object
paid          object
activities    object
Dalc          object
Walc          object
health        object
G3             int64
dtype: object

In [7]:
# let's have a look at how many labels each variable has

for col in data_raw.columns:
    print(col, ': ', len(data_raw[col].unique()), ' labels')

Mjob :  5  labels
Fjob :  5  labels
reason :  4  labels
guardian :  3  labels
schoolsup :  2  labels
famsup :  2  labels
paid :  2  labels
activities :  2  labels
Dalc :  5  labels
Walc :  5  labels
health :  5  labels
G3 :  17  labels


# Counting the items per each label

### Important

When doing count transformation of categorical variables, it is important to calculate the count (or frequency = count / total observations) **over the training set**, and then use those numbers to replace the labels in the test set.

In [8]:
inputs = data_raw.drop(['G3'], axis = 1)
target = data_raw.G3

In [9]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    inputs, # predictors
    target,  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((454, 11), (195, 11))

In [11]:

for col in X_train.columns:
    print(col, ':\n ', X_train[col].value_counts().sort_values(ascending=False),'\n ')

Mjob :
  other       180
services     99
at_home      91
teacher      49
health       35
Name: Mjob, dtype: int64 
 
Fjob :
  other       253
services    126
at_home      32
teacher      26
health       17
Name: Fjob, dtype: int64 
 
reason :
  course        203
home          103
reputation     99
other          49
Name: reason, dtype: int64 
 
guardian :
  mother    326
father    103
other      25
Name: guardian, dtype: int64 
 
schoolsup :
  no     405
yes     49
Name: schoolsup, dtype: int64 
 
famsup :
  yes    282
no     172
Name: famsup, dtype: int64 
 
paid :
  no     430
yes     24
Name: paid, dtype: int64 
 
activities :
  no     230
yes    224
Name: activities, dtype: int64 
 
Dalc :
  1    314
2     85
3     32
5     12
4     11
Name: Dalc, dtype: int64 
 
Walc :
  1    171
2    105
3     84
4     65
5     29
Name: Walc, dtype: int64 
 
health :
  5    179
3     85
4     80
1     62
2     48
Name: health, dtype: int64 
 


# replace the labels with the counts


In [12]:

# let's obtain the counts for each one of the labels
# in the variable Neigbourhood
for col in inputs.columns:
    count_map = X_train[col].value_counts().to_dict()
    X_train[col] = X_train[col].map(count_map)
    X_test[col] = X_test[col].map(count_map)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [14]:
X_train.head()

Unnamed: 0,Mjob,Fjob,reason,guardian,schoolsup,famsup,paid,activities,Dalc,Walc,health
561,91,32,203,326,405,282,430,224,314,65,179
452,180,253,203,326,405,282,430,230,314,171,80
89,49,17,99,326,405,282,430,230,32,29,179
299,180,253,203,25,405,282,24,224,314,105,48
231,99,17,103,103,405,282,430,230,85,65,80


# if instead of the count we would like the frequency
# we need only divide the count by the total number of observations


## Count or Frequency Encoding with Feature-Engine

In [16]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    inputs, # predictors
    target,  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((454, 11), (195, 11))

In [17]:
count_enc = CountFrequencyCategoricalEncoder(
    encoding_method='count', # to do frequency ==> encoding_method='frequency'
    variables=None)

count_enc.fit(X_train)

CountFrequencyCategoricalEncoder(encoding_method='count',
                                 variables=['Mjob', 'Fjob', 'reason',
                                            'guardian', 'schoolsup', 'famsup',
                                            'paid', 'activities', 'Dalc',
                                            'Walc', 'health'])

In [18]:
# in the encoder dict we can observe the number of 
# observations per category for each variable

count_enc.encoder_dict_

{'Mjob': {'other': 180,
  'services': 99,
  'at_home': 91,
  'teacher': 49,
  'health': 35},
 'Fjob': {'other': 253,
  'services': 126,
  'at_home': 32,
  'teacher': 26,
  'health': 17},
 'reason': {'course': 203, 'home': 103, 'reputation': 99, 'other': 49},
 'guardian': {'mother': 326, 'father': 103, 'other': 25},
 'schoolsup': {'no': 405, 'yes': 49},
 'famsup': {'yes': 282, 'no': 172},
 'paid': {'no': 430, 'yes': 24},
 'activities': {'no': 230, 'yes': 224},
 'Dalc': {1: 314, 2: 85, 3: 32, 5: 12, 4: 11},
 'Walc': {1: 171, 2: 105, 3: 84, 4: 65, 5: 29},
 'health': {5: 179, 3: 85, 4: 80, 1: 62, 2: 48}}

In [19]:
X_train = count_enc.transform(X_train)
X_test = count_enc.transform(X_test)

# let's explore the result
X_train.head()

Unnamed: 0,Mjob,Fjob,reason,guardian,schoolsup,famsup,paid,activities,Dalc,Walc,health
561,91,32,203,326,405,282,430,224,314,65,179
452,180,253,203,326,405,282,430,230,314,171,80
89,49,17,99,326,405,282,430,230,32,29,179
299,180,253,203,25,405,282,24,224,314,105,48
231,99,17,103,103,405,282,430,230,85,65,80


**Note**

If the argument variables is left to None, then the encoder will automatically identify all categorical variables. Is that not sweet?

The encoder will not encode numerical variables. So if some of your numerical variables are in fact categories, you will need to re-cast them as object before using the encoder.

Note, if there is a variable in the test set, for which the encoder doesn't have a number to assigned (the category was not seen in the train set), the encoder will return an error.