## Count or Frequency Encoding 
If we have categorical variables containing many multiple labels or high cardinality, then by using one hot encoding, we will expand the feature space dramatically. </br>
Count or Frequency Encoding says that replace each label of the categorical variable by the count of the labels. This is the amount of times each label appears in the dataset. Or the frequency, this is the percentage of observations within that category.

In [1]:
import numpy as np
import pandas as pd

In [2]:
data = pd.read_csv('train.csv',usecols=['X1','X2'])
data.head()

Unnamed: 0,X1,X2
0,v,at
1,t,av
2,w,n
3,t,n
4,v,n


In [3]:
# By using one hot encoding, the shape of dataset we get is
pd.get_dummies(data).shape

(4209, 71)

In [4]:
# Let's have a look at how many labels

for cols in data.columns:
    print(cols," : ",len(data[cols].unique()),'labels')

X1  :  27 labels
X2  :  44 labels


In [5]:
# Let's obtain the counts or frequency for each one of the labels in variable X2
# Let's capture this in a dictionary that we can use to re-map the labels

data.X2.value_counts().to_dict()

{'as': 1659,
 'ae': 496,
 'ai': 415,
 'm': 367,
 'ak': 265,
 'r': 153,
 'n': 137,
 's': 94,
 'f': 87,
 'e': 81,
 'aq': 63,
 'ay': 54,
 'a': 47,
 't': 29,
 'k': 25,
 'i': 25,
 'b': 21,
 'ao': 20,
 'ag': 19,
 'z': 19,
 'd': 18,
 'ac': 13,
 'g': 12,
 'y': 11,
 'ap': 11,
 'x': 10,
 'aw': 8,
 'h': 6,
 'at': 6,
 'al': 5,
 'q': 5,
 'an': 5,
 'av': 4,
 'ah': 4,
 'p': 4,
 'au': 3,
 'am': 1,
 'l': 1,
 'j': 1,
 'af': 1,
 'ar': 1,
 'o': 1,
 'aa': 1,
 'c': 1}

In [6]:
# Now let's replace each label with its frequency in X2 Column
# First we make a dictionary that maps each label to its frequency or count

data_freq_map = data.X2.value_counts().to_dict()

In [7]:
# now replace X2 labels in the dataset with their frequency or count
data.X2 = data.X2.map(data_freq_map)
data.head()

Unnamed: 0,X1,X2
0,v,6
1,t,4
2,w,137
3,t,137
4,v,137


In [8]:
# Similarly, we can do the same for X1 column

def count_freq_encoding(data,col):
    data_freq_map = data[col].value_counts().to_dict()
    data[col] = data[col].map(data_freq_map)

In [9]:
data = pd.read_csv('train.csv',usecols=['X1','X2'])
for cols in data.columns:
    count_freq_encoding(data,cols)

In [10]:
data.head()

Unnamed: 0,X1,X2
0,408,6
1,31,4
2,52,137
3,31,137
4,408,137


There are some advantages and disadvantages of Count or Frequency Encoding.</br>
### Advantages:
1. It is very simple to implement.
2. Does not increase the feature dimensional space.

### Disadvantages:
1. If some of the labels have same count, then they will be replaced with the same count and they will loose some valuable information.
2. Adds somewhat arbitrary numbers, and therefore weights to the different labels, that may not be related to their predictive power.  

### Example -2 

In [11]:
data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data' , header = None,index_col=None)
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [14]:
columns =[1,3,5,6,7,8,9,13]

In [17]:
data = data[columns]

In [18]:
# lets create our own column names because in dataset column names are not present
data.columns = ['Employment', 'Degree','Status','Designation','Family_job','Race','Sex','Country']

In [19]:
data.head()

Unnamed: 0,Employment,Degree,Status,Designation,Family_job,Race,Sex,Country
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,United-States
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,United-States
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,Cuba


In [20]:
data.Employment.value_counts()

 Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 ?                    1836
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name: Employment, dtype: int64

In [22]:
for feature in data.columns:
    print(feature," : ", len(data[feature].unique())," labels")

Employment  :  9  labels
Degree  :  16  labels
Status  :  7  labels
Designation  :  15  labels
Family_job  :  6  labels
Race  :  5  labels
Sex  :  2  labels
Country  :  42  labels


In [24]:
# let's do count/frequency encoding for country column
data['Country'].value_counts()

 United-States                 29170
 Mexico                          643
 ?                               583
 Philippines                     198
 Germany                         137
 Canada                          121
 Puerto-Rico                     114
 El-Salvador                     106
 India                           100
 Cuba                             95
 England                          90
 Jamaica                          81
 South                            80
 China                            75
 Italy                            73
 Dominican-Republic               70
 Vietnam                          67
 Guatemala                        64
 Japan                            62
 Poland                           60
 Columbia                         59
 Taiwan                           51
 Haiti                            44
 Iran                             43
 Portugal                         37
 Nicaragua                        34
 Peru                             31
 

In [38]:
Country_freq = data['Country'].value_counts().to_dict()

In [40]:
data.Country = data.Country.map(Country_freq)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [41]:
data.head()

Unnamed: 0,Employment,Degree,Status,Designation,Family_job,Race,Sex,Country
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,29170
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,29170
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,29170
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,29170
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,95
