### Feature_Engineering_Encoding

Topics covered:

1. Creating binary variables through one-hot encoding
2. Performing one-hot encoding of frequent categories
3. Replacing categories with ordinal numbers
4. Encoding with integers in an ordered manner
5. Encoding with the mean of the target
6. Encoding with the Weight of Evidence
7. Grouping rare or infrequent categories
8. Performing binary encoding
9. Performing feature hashing



### 1. Creating binary variables through one-hot encoding
In one-hot encoding, we represent a categorical variable as a group of binary variables,
where each binary variable represents one category. The binary variable indicates whether
the category is present in an observation (1) or not (0).

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.preprocessing  import OneHotEncoder

In [2]:
df = pd.read_csv(r'C:\Users\HP\Documents\EDA\EDA-BySunny\EDABySunny\Dataset\data1\Travel.csv')
df

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,200002,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,200003,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,200004,0,,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4883,204883,1,49.0,Self Enquiry,3,9.0,Small Business,Male,3,5.0,Deluxe,4.0,Unmarried,2.0,1,1,1,1.0,Manager,26576.0
4884,204884,1,28.0,Company Invited,1,31.0,Salaried,Male,4,5.0,Basic,3.0,Single,3.0,1,3,1,2.0,Executive,21212.0
4885,204885,1,52.0,Self Enquiry,3,17.0,Salaried,Female,4,4.0,Standard,4.0,Married,7.0,0,1,1,3.0,Senior Manager,31820.0
4886,204886,1,19.0,Self Enquiry,3,16.0,Small Business,Male,3,4.0,Basic,3.0,Single,3.0,0,5,0,2.0,Executive,20289.0


In [3]:
df.isnull().mean()

CustomerID                  0.000000
ProdTaken                   0.000000
Age                         0.046236
TypeofContact               0.005115
CityTier                    0.000000
DurationOfPitch             0.051350
Occupation                  0.000000
Gender                      0.000000
NumberOfPersonVisiting      0.000000
NumberOfFollowups           0.009206
ProductPitched              0.000000
PreferredPropertyStar       0.005319
MaritalStatus               0.000000
NumberOfTrips               0.028642
Passport                    0.000000
PitchSatisfactionScore      0.000000
OwnCar                      0.000000
NumberOfChildrenVisiting    0.013502
Designation                 0.000000
MonthlyIncome               0.047668
dtype: float64

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4888 entries, 0 to 4887
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   CustomerID                4888 non-null   int64  
 1   ProdTaken                 4888 non-null   int64  
 2   Age                       4662 non-null   float64
 3   TypeofContact             4863 non-null   object 
 4   CityTier                  4888 non-null   int64  
 5   DurationOfPitch           4637 non-null   float64
 6   Occupation                4888 non-null   object 
 7   Gender                    4888 non-null   object 
 8   NumberOfPersonVisiting    4888 non-null   int64  
 9   NumberOfFollowups         4843 non-null   float64
 10  ProductPitched            4888 non-null   object 
 11  PreferredPropertyStar     4862 non-null   float64
 12  MaritalStatus             4888 non-null   object 
 13  NumberOfTrips             4748 non-null   float64
 14  Passport

In [5]:
num_fet = [x for x in df.columns if df[x].dtype != 'O']
cat_fet = [x for x in df.columns if df[x].dtype == 'O']
print(num_fet,cat_fet)

['CustomerID', 'ProdTaken', 'Age', 'CityTier', 'DurationOfPitch', 'NumberOfPersonVisiting', 'NumberOfFollowups', 'PreferredPropertyStar', 'NumberOfTrips', 'Passport', 'PitchSatisfactionScore', 'OwnCar', 'NumberOfChildrenVisiting', 'MonthlyIncome'] ['TypeofContact', 'Occupation', 'Gender', 'ProductPitched', 'MaritalStatus', 'Designation']


### Encoding using get_dummies

In [6]:
import copy

In [7]:
df1 = copy.deepcopy(df)
df1

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,200002,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,200003,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,200004,0,,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4883,204883,1,49.0,Self Enquiry,3,9.0,Small Business,Male,3,5.0,Deluxe,4.0,Unmarried,2.0,1,1,1,1.0,Manager,26576.0
4884,204884,1,28.0,Company Invited,1,31.0,Salaried,Male,4,5.0,Basic,3.0,Single,3.0,1,3,1,2.0,Executive,21212.0
4885,204885,1,52.0,Self Enquiry,3,17.0,Salaried,Female,4,4.0,Standard,4.0,Married,7.0,0,1,1,3.0,Senior Manager,31820.0
4886,204886,1,19.0,Self Enquiry,3,16.0,Small Business,Male,3,4.0,Basic,3.0,Single,3.0,0,5,0,2.0,Executive,20289.0


In [8]:
cat_fet

['TypeofContact',
 'Occupation',
 'Gender',
 'ProductPitched',
 'MaritalStatus',
 'Designation']

In [9]:
g_d = ['Gender', 'MaritalStatus']
df1['Gender'].unique()

array(['Female', 'Male', 'Fe Male'], dtype=object)

In [10]:
df1['MaritalStatus'].unique()

array(['Single', 'Divorced', 'Married', 'Unmarried'], dtype=object)

In [11]:
df1_enc = pd.get_dummies(df1[g_d],drop_first=True)
df1_enc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4888 entries, 0 to 4887
Data columns (total 5 columns):
 #   Column                   Non-Null Count  Dtype
---  ------                   --------------  -----
 0   Gender_Female            4888 non-null   uint8
 1   Gender_Male              4888 non-null   uint8
 2   MaritalStatus_Married    4888 non-null   uint8
 3   MaritalStatus_Single     4888 non-null   uint8
 4   MaritalStatus_Unmarried  4888 non-null   uint8
dtypes: uint8(5)
memory usage: 24.0 KB


In [12]:
df1_enc

Unnamed: 0,Gender_Female,Gender_Male,MaritalStatus_Married,MaritalStatus_Single,MaritalStatus_Unmarried
0,1,0,0,1,0
1,0,1,0,0,0
2,0,1,0,1,0
3,1,0,0,0,0
4,0,1,0,0,0
...,...,...,...,...,...
4883,0,1,0,0,1
4884,0,1,0,1,0
4885,1,0,1,0,0
4886,0,1,0,1,0


In [13]:
df1_enc1 = pd.get_dummies(df1,drop_first=True)
df1_enc1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4888 entries, 0 to 4887
Data columns (total 31 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   CustomerID                   4888 non-null   int64  
 1   ProdTaken                    4888 non-null   int64  
 2   Age                          4662 non-null   float64
 3   CityTier                     4888 non-null   int64  
 4   DurationOfPitch              4637 non-null   float64
 5   NumberOfPersonVisiting       4888 non-null   int64  
 6   NumberOfFollowups            4843 non-null   float64
 7   PreferredPropertyStar        4862 non-null   float64
 8   NumberOfTrips                4748 non-null   float64
 9   Passport                     4888 non-null   int64  
 10  PitchSatisfactionScore       4888 non-null   int64  
 11  OwnCar                       4888 non-null   int64  
 12  NumberOfChildrenVisiting     4822 non-null   float64
 13  MonthlyIncome     

In [14]:
encoder = OneHotEncoder(categories='auto', drop='first',sparse=False)
#encoder.fit()
#X_train_enc = encoder.transform(X_train[vars_categorical])
#X_test_enc = encoder.transform(X_test[vars_categorical])


In [15]:
#from feature_engine.categorical_encoders import OneHotCategoricalEncoder
#ohe_enc = OneHotCategoricalEncoder(top_categories=None,drop_last=True)
#fit and transform

### Performing one-hot encoding of frequent categories


One-hot encoding represents each category of a categorical variable with a binary variable.
Hence, one-hot encoding of highly cardinal variables or datasets with multiple categorical
features can expand the feature space dramatically. To reduce the number of binary
variables, we can perform one-hot encoding of the most frequent categories only. One-hot
encoding of top categories is equivalent to treating the remaining, less frequent categories
as a single, unique category

In [16]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
#from feature_engine.categorical_encoders import
#OneHotCategoricalEncoder

In [17]:
#X_train['A6'].value_counts().sort_values(ascending=False).head(5)

In [18]:
#top_5 = [cat for cat in X_train['A6'].value_counts().sort_values(ascending=False).head(5).index]

In [19]:
#ohe_enc = OneHotCategoricalEncoder(top_categories=5, variables=['A6', 'A7'], drop_last=False)

In [20]:
#ohe_enc.fit(X_train)

In [21]:
#X_train_enc = ohe_enc.transform(X_train)
#X_test_enc = ohe_enc.transform(X_test)

### Replacing categories with ordinal numbers
Ordinal encoding consists of replacing the categories with digits from 1 to k (or 0 to k-1,
depending on the implementation), where k is the number of distinct categories of the
variable. The numbers are assigned arbitrarily. Ordinal encoding is better suited for nonlinear machine learning models, which can navigate through the arbitrarily assigned digits
to try and find patterns that relate to the target.

In [22]:
df2 = copy.deepcopy(df)
df2

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,200002,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,200003,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,200004,0,,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4883,204883,1,49.0,Self Enquiry,3,9.0,Small Business,Male,3,5.0,Deluxe,4.0,Unmarried,2.0,1,1,1,1.0,Manager,26576.0
4884,204884,1,28.0,Company Invited,1,31.0,Salaried,Male,4,5.0,Basic,3.0,Single,3.0,1,3,1,2.0,Executive,21212.0
4885,204885,1,52.0,Self Enquiry,3,17.0,Salaried,Female,4,4.0,Standard,4.0,Married,7.0,0,1,1,3.0,Senior Manager,31820.0
4886,204886,1,19.0,Self Enquiry,3,16.0,Small Business,Male,3,4.0,Basic,3.0,Single,3.0,0,5,0,2.0,Executive,20289.0


In [23]:
cat_fet

['TypeofContact',
 'Occupation',
 'Gender',
 'ProductPitched',
 'MaritalStatus',
 'Designation']

In [24]:
ordinal_mapping = {k: i for i, k in enumerate(df2['Designation'].unique(), 0) }
ordinal_mapping

{'Manager': 0, 'Executive': 1, 'Senior Manager': 2, 'AVP': 3, 'VP': 4}

In [25]:
df2['Designation']= df2['Designation'].map(ordinal_mapping)
df2['Designation'].sample(30)

4236    0
4494    1
247     0
3627    0
3266    0
3970    2
1733    3
2957    1
103     0
427     2
1425    0
1007    2
4359    3
2493    1
2367    0
3489    1
3255    2
3699    2
2171    3
3393    0
3496    1
2143    1
2196    0
4462    0
2287    0
600     0
4598    2
3311    3
2096    0
146     1
Name: Designation, dtype: int64

In [26]:
#le = OrdinalEncoder()
#le.fit(X_train[vars_categorical])
#X_train_enc = le.transform(X_train[vars_categorical])

In [27]:
#ordinal_enc = OrdinalCategoricalEncoder(encoding_method='arbitrary', variables=vars_categorical)
#ordinal_enc.fit(X_train)

In [28]:
#ordinal_enc.encoder_dict_.
#X_train = ordinal_enc.transform(X_train)

### Encoding with integers in an ordered manner

In the Replacing categories with ordinal numbers recipe, we replaced categories with integers,
which were assigned arbitrarily. This encoding works well with non-linear machine
learning algorithms that can navigate through the arbitrarily assigned digits to try and find
patterns that relate them to the target. However, this encoding method may not work so
well with linear models.

In [29]:
#from feature_engine.categorical_encoders import OrdinalCategoricalEncoder

In [30]:
#ordered_labels =X_train.groupby(['A7'])['A16'].mean().sort_values().index
#ordinal_mapping = {k: i for i, k in enumerate(ordered_labels, 0)}

#X_train['A7'] = X_train['A7'].map(ordinal_mapping)
#

In [31]:
#ordinal_enc = OrdinalCategoricalEncoder(encoding_method='ordered',variables=None)
#ordinal_enc.fit(X_train, y_train)
#X_train_enc = ordinal_enc.transform(X_train)


#### Encoding with the mean of the target

Mean encoding or target encoding implies replacing the categories with the average target
value for that category. For example, if we have a City variable, with the categories
of London, Manchester, and Bristol, and we want to predict the default rate; if the default
rate for London is 30%, we replace London with 0.3; if the default rate for Manchester is
20%, we replace Manchester with 0.2; and so on. The same can be done with a continuous
target.


In [32]:
df4 = copy.deepcopy(df)
df4

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,200002,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,200003,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,200004,0,,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4883,204883,1,49.0,Self Enquiry,3,9.0,Small Business,Male,3,5.0,Deluxe,4.0,Unmarried,2.0,1,1,1,1.0,Manager,26576.0
4884,204884,1,28.0,Company Invited,1,31.0,Salaried,Male,4,5.0,Basic,3.0,Single,3.0,1,3,1,2.0,Executive,21212.0
4885,204885,1,52.0,Self Enquiry,3,17.0,Salaried,Female,4,4.0,Standard,4.0,Married,7.0,0,1,1,3.0,Senior Manager,31820.0
4886,204886,1,19.0,Self Enquiry,3,16.0,Small Business,Male,3,4.0,Basic,3.0,Single,3.0,0,5,0,2.0,Executive,20289.0


In [33]:
cat_fet

['TypeofContact',
 'Occupation',
 'Gender',
 'ProductPitched',
 'MaritalStatus',
 'Designation']

In [34]:
ord_lab = df4.groupby(['Designation'])['MonthlyIncome'].mean().to_dict()
ord_lab

{'AVP': 32241.04678362573,
 'Executive': 19939.92250712251,
 'Manager': 22643.15762925599,
 'Senior Manager': 26637.749326145553,
 'VP': 35878.94782608696}

In [35]:
df4['Designation'] = df4['Designation'].map(ord_lab)
df4['Designation']


0       22643.157629
1       22643.157629
2       19939.922507
3       19939.922507
4       19939.922507
            ...     
4883    22643.157629
4884    19939.922507
4885    26637.749326
4886    19939.922507
4887    19939.922507
Name: Designation, Length: 4888, dtype: float64

In [36]:
#mean_enc = MeanCategoricalEncoder(variables=None)
#fit_transform

#### Encoding with the Weight of Evidence

The Weight of Evidence (WoE) was developed primarily for credit and financial industries
to facilitate variable screening and exploratory analysis and to build more predictive linear
models to evaluate the risk of loan default; that is, to predict how likely money lent to a
person or institution is to be lost.

In [37]:
df5= copy.deepcopy(df)

In [38]:
#from feature_engine.categorical_encoders import WoERatioCategoricalEncoder

In [39]:
cat_fet

['TypeofContact',
 'Occupation',
 'Gender',
 'ProductPitched',
 'MaritalStatus',
 'Designation']

In [40]:
num_fet

['CustomerID',
 'ProdTaken',
 'Age',
 'CityTier',
 'DurationOfPitch',
 'NumberOfPersonVisiting',
 'NumberOfFollowups',
 'PreferredPropertyStar',
 'NumberOfTrips',
 'Passport',
 'PitchSatisfactionScore',
 'OwnCar',
 'NumberOfChildrenVisiting',
 'MonthlyIncome']

In [45]:
p1 = df5.groupby('Occupation')['Age'].mean()
p1

Occupation
Free Lancer       37.500000
Large Business    36.637681
Salaried          37.649068
Small Business    37.796687
Name: Age, dtype: float64

In [46]:
p0 = 1- p1
p0

Occupation
Free Lancer      -36.500000
Large Business   -35.637681
Salaried         -36.649068
Small Business   -36.796687
Name: Age, dtype: float64

In [47]:
woe = dict(np.log(p1/p0)) 

In [48]:
woe

{'Free Lancer': nan,
 'Large Business': nan,
 'Salaried': nan,
 'Small Business': nan}

In [49]:
#use the map for the woe
#X_train['A1'] = X_train['A1'].map(woe)

In [50]:
#using feature Engineering 
#woe_enc = WoERatioCategoricalEncoder(encoding_method='woe',variables=['A1', 'A10', 'A12'])
#woe_enc.fit_transform(df)


### Grouping rare or infrequent categories


Rare values are those categories that are present only in a small percentage of the
observations. There is no rule of thumb to determine how small is a small percentage, but
typically, any value below 5 % can be considered rare. Infrequent labels often appear only
on the train set or only on the test set, therefore making the algorithms prone to overfitting
or unable to score an observation. To avoid these complications, we can group infrequent
categories into a new category called Rare or Other.

In [51]:
df6 = copy.deepcopy(df)
df6

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,200002,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,200003,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,200004,0,,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4883,204883,1,49.0,Self Enquiry,3,9.0,Small Business,Male,3,5.0,Deluxe,4.0,Unmarried,2.0,1,1,1,1.0,Manager,26576.0
4884,204884,1,28.0,Company Invited,1,31.0,Salaried,Male,4,5.0,Basic,3.0,Single,3.0,1,3,1,2.0,Executive,21212.0
4885,204885,1,52.0,Self Enquiry,3,17.0,Salaried,Female,4,4.0,Standard,4.0,Married,7.0,0,1,1,3.0,Senior Manager,31820.0
4886,204886,1,19.0,Self Enquiry,3,16.0,Small Business,Male,3,4.0,Basic,3.0,Single,3.0,0,5,0,2.0,Executive,20289.0


In [53]:
df6['NumberOfFollowups'].value_counts()/len(df)

4.0    0.423077
3.0    0.299918
5.0    0.157119
2.0    0.046849
1.0    0.036007
6.0    0.027823
Name: NumberOfFollowups, dtype: float64

If we consider as rare those labels present in less than 5% of the observations are Rare categories

In [54]:
def find_frequent_labels(df, variable, tolerance):
    temp = df[variable].value_counts() / len(df)
    frequent = [x for x in temp.loc[temp>tolerance].index.values]
    return frequent

In [56]:
frequent_cat = find_frequent_labels(df6, 'NumberOfFollowups', 0.05)
frequent_cat   # frequent categories

[4.0, 3.0, 5.0]

In [57]:
df6['NumberOfFollowups'] = np.where(df6['NumberOfFollowups'].isin(frequent_cat),df6['NumberOfFollowups'], 'Rare')


In [58]:
df6['NumberOfFollowups'].value_counts()/len(df6)

4.0     0.423077
3.0     0.299918
5.0     0.157119
Rare    0.119885
Name: NumberOfFollowups, dtype: float64

__If we observe the abobe the rarecateogories are replaced to 'Rare'__

In [59]:
# using feature engineering
#rare_encoder = RareLabelCategoricalEncoder(tol=0.05, n_categories=4)
#rare_encoder.fit(X_train)
#rare_encoder.encoder_dict_.
#X_train_enc = rare_encoder.transform(X_train)
#X_test_enc = rare_encoder.transform(X_test)

### Performing binary encoding


Binary encoding is an alternative categorical encoding technique that uses binary code, that
is, a sequence of zeroes and ones, to represent the different categories of the variable. How
does it work? First, the categories are arbitrarily replaced by ordinal numbers, as shown in
the intermediate step of the following table. Then, those numbers are converted into binary
code. For example, the integer 1 can be represented as the sequence 01, the integer 2 as 10,
the integer 3 as 00, and 4 as 11. The digits in the two positions of the binary string become
the columns, which are the encoded representation of the original variable

In [61]:
df7 = copy.deepcopy(df)

In [62]:
cat_fet

['TypeofContact',
 'Occupation',
 'Gender',
 'ProductPitched',
 'MaritalStatus',
 'Designation']

In [63]:
df7['TypeofContact'].unique()

array(['Self Enquiry', 'Company Invited', nan], dtype=object)

In [65]:
#from category_encoders import BinaryEncod

In [66]:
#encoder = BinaryEncoder(cols=['A7'], drop_invariant=True)
#encoder.fit(X_train)
#X_train_enc = encoder.transform(X_train)
#X_test_enc = encoder.transform(X_test)

### Performing feature hashing

With feature hashing, the categories of a variable are converted into a series of binary
vectors using a hashing function. How does this work? First, we determine, arbitrarily, the
number of binary vectors to represent the category. For example, let's say we would like to
use five vectors. Next, we need a hash function that will take a category and return a
number between 0 and n-1, where n is the number of binary vectors. In our example, the
hash function should return a value between 0 and 4. Let's say our hash function returns
the value of 3 for the category blue. That means that our category blue will be represented
by a 0 in the vectors 0, 1, 2, and 4 and 1 in the vector 3: [0,0,0,1,0]. Any hash function can be
used as long as it returns a number between 0 and n-1.

In [67]:
#from category_encoders import HashingEncoder
#encoder = HashingEncoder(cols=['A7'], n_components=4)
#encoder.fit(X_train)
#X_train_enc = encoder.transform(X_train)
#X_test_enc = encoder.transform(X_test)

### Thank you 