Rare values are labels/ categories within a categorical variable that are only present for a small percentage of the observations.

There is no rule of thumb to determine how small is a small percentage, but typically, any value below  #### 5% may cause over-fitting in trees. ######

In this section of the course, I will expand on how to handle rare labels by re-categorising the observation that show rare labels for a certain variable. These observations can be re-categorised by:

1) Replacing the rare label by most frequent label

2) Grouping the observations that show rare labels into a unique category (with a new label like 'Rare', or 'Other')

#### Grouping infrequent label or categories under a new category called "Rare" or "Other" is the most common practice in ML for business  

In [32]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

In [5]:
data = pd.read_csv('/Users/nandrajog/Downloads/House-price.csv',low_memory=False)
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [7]:
# divide dataset into train and test set
X_train, X_test, y_train, y_test = train_test_split(data, data.SalePrice,
                                                    test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

((1022, 81), (438, 81))

### Note


The identification of rare labels should be done considering only the presence of rare labels in the training set, and then propagated to the test set. This means, rare labels should be identified in the training set. And then, when those are present in the test set as well, they should be replaced, regardless of whether in the test set they are rare or not (i.e., regardless of whether in the test set they are also present in a tiny percentage of the observations or in a high percentage of observations)

For example : training set the variable 'city' with the labels 'London', 'Manchester' and 'Yorkshire'. 'Yorkshire' is present in less than 5% of the observations so we decide to replace it by 'London', the most frequent city in the training dataset. 

In addition, if in the test set we find the category 'Milton Keynes', that was not present in the training set, we should also replace that category by London. This is, all categories present in test set, not present in training set, should be treated as rare values and imputed accordingly.

In [21]:
def categorical_number(df_train,df_test,columns):
    
    for col in columns:
        print("col :",col)
        labels_dict = {k:i for i,k in enumerate(df_train[col].unique())}
        print("labels_dict :",labels_dict)
        df_train.loc[:,col] = df_train.loc[:,col].map(labels_dict)
        df_test.loc[:,col] =df_test.loc[:,col].map(labels_dict)


In [22]:
categorical_number(X_train,X_test,['Street'])

col : Street
labels_dict : {'Pave': 0, 'Grvl': 1}


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [24]:
X_train['Street']

64      0
682     0
960     0
1384    0
1100    0
       ..
763     0
835     0
1216    0
559     0
684     0
Name: Street, Length: 1022, dtype: int64

In [26]:
X_train.Street.value_counts()

0    1017
1       5
Name: Street, dtype: int64

In [36]:
def train_rf(X_train, y_train, X_test, y_test, columns):
    
    rf = RandomForestRegressor(n_estimators=500, random_state=39,max_depth=10)
    rf.fit(X_train[columns], y_train.values)
    pred_train = rf.predict(X_train[columns])
    pred_test = rf.predict(X_test[columns])

    print('Train set')
    print('Random Forests mse: {}'.format(mean_squared_error(y_train, pred_train)))
    print('Test set')
    print('Random Forests mse: {}'.format(mean_squared_error(y_test, pred_test)))

In [37]:
train_rf(X_train,y_train,X_test,y_test,['Street'])

Train set
Random Forests mse: 6089194286.482007
Test set
Random Forests mse: 6779530374.068212


Here we mainly focus on variable with One DOMINANT Category

In [67]:
for col in data:
    if data[col].dtypes == 'object':
        if len(data[col].unique()) <3:
            print(data.groupby(col)[col].count())
            print("####### Percentage ########" )
            print(data.groupby(col)[col].count()/np.float(len(data)))
            print()
            

Street
Grvl       6
Pave    1454
Name: Street, dtype: int64
####### Percentage ########
Street
Grvl    0.00411
Pave    0.99589
Name: Street, dtype: float64

Utilities
AllPub    1459
NoSeWa       1
Name: Utilities, dtype: int64
####### Percentage ########
Utilities
AllPub    0.999315
NoSeWa    0.000685
Name: Utilities, dtype: float64

CentralAir
N      95
Y    1365
Name: CentralAir, dtype: int64
####### Percentage ########
CentralAir
N    0.065068
Y    0.934932
Name: CentralAir, dtype: float64



### Conclusion




In the first 2 variables, Street and Utilities, the variables show one dominating category which accounts for more than 99% of the observations. In the third variable, the dominating category is present in more than 90% of the observations.

In cases of variables with one dominating category, engineering the rare label is not an option. One needs to choose between whether to use that variable as it is at all or remove it from the dataset.

These types of variables often are not useful for our predictions, and we should remove them from the set of features that we are going to use to build machine learning models. 


NOTE :     There are of course exceptions, for example in those cases in which the target is unbalanced

Therefore, instead of automating a feature engineering pipeline, perhaps it is better to evaluate these variables individually.