### Mean encoding (Target encoding)

     The general idea of this technique is to add new variables based on some feature to get where we started,. In simplest case, we encode each level of categorical variable with corresponding target mean. 
     
    we can reach better loss with shorter trees. 
    good indicators for mean encoding:
        - The presence of categorical variables with a lot of levels is already a good indicator
        
    Our model tries to treat all those categories differently and they are also very important for predicting the target. We can help our model via mean encodings.
    
    
<img src="files/Images/mean_enc1.png" width="500" height="200">

    Ways to use target variable
    
<img src="files/Images/mean_enc2.png" width="500" height="200">

In [None]:
means = X_tr.groupby(col).target.mean()
train_new[col + '_mean_target'] = train_new[col].map(means)
val_new[col + '_mean_target'] = val_new[col].map(means)

In [None]:
dtrain = xgb.DMatrix(train_new, label=y_tr)
dvalid = xgb.DMatrix(val_new, label=y_val)

evallist = [(dtrain, 'train'),(dvalid, 'eval')]
evals_result3 = {}
model = xgb.train(xgb_par, dtrain, 3000, evals = evallist, 
                  verbose_eval = 30, evals_result = evals_result3,
                 early_stopping_rounds = 50)

### Regularization

    1. CV loop inside training data. (!)
        Intuitive and robust method. 
        Usually decent results with 4-5 folds across different datasets
        Need to be careful with extreme situations like LOO
<img src="files/Images/cvloop.png" width="500" height="200">

    2. Smoothing
        Alpha controls the amount of regularization
        Only works together with some other regularization method 
        
<img src="files/Images/Smoothing.png" width="500" height="200">

    3. Adding random noise
        Noise degrades the quality of encoding
        How much noise should we add?
        Usually used together with LOO
        
    4. Sorting and calculating expanding mean (!)
        Least amount of leakage
        No hyper parameters
        Irregular encoding quality
        Built-in in Catboost

In [None]:
##CV_LOOP
y_tr = df_tr['target'].values()
skf = StratifiedKFold(y_tr, 5, shuffle = True, random_state = 123)

for tr_ind, val_ind in skf:
    X_tr, X_val = df_tr.iloc[tr_ind], df_tr.iloc[val_ind]
    for col in cols: #iterate though the columns we want to encode
        means = X_val[col].map(X_tr.groupby(col).target.mean())
        X_val[col + '_mean_target'] = means
    train_new.iloc[val_ind] = X_val
prior = df_tr['target'].mean() #global mean
train_new.fillna(prior, inplace = True) #fill na with global means

In [None]:
##Calculating expanding mean
cumsum = df_tr.groupby(col)['target'].cumsum() - df_tr['target']
cumcnt = df_tr.groupby(col).cumscount()
train_new[col + '_mean_target'] = cumsum/cumcnt

### Extensions and generalizations

    Regression and multiclass:
        More statistics for regression tasks. Percentiles, std, distribution bins
        Introducing new information for one vs all classifiers in multi class tasks
    Many to many relations:
        Cross product of entities (Long representation)
        Statistics from vectors
    Times series:
        Time structure allows us to make a lot of complicated features
        Rolling statistics of target variable
     Interections and numerical features
         Analysing fitted model
         Binnin numeric and selecting interactions
        
     Correct validation reminder
     
<img src="files/Images/valrem.png" width="500" height="200">

### Summary

Main advantages:
    * Compact transformation of categorical variables
    * Powerful basis for feature engineering

Disadvantages:
    * Need careful validation, there are a lot of ways to overfit
    * Significant improvements only on specific datasets