In [None]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



### Using target to generate features

In [None]:
rows = [['Moscow', 1,0.4,0],['Moscow', 1,0.4,1],['Moscow', 1,0.4,1],['Moscow', 1,0.4,0],['Moscow', 1,0.4,0],
       ['Tver', 2,0.8,1],['Tver', 2,0.8,1],['Tver', 2,0.8,1],['Tver', 2,0.8,0],
       ['Klin', 0,0,0],['Klin',0,0,0],['Tver', 2,0.8,1]]


df = pd.DataFrame(rows, columns = ['feature','feature_label','feature_mean','target'])
df

feature_label are label encodings
<t> feature_mean are mean encodings
    
 ## Why does it work ?
 
 1. Label encoding gives random order. No correlation with target
 2. Mean encoding helps to separate zeros from ones 

In general, the more complex a non-linear target dependencies, the more effective mean encodings
<t> There are countless possibilities to derived feature from target.

### Indicator of usefulness for mean encodings
The presence of categorical variables with a lot of level is already a good indicator

<t> Use xgboost, then compare the depth for 7,9,11 quality. If we can see the more depth it has the more good score we had. This is true for train set, but for the test set ? if we didn't get overfit,that is weird. This tells us something, that some features had tremendous amount of split to get good prediction.
    
<t> Can help our model with mean encodings !

## Ways to use target variable
Goods - number of ones in a group
<t> Bads - number of zeros
    
- Likelihood = $\frac{Goods}{Goods+Bads} = mean(target)$
- Weight of Evidence = $ln \frac{Goods}{Bads} \times 100$
- Count = Goods = sum(target)
- Diff = Goods - Bads

Example code :
```python
means = X_tr.groupby(col).target.mean()
train_new[col+'_mean_target'] = train_new[col].map(means)
val_new[col+'_mean_target'] = val_new[col].map(means)

dtrain = xgb.DMatrix(train_new, label=y_tr)
dvalid = xgb.DMatrix(val_new, label=y_val)

evallist = [(dtrain,'train'), (dvalid,'eval')]
evals_result3 = {}
model = xgb.train(xgb_par, dtrain, 3000, evals = evallist,verbose_eval=30, evals_result=evals_result3,early_stopping_rounds=50)
```

If this cause overfit, we need to overcome with regularization !

## Regularization
1. CV loop inside training data
    - Robust and intuitive
    - Usually decent results with 4-5 folds across different datasets
    - Set the mean encodings inside the cv loop
    - Need to be careful with extreme situations like LOO
    
   ```python
    y_tr = df_tr['target'].values # target variable
    skf = StratifiedKFold(y_tr,5, shuffle=True, random_state=123)
    
    for tr_ind, val_ind in skf:
        X_tr, X_val = df_tr.iloc[tr_ind], df_tr.iloc[val_ind]
        for col in cols : #iterate though the columns we want to encode
            means = X_val[col].map(X_tr.groupby(col).target.mean())
            X_val[col+'_mean_target'] = means
        train_new.iloc[val_ind] = X_val
        
    prior = df_tr['target'].mean() #global mean
    train_new.fillna(prior,inplace=True) #fill NANs with global mean```
    
    - Perfect feature for LOO scheme
    - Target variable leakage is still present even for KFold scheme
    
2. Smoothing Regularization
    - Alpha controls the amount of regularization. If categories is big, means has a lot of data points, can trust the mean encodings. But if it is very rare, the opposite.
    
    <t> Formula : $\frac{mean(target) * nrows + globalmean * alpha}{nrows+alpha}$, alpha is equal to categorize size we can trust !
    <t> Possible to use another formula, anything that punishes of rare categories can be considered smoothing
    
    
3. Noise 
    - Noise degrades the quality of encoding
    - How much noise should we add ? --> hard to make, it is not stable 
    - Usually used together with LOO
    
    
4. Expanding Mean
    - Least amount of leakage
    - No hyperparameters
    - Irregular encoding quality
    - Built - in  in catboost --> Magnificent on a dataset categorical features
    
    ```python
    cumsum = df_tr.groupby(col)['target'].cumsum() - df_tr['target']
    cumcnt = df_tr.groupby(col).cumcount()
    train_new[col+'_mean_target'] = cumsum/cumcnt ```

## Extension and Generalization
1. Regression and multiclass
    - More statistics for regression tasks. Percentiles, std, distribution bins.
    - Introducing new information for one vs all classifiers in multi class tasks. 
    
2. Many-to-many relations
    - Long representation
    - Statistics from vectors
    
    
3. Time Series
    - Time structure allows us to make a lot of complicated features
    - Rolling statistics of target variable
    - The more data we have, the more complex feature we can create
    
    
4. Interactions and numerical features
    - Binning numeric features and treat it as a categorical feature
    - Train the model with xgboost without any encodings. If numeric feature has a lot of split point it means that it has some kind complicated dependency to target --> Worth trying to mean encoded that feature !
    - Then this exact point split feature, can be use to bin the numeric feature!
    - How to select useful combination features ?
    - How to extract interaction feature from decision tree?
    - 2 features are interact in a tree, if they are in two neighbouring nodes. So, we can calculate how many times each feature interactions appear.
    - The most frequent interactions are more probably worthy to mean encodings. Then, we can concatenate those 2 features and mean encodings it !
    - Catboost model is correlated to feature interaction --> Good for a lot of categorical variable dataset

### Correct validation reminder
1. Local experiments:
    - Estimate encodings on X_tr
    - Map them to X_tr and X_val
    - Regularize on X_tr
    - Validate model on X_tr/X_val split
    
2. Submission:
    - Estimate encodings on whole Train data
    - Map them to Train and Test
    - Regularize on Train
    - Fit on Train
    
    
### Summary
1. Main advantages:
    - Compact transformation of categorical variables
    - Powerful basis for feature engineering
   
2. Disadvantages:
    - Need careful validation, there a lot of ways to overfit
    - Significant improvements only on specific datasets