## Validation types

    Holdout(ShuffleSplit)
        ngroups = 1
        Divides data in 2 parts. Samples do not overlap 
        Choice if we have enough data or if we are likely to get simmilar scores for different splits
        
    K-fold(KFold)
        ngroups = k
        Every sample is used for validation only once.
        This method is a good choice when we have a minimum amount of data, and we can get either a sufficiently 
        big difference in quality, or different optimal parameters between folds.
        
    Leave-one-out(LeaveOneOut)
        ngroups = len(train)
        Each time using k-1 object is a train subset and one object left is a test subset. 
        This method can be helpful if we have too little data and fast enough model to entrain.
    
    Stratification(StratifiedShuffleSplit) 
    It is just the way to insure we'll get similar target distribution over different faults.
    It is easier to guess that significance of this problem is higher:
        first for small data sets, like in this example, 
        second for unbalanced data sets. For binary classification, that could be, if target average were very 
        close to 0 or vice versa, very close to 1. 
        And third, for multiclass classification tasks with huge amount of classes. 
        
        For good classification data sets, stratification split will be quite similar to a simple shuffle split. 
        That is, to a random split.
        
        
     Time-based splits
     
         If we make train validation split different from train/test split, 
         then we are going to create a useless model. 
         We should, if possible, set up validation to mimic train/test split
     
         Generate features based on the validation and model type. 'Models indeed differ significantly, 
         including the fact that most useful features for one model are useless for another.'
        
          
     To be able to find smart ideas for feature generation and to consistently improve our model, 
     we absolutely want to identify train/test split made by organizers, including the competition, 
     and reproduce it
     
     
     Splits types:
         1. Random, Rowwise (usually then no dependeces between rows)
         2. Timewise (features based on target, Moving window)
         
   <img src="files/Images/Moving_window.png" width="800" height="400">
         
         3. By id 
       
   <img src="files/Images/id_split.png" width="400" height="100">
   
         4. Combined

In [1]:
from sklearn.model_selection import ShuffleSplit #1
from sklearn.model_selection import KFold #2
from sklearn.model_selection import LeaveOneOut #3
from sklearn.model_selection import StratifiedKFold #4.1
from sklearn.model_selection import StratifiedShuffleSplit #4.2

## Validation problems:
    1. Validation stage
        1.1 Too little in the data
        1.2 Too diversed and inconsistent
        
        We should do extensive validation:
            1. Average scores from different K-Fold splits
            2. Tune model on one split, evaluate score on the other
            
    2. Submision stage
        We can observe that:
            * LB score is consistenly higher/lower than validation score
            * LB score is not correlated with validation score at all
            
        0. We may already have quire different scores in k-fold
        Other reasons:
            1. Too little data in public leaderboard
            2. incorrect train/test split
            3. Train and test data is from different distributions
            4. Check if you overfitted
        
     Leaderboard probing:
        The simplest way to solve this particular situation in a competition is to try to figure out
        the optimal constant prediction for train and test data. And shift your predictions by the difference.
        Get groundtruth public values by submissioms
        
        
     Links:
         http://scikit-learn.org/stable/modules/cross_validation.html
         http://www.chioka.in/how-to-select-your-final-models-in-a-kaggle-competitio/
         

## Data Leakages

    Leaks in time series: 
        * Splits should be done on time
          - In real life we dont have information from future
          - In competitions first thinh to look: train/public/private split, is it on time?
        * Even when split by time, features may contain information about future.
          - User history in CTR tasks
          - Weather
          
    Unexpected informaion:
        * Meta data
        * Information in IDs (can be generated with hash)
        * Row order