# <span style="color:#FFC300;">***Ubiquant Market Prediction Explore Data Analisys***<span>
_**<span style="color:#C70039;">I can’t thank you enough, if you upvote this notebook.</span>**_  🔥💛🔥
## <span style="color:#FFC300;">*Table of content*<span>
<a id="table-of-contents"></a>
- [1. Introduction](#1)
- [2. Preparations](#2)
- [3. Dataset Overview](#3) 
- [4. Features](#4) 
    - [4.1. f_0 - f_24](#4.1)
    - [4.2. f_25 - f_49](#4.2)
    - [4.3. f_50 - f_74](#4.3)
    - [4.4. f_75 - f_99](#4.4)
    - [4.5. f_100 - f_124](#4.5)
    - [4.6. f_125 - f_149](#4.6)
    - [4.7. f_150 - f_174](#4.7)
    - [4.8. f_175 - f_199](#4.8)
    - [4.9. f_200 - f_224](#4.9)
    - [4.10. f_225 - f_249](#4.10)
    - [4.11. f_250 - f_274](#4.11)
    - [4.12. f_275 - f_299](#4.12)
- [5. Target](#5) 
- [6. Correlation](#6)
    - [6.1. Correlation between features and target](#6.1)
    - [6.2. P-value between features and target](#6.2)
    - [6.3. Scatterplot between low correlation features and target](#6.3)
    - [6.4. Features Correlation Matrix](#6.4)
    - [6.5. Features VIF](#6.5)
- [7. Investment_id](#7)
- [8. Quantile is all you need to analisys](#8)

[back to top](#table-of-contents)
<a id="1"></a>
# **<span style="color:#FFC300;">1. Introduction</span>**
In this competition, you’ll build a model that forecasts an investment's return rate. Train and test your algorithm on historical prices. Top entries will solve this real-world data science problem with as much accuracy as possible.

Submissions are evaluated on the mean of the Pearson correlation coefficient for each time ID.

**Files**

`train.csv`

`row_id -` A unique identifier for the row.

`time_id -` The ID code for the time the data was gathered. The time IDs are in order, but the real time between the time IDs is not constant and will likely be shorter for the final private test set than in the training set.

`investment_id`- The ID code for an investment. Not all investment have data in all time IDs.

`target -` The target.

`[f_0:f_299] -` Anonymized features generated from market data.

`example_test.csv` - Random data provided to demonstrate what shape and format of data the API will deliver to your notebook when you submit.

`example_sample_submission.csv` - An example submission file provided so the publicly accessible copy of the API provides the correct data shape and format.

`ubiquant/` - The image delivery API that will serve the test set. You may need Python 3.7 and a Linux environment to run the example test set through the API offline without errors.


[back to top](#table-of-contents)
<a id="2"></a>
# <span style="color:#FFC300;">2. Preparations</span>
Preparing packages and data that will be used in the analysis process. Packages that will be loaded are mainly for data manipulation and data visualization. We are dealing with high-dimensional data, so for use we will use parquet to read data. 

If you gonna compute something I recomend to use dask package.
https://dask.org

I really appreciate [@valleyzw](https://www.kaggle.com/valleyzw) to share code.  
**Reference:**  https://www.kaggle.com/valleyzw/ubiquant-lgbm-baseline


In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
import random
import tqdm

from argparse import Namespace
import random
import os
import gc
import seaborn as sns
from matplotlib import pyplot as plt

# setting up options
import warnings
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
warnings.filterwarnings('ignore')
from cycler import cycler

In [None]:
args = Namespace(
    seed=21,
    folds=5,
    workers=4,
    samples=2500000,
    data_path=Path("../input/ubiquant-parquet/"),
)


def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

#train = reduce_mem_usage(train)

#def seed_everything(seed: int = 42) -> None:
#    random.seed(seed)
#    np.random.seed(seed)
#    os.environ["PYTHONHASHSEED"] = str(seed)
    
#seed_everything(args.seed)
#train = reduce_mem_usage(pd.read_parquet(args.data_path.joinpath("train_low_mem.parquet")))
train = pd.read_parquet(args.data_path.joinpath("train_low_mem.parquet"))

#references: https://www.kaggle.com/valleyzw/ubiquant-lgbm-baseline

FEATURES = [col for col in train.columns if col not in ['target', 'row_id']]
NUM_FEATURES = [col for col in train.columns if col not in ['target', 'row_id', 'investment_id', 'time_id']]
CAT_FEATURES = [feature for feature in FEATURES if feature not in NUM_FEATURES]

inv_ids = random.choices(train['investment_id'].unique(), k=3)


In [None]:
train.head()

[back to top](#table-of-contents)
<a id="3"></a>
# <span style="color:#FFC300;">3. Dataset Overview</span>

The intent of the overview is to get a feel of the data and its structure train file. An overview on train datasets will include a quick analysis on missing values and basic statistics.

We are dealing with many different distributions, probably encoded financial factors. Some if their have outliers, others look like qualitative variables. You need to work with this data using risk theory, this is the basis of investment.


In [None]:
train.info()

In [None]:
print('Rows and Columns in train dataset:', train.shape)
#print('Rows and Columns in test dataset:', test.shape)

In [None]:
print('Missing values in train dataset:', sum(train.isnull().sum()))
#print('Missing values in test dataset:', sum(test_df.isnull().sum()))

In [None]:
train.head(5)

In [None]:
train.describe()

[back to top](#table-of-contents)
<a id="4"></a>
# <span style="color:#FFC300;"> 4. Features </span>
**Features distribution** 
<a id="4.1"></a>
## <span style="color:#FFC300;">4.1. f_0 - f_24</span>

**Probably qualitative features** - f_9, f_14, f_18, f_22?, f_27, f_34, f_41, f_48, f_58, f_60?, f_66, f_73?, f_75?, f_81, f_107, f_124, f_120?, f_132, f_138, f_143, f_148, f_152, f_154, f_156, f_163, f_166, f_168, f_170, f_174, f_176, f_177, f_182, f_187, f_227, f_229, f_238, f_246, f_263, f_272, f_263

**Note** - f_111, f_112 are mirror


In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)
background_color = '#f6f5f5'
run_no = 0

colormap = ['#1DBA94','#1C5ED2', '#FFC300', '#C70039']
plt.rc('axes', prop_cycle=(cycler('color', colormap)))

for row in range(0, 5):
    for col in range(0, 5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  


features = list(train.columns[4:29])

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train[col], zorder=2, alpha=1, linewidth=1, color='#ffd514')
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train[train['investment_id'].isin(inv_ids)][col], hue=train[train['investment_id'].isin(inv_ids)]['investment_id'],zorder=2, alpha=1, fill=True, color=colormap, linewidth=0.5, legend=False,palette=colormap[:3], hue_order=inv_ids.sort(reverse=True))
    
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    #locals()["ax"+str(run_no)].get_legend().remove()
    
    run_no += 1

plt.show()

[back to top](#table-of-contents)
<a id="4.2"></a>
## <span style="color:#FFC300;">4.2. f_25 - f_49</span>

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)
background_color = '#f6f5f5'
run_no = 0

colormap = ['#1DBA94','#1C5ED2', '#FFC300', '#C70039']
plt.rc('axes', prop_cycle=(cycler('color', colormap)))

for row in range(0, 5):
    for col in range(0, 5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  


features = list(train.columns[29:54])

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train[col], zorder=2, alpha=1, linewidth=1, color='#ffd514')
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train[train['investment_id'].isin(inv_ids)][col], hue=train[train['investment_id'].isin(inv_ids)]['investment_id'],zorder=2, alpha=1, fill=True, color=colormap, linewidth=0.5, legend=False,palette=colormap[:3], hue_order=inv_ids.sort(reverse=True))
    
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    #locals()["ax"+str(run_no)].get_legend().remove()
    
    run_no += 1

plt.show()

[back to top](#table-of-contents)
<a id="4.3"></a>
## <span style="color:#FFC300;">4.3. f_50 - f_74</span>

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)
background_color = '#f6f5f5'
run_no = 0

colormap = ['#1DBA94','#1C5ED2', '#FFC300', '#C70039']
plt.rc('axes', prop_cycle=(cycler('color', colormap)))

for row in range(0, 5):
    for col in range(0, 5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  


features = list(train.columns[54:79])

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train[col], zorder=2, alpha=1, linewidth=1, color='#ffd514')
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train[train['investment_id'].isin(inv_ids)][col], hue=train[train['investment_id'].isin(inv_ids)]['investment_id'],zorder=2, alpha=1, fill=True, color=colormap, linewidth=0.5, legend=False,palette=colormap[:3], hue_order=inv_ids.sort(reverse=True))
    
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    #locals()["ax"+str(run_no)].get_legend().remove()
    
    run_no += 1

plt.show()

[back to top](#table-of-contents)
<a id="4.4"></a>
## <span style="color:#FFC300;"> 4.4. f_75 - f_99</span>

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)
background_color = '#f6f5f5'
run_no = 0

colormap = ['#1DBA94','#1C5ED2', '#FFC300', '#C70039']
plt.rc('axes', prop_cycle=(cycler('color', colormap)))

for row in range(0, 5):
    for col in range(0, 5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  


features = list(train.columns[79:104])

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train[col], zorder=2, alpha=1, linewidth=1, color='#ffd514')
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train[train['investment_id'].isin(inv_ids)][col], hue=train[train['investment_id'].isin(inv_ids)]['investment_id'],zorder=2, alpha=1, fill=True, color=colormap, linewidth=0.5, legend=False,palette=colormap[:3], hue_order=inv_ids.sort(reverse=True))
    
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    #locals()["ax"+str(run_no)].get_legend().remove()
    
    run_no += 1

plt.show()

[back to top](#table-of-contents)
<a id="4.5"></a>
## <span style="color:#FFC300;">4.5. f_100 - f_124</span>

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)
background_color = '#f6f5f5'
run_no = 0

colormap = ['#1DBA94','#1C5ED2', '#FFC300', '#C70039']
plt.rc('axes', prop_cycle=(cycler('color', colormap)))

for row in range(0, 5):
    for col in range(0, 5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  


features = list(train.columns[104:129])

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train[col], zorder=2, alpha=1, linewidth=1, color='#ffd514')
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train[train['investment_id'].isin(inv_ids)][col], hue=train[train['investment_id'].isin(inv_ids)]['investment_id'],zorder=2, alpha=1, fill=True, color=colormap, linewidth=0.5, legend=False,palette=colormap[:3], hue_order=inv_ids.sort(reverse=True))
    
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    #locals()["ax"+str(run_no)].get_legend().remove()
    
    run_no += 1

plt.show()

[back to top](#table-of-contents)
<a id="4.6"></a>
## <span style="color:#FFC300;">4.6. f_125 - f_149</span>

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)
background_color = '#f6f5f5'
run_no = 0

colormap = ['#1DBA94','#1C5ED2', '#FFC300', '#C70039']
plt.rc('axes', prop_cycle=(cycler('color', colormap)))

for row in range(0, 5):
    for col in range(0, 5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  


features = list(train.columns[129:154])

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train[col], zorder=2, alpha=1, linewidth=1, color='#ffd514')
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train[train['investment_id'].isin(inv_ids)][col], hue=train[train['investment_id'].isin(inv_ids)]['investment_id'],zorder=2, alpha=1, fill=True, color=colormap, linewidth=0.5, legend=False,palette=colormap[:3], hue_order=inv_ids.sort(reverse=True))
    
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    #locals()["ax"+str(run_no)].get_legend().remove()
    
    run_no += 1

plt.show()

[back to top](#table-of-contents)
<a id="4.7"></a>
## <span style="color:#FFC300;">4.7. f_150 - f_174</span>

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)
background_color = '#f6f5f5'
run_no = 0

colormap = ['#1DBA94','#1C5ED2', '#FFC300', '#C70039']
plt.rc('axes', prop_cycle=(cycler('color', colormap)))

for row in range(0, 5):
    for col in range(0, 5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  


features = list(train.columns[154:179])

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train[col], zorder=2, alpha=1, linewidth=1, color='#ffd514')
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train[train['investment_id'].isin(inv_ids)][col], hue=train[train['investment_id'].isin(inv_ids)]['investment_id'],zorder=2, alpha=1, fill=True, color=colormap, linewidth=0.5, legend=False,palette=colormap[:3], hue_order=inv_ids.sort(reverse=True))
    
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    #locals()["ax"+str(run_no)].get_legend().remove()
    
    run_no += 1

plt.show()

[back to top](#table-of-contents)
<a id="4.8"></a>
## <span style="color:#FFC300;">4.8. f_175 - f_199</span>

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)
background_color = '#f6f5f5'
run_no = 0

colormap = ['#1DBA94','#1C5ED2', '#FFC300', '#C70039']
plt.rc('axes', prop_cycle=(cycler('color', colormap)))

for row in range(0, 5):
    for col in range(0, 5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  


features = list(train.columns[179:204])

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train[col], zorder=2, alpha=1, linewidth=1, color='#ffd514')
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train[train['investment_id'].isin(inv_ids)][col], hue=train[train['investment_id'].isin(inv_ids)]['investment_id'],zorder=2, alpha=1, fill=True, color=colormap, linewidth=0.5, legend=False,palette=colormap[:3], hue_order=inv_ids.sort(reverse=True))
    
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    #locals()["ax"+str(run_no)].get_legend().remove()
    
    run_no += 1

plt.show()

[back to top](#table-of-contents)
<a id="4.9"></a>
## <span style="color:#FFC300;">4.9. f_200 - f_224</span>

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)
background_color = '#f6f5f5'
run_no = 0

colormap = ['#1DBA94','#1C5ED2', '#FFC300', '#C70039']
plt.rc('axes', prop_cycle=(cycler('color', colormap)))

for row in range(0, 5):
    for col in range(0, 5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  


features = list(train.columns[204:229])

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train[col], zorder=2, alpha=1, linewidth=1, color='#ffd514')
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train[train['investment_id'].isin(inv_ids)][col], hue=train[train['investment_id'].isin(inv_ids)]['investment_id'],zorder=2, alpha=1, fill=True, color=colormap, linewidth=0.5, legend=False,palette=colormap[:3], hue_order=inv_ids.sort(reverse=True))
    
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    #locals()["ax"+str(run_no)].get_legend().remove()
    
    run_no += 1

plt.show()

[back to top](#table-of-contents)
<a id="4.10"></a>
## <span style="color:#FFC300;">4.10. f_225 - f_249</span>

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)
background_color = '#f6f5f5'
run_no = 0

colormap = ['#1DBA94','#1C5ED2', '#FFC300', '#C70039']
plt.rc('axes', prop_cycle=(cycler('color', colormap)))

for row in range(0, 5):
    for col in range(0, 5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  


features = list(train.columns[229:254])

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train[col], zorder=2, alpha=1, linewidth=1, color='#ffd514')
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train[train['investment_id'].isin(inv_ids)][col], hue=train[train['investment_id'].isin(inv_ids)]['investment_id'],zorder=2, alpha=1, fill=True, color=colormap, linewidth=0.5, legend=False,palette=colormap[:3], hue_order=inv_ids.sort(reverse=True))
    
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    #locals()["ax"+str(run_no)].get_legend().remove()
    
    run_no += 1

plt.show()

[back to top](#table-of-contents)
<a id="4.11"></a>
## <span style="color:#FFC300;">4.11. f_250 - f_274</span>

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)
background_color = '#f6f5f5'
run_no = 0

colormap = ['#1DBA94','#1C5ED2', '#FFC300', '#C70039']
plt.rc('axes', prop_cycle=(cycler('color', colormap)))

for row in range(0, 5):
    for col in range(0, 5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  


features = list(train.columns[254:279])

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train[col], zorder=2, alpha=1, linewidth=1, color='#ffd514')
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train[train['investment_id'].isin(inv_ids)][col], hue=train[train['investment_id'].isin(inv_ids)]['investment_id'],zorder=2, alpha=1, fill=True, color=colormap, linewidth=0.5, legend=False,palette=colormap[:3], hue_order=inv_ids.sort(reverse=True))
    
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    #locals()["ax"+str(run_no)].get_legend().remove()
    
    run_no += 1

plt.show()

[back to top](#table-of-contents)
<a id="4.12"></a>
## <span style="color:#FFC300;">4.12. f_275 - f_300</span>

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)
background_color = '#f6f5f5'
run_no = 0

colormap = ['#1DBA94','#1C5ED2', '#FFC300', '#C70039']
plt.rc('axes', prop_cycle=(cycler('color', colormap)))

for row in range(0, 5):
    for col in range(0, 5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  


features = list(train.columns[279:304])

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train[col], zorder=2, alpha=1, linewidth=1, color='#ffd514')
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train[train['investment_id'].isin(inv_ids)][col], hue=train[train['investment_id'].isin(inv_ids)]['investment_id'],zorder=2, alpha=1, fill=True, color=colormap, linewidth=0.5, legend=False,palette=colormap[:3], hue_order=inv_ids.sort(reverse=True))
    
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    #locals()["ax"+str(run_no)].get_legend().remove()
    
    run_no += 1

plt.show()

[back to top](#table-of-contents)
<a id="5"></a>
# <span style="color:#FFC300;">5. Target</span>
Distribution of target and labeled features

In [None]:
targets = ['target', 'investment_id', 'time_id']

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(5, 1), facecolor='#f6f5f5')
gs = fig.add_gridspec(1, 3)
gs.update(wspace=0.2, hspace=0.5)

background_color = "#f6f5f5"

run_no = 0
for row in range(0, 1):
    for col in range(0, 3):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        locals()["ax"+str(run_no)].tick_params(axis='y', which=u'both',length=0)
        locals()["ax"+str(run_no)].set_yticklabels([])
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

run_no = 0
for col in targets:
    sns.kdeplot(train[col], ax=locals()["ax"+str(run_no)], shade=True, color='#fcd12a', alpha=0.95, linewidth=0, zorder=2)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=5, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=5, width=0.5, length=1.5)
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.7)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.7)
    run_no += 1
    
#ax0.text(-1.2, 0.44, 'Target Distribution', fontsize=8, fontweight='bold')
#ax0.text(-1.2, 0.40, 'Target variables are showing a lognormal distribution', fontsize=5)

plt.show()

[back to top](#table-of-contents)
<a id="6"></a>
# <span style="color:#FFC300;">6. Correlation</span>
<a id="6.1"></a>
## <span style="color:#FFC300;">6.1 Correlation between features and target</span>
Lowest values:

In [None]:
corr_coef = [np.corrcoef(train['target'], train[col])[0][1] for col in train.columns if col not in ['row_id', 'target', 'investment_id']]
corr_name = [col for col in train.columns if col not in ['row_id', 'target', 'investment_id']]

corr = pd.DataFrame(corr_coef, index=corr_name, columns=['corr'])
corr = np.abs(corr).sort_values(by='corr', ascending=False)

corr.tail(10)

Highest values:

In [None]:
corr.head(10)

[back to top](#table-of-contents)
<a id="6.2"></a>
## <span style="color:#FFC300;">6.2 P-value between features and target</span>
**reference**   https://www.kaggle.com/hasanbasriakcay/ubiquan-market-prediction-eda-ignore-cols

In [None]:
# reference https://www.kaggle.com/hasanbasriakcay/ubiquan-market-prediction-eda-ignore-cols
from scipy.stats import pearsonr

p_values_list = []
for c in NUM_FEATURES:
    p = round(pearsonr(train.loc[:,'target'], train.loc[:,c])[1], 4)
    p_values_list.append(p)

p_values_df = pd.DataFrame(p_values_list, columns=['target'], index=NUM_FEATURES)
def p_value_warning_background(cell_value):
    highlight = 'background-color: #ffd514;'
    default = ''
    if cell_value > 0.05:
            return highlight
    return default

p_values_df = p_values_df[p_values_df['target'] > 0.03]
p_values_df.style.applymap(p_value_warning_background)

[back to top](#table-of-contents)
<a id="6.3"></a>
## <span style="color:#FFC300;">6.3 Scatterplot between low correlation features and target</span>

In [None]:
targets = ['target', 'investment_id', 'time_id']

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(13, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(2, 5)
gs.update(wspace=0.2, hspace=0.5)

background_color = "#f6f5f5"

run_no = 0
for row in range(0, 2):
    for col in range(0, 5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        locals()["ax"+str(run_no)].tick_params(axis='y', which=u'both',length=0)
        locals()["ax"+str(run_no)].set_yticklabels([])
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

run_no = 0
for col in corr.tail(10).index:
    sns.scatterplot(x=train[col], y=train['target'],ax=locals()["ax"+str(run_no)], color='#fcd12a', zorder=2)
    locals()["ax"+str(run_no)].set_ylabel('target')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=5, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=5, width=0.5, length=1.5)
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.7)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.7)
    run_no += 1
    
#ax0.text(-1.2, 0.44, 'Target Distribution', fontsize=8, fontweight='bold')
#ax0.text(-1.2, 0.40, 'Target variables are showing a lognormal distribution', fontsize=5)

plt.show();

[back to top](#table-of-contents)
<a id="6.4"></a>
## <span style="color:#FFC300;">6.4 Features Correlation Matrix</span>
**reference**  https://www.kaggle.com/igorkf/ubiquant-simple-linear-regression-baseline/notebook

In [None]:
%%time
# reference https://www.kaggle.com/igorkf/ubiquant-simple-linear-regression-baseline/notebook
train.sample(int(len(train) * 0.001), random_state=1)[FEATURES].corr().style.background_gradient(cmap='autumn',axis=None, vmin=-0.3, vmax=0.3, low=0.8, high=0.6) 

[back to top](#table-of-contents)
<a id="6.5"></a>
## <span style="color:#FFC300;">6.5 Features VIF</span>

In [None]:
#If you get some error, probably you take in reduce memory at your dataset. Algorithm have used linalg cant compute values with float16, which reduce memory using to be optimal.
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calc_vif(X):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)

vif_val = calc_vif(train[train['investment_id'].isin(inv_ids)][FEATURES])
vif_val.style.background_gradient(cmap='autumn',axis=0, vmin=2, vmax=10, low=0.8, high=0.6) 

In [None]:
multicorrel_val = vif_val[vif_val['VIF'] > 10]['variables'].to_list()

print(f'Features which multocollinear:')
print(multicorrel_val)

[back to top](#table-of-contents)
<a id="7"></a>
# <span style="color:#FFC300;">7. Counts of cat features</span>
In risk theory, such a value as VAR is used. This value means that the tails of the distributions are riskier than those closer to the middle.  
Also disbalance of classes - big problem of this data. In my opinion, the good models: who remember about past and include stratifited classes.   
As we can see, that large scatter of data place at ~400 value of time.  
Also if you train catboost baseline, residual of this model will be make mistakes more at 400 value of time too.  
*Let me know if you have ideas how to fix this, in my opinion bayesian models can predict this better then all, but a good features cleaning and features engineering can do more.*

In [None]:
print(f'{len(train.investment_id.unique())} investment identifiers are unique')
print(f'{len(train.time_id.unique())} time identifiers are unique')
print(f'{len(train.target.unique())} target are unique')

In [None]:
color_real = '#1DBA94'
color_outliers = '#FFC300'

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(13, 10), facecolor='#f6f5f5')

investment_idx = random.choices(train.investment_id, k=9)

gs = fig.add_gridspec(3, 1)
gs.update(wspace=0.2, hspace=0.35)

background_color = "#f6f5f5"

run_no = 0

for col in range(0, 3):
    locals()["ax"+str(run_no)] = fig.add_subplot(gs[col, 0])
    locals()["ax"+str(run_no)].set_facecolor(background_color)
    locals()["ax"+str(run_no)].tick_params(axis='y', which=u'both',length=0)
    locals()["ax"+str(run_no)].set_yticklabels([])
    for s in ["top","right"]:
        locals()["ax"+str(run_no)].spines[s].set_visible(False)
    run_no += 1
        
        
sns.lineplot(y=train.groupby(['time_id'])['investment_id'].count(), x=train.groupby(['time_id'])['investment_id'].count().index, ax=ax0, color=color_outliers)
sns.lineplot(y=train.groupby(['time_id'])['target'].mean(), x=train.groupby(['time_id'])['target'].count().index, ax=ax1, color=color_real)
sns.histplot(x=train.groupby(['investment_id'])['target'].count(), ax=ax2, bins=50, color=color_outliers)

plt.plot();

[back to top](#table-of-contents)
<a id="8"></a>
## <span style="color:#FFC300;">8. Quantile is all you need to analisys</span>
if you have analysis residual of you model, you can saw that he badly approximate outside ~ 5 and 95 quantile.  
In addition some work clarified, that it is problem related to disbalance investmend_ids on time_ids.   
If you run a regular catboost baseline, you will probably run into this problem.   
Let's see what features depend on outside 5 and 95 quantiles.  

**references:**
https://www.kaggle.com/lucamassaron/eda-target-analysis

### Target

In [None]:


threshold_right = train['target'].quantile(0.95)
threshold_left = train['target'].quantile(0.05)
#['#1DBA94','#1C5ED2', '#FFC300', '#C70039']
color_real = '#1DBA94'
color_outliers = '#FFC300'

print(f'==================   Statistics of target   ==================')
print(f'Max: {train["target"].max()}')
print(f'Mean: {train["target"].mean()}')
print(f'Median: {train["target"].median()}')
print(f'Min: {train["target"].min()}')
print(f'Std: {train["target"].std()}')
print(f'Skew: {train["target"].skew()}')  
print(f'Kurtosis: {train["target"].kurtosis()}')
print('===========================================================')

fig = plt.figure(figsize=(25, 8), facecolor='#f6f5f5')
gs = fig.add_gridspec(2, 2)
background_color = '#f6f5f5'
gs.update(wspace=0.2, hspace=0.5)
run_no = 0
for row in range(0, 2):
    for col in range(0, 2):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  

ax0.set_title(f'Distribution of target')                     
sns.histplot(train['target'], color=color_real, bins=50, ax=ax0)
ax1.set_title(f'Distribution of target outside quantiles')                     
sns.histplot(train[(train['target'] < threshold_left) | (train['target'] > threshold_right)]['target'], color=color_outliers, bins=50, ax=ax1)
time2target_mean_threshold = train[(train['target'] > threshold_left) & (train['target'] < threshold_right)].groupby('time_id')['target'].mean()
time2target_std_threshold = train[(train['target'] > threshold_left) & (train['target'] < threshold_right)].groupby('time_id')['target'].std()
time2target_mean = train.groupby('time_id')['target'].mean()
time2target_std = train.groupby('time_id')['target'].std()


ax2.set_title(f'Mean and Std changing by time, include quantiles')
ax2.fill_between(
        time2target_mean_threshold.index,
        time2target_mean_threshold - time2target_std_threshold,
        time2target_mean_threshold + time2target_std_threshold,
        alpha=0.2,
        color=color_outliers,
        label='inside quantile std')
ax2.fill_between(
        time2target_mean.index,
        time2target_mean - time2target_std,
        time2target_mean + time2target_std,
        alpha=0.2,
        color=color_real,
        label='real std')
ax2.plot(time2target_mean_threshold.index, time2target_mean_threshold, color=color_outliers, label='inside quantile mean')
ax2.plot(time2target_mean.index, time2target_mean, color=color_real, label='original mean')
ax2.legend(ncol=4, facecolor=background_color, edgecolor=background_color, loc='lower center')
ax3.remove()
plt.plot();


### Features

In [None]:
#references: https://www.kaggle.com/lucamassaron/eda-target-analysis
#column = 'f_1'

for column in ['f_0', 'f_1', 'f_2', 'f_3', 'f_4']:
    threshold_right = train['target'].quantile(0.95)
    threshold_left = train['target'].quantile(0.05)
    #['#1DBA94','#1C5ED2', '#FFC300', '#C70039']
    color_real = '#1DBA94'
    color_outliers = '#FFC300'

    print(f'==================   Statistics of {column}   ==================')
    print(f'Max: {train[column].max()}')
    print(f'Mean: {train[column].mean()}')
    print(f'Median: {train[column].median()}')
    print(f'Min: {train[column].min()}')
    print(f'Std: {train[column].std()}')
    print(f'Skew: {train[column].skew()}')  
    print(f'Kurtosis: {train[column].kurtosis()}')
    print('===========================================================')

    fig = plt.figure(figsize=(30, 10), facecolor='#f6f5f5')
    gs = fig.add_gridspec(3, 2)
    background_color = '#f6f5f5'
    gs.update(wspace=0.2, hspace=0.5)
    run_no = 0
    for row in range(0, 3):
        for col in range(0, 2):
            locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
            locals()["ax"+str(run_no)].set_facecolor(background_color)
            for s in ["top","right"]:
                locals()["ax"+str(run_no)].spines[s].set_visible(False)
            run_no += 1  

    ax0.set_title(f'Distribution of {column}')
    sns.histplot(train[column], bins=50, color=color_real, ax=ax0)

    ax1.set_title(f'Distribution of {column} outside quantiles')
    sns.histplot(train[(train['target'] < threshold_left) | (train['target'] > threshold_right)][column], color=color_outliers, ax=ax1, bins=50)

    ax2.set_title(f'Outliers of target by plot column-time_id')
    sns.scatterplot(y=train[column], x=train['time_id'], color=color_real, ax=ax2, edgecolor=None, label='origin')
    sns.scatterplot(y=train[(train['target'] < threshold_left) | (train['target'] > threshold_right)][column],x=train['time_id'], color=color_outliers, ax=ax2, edgecolor=None, label='outside quantile')
    ax2.legend(ncol=2, facecolor=background_color, edgecolor=background_color, loc='lower center')

    ax3.set_title(f'Outliers of target by plot column-investment_id')
    sns.scatterplot(y=train[column], x=train['investment_id'], color=color_real, ax=ax3, edgecolor=None, label='origin')
    sns.scatterplot(y=train[(train['target'] < threshold_left) | (train['target'] > threshold_right)][column],x=train['investment_id'], color=color_outliers, ax=ax3, edgecolor=None, label='outside quantile')
    ax3.legend(ncol=2, facecolor=background_color, edgecolor=background_color, loc='lower center')
    
    ax4.set_title(f'Outliers of target by scatter target-column')
    sns.scatterplot(y=train['target'], x=train[column], color=color_real, ax=ax4, edgecolor=None, label='origin')
    sns.scatterplot(y=train[(train['target'] < threshold_left) | (train['target'] > threshold_right)]['target'],x=train[column], color=color_outliers, ax=ax4, edgecolor=None, label='outside quantile')
    ax4.legend(ncol=2, facecolor=background_color, edgecolor=background_color, loc='lower center')

    time2column_mean_threshold = train[(train['target'] > threshold_left) & (train['target'] < threshold_right)].groupby('time_id')[column].mean()
    time2column_std_threshold = train[(train['target'] > threshold_left) & (train['target'] < threshold_right)].groupby('time_id')[column].std()
    time2column_mean = train.groupby('time_id')[column].mean()
    time2column_std = train.groupby('time_id')[column].std()


    ax5.set_title(f'Mean and Std changing by time, include quantiles')
    ax5.fill_between(
            time2column_mean_threshold.index,
            time2column_mean_threshold - time2column_std_threshold,
            time2column_mean_threshold + time2column_std_threshold,
            alpha=0.2,
            color=color_outliers,
            label='inside quantile std')
    ax5.fill_between(
            time2column_mean.index,
            time2column_mean - time2column_std,
            time2column_mean + time2column_std,
            alpha=0.2,
            color=color_real,
            label='original std')
    ax5.plot(time2column_mean_threshold.index, time2column_mean_threshold, color=color_outliers, label='inside quantile mean')
    ax5.plot(time2column_mean.index, time2column_mean, color=color_real, label='original mean')
    ax5.legend(ncol=4, facecolor=background_color, edgecolor=background_color, loc='lower center')

    plt.plot();

## **IN PROCESS**
I update analytics frequently

references:

https://www.kaggle.com/igorkf/ubiquant-simple-linear-regression-baseline/notebook  
https://www.kaggle.com/lucamassaron/eda-target-analysis  
https://www.kaggle.com/valleyzw/ubiquant-lgbm-baseline  
https://www.kaggle.com/usharengaraju/tensorflow-probability-probabilisticbnn