# Introduction
Regardless of your investment strategy, fluctuations are expected in the financial market. Despite this variance, professional investors try to estimate their overall returns. Risks and returns differ based on investment types and other factors, which impact stability and volatility. To attempt to predict returns, there are many computer-based algorithms and models for financial market trading. Yet, with new techniques and approaches, data science could improve quantitative researchers' ability to forecast an investment's return.



Ubiquant Investment (Beijing) Co., Ltd is a leading domestic quantitative hedge fund based in China. Established in 2012, they rely on international talents in math and computer science along with cutting-edge technology to drive quantitative financial market investment. Overall, Ubiquant is committed to creating long-term stable returns for investors.

In this competition, you’ll build a model that forecasts an investment's return rate. Train and test your algorithm on historical prices. Top entries will solve this real-world data science problem with as much accuracy as possible.

If successful, you could improve the ability of quantitative researchers to forecast returns. This will enable investors at any scale to make better decisions. You may even discover you have a knack for financial datasets, opening up a world of new opportunities in many industries.

## Contents

1. Load data
2. General information
3. Analysis the target
4. Analysis the features

Credit: this analysis is inspired by the seminal [nootbook](https://www.kaggle.com/carlmcbrideellis/jane-street-eda-of-day-0-and-feature-importance)

In [None]:
# numpy
import numpy as np
from scipy.stats import pearsonr as p
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
sns.set(rc={'figure.figsize':(15,13)})

import datatable as dt

# garbage collector to keep RAM in check
import gc  

# system
import warnings
warnings.filterwarnings('ignore')

# 1. Load data
#### The train.csv is large: 18.557G with 3141411 rows

In [None]:
%%time
!wc -l ../input/ubiquant-market-prediction/train.csv

#### We should not use **pandas** to load this file. Indeed, we should use **datatable** to avoid OOM issue and speed up the loading process then convert the loaded data to pandas dataframe.

In [None]:
%%time
train_data = dt.fread('../input/ubiquant-market-prediction/train.csv')

In [None]:
%%time
train_data = train_data.to_pandas()

# 2. General information

#### There are 304 columns in the train dataset including *row_id*, *time_id*, *investment_id*, *target* and 300 features

In [None]:
print(train_data.columns)

#### There are 3579 investments indexed in the rage from 0 to 3773, i.e. not all the indexes in this range is used.

In [None]:
train_data['investment_id'].nunique()

In [None]:
print(f'Investment index is in the range from {min(train_data.investment_id.unique())} to {max(train_data.investment_id.unique())}')

# 3. The target: investment return rate (IRR)

#### IRR randomly variate in the range (-10,10). The variation of its deviation seems to correlate with the number of assets that increases linearly with time. This trend may continue for the next 200 time_ids.  

In [None]:
plt.figure(figsize=(20,10))
plt.subplot(2,1,1)
plt.plot(train_data['time_id'], train_data['target'], color='black', lw=0.1)
plt.ylabel (f'Target', fontsize=18);
plt.xticks([])
plt.tight_layout()

plt.subplot(2,1,2)
plt.plot(train_data.groupby('time_id')['investment_id'].nunique(), color='black', lw=1)
plt.plot(train_data.groupby('time_id')['investment_id'].nunique().rolling(30).mean(), color='red', lw=2)
plt.ylabel (f'Number of assets', fontsize=18);
plt.xlabel ('Time_id', fontsize=18)
plt.tight_layout()

plt.show()
gc.collect()

In [None]:
plt.figure(figsize=(20,10))
plt.subplot(2,1,1)
plt.plot(train_data['investment_id'], train_data['target'], color='gray', lw=0.1)
plt.plot(train_data.groupby('investment_id')['target'].mean(), color='red', lw=1)
plt.plot(train_data.groupby('investment_id')['target'].std(), color='blue', lw=1)
plt.ylabel (f'Target', fontsize=18);
plt.xticks([])
plt.tight_layout()

plt.subplot(2,1,2)
plt.plot(train_data.groupby('investment_id')['time_id'].nunique(), color='black', lw=1)
plt.ylabel (f'Number of time_id', fontsize=18);
plt.xlabel ('Assets', fontsize=18)
plt.tight_layout()

plt.show()
gc.collect()

In [None]:
plt.figure(figsize=(20,5))
plt.plot(train_data.groupby('investment_id')['target'].mean(), color='red', lw=1)
plt.ylabel (f'Mean of target over each asset', fontsize=18);
plt.xlabel (f'Assets', fontsize=18);
plt.xticks([])
plt.tight_layout()
plt.show()
gc.collect()

In [None]:
plt.figure(figsize = (12,5))
ax = sns.distplot(train_data['target'], bins=1000)
plt.xlim(-3,3)
plt.xlabel("Histogram of the IRR values", size=18)
plt.show();
gc.collect();

In [None]:
print(f'Skew and kurtosis of the target are: {train_data.target.skew()} and {train_data.target.kurtosis()}')

#### Assuming an uniform investment (all investment have the same weight), the overall investment is in loss ^_^

In [None]:
plt.figure(figsize=(20,5))
plt.title ('Cumulative net return', fontsize=18)
plt.plot(train_data['time_id'], train_data['target'].cumsum(), color='green', lw=2);
plt.ylabel (f'Overall return', fontsize=18);
plt.xlabel ('Time_id', fontsize=18)
plt.show()

#### Some individual investments has positive return

In [None]:
plt.figure(figsize=(20,20))
for i in range(5):
    plt.subplot(5,1,i+1)
    cumReturn = train_data.loc[train_data['investment_id']==i,'target'].cumsum()
    time_id = train_data.loc[train_data['investment_id']==i,'time_id']
    plt.plot(time_id, cumReturn, color='green', lw=2);
    plt.ylabel (f'Return {i}', fontsize=18);

plt.xlabel ('Time_id', fontsize=18)
del cumReturn, time_id
gc.collect();

# 4. Features

#### Cumulated features of some investments

In [None]:
plt.figure(figsize=(20,20))
for i in range(5):
    plt.subplot(5,1,i+1)
    for j in range(300):
        feature = train_data.loc[train_data['investment_id']==i,f'f_{j}'].cumsum()
        time_id = train_data.loc[train_data['investment_id']==i,'time_id']
        plt.plot(time_id, feature, color='gray', lw=0.1);
    plt.ylabel (f'Invest {i}', fontsize=18);

plt.xlabel ('Time_id', fontsize=18)
del feature, time_id
gc.collect();

#### Correlation between the features

In [None]:
%%time
sns.heatmap(train_data[[f'f_{i}' for i in range(100)]].corr());

#### Correlation between features and target

In [None]:
corr = []
for i in range(300):
    corr.append( train_data['target'].corr(train_data[f'f_{i}']) )

In [None]:
plt.figure(figsize=(10,7))
plt.plot(corr, 'k')
plt.xlabel('Features', fontsize=16)
plt.ylabel('Target', fontsize=16)
plt.title('Correlation between target and features', fontsize=18)
plt.show()

#### Feature that correlates most with the target

In [None]:
corr.index(max(corr))

#### Most of cumulated features correlate perfectly with cumulated target!!!

In [None]:
corr = []
for i in range(300):
    corr.append( train_data['target'].cumsum().corr(train_data[f'f_{i}'].cumsum()) )

In [None]:
plt.figure(figsize=(10,7))
plt.plot(corr, 'k.')
plt.xlabel('Features', fontsize=16)
plt.ylabel('Target', fontsize=16)
plt.title('Correlation between target and features', fontsize=18)
plt.show()

#### There are 202 correlations with an absolute correlation coefficient over 0.95 

In [None]:
len([idx for idx, elem in enumerate(corr) if abs(elem)>0.95])

# 5. Investments

#### There are many investment with just few time id. They may not be relevant for the training.

In [None]:
groups = train_data.groupby('investment_id').size()

plt.plot(groups, 'k.')
plt.xlabel('Investments')
plt.ylabel('Number of time ids')
plt.show()

In [None]:
[i for i, val in enumerate(groups) if val < 200]

In [None]:
groups = train_data.groupby('investment_id')['time_id'].max()

plt.plot(groups, 'k.')
plt.xlabel('Investments')
plt.ylabel('Max time ids')
plt.show()

In [None]:
groups = train_data.groupby('investment_id')['time_id'].min()

plt.plot(groups, 'k.')
plt.xlabel('Investments')
plt.ylabel('Min time ids')
plt.show()

## Good luck!