<h1 id="heading">
    <div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#4fc5ad;
           font-size:100%;
           font-family:Verdana;
           letter-spacing:0.5px">
1. 📝 Introduction
<a class="anchor-link" href="https://www.kaggle.com/fangya/ubiquant-investment">¶</a>
</h1>
    
Hedge funds utilize quantitaive trading strategies to purchase and sale massive amount of stock shares and other securities. Quantitative trading stragegy heavily rely on mathematical modeling to capture and identify trading opportunities.

The Ubiquant Investment provided more than 3 million trading records with 300 features. In this notebook, we will present the fundamental concept for Quantitaive Trading, explore the investment EDA, investigate the relationship between target and certain features, and construct model to predict the target value.

We hope individual investors can benefit from the analysis, we don't need to do what financial institution do, but we'd feel better about our money when we understand the concept of quant trading!  

*Note: This is a mini-exmaple. Due to the running time with this huge dataset, we will only use partial of the data to perform the tasks !*
    
<img src="https://media-exp1.licdn.com/dms/image/C510BAQHlae3R4MzWBA/company-logo_200_200/0/1554264099251?e=2147483647&v=beta&t=DFzivV7lsTKKuViuDNFhpJIo_S857Bt743avpsDFES8"  width="200" align="center">


<h2 >
    <div style="color:#3482a4;
           display:fill;
           border-radius:2px;
           font-size:100%;
           font-family:Verdana">
Project Outline
</h2>

1. EDA
   - Overall Histogram 
   - Correlation plot
   - Target variation
   
    
2. Feature Selection
   - Regular Feature 
   - Lightbgm
    
   
3. Modeling
   - Linear Regression
   - CNN


<h2 >
    <div style="color:#3482a4;
           display:fill;
           border-radius:2px;
           font-size:100%;
           font-family:Verdana">
General Trading Steps
</h2>


> A good trading system is easy when we know the correct infomation, a solid mathematical model, and  execute in right time.

For example, if we know Pfizer vaccine will work, the model will compute how many shares we should buy based on our cashflow.
The challenge is to identify the definite news, build the correct model, and execute without any personal feelings.

1. **Info**

2. **Modeling** 

3. **Execution** 

<h2 >
    <div style="color:#3482a4;
           display:fill;
           border-radius:5px;
           font-size:100%;
           font-family:Verdana">
Quantitative Trading Key Concpets
</h2>

- Construct various mathematical models to automate trading decisions to eliminates the emotional decision

- Apply backtest data to various scenarios to help indentify opportunities for profit

- High-frequency trading (HFT) are beneficial only when other market actors did not know the strategy and market conditions haven't changed.



In [None]:
import pandas as pd
import numpy as np
import gc
import matplotlib.pyplot as plt
import seaborn as sns
import math
import lightgbm
from sklearn.model_selection import StratifiedKFold 
from lightgbm import LGBMRegressor
from sklearn.linear_model import LinearRegression
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

<h1 id="h2">
    <div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#4fc5ad;
           font-size:100%;
           font-family:Verdana;
           letter-spacing:0.5px">
2.🎲  Data Overview
<a class="anchor-link" href="https://www.kaggle.com/fangya/ubiquant-investment">¶</a>

</h1>
    
 ### Color Scheme   

In [None]:
sns.color_palette("mako_r",10)

### Read in the data by **parquet**

In [None]:
train = pd.read_parquet('../input/ubiquant-parquet/train.parquet',engine="pyarrow") 
train.head()

<h2 >
    <div style="color:#3482a4;
           display:fill;
           border-radius:2px;
           font-size:100%;
           font-family:Verdana">
Histogram
</h2>

### Histogram for Investment ID

In [None]:
def histg(data, color , label):
    fig, ax = plt.subplots(1, 1, figsize=(12, 6))
    data.plot.hist(bins=60, color=color)
    plt.title(label)
    plt.show()

In [None]:
obs_by_asset = train.groupby(['investment_id'])['target'].count()
histg(obs_by_asset,"#c6ebd1", "Investment ID Record Distribution")

In [None]:
mean_target = train.groupby(['investment_id'])['target'].mean()
mean_mean_target = np.mean(mean_target)
histg(mean_target,"#8bdab2", "Mean Investment_Id Target Distribution")

In [None]:
sts_target = train.groupby(['investment_id'])['target'].std()
mean_std_target = np.mean(sts_target)
histg(sts_target,"#4fc5ad", "SD for Investment_Id and Target Distribution")

In [None]:
fig, axes = plt.subplots(1, 5, figsize=(10,2.5), dpi=100, sharex=True, sharey=True)
#colors = ['tab:green', 'tab:blue', 'tab:pink', 'tab:red', 'tab:purple']
colors = ["#c6ebd1","#8bdab2","#4fc5ad","#38aaac","#348fa7"]
for i, (ax, investment_id) in enumerate(zip(axes.flatten(), np.random.choice(train["investment_id"].unique(),5))):
    x = train.loc[train.investment_id==investment_id, "target"]
    ax.hist(x, alpha=0.5, bins=55, density=True, stacked=True, label=str(investment_id), color=colors[i])
    ax.set_title(investment_id)
plt.suptitle('Target Histogram of Different Investment ID', y=1.05, size=16)  

#### Record for unique Investment ID by time

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
train.groupby('time_id')['investment_id'].nunique().plot(color="#38aaac")
plt.title("Number of Unique Assets by time")
plt.show()

#### Number of Records per Investment_ID 

In [None]:
train['investment_id'].value_counts().plot(kind = 'bar',figsize = (12,6), color=("#8bdab2","#2e1e3b"))
plt.title("Record Number Count for Investment Id")
plt.show()

### Top 10 Investment Id with most Record Numbers

In [None]:
train['investment_id'].value_counts().nlargest(10).plot(kind = 'bar',figsize = (10,5), 
                        color=("#c6ebd1","#8bdab2","#4fc5ad", "#38aaac","#359caa","#3482a4",
                            "#37659e","#40498e","#413d7b","#37284f","#241628"))
plt.title("Largest 10 Record Count Ubiquant Investment ID ")
plt.show()

### Top 10 Investment Id with least Record Numbers

In [None]:
train['investment_id'].value_counts().nsmallest(10).plot(kind = 'bar',figsize = (10,5), 
                        color=("#abe2be","#68d1ad", "#40b7ad", "#359caa","#3482a4",
                             "#37659e","#40498e","#3d3164","#2e1e3b","#180d16"))
plt.title(" 10 Smallest Record Count Ubiquant Investment ID ")
plt.show()

<h2 >
    <div style="color:#3482a4;
           display:fill;
           border-radius:2px;
           font-size:100%;
           font-family:Verdana">
Investment ID for Target Value and Time-Id
</h2>
    
 We can see the investment ID with more records has high volatility, which means it is easier to create arbitrage and make profit. For example, the target value for 1415 is a straight line, there won't be any opportunity for investment. 

In [None]:
array=[2140,2385,1062,1144,2727,194,2780,509,2406,952,1415,2800,3662,85,905]
s_id=train.loc[train['investment_id'].isin(array)]

In [None]:
def data_id(id1):
    df=s_id[s_id["investment_id"]==id1].set_index("time_id")
    return df

id1=data_id(2140)
id2=data_id(2385)
id3=data_id(1062)
id4=data_id(1144)
id5=data_id(2727)
id6=data_id(1415)
id7=data_id(2800)
id8=data_id(3662)
id9=data_id(85)
id10=data_id(905)

In [None]:
# Plot the target value for top investment id
f= plt.figure(figsize=(8,16))  

def gplot(no, data, color):
    ax=f.add_subplot(no)
    plt.plot(data["target"], label=data["investment_id"].head(1), color=color)
    plt.legend()
    plt.xlabel("Time_id")
    plt.ylabel("Target Value")
    return plt

gplot(no=511, data=id1, color="#c6ebd1")
gplot(no=512, data=id2, color="#8bdab2")
gplot(no=513, data=id3, color="#4fc5ad")
gplot(no=514, data=id4, color="#38aaac")
gplot(no=515, data=id5, color="#359caa")
plt.suptitle('Target vs Time_id for Top 5  Investment ID', y=1, size=16) 
plt.tight_layout()
plt.show()

In [None]:
# Plot the target value for smallest investment id
f= plt.figure(figsize=(8,16))  

def gplot(no, data, color):
    ax=f.add_subplot(no)
    plt.plot(data["target"], label=data["investment_id"].head(1), color=color)
    plt.legend()
    plt.xlabel("Time_id")
    plt.ylabel("Target Value")
    return plt

gplot(no=511, data=id6, color="#359caa")
gplot(no=512, data=id7, color="#3482a4")
gplot(no=513, data=id8, color="#37659e")
gplot(no=514, data=id9, color="#40498e")
gplot(no=515, data=id10, color="#3d3164")
plt.suptitle('Target vs Time_id for 5 Smallest  Investment ID', y=1, size=16) 
plt.tight_layout()
plt.show()

<h2 >
    <div style="color:#3482a4;
           display:fill;
           border-radius:2px;
           font-size:100%;
           font-family:Verdana">
Mini Correlation Plot
</h2>

we will use 0.01% of the data to generate a correlation plot 

In [None]:
data_types_dict = {
    'time_id': 'int32',
    'investment_id': 'int16',
    "target": 'float16',
}

features = [f'f_{i}' for i in range(300)]

for f in features:
    data_types_dict[f] = 'float16'
    
target = 'target'

In [None]:
sample_corr=train.sample(frac=0.0001, random_state=18)
correlation = sample_corr[[target] + features].corr()
sns.clustermap(correlation, figsize=(20, 20), cmap="mako_r")

<h1 id="h3">
    <div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#4fc5ad;
           font-size:100%;
           font-family:Verdana;
           letter-spacing:0.5px">
3.✨  Mini Ubiquant Feautrue Selection
<a class="anchor-link" href="https://www.kaggle.com/fangya/ubiquant-investment">¶</a>
 </h1>


Our Computer Restart the Kernel all the time because lacking of the computing power.  
Therefore we decided to use **1% of the Ubiquant data** for a mini analysis.  

In [None]:
#seed
s_train=train.sample(frac=0.01, random_state=8)

<h2 >
    <div style="color:#3482a4;
           display:fill;
           border-radius:2px;
           font-size:100%;
           font-family:Verdana">
Correlation Feature Importance
</h2>
    
One of the challenge in this project, is the features are coded, we are hard to interpret them even we figured out which ones are more important.
    
I search some sample features in the financial industry that are considered critical.

<div style="color:black;
           display:fill;
           border-radius:5px;
           background-color:#abe2be;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">
<p style="padding: 12px;
              color:black;">
📌 Two Sigma ML Model Top Features
    
>    1. Relative Employment Size:  this can be viewed as a proxy of the stock’s sensitivity to wage                                   inflation
    
>    2. Industry: there are some sectors (and industries) that are more sensitive to inflation than                   others

>    3. Dividend Payout Ratio: value stocks with nearer-term cash flows are less impacted by rising                               inflation

>    4. EPS growth

>    5. Revenue per Employee

>    6. Stablility of Operating CashFlow

>    7. Return on Asset

>    8. Profit Margin : perhaps companies with high profit margins have a better chance of surviving                                         in High Inflation periods
</p>
</div>


In [None]:
obs_by_asset = s_train.groupby(['investment_id'])['target'].count().to_dict()
target = s_train.investment_id.copy().replace(obs_by_asset).astype(np.int16)
features = s_train.columns[4:]

del(obs_by_asset)

In [None]:
corrs = list()
for col in features:
    corr = np.corrcoef(target, s_train[col])[0][1]
    corrs.append(corr)
    
del(target)

In [None]:
feat_importances = pd.Series(corrs, index=features)
feat_importances.nlargest(20).plot(kind='barh', figsize=(12, 6),
                                   color=("#c6ebd1","#abe2be","#8bdab2","#68d1ad","#4fc5ad",
                                          "#40b7ad","#38aaac","#359caa","#348fa7","#3482a4",
                                           "#3573a1","#37659e","#3b5799","#40498e","#413d7b",
                                         "#3d3164","#37284f","#2e1e3b","#241628","#180d16")).invert_yaxis()
plt.title("Top 20 Most Important Feautres for Mini Ubiquant Investment")
plt.show()

<h2 >
    <div style="color:#3482a4;
           display:fill;
           border-radius:2px;
           font-size:100%;
           font-family:Verdana">
Lgbm Feature Importance 
</h2>

In [None]:
data_types_dict = {
    'time_id': 'int32',
    'investment_id': 'int16',
    "target": 'float16',
}

features = [f'f_{i}' for i in range(300)]

for f in features:
    data_types_dict[f] = 'float16'
    
target = 'target'

In [None]:
seed = 8
folds = 5
models = []

skf = StratifiedKFold(folds, shuffle = True, random_state = seed)

for train_index, test_index in skf.split(s_train, s_train['investment_id']):
    train1 = s_train.iloc[train_index]
    valid1 = s_train.iloc[test_index]
    
    lgbm = LGBMRegressor(
        num_leaves=2 ** np.random.randint(3, 8),
        learning_rate = 10 ** (-np.random.uniform(0.1,2)),
        n_estimators = 1000,
        min_child_samples = 1000, 
        subsample=np.random.uniform(0.5,1.0), 
        subsample_freq=1,
        n_jobs= -1
    )

    lgbm.fit(train1[features], train1[target], eval_set = (valid1[features], valid1[target]), early_stopping_rounds = 10)
    models.append(lgbm)

In [None]:
# sorted(zip(clf.feature_importances_, X.columns), reverse=True)
feature_imp = pd.DataFrame(sorted(zip(lgbm.feature_importances_,s_train.columns)), columns=['Value','Feature']).nlargest(20,"Value")

plt.figure(figsize=(20, 10))
sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False),  palette='mako_r')
plt.title('LightGBM Important Features (avg over folds)')
plt.tight_layout()
plt.show()

<h1 id="h4">
    <div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#4fc5ad;
           font-size:100%;
           font-family:Verdana;
           letter-spacing:0.5px">
4.🌅  Mini Ubiquant Feautrue EDA
<a class="anchor-link" href="https://www.kaggle.com/fangya/ubiquant-investment">¶</a>
</h1>
    
We will present the Feature EDA selection by the LGBM and regular Model.

In [None]:
df = pd.DataFrame(s_train, columns= ["row_id","time_id","investment_id","target" ,'f_22','f_30',"f_61","f_72","f_90","f_95","f_97","f_113","f_164","f_194"])
df.head()

In [None]:
sample_features =[22,30,61,72,90,95,164,194,113]
fig, ax = plt.subplots(3,3, figsize=(18, 18))
for i, sample in enumerate(sample_features):
    sns.distplot(df[f'f_{sample}'], ax=ax[math.floor(i/3),i%3],color='#38aaac').set_title(f'f_{sample} Distribution')
fig.suptitle('Top 9 Feature Density Plot', y=1, size=16) 
fig.tight_layout()
fig.show()

In [None]:
fig, ax = plt.subplots(3,3, figsize=(18, 18))
for i, sample in enumerate(sample_features):
    sns.regplot(data=df, x=df[f'f_{sample}'], y="target", ax=ax[math.floor(i/3),i%3],color='#348fa7').set_title(f'f_{sample} Scatter Plot with Target')
fig.suptitle('Top 9 Feature Scatter Plot + Linear Reg', y=1, size=16) 
fig.tight_layout()
fig.show()

<h1 id="h5">
    <div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#4fc5ad;
           font-size:100%;
           font-family:Verdana;
           letter-spacing:0.5px">
5.🐱‍💻  Mini Ubiquant Model
<a class="anchor-link" href="https://www.kaggle.com/fangya/ubiquant-investment">¶</a>
        </h1>
    
 <h2 >
    <div style="color:#3482a4;
           display:fill;
           border-radius:2px;
           font-size:100%;
           font-family:Verdana">
Linear Regression
</h2>   
     
As we can see, Linear regression demonstrated a extremely poor performance.     

In [None]:
Y=df["target"]
X = df[df.columns[4:]]

In [None]:
model = LinearRegression().fit(X, Y)
model.score(X,Y)

In [None]:
Y_pred = model.predict(X)
lin = pd.DataFrame({'Actual': Y, 'Predicted': Y_pred.flatten()})
display(lin)

<h1 id="h6">
    <div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#4fc5ad;
           font-size:100%;
           font-family:Verdana;
           letter-spacing:0.5px">
6.🎨 Artistry in Investment
<a class="anchor-link" href="https://www.kaggle.com/fangya/ubiquant-investment">¶</a>
        </h1>
    
At the beggining, I was overwhelmed by the huge dataset and 300 unknown features, yet in real trading, it could be at least thousands of features, maybe up to ten thousand. <br>    
Special thanks to **@Kehan Liu, @ Jason GCL** provided many insightful ideas, and we decided to cut down the dataset and do a Mini Analysis.
\
It is a competition, it is also about doing the best as we can and learning new knowledge!
    
My friend **ZMQ  阿秋** works in quant investment, she enjoys the career for **the fast process of transform knowledge to fortunes, and be able to integrate convulted theories with application**.

This is probablity the most romantic and artistic view of quant investing I heard of.

> Supreme Quant Modeling besides the science, it requires **unconstrained imagination, individual creativity, and a    share of solitude**!
        - 🎆阿秋 
   

<h1 id="h7">
    <div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#4fc5ad;
           font-size:100%;
           font-family:Verdana;
           letter-spacing:0.5px">
7.🍵  Reference
<a class="anchor-link" href="https://www.kaggle.com/fangya/ubiquant-investment">¶</a>
        </h1>
    
[1] https://www.kaggle.com/robikscube/fast-data-loading-and-low-mem-with-parquet-files

[2] https://www.kaggle.com/ilialar/ubiquant-eda-and-baseline

[3] https://www.investopedia.com/terms/q/quantitative-trading.asp

[4] https://www.kaggle.com/lucamassaron/eda-target-analysis

[5] https://www.kaggle.com/edwardcrookenden/eda-and-lgbm-baseline-feature-imp

[6] https://www.twosigma.com/

#### Thanks for reading! Happy Studying!