# G-Research | Exploratory Data Analysis (EDA)

### Acknowledgements 😍
My acknowledgments are given to: 
1. [Yam Peleg. G-Research: XGBoost with GPU (Fit in 1min)](https://www.kaggle.com/yamqwe/g-research-xgboost-with-gpu-fit-in-1min).
2. [Carlos Stein N Brito. Tutorial to the G-Research Crypto Competition](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition).

### About 💸
In the G-Research Crypto forecasting competition, participants have the challenge to predict price returns across a bundle of major cryptocurrencies. 

### 1. Import libraries

In [None]:
import time
import pandas as pd
import numpy as np
import scipy.stats as stats
from datetime import datetime
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.multioutput import MultiOutputRegressor

### 2. Read data

In [None]:
data_folder = "../input/g-research-crypto-forecasting/"
crypto_df = pd.read_csv(data_folder + 'train.csv')
crypto_df.head(10)

Our aim is to predict **"Target" column**. The full description of each column you can find [here](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition?scriptVersionId=84976518&cellId=12). 

We also can check the **asset information**, including the list of all assets, the Asset_ID to asset mapping, and the weight of each asset used to weigh their relative importance in the evaluation metric.

In [None]:
asset_details = pd.read_csv(data_folder + 'asset_details.csv')
asset_details

### 3. Visualize data: Candlestick Bar
The trading data format is an aggregated form of market data including for Open, High, Low and Close. We can visualize this data through the commonly used **candlestick bar** chart, which allows traders to perform technical analysis on intraday values. 

The **bar's body length** represents the price range between the open and close of that day's trading. When the bar is **red**, it means the close was lower than the open, and **green** otherwise. These are also referred to as bullish and bearish candlesticks. The wicks above and below the bars show the high and low prices of that interval's trading.

The bottom part of the plot shows a **rangeslider**, which you can use to zoom in the plot.

In [None]:
# Asset_ID = 1 for Bitcoin, for example
btc = crypto_df[crypto_df["Asset_ID"]==1].set_index("timestamp") 

# select recent data rows
btc_mini = btc.iloc[-200:] 

fig = go.Figure(data=[go.Candlestick(x=btc_mini.index, 
                                     open=btc_mini['Open'], 
                                     high=btc_mini['High'], 
                                     low=btc_mini['Low'], 
                                     close=btc_mini['Close'])])
fig.show()

### 4. Exploratory Data Analysis (EDA). Check for missing data

In [None]:
# Asset_ID = 6 for Ethereum, as example
eth = crypto_df[crypto_df["Asset_ID"]==6].set_index("timestamp") 
eth.isna().sum()

As we can see, there are 340 missed "Target" values. 

Let's check the time range for Bitcoin and Ethereum data, using the coversion from timestamp to datetime.

In [None]:
beg_btc = btc.index[0].astype('datetime64[s]')
end_btc = btc.index[-1].astype('datetime64[s]')
beg_eth = eth.index[0].astype('datetime64[s]')
end_eth = eth.index[-1].astype('datetime64[s]')

print('BTC data goes from ', beg_btc, 'to ', end_btc)
print('Ethereum data goes from ', beg_eth, 'to ', end_eth)

After that, we'll check for timegaps:

In [None]:
(eth.index[1:]-eth.index[:-1]).value_counts().head()

To work with most time series models, we should preprocess our data into a format **without time gaps**. To fill the gaps, we can use the **.reindex()** method for forward filling, filling gaps with the previous valid value.

In [None]:
eth = eth.reindex(range(eth.index[0],eth.index[-1]+60,60),method='pad')
btc = btc.reindex(range(btc.index[0],btc.index[-1]+60,60),method='pad')

And check that are no time gaps now:

In [None]:
(eth.index[1:]-eth.index[:-1]).value_counts().head()

In [None]:
(btc.index[1:]-btc.index[:-1]).value_counts().head()

### 5. Data visualisation

In [None]:
# plot vwap time series for both chosen assets
f = plt.figure(figsize=(15,4))

ax = f.add_subplot(121)
plt.plot(btc['Close'], color = 'green', label='BTC')
plt.legend()
plt.xlabel('Time')
plt.ylabel('Bitcoin')

ax2 = f.add_subplot(122)
ax2.plot(eth['Close'], color='orange', label='ETH')
plt.legend()
plt.xlabel('Time')
plt.ylabel('Ethereum')

plt.tight_layout()
plt.show()

After that we could check if they correlate in recent time: 

In [None]:
# auxiliary function, from datetime to timestamp
totimestamp = lambda s: np.int32(time.mktime(datetime.strptime(s, "%d/%m/%Y").timetuple()))

# create intervals 
btc_mini_2021 = btc.loc[totimestamp('01/06/2021'):totimestamp('01/07/2021')]
eth_mini_2021 = eth.loc[totimestamp('01/06/2021'):totimestamp('01/07/2021')]

In [None]:
# plot time series for both chosen assets
f = plt.figure(figsize=(7,8))

ax = f.add_subplot(211)
plt.plot(btc_mini_2021['Close'], label='btc')
plt.legend()
plt.xlabel('Time')
plt.ylabel('Bitcoin Close')

ax2 = f.add_subplot(212)
ax2.plot(eth_mini_2021['Close'], color='red', label='eth')
plt.legend()
plt.xlabel('Time')
plt.ylabel('Ethereum Close')

plt.tight_layout()
plt.show()

#### Log returns
In order to analyze price changes for an asset we can deal with the price difference. However, different assets exhibit different price scales, so that the their returns are not readily comparable. We can solve this problem by computing the percentage change in price instead, also known as the return. This return coincides with the percentage change in our invested capital.

In [None]:
# define function to compute log returns
def log_return(series, periods=1):
    return np.log(series).diff(periods=periods)

In [None]:
lret_btc = log_return(btc_mini_2021.Close)[1:]
lret_eth = log_return(eth_mini_2021.Close)[1:]
lret_btc.rename('lret_btc', inplace=True)
lret_eth.rename('lret_eth', inplace=True)

plt.figure(figsize=(8,4))
plt.plot(lret_btc);
plt.plot(lret_eth);
plt.show()

#### Correlation between assets
We can check how the correlation between Bitcoin and Ethereum change over time for the 2021 period we selected.

In [None]:
# join two asset in single DataFrame

lret_btc_long = log_return(btc.Close)[1:]
lret_eth_long = log_return(eth.Close)[1:]
lret_btc_long.rename('lret_btc', inplace=True)
lret_eth_long.rename('lret_eth', inplace=True)
two_assets = pd.concat([lret_btc_long, lret_eth_long], axis=1)

# group consecutive rows and use .corr() for correlation between columns
corr_time = two_assets.groupby(two_assets.index//(10000*60)).corr().loc[:,"lret_btc"].loc[:,"lret_eth"]

corr_time.plot()
plt.xticks([])
plt.ylabel("Correlation")
plt.title("Correlation between BTC and ETH over time")
plt.show()

Note the high but variable correlation between the assets. Here we can see that there is some changing dynamics over time, and this would be critical for this time series challenge, that is, how to perform forecasts in a highly non-stationary environment.

A stationary behaviour of a system or a process is characterized by non-changing statistical properties over time such as the mean, variance and autocorrelation. On the other hand, a non-stationary behaviour is characterized by a continuous change of statistical properties over time. Stationarity is important because many useful analytical tools and statistical tests and models rely on it.

We can also check the correlation between all assets visualizing the correlation matrix. Note how some assets have much higher pairwise correlation than others.

In [None]:
# create dataframe with returns for all assets
all_assets_2021 = pd.DataFrame([])
for asset_id, asset_name in zip(asset_details.Asset_ID, asset_details.Asset_Name):
    asset = crypto_df[crypto_df["Asset_ID"]==asset_id].set_index("timestamp")
    asset = asset.loc[totimestamp('01/01/2021'):totimestamp('01/05/2021')]
    asset = asset.reindex(range(asset.index[0],asset.index[-1]+60,60),method='pad')
    lret = log_return(asset.Close.fillna(0))[1:]
    all_assets_2021 = all_assets_2021.join(lret, rsuffix=asset_name, how="outer")

In [None]:
asset_details.Asset_ID

In [None]:
asset_details.Asset_Name

In [None]:
plt.imshow(all_assets_2021.corr());
plt.yticks(range(14), asset_details.Asset_Name.values);
plt.xticks(range(14), asset_details.Asset_Name.values, rotation='vertical');
plt.colorbar()
plt.show()

### 6. Predictions
### Feature engineering 

In [None]:
# Select some input features from the trading data: 
# 5 min log return, abs(5 min log return), upper shadow, and lower shadow.
upper_shadow = lambda asset: asset.High - np.maximum(asset.Close,asset.Open)
lower_shadow = lambda asset: np.minimum(asset.Close,asset.Open)- asset.Low

X_btc = pd.concat([log_return(btc.VWAP,periods=5), 
                   log_return(btc.VWAP,periods=1).abs(), 
                   upper_shadow(btc), 
                   lower_shadow(btc)], axis=1)
y_btc = btc.Target

X_eth = pd.concat([log_return(eth.VWAP,periods=5), 
                   log_return(eth.VWAP,periods=1).abs(), 
                   upper_shadow(eth), 
                   lower_shadow(eth)], axis=1)
y_eth = eth.Target

#### Preparing the data for building predictive models
As we will train linear regression parameters, we need to separate training and test sets. To do so, we will compute X and y and split this data into train and test splits.

In [None]:
# select training and test periods
train_window = [totimestamp("01/05/2021"), totimestamp("30/05/2021")]
test_window = [totimestamp("01/06/2021"), totimestamp("30/06/2021")]

# divide data into train and test, compute X and y
# we aim to build simple regression models using a window_size of 1
X_btc_train = X_btc.loc[train_window[0]:train_window[1]].fillna(0).to_numpy()  # filling NaN's with zeros
y_btc_train = y_btc.loc[train_window[0]:train_window[1]].fillna(0).to_numpy()  

X_btc_test = X_btc.loc[test_window[0]:test_window[1]].fillna(0).to_numpy() 
y_btc_test = y_btc.loc[test_window[0]:test_window[1]].fillna(0).to_numpy() 

X_eth_train = X_eth.loc[train_window[0]:train_window[1]].fillna(0).to_numpy()  
y_eth_train = y_eth.loc[train_window[0]:train_window[1]].fillna(0).to_numpy()  

X_eth_test = X_eth.loc[test_window[0]:test_window[1]].fillna(0).to_numpy() 
y_eth_test = y_eth.loc[test_window[0]:test_window[1]].fillna(0).to_numpy() 

#### Normalize data

In [None]:
# simple preprocessing of the data 
scaler = StandardScaler()

X_btc_train_scaled = scaler.fit_transform(X_btc_train)
X_btc_test_scaled = scaler.transform(X_btc_test)

X_eth_train_scaled = scaler.fit_transform(X_eth_train)
X_eth_test_scaled = scaler.transform(X_eth_test)

#### Linear Regression

In [None]:
# implement basic ML baseline (one per asset)
lr = LinearRegression()
lr.fit(X_btc_train_scaled,y_btc_train)
y_pred_lr_btc = lr.predict(X_btc_test_scaled)

lr.fit(X_eth_train_scaled,y_eth_train)
y_pred_lr_eth = lr.predict(X_eth_test_scaled)

In [None]:
# we concatenate X and y for both assets
X_both_train = np.concatenate((X_btc_train_scaled, X_eth_train_scaled), axis=1)
X_both_test = np.concatenate((X_btc_test_scaled, X_eth_test_scaled), axis=1)
y_both_train = np.column_stack((y_btc_train, y_eth_train))
y_both_test = np.column_stack((y_btc_test, y_eth_test))

# define the direct multioutput model and fit it
mlr = MultiOutputRegressor(LinearRegression())
mlr.fit(X_both_train,y_both_train)
y_pred_lr_both = mlr.predict(X_both_test)

#### Evaluate results

In [None]:
print('Test score for LR baseline: BTC', 
      f"{np.corrcoef(y_pred_lr_btc, y_btc_test)[0,1]:.2f}", ', ETH', 
      f"{np.corrcoef(y_pred_lr_eth, y_eth_test)[0,1]:.2f}")

print('Test score for multiple output LR baseline: BTC', 
      f"{np.corrcoef(y_pred_lr_both[:,0], y_btc_test)[0,1]:.2f}",', ETH', 
      f"{np.corrcoef(y_pred_lr_both[:,1], y_eth_test)[0,1]:.2f}")

### 7. Conclusion
Thank you for reading my new article! If you liked it, please, make an upvote 💖

*My other articles:*

* [G-Research Forecast | Overlap | Score = 0.9999](https://www.kaggle.com/maricinnamon/g-research-forecast-overlap-score-0-9999)
* [Harry Potter Movies Dataset | Starter Notebook](https://www.kaggle.com/maricinnamon/harry-potter-movies-dataset-starter-notebook)
* [Contradictory, My Dear Watson | nlp | tensorflow](https://www.kaggle.com/maricinnamon/contradictory-my-dear-watson-nlp-tensorflow)

👉 More articles you can find on my [page](http://https://www.kaggle.com/maricinnamon/code?userId=3717445&sortBy=dateRun&tab=profile&privacy=public).
