## Creating Dataset for Machine Learning with SD Score

The below script creates a dataset for machine learning with the following features:

- TVL: Total Value Locked in USD
- APY: Annual Percentage Yield
- APY Mean 7D: The average APY over the last 7 days
- APY Std 7D: The standard deviation of the APY over the last 7 days
- TVL Percentile: The percentile of TVL at each date
- APY 7D Percentile: The percentile of the 7 day APY at each date
- APY 30D Percentile: The percentile of the 30 day APY at each date
- APY 7D Std Ratio: The average APY for 7 days divided by the standard deviation of the APY for 7 days
- TVL Change 7D: The change in TVL over the last 7 days
- TVL Change 1D: The change in TVL over the last 1 day
- SD Score: The product of '7 day APY percentile' and 'tvl percentile' multiplied by 100
- SD Score 7D Avg: The rolling mean of the past 7 days SD score
- SD Score 7D Std: The standard deviation of the SD_Score_7D over the last 7 days

The dataset is a derivtive and statiscial description of TVL and APY data for DeFi protocols. The source of the data is DeFiLlama



In [31]:
# Import libraries and dependencies
import pandas as pd

data = pd.read_csv(r'/Users/karolk/Python_Work/Data_Sets/Global_Data/DeFi_Global_DB.csv', index_col=0)
pd.set_option("display.max.columns", None)
pd.set_option("display.precision", 4)
pd.options.display.float_format = '{:,.2f}'.format


In [32]:
#set date and time added as datetime objects
data['date'] = pd.to_datetime(data['date'])
data['time added'] = pd.to_datetime(data['time added'])

#drop all pools that have a TVL of 0 or a APY of 0
data = data[(data['tvlUsd'] > 0) & (data['apy'] > 0)]

# specify the start and end date
analysis_date = pd.to_datetime('2023-11-30') # convert the end_date to a datetime object
# end date is 21 days after the analysis date
end_date = analysis_date + pd.DateOffset(days=10) 
# start date is 21 days before the analysis date
start_date = analysis_date - pd.DateOffset(days=21)
data = data[(data['date'] >= start_date) & (data['date'] <= end_date)]


In [33]:

# select only stablecoin is true
#data = data[data['stablecoin'] == True]

# sort the data table by pool and by date
data = data.sort_values(['pool', 'date'], ascending=[True, True]).reset_index(drop=True)

# create a new column which would the average APY over the last 7 days using the 'apy' column
data['apyMean7d'] = data.groupby('pool')['apy'].transform(lambda x: x.rolling(7, 1).mean())

# create a new column for the standard deviation of the APY over the last 7 days
data['apyStd7d'] = data.groupby('pool')['apy'].transform(lambda x: x.rolling(7, 1).std())

# getting the forward mean APY for the next 7 days. 
data['apyMean7dForward'] = data.groupby('pool')['apy'].transform(lambda x: x.shift(-7).rolling(7, 1).mean())

# getting the APY in 7 days
data['apy7dForward'] = data.groupby('pool')['apy'].shift(-7)

# creating a column for change in APY over the last 7 days measured
data['apyChange7d'] = data.groupby(['pool'])['apy'].pct_change(7)*100

# creating a column for the change in APY over the last 7 days in percentage
data['apyChange7dPercent'] = data.groupby(['pool'])['apy'].pct_change(7) * 100

# create a column which has the percentile for TVL for the TVL at each date
data['tvlPercentile'] = data.groupby('date')['tvlUsd'].rank(pct=True)

# create a column which has the percentile for 7 day APY
data['apy7DPercentile'] = data.groupby('date')['apyMean7d'].rank(pct=True)

# creating 2 new columns which is the average APY for 7 days divided by the standard deviation of the APY for 7 days
data['apy7DStdRatio'] = data['apyMean7d'] / data['apyStd7d']

# calculate a new column for the change in TVL over the last 7 days
data['tvlChange7d'] = data.groupby(['pool'])['tvlUsd'].pct_change(periods=7) * 100

# creating column with 'SD_Score' which is the product of '7 day APY percentile' and 'tvl percentile' multiplied by 100
data['SD_Score'] = data['apy7DPercentile'] * data['tvlPercentile'] * 100

# creating a column 'SD_Score_7D' which is the rolling mean of the past 7 days SD score
data['SD_Score_7D_avg'] = data.groupby('pool')['SD_Score'].transform(lambda x: x.rolling(7, 1).mean())

# create a column 'SD_Score_7D_std' which is the standard deviation of the SD_Score_7D over the last 7 days
data['SD_Score_7D_std'] = data.groupby('pool')['SD_Score'].transform(lambda x: x.rolling(7, 1).std())

# create a new column for forward SD_Score_7D_avg
data['SD_Score_7D_forward_rolling'] = data.groupby('pool')['SD_Score'].transform(lambda x: x.shift(-7).rolling(7, 1).mean())

# create a new column for the forward SD_Score_7D which is the SD_Score 7 days in the future
data['SD_Score_7D_forward'] = data.groupby('pool')['SD_Score'].shift(-7)

# create a new column which is the change in SD_Score over the last 7 days
data['SD_Score_7D_change'] = data.groupby(['pool'])['SD_Score'].pct_change(7)*100

# create a new column which is the change in SD_Score over the last 7 days
data['SD_Score_7D_forward_change'] = data.groupby(['pool'])['SD_Score'].pct_change(-7)*100


In [34]:
# Data Transformation - Adding new columns for the underlying tokens & taking only the most recent date

# create a list of all unique symbols in the data set
symbols = data['symbol'].unique()

# sort the symbols alphabetically
symbols.sort()

# splitting the 'symbol' column into 4 new columns using str.split() method with '-' as the separator
symbol_split = data['symbol'].str.split('-', expand=True, n=3)

# adding the 4 new columns to the data dataframe
symbol_split.columns = ['token_id_1', 'token_id_2', 'token_id_3', 'token_id_4']
data = pd.concat([data, symbol_split], axis=1)

# create a new column called num_tokens which is the number of tokens in the symbol
data['num_tokens'] = data['symbol'].str.count('-') + 1


In [35]:
# select the data that is equal to or less than the analysis date
data = data[data['date'] <= analysis_date]

# select data that is 7 days after the start date
data = data[data['date'] >= start_date + pd.DateOffset(days=7)]

In [36]:
data

Unnamed: 0,chain,project,symbol,tvlUsd,apy,pool,stablecoin,ilRisk,exposure,outlier,apyMean30d,date,time added,new_upload,possible_error,apyMean7d,apyStd7d,apyMean7dForward,apy7dForward,apyChange7d,apyChange7dPercent,tvlPercentile,apy7DPercentile,apy7DStdRatio,tvlChange7d,SD_Score,SD_Score_7D_avg,SD_Score_7D_std,SD_Score_7D_forward_rolling,SD_Score_7D_forward,SD_Score_7D_change,SD_Score_7D_forward_change,token_id_1,token_id_2,token_id_3,token_id_4,num_tokens
7,Base,uniswap-v3,ISK-WETH,185290.00,0.98,0005d7bf-1f14-4c74-92cd-857c9931053e,False,yes,multi,False,1.13,2023-11-16,2023-11-16 07:18:02,False,False,1.26,0.68,1.10,1.33,-64.98,-64.98,0.54,0.17,1.87,4.51,9.17,11.92,2.25,8.32,8.59,-48.70,6.85,ISK,WETH,,,2
8,Base,uniswap-v3,ISK-WETH,185290.00,0.98,0005d7bf-1f14-4c74-92cd-857c9931053e,False,yes,multi,False,1.17,2023-11-17,2023-11-17 07:03:44,False,False,1.05,0.40,1.31,2.43,-60.62,-60.62,0.54,0.16,2.58,1.06,8.42,10.91,1.93,8.37,8.80,-45.81,-4.26,ISK,WETH,,,2
9,Base,uniswap-v3,ISK-WETH,185290.00,0.98,0005d7bf-1f14-4c74-92cd-857c9931053e,False,yes,multi,False,1.20,2023-11-18,2023-11-18 07:16:53,False,False,0.92,0.17,1.38,1.47,-47.92,-47.92,0.54,0.15,5.42,-0.18,8.03,10.02,1.53,8.48,8.81,-43.62,-8.81,ISK,WETH,,,2
10,Base,uniswap-v3,ISK-WETH,185290.00,0.98,0005d7bf-1f14-4c74-92cd-857c9931053e,False,yes,multi,False,1.15,2023-11-19,2023-11-19 07:02:02,False,False,0.98,0.00,1.61,2.63,83.84,83.84,0.54,0.15,inf,0.81,8.28,9.47,1.32,8.67,9.58,-31.66,-13.61,ISK,WETH,,,2
11,Base,uniswap-v3,ISK-WETH,111779.00,0.26,0005d7bf-1f14-4c74-92cd-857c9931053e,False,yes,multi,False,1.92,2023-11-27,2023-11-27 07:18:31,False,False,0.88,0.27,2.11,3.70,-73.63,-73.63,0.45,0.18,3.22,-39.67,8.08,8.99,1.09,9.11,11.15,-29.18,-27.53,ISK,WETH,,,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
293453,Ethereum,uniswap-v2,HT-WETH,173392.00,0.08,fff0af84-1d67-4d88-b2ec-cbba9d5c03ca,False,yes,multi,False,6.92,2023-11-30,2023-11-30 07:09:43,False,False,3.40,2.80,21.82,1.01,-98.66,-98.66,0.53,0.34,1.21,-2.78,18.01,20.99,2.15,23.46,37.66,5.41,-52.17,HT,WETH,,,2
293463,BSC,uniswap-v3,USDT-NCX,171541.00,1.82,fff671da-4520-4a5f-bb4e-3dab41a8d1a0,False,yes,multi,False,0.97,2023-11-27,2023-11-27 07:18:31,False,False,1.82,,3.09,3.09,,,0.53,0.25,,,13.22,13.22,,17.83,17.83,,-25.85,USDT,NCX,,,2
293464,BSC,uniswap-v3,USDT-NCX,235742.00,3.09,fff671da-4520-4a5f-bb4e-3dab41a8d1a0,False,yes,multi,False,1.11,2023-11-28,2023-11-28 07:09:56,False,False,2.45,0.90,3.09,,,,0.59,0.30,2.73,,17.61,15.42,3.10,17.83,,,,USDT,NCX,,,2
293465,BSC,uniswap-v3,USDT-NCX,235742.00,3.09,fff671da-4520-4a5f-bb4e-3dab41a8d1a0,False,yes,multi,False,1.11,2023-11-29,2023-11-29 07:08:22,False,False,2.66,0.73,3.09,,,,0.59,0.30,3.63,,17.82,16.22,2.60,17.83,,,,USDT,NCX,,,2


In [37]:
# save the data to a csv file
filepath= '/Users/karolk/Python_Work/ML_Price/Datasets/DeFi_Quant_Data.csv'
data.to_csv(filepath, index=False)