## Creating Dataset for Machine Learning with SD Score

The below script creates a dataset for machine learning with the following features:

- TVL: Total Value Locked in USD
- APY: Annual Percentage Yield
- APY Mean 7D: The average APY over the last 7 days
- APY Std 7D: The standard deviation of the APY over the last 7 days
- TVL Percentile: The percentile of TVL at each date
- APY 7D Percentile: The percentile of the 7 day APY at each date
- APY 30D Percentile: The percentile of the 30 day APY at each date
- APY 7D Std Ratio: The average APY for 7 days divided by the standard deviation of the APY for 7 days
- TVL Change 7D: The change in TVL over the last 7 days
- TVL Change 1D: The change in TVL over the last 1 day
- SD Score: The product of '7 day APY percentile' and 'tvl percentile' multiplied by 100
- SD Score 7D Avg: The rolling mean of the past 7 days SD score
- SD Score 7D Std: The standard deviation of the SD_Score_7D over the last 7 days

The dataset is a derivtive and statiscial description of TVL and APY data for DeFi protocols. The source of the data is DeFiLlama



In [9]:
# Import libraries and dependencies
import pandas as pd

data = pd.read_csv(r'/Users/karolk/Python_Work/Data_Sets/Global_Data/DeFi_Global_DB.csv', index_col=0)
pd.set_option("display.max.columns", None)
pd.set_option("display.precision", 4)
pd.options.display.float_format = '{:,.2f}'.format

display(data.head())


Unnamed: 0_level_0,chain,project,symbol,tvlUsd,apy,pool,stablecoin,ilRisk,exposure,outlier,apyMean30d,date,time added,new_upload,possible_error
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2708788,Ethereum,uniswap-v2,BLOCK-WETH,212108.0,0.0,ffff4226-4328-404f-be4c-428d01a06ccd,False,yes,multi,False,0.0,2023-12-16,2023-12-16 07:01:18,False,False
2695550,Ethereum,uniswap-v2,BLOCK-WETH,212108.0,0.0,ffff4226-4328-404f-be4c-428d01a06ccd,False,yes,multi,False,0.0,2023-12-15,2023-12-15 07:02:03,False,False
2682343,Ethereum,uniswap-v2,BLOCK-WETH,212108.0,0.0,ffff4226-4328-404f-be4c-428d01a06ccd,False,yes,multi,False,0.0,2023-12-14,2023-12-14 07:05:52,False,False
2669155,Ethereum,uniswap-v2,BLOCK-WETH,212108.0,0.0,ffff4226-4328-404f-be4c-428d01a06ccd,False,yes,multi,False,0.0,2023-12-13,2023-12-13 09:24:48,False,False
2656036,Ethereum,uniswap-v2,BLOCK-WETH,212108.0,0.0,ffff4226-4328-404f-be4c-428d01a06ccd,False,yes,multi,False,0.0,2023-12-12,2023-12-12 07:01:18,False,False


In [10]:
#set date and time added as datetime objects
data['date'] = pd.to_datetime(data['date'])
data['time added'] = pd.to_datetime(data['time added'])

#drop all pools that have a TVL of 0 or a APY of 0
data = data[(data['tvlUsd'] > 0) & (data['apy'] > 0)]

# use only the past 35 days of data for each pool. Defined as 35 days from analysis date below
analysis_date = '2023-12-15' # change the analysis date here
data['date'] = pd.to_datetime(data['date'])
data = data[data['date'] >= pd.to_datetime(analysis_date) - pd.DateOffset(days=35)]

# select only stablecoin is true
data = data[data['stablecoin'] == False]

#sort the data table by pool and by date
data = data.sort_values(['pool', 'date'], ascending=[True, True]).reset_index(drop=True)

#create a new column which would the average APY over the last 7 days using the 'apy' column
data['apyMean7d'] = data.groupby('pool')['apy'].transform(lambda x: x.rolling(7, 1).mean())

#create a new column for the standard deviation of the APY over the last 7 days
data['apyStd7d'] = data.groupby('pool')['apy'].transform(lambda x: x.rolling(7, 1).std())

#create a column which has the percentile for TVL for the TVL at each date
data['tvlPercentile'] = data.groupby('date')['tvlUsd'].rank(pct=True)

#create a column which has the percentile for 7 day APY and 30 day APY
data['apy7DPercentile'] = data.groupby('date')['apyMean7d'].rank(pct=True)
data['apy30DPercentile'] = data.groupby('date')['apyMean30d'].rank(pct=True)

#creating 2 new columns which is the average APY for 7 days divided by the standard deviation of the APY for 7 days
data['apy7DStdRatio'] = data['apyMean7d'] / data['apyStd7d']

#calculate a new column for the change in TVL over the last 7 days
data['tvlChange7d'] = data.groupby(['pool'])['tvlUsd'].pct_change(periods=7) * 100

data['tvlChange1d'] = data.groupby(['pool'])['tvlUsd'].pct_change(periods=1) * 100

#creating column with 'SD_Score' which is the product of '7 day APY percentile' and 'tvl percentile' multiplied by 100
data['SD_Score'] = data['apy7DPercentile'] * data['tvlPercentile'] * 100

#creating a column 'SD_Score_7D' which is the rolling mean of the past 7 days SD score
data['SD_Score_7D_avg'] = data.groupby('pool')['SD_Score'].transform(lambda x: x.rolling(7, 1).mean())

#create a column 'SD_Score_7D_std' which is the standard deviation of the SD_Score_7D over the last 7 days
data['SD_Score_7D_std'] = data.groupby('pool')['SD_Score'].transform(lambda x: x.rolling(7, 1).std())

In [11]:
# save the data to a csv file
filepath= '/Users/karolk/Python_Work/ML_Price/Datasets/DeFi_Quant_Data_n_stables.csv'
data.to_csv(filepath, index=False)

