# Common Risk Factors in Cryptocurrency

# Master Thesis Project

<div style="text-align: right;font-size: 0.8em">Document Version 1.0.0</div>

This file requires `pandas`, `datetime`, `numpy`, `wand`, `pdf2image`, `Pillow`, and `math` to run. If one of these imports fails, please install the corresponding library and make sure that you have activated the corresponding virtual environment.

The project follows closely the methodology proposed by Liu, Tsyvinski, and Wu (2022) in their paper titled [Common Risk Factors in Cryptocurrency](https://onlinelibrary.wiley.com/doi/abs/10.1111/jofi.13119). Researchers and practitioners can use this paper to check the results of the paper and perhaps retrieve an updated version of the basic findings. They can also use it as a toolbox to use for other projects or to run an extended analysis including further risk factors. Finally, asset management firm may use this code to assess the risk of their portfolio or to firm anomalies in the returns of cryptocurrencies.

For this analysis, I occasionally had to make assumption, for example, regarding the procedure to convert daily to weekly data. This is especially so because the authors of the paper did not provide a detailed enough description of their decisions. There are other, perhabs better ways of doing certain steps and I am always grateful for any feedback that you might provide.

The order of the following sections is closely following the structure of the paper. The outline is:
* [I. Data](#I.-Data): The files for all data sources can be found in the data folder. The main blockchain trading data is retrieved from CoinGecko (coingecko_data.py). It is advisable to download the cryptocurrency data set in smaller chunks (for example, 100 cryptocurrencies), since the data set is relatively large and takes long to download due to the API limit. The merge_data.py file can then be used to merge all individal cryptocurrency data files into one large file that is supposed to be loaded into the code below. The daily crptocurrency (aka coin) data is converted to weekly returns using the last available prices: $$r_t = \frac{p_t-p_{t-1}}{p_{t-1}}$$ One can also compute log-returns instead. The definition of the weeks is as follows: The first 7 days of a given year are the first week. The following 50 weeks consist of 7 days each. The last week has either 8 or 9 days (if the year is a leap year). 

# I. Data

In [1]:
import pandas as pd, datetime, numpy as np, math, time, os
pd.options.mode.chained_assignment = None
from wand.image import Image
# and other modules from the directory
import convert_frequency, data.coingecko_data as coingecko_data, merge_data, render

# specify the data range for the analysis
# in the paper, the authors start on 2014-01-01 due to data availability
start_date = "2014-01-01"
end_date = str(datetime.date.today())

# select the path to the directory where you want to store the data
data_path = r"/Users/Marc/Desktop/Past Affairs/Past Universities/SSE Courses/Master Thesis/Data"
"""
# downloading the data from CoinGecko.com and storing it in smaller data subsets at the specified location
# the data contains daily prices, market caps, and trading volumes
# this step can take up to 2 days due to the API traffic limit
# we are also always checking if the subsequent files already exist (/cg_data.csv, /cg_weekly_data.csv, /cg_weekly_returns.csv, /market_weekly_returns.csv)
# this helps in case the previous files have been deleted
if not os.path.exists(data_path + "/coingecko") and not os.path.exists(data_path + "/cg_data.csv") and not os.path.exists(data_path + "/cg_weekly_data.csv") and not os.path.exists(data_path + "/cg_weekly_returns.csv") and not os.path.exists(data_path + "/market_weekly_returns.csv"):
    coingecko_data.retrieve_data(start_date, end_date, path=data_path)
else:
    print("The individual data files already exist.")

# merging the data subsets and storing the result at the specified location
# this task also absorbs part of the preprocessing, so it's recommended to run this step in any case 
# this step can take up to 12 hours
if not os.path.exists(data_path + "/cg_data.csv") and not os.path.exists(data_path + "/cg_weekly_data.csv") and not os.path.exists(data_path + "/cg_weekly_returns.csv") and not os.path.exists(data_path + "/market_weekly_returns.csv"):
    merge_data.merge(start_date, end_date, path=data_path)
else:
    print("The data was already merged into a single file.")

# the data was retrieved on 2023-01-13
daily_trading_data = pd.read_csv(data_path+"/cg_data.csv")

# all unique coin IDs
coin_ids = pd.unique(daily_trading_data["id"])
print("There are " +str(len(coin_ids)) + " coins in the data set.")

# downloading the data since the conversion process might also take a long time
# if the file for weekly data does not already exist
if not os.path.exists(data_path + "/cg_weekly_data.csv") and not os.path.exists(data_path + "/cg_weekly_returns.csv") and not os.path.exists(data_path + "/market_weekly_returns.csv"):
    print("The data is being transformed into weekly data.")
    # converting the data subset for every coin into weekly frequency
    dfs = []
    percentage_counter = 1
    print("The conversion progress is: ")
    for coin_id in coin_ids:
        # printing the progress 
        progress = int(len(dfs) / len(coin_ids) * 100)
        if progress > percentage_counter:
            percentage_counter += 1
            print(str(progress) + "%")
        # get all the data for one coin
        coin_daily_trading_data = daily_trading_data[daily_trading_data["id"] == coin_id]
        # now we compute the weekly data
        # the function weekly_data is designed to perform this transformation for a single coin at a time
        # this step takes a long time since the data set has a large size
        coin_weekly_data = convert_frequency.weekly_data(coin_daily_trading_data, start_date, end_date, download=False)
        dfs.append(coin_weekly_data)
        
    # combining all dataframes in the dfs list
    weekly_trading_data = pd.concat(dfs)
    # downloading the data
    weekly_trading_data.to_csv(data_path + "/cg_weekly_data.csv", index=False)
else:
    # next, we need to load the data and "unwrap" it again
    print("The data has already been converted.")
    weekly_trading_data = pd.read_csv(data_path + "/cg_weekly_data.csv")

# downloading the data since the returns computation process might also take a long time
# if the file for weekly returns data does not already exist
if not os.path.exists(data_path + "/cg_weekly_returns.csv") and not os.path.exists(data_path + "/market_weekly_returns.csv"):
    print("The data is being transformed into weekly returns data.")
    # converting the data subset for every coin into weekly frequency
    dfs = []
    percentage_counter = 1
    print("The conversion progress is: ")
    for coin_id in coin_ids:
        # printing the progress 
        progress = int(len(dfs) / len(coin_ids) * 100)
        if progress > percentage_counter:
            percentage_counter += 1
            print(str(progress) + "%")
        coin_weekly_prices = weekly_trading_data[weekly_trading_data["id"] == coin_id]
        # we are losing the first week, since we do not have a previous week for the first week (first week of 2014)
        coin_weekly_returns = [np.nan]
        # the indices of the dataframe for the respective coin
        indices = list(coin_weekly_prices.index)
        if len(coin_weekly_prices) > 2:
            for i in range(len(coin_weekly_prices) - 1):
                # to retrieve the actual indices
                this_week_index = indices[i]
                next_week_index = indices[i + 1]
                try:
                    weekly_return = (coin_weekly_prices["price"][next_week_index] - coin_weekly_prices["price"][this_week_index]) / coin_weekly_prices["price"][this_week_index]
                    # alternatively, the log-return can be computed as follows (math.log() is the natural logarithm by default):
                    # weekly_log_return = math.log(coin_weekly_prices["price"][i + 1] / coin_weekly_prices["price"][i])
                    coin_weekly_returns.append(weekly_return)
                except:
                    # this exception occurs when either current or future price are NaN
                    coin_weekly_returns.append(np.nan)
        # adding the return column to the previous date column
        coin_weekly_prices["return"] = coin_weekly_returns
        dfs.append(coin_weekly_prices)
    # combining all dataframes in the dfs list
    coins_weekly_returns = pd.concat(dfs)
    # downloading the data
    coins_weekly_returns.to_csv(data_path + "/cg_weekly_returns_data.csv", index=False)
else:
    # next, we need to load the data and "unwrap" it again
    print("The data has already been transformed into returns data.")
    coins_weekly_returns = pd.read_csv(data_path + "/cg_weekly_returns.csv")

coins_weekly_returns_old = coins_weekly_returns
# storing the data in a dict with keys for the ID
coins_weekly_returns = {}
for coin_id in coin_ids:
    # get all the data for one coin
    coins_weekly_returns[coin_id] = coins_weekly_returns_old[coins_weekly_returns_old["id"] == coin_id]

# downloading the data since the returns computation process might also take a long time
# if the file for weekly returns data does not already exist
if not os.path.exists(data_path + "/market_weekly_returns.csv"):
    # constructing the cryptocurrency market returns
    years = []
    weeks = []
    market_returns = []
    included_ids = []
    # taking an arbitrary ID to loop through all weeks
    for i in coins_weekly_returns[coin_ids[0]].index:
        year = coins_weekly_returns[coin_ids[0]]["year"][i]
        years.append(year)
        week = coins_weekly_returns[coin_ids[0]]["week"][i]
        weeks.append(week)
        returns = []
        market_caps = []
        # to keep track of which and how many coins are included for every week
        weekly_included_ids = []
        for coin_id in coin_ids:
            coin_weekly_returns = coins_weekly_returns[coin_id]
            coin_weekly_data = coin_weekly_returns[(coin_weekly_returns["year"] == year) & (coin_weekly_returns["week"] == week)]
            # ignoring all NaNs
            # the most convenient way to check if no cell value are NaN is by applying .isna().sum().sum()
            if coin_weekly_data.isna().sum().sum() == 0:
                # the ID is included
                weekly_included_ids.append(coin_id)
                returns.append(coin_weekly_data["return"])
                market_caps.append(coin_weekly_data["market_cap"])
        # if all returns are NaN (for example, in the first week of the time period considered)
        if len(returns) == 0:
            # if no value was added
            market_returns.append(np.nan)
            included_ids.append(np.nan)
        else:
            # for every week add the value-weighted market return (the sumproduct of the returns and the market caps divided by the sum of the market caps) and the included coin IDs
            weighted_averge = (sum(x * y for x, y in zip(returns, market_caps)) / sum(market_caps)).tolist()[0]
            market_returns.append(weighted_averge)
            included_ids.append(weekly_included_ids)
    market_weekly_returns = pd.DataFrame({"year": years, "week": weeks, "average_return": market_returns, "included_ids": included_ids})
    # downloading the data
    coins_weekly_returns.to_csv(data_path + "/market_weekly_returns.csv", index=False)
else:
    # next, we need to load the data and "unwrap" it again
    print("The data market returns have already been computred.")
    market_weekly_returns = pd.read_csv(data_path + "/market_weekly_returns.csv")

print(market_weekly_returns.head())


"""

# mock data
import pandas as pd, datetime, random
start_date = "2014-01-01"
end_date = "2023-01-12"

daily_trading_data = pd.DataFrame({"id": [], "date": [], "price": [], "market_cap": [], "total_volume": []})
daily_trading_data["id"] = ["Check"] * 90 + ["Out"] * 90
daily_trading_data["date"] = ["2014"] * 10 + ["2015"] * 10 + ["2016"] * 10 + ["2017"] * 10 + ["2018"] * 10 + ["2019"] * 10 + ["2020"] * 10 + ["2021"] * 10 + ["2022"] * 10 + ["2014"] * 10 + ["2015"] * 10 + ["2016"] * 10 + ["2017"] * 10 + ["2018"] * 10 + ["2019"] * 10 + ["2020"] * 10 + ["2021"] * 10 + ["2022"] * 10
daily_trading_data["price"] = random.sample(range(10, 300), 180)
daily_trading_data["market_cap"] = random.sample(range(10, 300), 180)
daily_trading_data["total_volume"] = random.sample(range(10, 300), 180)

market_weekly_returns = pd.DataFrame({"year": [], "week": [], "average_return": [], "included_ids": []})
market_weekly_returns["year"] = [2014] * 10 + [2015] * 10 + [2016] * 10 + [2017] * 10 + [2018] * 10 + [2019] * 10 + [2020] * 10 + [2021] * 10 + [2022] * 10
market_weekly_returns["week"] = random.sample(range(10, 300), 90)
returns = random.sample(range(10, 300), 90)
returns = [x / 1000 for x in returns]
market_weekly_returns["average_return"] = returns
market_weekly_returns["included_ids"] = [["Test"]] * 90

coins_weekly_returns = {"ethereum": None, "bitcoin": None, "ripple": None, "other": None}
for coin in coins_weekly_returns.keys():
    year = [2014] * 10 + [2015] * 10 + [2016] * 10 + [2017] * 10 + [2018] * 10 + [2019] * 10 + [2020] * 10 + [2021] * 10 + [2022] * 10
    week = random.sample(range(10, 300), 90)
    price = random.sample(range(10, 300), 90)
    market_cap = random.sample(range(10, 300), 90)
    volume = random.sample(range(10, 300), 90)
    returns = random.sample(range(10, 300), 90)
    returns = [x / 1000 for x in returns]

    df = pd.DataFrame({"year": year, "week": week, "price": price, "market_cap": market_cap, "volume": volume, "return": returns})
    coins_weekly_returns[coin] = df
        



# creates a temporary PDF file named "cover.pdf"
# repeating the process overwrites the file
render.render_summary_statistics(start_date, end_date, daily_trading_data, market_weekly_returns, coins_weekly_returns)

pdf_path = os.getcwd() + "/cover.pdf"
# printing the PDF
# this code has to be in the main file
img = Image(filename=pdf_path, resolution=100)
img
