## Objective:

Data collection is an integral part of the ML system as models cannot be built without data. So, in this notebook, let us look into one of the ways to collect the data for web3.

Cryptocurrencies form the integral part of the web3 ecosystem. So, in this notebook, let us get the historical prices of these tokens. There are thousands of tokens available in the web3 ecosystem. 

[CoinGecko](https://www.coingecko.com/) is one of the largest cryptocurrency data aggregator and also offers us [free API](https://www.coingecko.com/en/api) to use. So, we will use this free API to pull the historical prices.

Also, we will be pulling the data only for the top 50 coins. This can be extended to pull the data for the coins of interest with modifications.

So, let us get started.

## Installation of pycoingecko

We will be using the python client (unofficial) to get the data. So let us first install the [pycoingecko](https://github.com/man-c/pycoingecko) python client.

In [2]:
!pip install -U pycoingecko

## Top 50 crypto tokens by market cap

In this section, let us first use the API to get the top 50 tokens by market cap.

In [4]:
import numpy as np
import pandas as pd
import requests
import json
import datetime
import time

from pycoingecko import CoinGeckoAPI
cg = CoinGeckoAPI()

We will use `cg.get_coins_markets` function to get the data. This will give us the top 100 coins by market capitalization. We can also specify the currency in which we need the prices and let us use USD for this.

In [8]:
response = cg.get_coins_markets(vs_currency="USD")
response[0]

{'id': 'bitcoin',
 'symbol': 'btc',
 'name': 'Bitcoin',
 'image': 'https://assets.coingecko.com/coins/images/1/large/bitcoin.png?1547033579',
 'current_price': 18866.06,
 'market_cap': 358946992061,
 'market_cap_rank': 1,
 'fully_diluted_valuation': 393398141141,
 'total_volume': 27147370357,
 'high_24h': 19142.43,
 'low_24h': 18708.63,
 'price_change_24h': -224.82682384019063,
 'price_change_percentage_24h': -1.17767,
 'market_cap_change_24h': -6368424369.597046,
 'market_cap_change_percentage_24h': -1.74327,
 'circulating_supply': 19160962.0,
 'total_supply': 21000000.0,
 'max_supply': 21000000.0,
 'ath': 69045,
 'ath_change_percentage': -72.86982,
 'ath_date': '2021-11-10T14:24:11.849Z',
 'atl': 67.81,
 'atl_change_percentage': 27524.60911,
 'atl_date': '2013-07-06T00:00:00.000Z',
 'roi': None,
 'last_updated': '2022-09-26T07:27:03.188Z'}

There are quite a few information present for each coin. We will be needing the `id` column to pull the historical data and so let us get that for the first 50 coins.

In [11]:
coins_list = [coin["id"] for coin in response[:50]]
coins_list

['bitcoin',
 'ethereum',
 'tether',
 'usd-coin',
 'binancecoin',
 'ripple',
 'binance-usd',
 'cardano',
 'solana',
 'dogecoin',
 'polkadot',
 'dai',
 'shiba-inu',
 'staked-ether',
 'tron',
 'matic-network',
 'avalanche-2',
 'wrapped-bitcoin',
 'uniswap',
 'cosmos',
 'ethereum-classic',
 'leo-token',
 'okb',
 'chainlink',
 'litecoin',
 'ftx-token',
 'stellar',
 'near',
 'crypto-com-chain',
 'monero',
 'algorand',
 'bitcoin-cash',
 'apecoin',
 'flow',
 'filecoin',
 'vechain',
 'chain-2',
 'quant-network',
 'internet-computer',
 'hedera-hashgraph',
 'frax',
 'chiliz',
 'terra-luna',
 'tezos',
 'the-sandbox',
 'decentraland',
 'eos',
 'axie-infinity',
 'theta-token',
 'elrond-erd-2']

## Data Collection

Let us get the data from Jan 01, 2015.

In [12]:
# Number of historical days to get the data
n_days = (datetime.date.today() - datetime.date(2014,12,31)).days

We will now use some helper functions to get the data.

* `get_date_information` - function to convert the time information from epoch time to human readable time format
* `get_value_from_list` - Some of the values from coingecko api are in tuple format and so this function helps us to get the value from the tuple.
* `get_historical_data_for_coin` - if we specify the coin name and number of days to pull, this function will pull the data and save the data as a csv file

In [13]:
def get_date_information(value):
    """Function to get the datetime from epoch time"""
    return datetime.datetime.fromtimestamp(value[0]/1000)

def get_value_from_list(value):
    """Fucntion to get the value from response"""
    return value[1]

def get_historical_data_for_coin(coin_name, n_days):
    """Function to get the hisrotical data for one single coin"""
    coin_data = cg.get_coin_market_chart_by_id(id=coin_name, vs_currency='usd', days=n_days)
    coin_df = pd.DataFrame(coin_data)
    coin_df = coin_df.iloc[:-1] # Remove today
    coin_df["date"] = coin_df["prices"].apply(lambda x: get_date_information(x))
    coin_df["price"] = coin_df["prices"].apply(lambda x: get_value_from_list(x))
    coin_df["total_volume"] = coin_df["total_volumes"].apply(lambda x: get_value_from_list(x))
    coin_df["market_cap"] = coin_df["market_caps"].apply(lambda x: get_value_from_list(x))
    coin_df["coin_name"] = coin_name
    coin_df.drop(["prices", "total_volumes", "market_caps"], axis=1, inplace=True)
    coin_df.to_csv(f"./{coin_name}.csv", index=False)

Now, let us pull the data for the top 50 coins and save them as csv files.

In [14]:
for coin_name in coins_list:
    get_historical_data_for_coin(coin_name, n_days)
    time.sleep(10) # Not to overload API

Tada! Now we have the data for the top 50 tokens to do some analysis.

If you are looking to pull data for other tokens, please get the `id` value from the coingecko and use the above functions to pull the data.

We have created a dataset based on the above code in Kaggle and can be accessed here - [Cryptocoins Historical Prices - CoinGecko](https://www.kaggle.com/datasets/sudalairajkumar/cryptocurrency-historical-prices-coingecko). This also gets updated on a daily basis and so feel free to use them for your analysis.