# **Crypto market capitalization forecast based on S&P 500.**

## **Abstract**
   Abstract here. Give an executive summary of your project: goal, methods, results, conclusions. Usually no more than 200 words.


## **Introduction**

Here you have to explain the problem that you are solving. Explain why it is important, and what are the main challenges. Mention previous attempts (add papers as references) to solve it. Mainly focus on the techniques closely related to our approach. Briefly describe your approach and explain why it is promising for solving the addressed problem. Mention the dataset and the main results achieved.

In this section, you can add **text** and **figures**.

## **Methodology**
Describe the important steps you took to achieve your goal. Focus more on the most important steps (preprocessing, extra features, model aspects) that turned out to be important. Mention the original aspects of the project and state how they relate to existing work.

In this section, you can add **text** and **figures**. For instance, it is strongly suggested to add a picture of the best machine learning model that you implemented to solve your problem (and describe it).


### **Preprocessing**

The first step in our methodology involved preprocessing the raw data from three CSV files: one containing cryptocurrency market data (e.g., Aave), another one containing S&P 500 historical data, and the final one containing the fear index of the S&P500 (VIX). For the cryptocurrency data, we focused on key features such as Date, Volume, and Marketcap. Similarly, for the S&P 500 data, we retained relevant columns like Date, Open, High, Low, Close, Volume, and additional info regarding the fear index. The datasets were cleaned to handle missing values, if any, unwanted data and the Date columns were standardized to ensure compatibility for merging.

First Let's import the necessary libraries that we need for the project!  
Run the code below...


In [42]:
import os
import pandas as pd
pd.options.mode.chained_assignment = None
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Once the libraries imported, we can now load the data and take a look at the first few rows along with some additional info by running the code below.

In [43]:
KAGGLE_DATA_PATH = 'KaggleData/'

def load_data(filename: str, date_col: str, date_format: str) -> pd.DataFrame:
    """
    Loads a CSV file into a pandas DataFrame and parses the date column.

    Args:
        filename (str): Name of the CSV file.
        date_col (str): Name of the date column.
        date_format (str): Format of the date in the CSV.

    Returns:
        pd.DataFrame: Processed DataFrame with parsed dates.
    """
    filepath = os.path.join(KAGGLE_DATA_PATH, filename)
    df = pd.read_csv(filepath)
    df[date_col] = pd.to_datetime(df[date_col], format=date_format)
    return df

crypto_df = load_data('All_Crypto.csv', 'Date', '%d-%m-%Y %H:%M')
stock_df = load_data('spy.csv', 'Date', '%Y-%m-%d')
vix_df = load_data('vix_daily.csv', 'Date', '%Y-%m-%d')

for name, df in zip(["Crypto", "Stock", "VIX"], [crypto_df, stock_df, vix_df]):
    print(f"\n{name} Dataset Preview:")
    print(df.head())

for name, df in zip(["Crypto", "Stock", "VIX"], [crypto_df, stock_df, vix_df]):
    print(f"\n{name} Dataset ==> Min Date: {df['Date'].min()} / Max Date: {df['Date'].max()}")



Crypto Dataset Preview:
   Sno  Name Symbol                Date       High        Low       Open  \
0    1  Aave   AAVE 2020-10-05 23:59:00  55.112358  49.787900  52.675035   
1    2  Aave   AAVE 2020-10-06 23:59:00  53.402270  40.734578  53.291969   
2    3  Aave   AAVE 2020-10-07 23:59:00  42.408314  35.970690  42.399947   
3    4  Aave   AAVE 2020-10-08 23:59:00  44.902511  36.696057  39.885262   
4    5  Aave   AAVE 2020-10-09 23:59:00  47.569533  43.291776  43.764463   

       Close        Volume     Marketcap  
0  53.219243  0.000000e+00  8.912813e+07  
1  42.401599  5.830915e+05  7.101144e+07  
2  40.083976  6.828342e+05  6.713004e+07  
3  43.764463  1.658817e+06  2.202651e+08  
4  46.817744  8.155377e+05  2.356322e+08  

Stock Dataset Preview:
        Date       Open       High        Low      Close   Volume  Day  \
0 1993-01-29  24.543517  24.543517  24.421410  24.526073  1003200   29   
1 1993-02-01  24.543515  24.700510  24.543515  24.700510   480500    1   
2 1993-02-02  

#### Crypto data

Let us focus on preprocessing the crypto data first. We can tackle afterwards the S&P500.


Let us see how many crypto coins we are dealing with by running the following code.

In [None]:
unique_crypto_names = crypto_df['Name'].unique()
print(f"Total Unique Cryptocurrencies: {len(unique_crypto_names)}")
print(", ".join(unique_crypto_names)) 


Total Unique Cryptocurrencies: 23
Aave, Binance Coin, Bitcoin, Cardano, Chainlink, Cosmos, Crypto.com Coin, Dogecoin, EOS, Ethereum, IOTA, Litecoin, Monero, NEM, Polkadot, Solana, Stellar, Tether, TRON, Uniswap, USD Coin, XRP, Wrapped Bitcoin


There are 23 crypto currencies in the data we are using. Even though we would have preferred to have data about all the top 100 crypto currencies, this should also be fine for the scope of this project.  
However, there are some cryptocurrencies that we would want to discard here like stable coins and currencies that are not in the top 100.  
The reason we want to discard stablecoins is because they are designed to maintain a fixed value (usually pegged to the USD), so their market cap changes are primarily driven by issuance and redemption rather than market speculation or macroeconomic factors. Including them might introduce noise rather than meaningful predictive signals.  
In this case, we are discarding `Crypto.com Coin`, `NEM`, `Tether`, `USD Coin`, and `Wrapped Bitcoin`.

In [45]:
EXCLUDE_LIST = {'Crypto.com Coin', 'NEM', 'Tether', 'USD Coin', 'Wrapped Bitcoin'}
crypto_df_filtered = crypto_df[~crypto_df['Name'].isin(EXCLUDE_LIST)]
print(f"Remaining Cryptocurrencies After Filtering: {crypto_df_filtered['Name'].nunique()}")


Remaining Cryptocurrencies After Filtering: 18


We should now take a look at the completness of the data from a range that I have educationally chosen (2018-01-01 to 2021-06-07).  
By running the two cells below, we notice that some crypto currencies have been created after 2018, thus they would have no data from the year 2018 to the year that they have been created.  
This is an issue.  

In [None]:
date_range = pd.date_range(start='2018-01-01', end='2021-06-07', freq='D')

recent_cryptos = []
original_cryptos = []

# Group by 'Name' and check for missing dates
for name, group in crypto_df_filtered.groupby('Name'):
    unique_dates = group['Date'].dt.normalize().unique() 
    missing_dates = set(date_range) - set(unique_dates) 

    (recent_cryptos if missing_dates else original_cryptos).append(name)

print(f"Cryptos with complete data: {len(original_cryptos)} -> {original_cryptos}")
print(f"Cryptos missing some dates: {len(recent_cryptos)} -> {recent_cryptos}")


Cryptos with complete data: 13 -> ['Binance Coin', 'Bitcoin', 'Cardano', 'Chainlink', 'Dogecoin', 'EOS', 'Ethereum', 'IOTA', 'Litecoin', 'Monero', 'Stellar', 'TRON', 'XRP']
Cryptos missing some dates: 5 -> ['Aave', 'Cosmos', 'Polkadot', 'Solana', 'Uniswap']


Following this I have decided on doing the project in 2 different experiments:  
1. Include all crypto currencies that have data that range from 2018 or earlier to 2021.  
2. Include all crypto currencies that have data that range from the `last date a crypto was created` to 2021.  

This means that `option 1` will have less crypto currency diversity, but have a longer time frame, and `option 2`, will have a wider  crypto currency diversity on a smaller time frame.

Let us create the dataset for `option 1`. (Full time frame)

In [None]:
min_start_dates = crypto_df_filtered.loc[crypto_df_filtered['Name'].isin(original_cryptos)] \
                                   .groupby('Name')['Date'].min()
latest_common_start = min_start_dates.max()

max_date = crypto_df['Date'].max()

full_time_frame_crypto_df = crypto_df_filtered[
    (crypto_df_filtered['Name'].isin(original_cryptos)) &
    (crypto_df_filtered['Date'].between(latest_common_start, max_date))
]

print(f"Dataset includes cryptos from {latest_common_start.date()} to {max_date.date()} "
      f"with {full_time_frame_crypto_df['Name'].nunique()} cryptos.")


Dataset includes cryptos from 2017-10-02 to 2021-07-06 with 13 cryptos.


Let us now create the dataset for `option 2`. (Partial time frame)

In [51]:
latest_start_date = crypto_df_filtered.groupby('Name')['Date'].min().max()
partial_time_frame_crypto_df = crypto_df_filtered[crypto_df_filtered['Date'] >= latest_start_date]

print(f"Dataset includes cryptos from {latest_start_date.date()} to {partial_time_frame_crypto_df['Date'].max().date()} "
      f"with {partial_time_frame_crypto_df['Name'].nunique()} cryptos.")



Dataset includes cryptos from 2020-10-05 to 2021-07-06 with 18 cryptos.


#### S&P500 Data

Let us now process the stock market data.  
From what we have accomplished before, we know that we need to match the date ranges of our crypto data. Which means we will have `full_time_frame_stock_df` and a `partial_time_frame_stock_df` that range from `2017-10-02` to `2021-07-06` and from `2020-10-05` to `2021-07-06` respectively.

We also need to keep in mind that the stock market closes on the weekends. Thus, for the sake of this project, we will assume that the last available price (Friday’s) carries over to Saturday and Sunday since stock prices don’t change on weekends.  
We will use Forward fill to accomplish this.  
This will keep the dataset aligned with the crypto data. Also, it reflects the reality that stock prices remain unchanged on weekends.
Same thing will be done to the VIX data frame as well.

In [56]:
full_start_date, partial_start_date, end_date = "2017-10-02", "2020-10-05", "2021-07-06"

def handle_closed_weekends(df: pd.DataFrame, start_date: str, end_date: str, date_col: str = 'Date') -> pd.DataFrame:
    """
    Filters a dataframe within a date range, expands it to include weekends,
    and forward-fills missing values.
    
    Args:
        df (pd.DataFrame): The original dataframe with a date column.
        start_date (str): The start date for filtering.
        end_date (str): The end date for filtering.
        date_col (str): The name of the date column. Defaults to 'Date'.
    
    Returns:
        pd.DataFrame: Processed data with weekends included and missing values filled.
    """
    filtered_df = df[(df[date_col] >= start_date) & (df[date_col] <= end_date)].copy()

    date_range = pd.date_range(start=start_date, end=end_date, freq='D')

    # Reindex to ensure weekends are included, then forward-fill missing values
    return (
        filtered_df.set_index(date_col)
        .reindex(date_range)
        .ffill()
        .reset_index()
        .rename(columns={'index': date_col})
    )

In [57]:
# Preprocess stock data
full_time_frame_stock_df = handle_closed_weekends(stock_df, full_start_date, end_date)
partial_time_frame_stock_df = handle_closed_weekends(stock_df, partial_start_date, end_date)

# Print results
print(f"Full S&P500 dataset: {full_time_frame_stock_df['Date'].min()} to {full_time_frame_stock_df['Date'].max()} "
      f"with {len(full_time_frame_stock_df)} rows")
print(f"Partial S&P500 dataset: {partial_time_frame_stock_df['Date'].min()} to {partial_time_frame_stock_df['Date'].max()} "
      f"with {len(partial_time_frame_stock_df)} rows")

Full S&P500 dataset: 2017-10-02 00:00:00 to 2021-07-06 00:00:00 with 1374 rows
Partial S&P500 dataset: 2020-10-05 00:00:00 to 2021-07-06 00:00:00 with 275 rows


In [58]:
# Preprocess VIX data
full_time_frame_vix_df = handle_closed_weekends(vix_df, full_start_date, end_date)
partial_time_frame_vix_df = handle_closed_weekends(vix_df, partial_start_date, end_date)

# Print results
print(f"Full VIX dataset: {full_time_frame_vix_df['Date'].min()} to {full_time_frame_vix_df['Date'].max()} "
      f"with {len(full_time_frame_vix_df)} rows")
print(f"Partial VIX dataset: {partial_time_frame_vix_df['Date'].min()} to {partial_time_frame_vix_df['Date'].max()} "
      f"with {len(partial_time_frame_vix_df)} rows")

Full VIX dataset: 2017-10-02 00:00:00 to 2021-07-06 00:00:00 with 1374 rows
Partial VIX dataset: 2020-10-05 00:00:00 to 2021-07-06 00:00:00 with 275 rows


We can now save all the data frames as csvs. Might be useful later!

In [60]:
PROCESSED_DATA_PATH = 'ProcessedData/'
os.makedirs(PROCESSED_DATA_PATH, exist_ok=True)

def save_to_csv(df: pd.DataFrame, filename: str) -> None:
    """
    Saves a DataFrame to a CSV file in the processed data directory.
    
    Args:
        df (pd.DataFrame): The DataFrame to save.
        filename (str): The name of the CSV file.
    """
    filepath = os.path.join(PROCESSED_DATA_PATH, filename)
    df.to_csv(filepath, index=False)
    print(f"Saved {filename} to {filepath}")

In [61]:
# Save all processed DataFrames
save_to_csv(full_time_frame_crypto_df, 'full_time_frame_crypto_data.csv')
save_to_csv(partial_time_frame_crypto_df, 'partial_time_frame_crypto_data.csv')
save_to_csv(full_time_frame_stock_df, 'full_time_frame_stock_data.csv')
save_to_csv(partial_time_frame_stock_df, 'partial_time_frame_stock_data.csv')
save_to_csv(full_time_frame_vix_df, 'full_time_frame_vix_data.csv')
save_to_csv(partial_time_frame_vix_df, 'partial_time_frame_vix_data.csv')

Saved full_time_frame_crypto_data.csv to ProcessedData/full_time_frame_crypto_data.csv
Saved partial_time_frame_crypto_data.csv to ProcessedData/partial_time_frame_crypto_data.csv
Saved full_time_frame_stock_data.csv to ProcessedData/full_time_frame_stock_data.csv
Saved partial_time_frame_stock_data.csv to ProcessedData/partial_time_frame_stock_data.csv
Saved full_time_frame_vix_data.csv to ProcessedData/full_time_frame_vix_data.csv
Saved partial_time_frame_vix_data.csv to ProcessedData/partial_time_frame_vix_data.csv


#### Merging the datasets

Perfect, now that we have cleaned the data, we still have a couple more steps before starting to play with models.  
Steps:  
- Extract the global volume and global market cap for each day in the crypto data.
- Merge the stock, vix, and crypto market data together.

In [62]:
def aggregate_crypto_data(df: pd.DataFrame) -> pd.DataFrame:
    """
    Aggregates crypto data by date, summing 'Volume' and 'Marketcap',
    and renames the columns for clarity.
    
    Args:
        df (pd.DataFrame): The input DataFrame with 'Date', 'Volume', and 'Marketcap' columns.
    
    Returns:
        pd.DataFrame: Aggregated DataFrame with 'Date', 'Total_Volume', and 'Total_Market_Cap'.
    """
    aggregated_df = (
        df.groupby('Date', as_index=False)
        .agg({'Volume': 'sum', 'Marketcap': 'sum'})
        .rename(columns={'Volume': 'Total_Volume', 'Marketcap': 'Total_Market_Cap'})
    )
    # Remove time from the 'Date' column for better compactness
    aggregated_df['Date'] = aggregated_df['Date'].dt.date
    return aggregated_df

In [66]:
# Aggregate full and partial time frame crypto data
full_time_frame_crypto_aggregated_df = aggregate_crypto_data(full_time_frame_crypto_df)
partial_time_frame_crypto_aggregated_df = aggregate_crypto_data(partial_time_frame_crypto_df)

print("Full time frame crypto aggregate data: \n", full_time_frame_crypto_aggregated_df.head())
print("\nPartial time frame crypto aggregate data: \n", partial_time_frame_crypto_aggregated_df.head(), '\n')

save_to_csv(full_time_frame_crypto_aggregated_df, 'full_time_frame_aggregate_crypto_data.csv')
save_to_csv(partial_time_frame_crypto_aggregated_df, 'partial_time_frame_aggregate_crypto_data.csv')

Full time frame crypto aggregate data: 
          Date  Total_Volume  Total_Market_Cap
0  2017-10-02  2.084329e+09      1.166802e+11
1  2017-10-03  1.857098e+09      1.145126e+11
2  2017-10-04  1.648587e+09      1.134182e+11
3  2017-10-05  1.965073e+09      1.162994e+11
4  2017-10-06  1.711230e+09      1.180752e+11

Partial time frame crypto aggregate data: 
          Date  Total_Volume  Total_Market_Cap
0  2020-10-05  7.121521e+10      2.789262e+11
1  2020-10-06  7.336479e+10      2.719265e+11
2  2020-10-07  6.273427e+10      2.734531e+11
3  2020-10-08  9.018118e+10      2.803745e+11
4  2020-10-09  4.923914e+10      2.861705e+11 

Saved full_time_frame_aggregate_crypto_data.csv to ProcessedData/full_time_frame_aggregate_crypto_data.csv
Saved partial_time_frame_aggregate_crypto_data.csv to ProcessedData/partial_time_frame_aggregate_crypto_data.csv


In [67]:
def process_dataset(df: pd.DataFrame, dataset_type: str) -> pd.DataFrame:
    """
    Processes a dataset by converting the 'Date' column to datetime,
    dropping unnecessary columns, and renaming columns for clarity.
    
    Args:
        df (pd.DataFrame): The input DataFrame.
        dataset_type (str): The type of dataset ('crypto', 'stock', or 'vix').
    
    Returns:
        pd.DataFrame: Processed DataFrame.
    """
    # Convert 'Date' column to datetime
    df['Date'] = pd.to_datetime(df['Date'])
    
    # Process based on dataset type
    if dataset_type == 'stock':
        df = df.drop(['Day', 'Weekday', 'Week', 'Month', 'Year'], axis=1)
        df = df.rename(columns={
            'Open': 'S&P500_Open',
            'High': 'S&P500_High',
            'Low': 'S&P500_Low',
            'Close': 'S&P500_Close',
            'Volume': 'S&P500_Volume'
        })
    elif dataset_type == 'crypto':
        df = df.rename(columns={
            'Total_Volume': 'Crypto_Volume',
            'Total_Market_Cap': 'Crypto_Market_Cap'
        })
    elif dataset_type == 'vix':
        df = df.rename(columns={
            'Open': 'VIX_Open',
            'High': 'VIX_High',
            'Low': 'VIX_Low',
            'Close': 'VIX_Close'
        })
    else:
        raise ValueError(f"Invalid dataset type: {dataset_type}. Expected 'crypto', 'stock', or 'vix'.")
    
    return df

In [68]:
full_time_frame_crypto_aggregated_df = process_dataset(full_time_frame_crypto_aggregated_df, 'crypto')
full_time_frame_stock_df = process_dataset(full_time_frame_stock_df, 'stock')
full_time_frame_vix_df = process_dataset(full_time_frame_vix_df, 'vix')

partial_time_frame_crypto_aggregated_df = process_dataset(partial_time_frame_crypto_aggregated_df, 'crypto')
partial_time_frame_stock_df = process_dataset(partial_time_frame_stock_df, 'stock')
partial_time_frame_vix_df = process_dataset(partial_time_frame_vix_df, 'vix')

Let us now merge the crypto, stock, and VIX market data!

In [69]:
merged_df1 = pd.merge(full_time_frame_crypto_aggregated_df, full_time_frame_stock_df, on='Date', how='inner')
full_time_merged_df = pd.merge(merged_df1, full_time_frame_vix_df, on='Date', how='inner')

merged_df2 = pd.merge(partial_time_frame_crypto_aggregated_df, partial_time_frame_stock_df, on='Date', how='inner')
partial_time_merged_df = pd.merge(merged_df2, partial_time_frame_vix_df, on='Date', how='inner')

print('Merged data for full time: \n', full_time_merged_df.head())
print('Merged data for partial time: \n', partial_time_merged_df.head())

Merged data for full time: 
         Date  Crypto_Volume  Crypto_Market_Cap  S&P500_Open  S&P500_High  \
0 2017-10-02   2.084329e+09       1.166802e+11   223.421919   224.159286   
1 2017-10-03   1.857098e+09       1.145126e+11   224.159281   224.665658   
2 2017-10-04   1.648587e+09       1.134182e+11   224.487941   225.154236   
3 2017-10-05   1.965073e+09       1.162994e+11   225.243074   226.255841   
4 2017-10-06   1.711230e+09       1.180752e+11   225.784976   226.273594   

   S&P500_Low  S&P500_Close  S&P500_Volume  VIX_Open  VIX_High  VIX_Low  \
0  223.244229    224.159286     59023000.0      9.59     10.04     9.37   
1  224.079316    224.639008     66810200.0      9.30      9.75     9.30   
2  224.372446    224.905487     55953600.0      9.53      9.88     9.53   
3  224.941024    226.238083     63522800.0      9.48      9.62     9.13   
4  225.518469    225.980423     80646000.0      9.23     10.27     9.11   

   VIX_Close  
0       9.45  
1       9.51  
2       9.63  
3  

In [70]:
# Save the merged data
save_to_csv(full_time_merged_df, 'full_time_merged_data.csv')
save_to_csv(partial_time_merged_df, 'partial_time_merged_data.csv')

Saved full_time_merged_data.csv to ProcessedData/full_time_merged_data.csv
Saved partial_time_merged_data.csv to ProcessedData/partial_time_merged_data.csv


**And we're done for the preprocessing part!**

## **Experimental Setup**
Describe the datasets used for your experiments. List the machine learning techniques used to solve your problem and report the corresponding hyperparameters.

In this section, you can add **text**, **tables**, and **figures**.

## **Experimental Results**
Describe here the main experimental results. Critically discuss them. Compare them with results available in the literature (if applicable).

In this section, you can add **text** and **figures**, **tables**, **plots**, and code. Make sure the code is runnable and replicable.

## **Conclusions**

Summarize what you could and could not conclude based on your experiments.
In this section, you can add **text**.



## **References**
You can add here the citations of books, websites, or academic papers, etc.