# **Crypto market capitalization forecast based on S&P 500.**

## **Abstract**
   Abstract here. Give an executive summary of your project: goal, methods, results, conclusions. Usually no more than 200 words.


## **Introduction**

Here you have to explain the problem that you are solving. Explain why it is important, and what are the main challenges. Mention previous attempts (add papers as references) to solve it. Mainly focus on the techniques closely related to our approach. Briefly describe your approach and explain why it is promising for solving the addressed problem. Mention the dataset and the main results achieved.

In this section, you can add **text** and **figures**.

## **Methodology**
Describe the important steps you took to achieve your goal. Focus more on the most important steps (preprocessing, extra features, model aspects) that turned out to be important. Mention the original aspects of the project and state how they relate to existing work.

In this section, you can add **text** and **figures**. For instance, it is strongly suggested to add a picture of the best machine learning model that you implemented to solve your problem (and describe it).


### **Preprocessing**

The first step in our methodology involved preprocessing the raw data from 2 sources: Kaggle and CoinCodex. We will be using the Kaggle data for everything that is related to the S&P500, and CoinCodex for everything related crypto. For the cryptocurrency data, we focused on key features such as Date, Volume, and Marketcap. Similarly, for the S&P500 data, we retained relevant columns like Date, Open, High, Low, Close, Volume, and additional info regarding the fear index (VIX). The datasets were cleaned to handle missing values, if any, unwanted data and the Date columns were standardized to ensure compatibility for merging.

First Let's import the necessary libraries that we need for the project and define some constants!
Run the code below...


In [51]:
import os
import pandas as pd
pd.options.mode.chained_assignment = None
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [61]:
STOCK_DATA_PATH = 'Data/StockData/'
CRYPTO_DATA_PATH_RAW = 'Data/CryptoData/RawData/'
CRYPTO_DATA_PATH_PROCESSED = 'Data/CryptoData/PreProcessedData/'
KAGGLE_DATA_PATH = 'Data/KaggleData/'

Once the libraries imported, we can now load the S&P500 data and take a look at the first few rows along with some additional info by running the code below.

In [62]:
def load_data(filename: str, date_col: str, date_format: str) -> pd.DataFrame:
    """
    Loads a CSV file into a pandas DataFrame and parses the date column.

    Args:
        filename (str): Name of the CSV file.
        date_col (str): Name of the date column.
        date_format (str): Format of the date in the CSV.

    Returns:
        pd.DataFrame: Processed DataFrame with parsed dates.
    """
    filepath = os.path.join(STOCK_DATA_PATH, filename)
    df = pd.read_csv(filepath)
    df[date_col] = pd.to_datetime(df[date_col], format=date_format)
    return df

stock_df = load_data('S&P500_Historical_Data.csv', 'Date', '%Y-%m-%d')
vix_df = load_data('VIX_Historical_Data.csv', 'Date', '%m/%d/%Y')

for name, df in zip(["Stock", "VIX"], [stock_df, vix_df]):
    print(f"\n{name} Dataset ==> Min Date: {df['Date'].min()} / Max Date: {df['Date'].max()}")



Stock Dataset ==> Min Date: 2017-01-03 00:00:00 / Max Date: 2025-04-04 00:00:00

VIX Dataset ==> Min Date: 1990-01-02 00:00:00 / Max Date: 2025-04-04 00:00:00


In [63]:
stock_df.head()

Unnamed: 0,Date,Price,Open,High,Low,Vol.,Change %
0,2025-04-04,5074.08,5292.14,5292.14,5069.9,,-5.97%
1,2025-04-03,5396.52,5492.74,5499.53,5390.83,,-4.84%
2,2025-04-02,5670.97,5580.76,5695.31,5571.48,,0.67%
3,2025-04-01,5633.07,5597.53,5650.57,5558.52,,0.38%
4,2025-03-31,5611.85,5527.91,5627.56,5488.73,,0.55%


In [64]:
vix_df.head()

Unnamed: 0,Date,Open,High,Low,Close
0,1990-01-02,17.24,17.24,17.24,17.24
1,1990-01-03,18.19,18.19,18.19,18.19
2,1990-01-04,19.22,19.22,19.22,19.22
3,1990-01-05,20.11,20.11,20.11,20.11
4,1990-01-08,20.26,20.26,20.26,20.26


Great, let's now tackle our raw crypto data.  
After some research and many hours of trying to find the best balance between variety of cryptos and the amount of data that can be used, I have decided to include 14 cryptos from the top 100 where the data stretches from 2017 to 2025.  

Let us take a look at an example of crypto data that we have:

**Note**: We have a CSV for every crypto. (14 CSVs total)

In [65]:
btc_df = pd.read_csv(os.path.join(CRYPTO_DATA_PATH_RAW, 'BTC.csv'))
btc_df.head()

Unnamed: 0,Start,End,Open,High,Low,Close,Volume,Market Cap
0,2025-04-06,2025-04-07,83533.45,83704.76,77296.39,78310.34,29747690000.0,1626852000000.0
1,2025-04-05,2025-04-06,83769.12,84219.7,82384.97,83582.03,54248860000.0,1654110000000.0
2,2025-04-04,2025-04-05,83259.08,84676.27,81767.53,83879.86,62632260000.0,1654911000000.0
3,2025-04-03,2025-04-04,82259.03,83781.7,81307.75,83199.95,77668430000.0,1643472000000.0
4,2025-04-02,2025-04-03,85170.68,87898.01,82487.4,82548.31,52376110000.0,1688190000000.0


Let us standardize our timeframe now. From all the CSVs we have, we can see that our range should be from 2018-1-18 to 2025-04-04 to match the maximum date of the stock market data.

In [68]:
date_range = pd.date_range(start='2018-1-18', end='2025-04-04', freq='D')

for filename in os.listdir(CRYPTO_DATA_PATH_RAW):
        try:
            raw_path = os.path.join(CRYPTO_DATA_PATH_RAW, filename)
            df = pd.read_csv(raw_path)
            df['Start'] = pd.to_datetime(df['Start'])
            df = df[df['Start'].isin(date_range)]
            df = df.sort_values('Start')
            
            processed_path = os.path.join(CRYPTO_DATA_PATH_PROCESSED, filename)
            df.to_csv(processed_path, index=False)
                        
        except Exception as e:
            print(f"Error processing {filename}: {str(e)}")

print("All files processed!")

All files processed!


After inspecting these new CSVs, we can see that there are around 100 data points of market cap data missing in BNB and EOS.  
We will use data from kaggle to fill that in.  

In [69]:
def update_market_caps(processed_path, all_crypto_file):
    all_crypto_df = pd.read_csv(all_crypto_file)
    
    crypto_files = [f for f in os.listdir(processed_path) if f.endswith('.csv') and f != 'All_Crypto.csv']
    
    for filename in crypto_files:
        try:
            symbol = filename.replace('.csv', '')
            
            filepath = os.path.join(processed_path, filename)
            crypto_df = pd.read_csv(filepath)
            crypto_df['Start'] = pd.to_datetime(crypto_df['Start'])
            
            symbol_data = all_crypto_df[all_crypto_df['Symbol'] == symbol].copy()
            symbol_data['Date'] = pd.to_datetime(symbol_data['Date'], format='%d-%m-%Y %H:%M')
            
            crypto_df['Start_date'] = crypto_df['Start'].dt.normalize()
            symbol_data['Date_date'] = symbol_data['Date'].dt.normalize()
            
            market_cap_dict = dict(zip(symbol_data['Date_date'], symbol_data['Marketcap']))
            
            updated_count = 0
            for index, row in crypto_df.iterrows():
                if row['Market Cap'] == 0.0 or row['Market Cap'] == -1.0:
                    start_date = row['Start_date']
                    if start_date in market_cap_dict:
                        crypto_df.at[index, 'Market Cap'] = market_cap_dict[start_date]
                        updated_count += 1
            
            crypto_df.drop(columns=['Start_date'], inplace=True)
            crypto_df.to_csv(filepath, index=False)
            
            if updated_count > 0:
                print(f"{symbol}: Updated {updated_count} market cap values")
            
        except Exception as e:
            print(f"Error processing {filename}: {str(e)}")

update_market_caps(CRYPTO_DATA_PATH_PROCESSED, KAGGLE_DATA_PATH + 'All_Crypto.csv')
print("\nAll files processed and overwritten!")

BNB: Updated 133 market cap values
EOS: Updated 133 market cap values

All files processed and overwritten!


Nice, now our crypto data is complete within the time range of 2018 to 2025.  
Let us now work on dropping the unwated features and making the merged crypto dataset.