# 01_data_preparation.ipynb

## Notebook Purpose
This notebook is designed to load, clean, and preprocess historical cryptocurrency data. It will also calculate technical indicators that will be used for further analysis and model training.

## Instructions
1. **Import Necessary Libraries**:
   - Import `pandas` for data manipulation.
   - Import functions from `utils.py` for loading, preprocessing data, and calculating technical indicators.

2. **Load Data**:
   - Use the `load_data` function to load the CSV file containing historical cryptocurrency data.

3. **Preprocess Data**:
   - Use the `preprocess_data` function to clean and preprocess the loaded data.
   - Ensure any missing values are handled appropriately.

4. **Calculate Technical Indicators**:
   - Use the `calculate_indicators` function to add technical indicators (e.g., SMA, EMA, RSI) to the data.

5. **Save Preprocessed Data**:
   - Save the cleaned and preprocessed data, including the calculated technical indicators, to a new CSV file for later use.

6. **Review Data**:
   - Display the first few rows of the preprocessed data to ensure it looks correct.

## Example Code
```python
# Import necessary libraries
import pandas as pd
from scripts.utils import load_data, preprocess_data, calculate_indicators

# Load data
data_path = 'data/historical_data/btc_usd.csv'  # Update this path based on the selected cryptocurrency
data = load_data(data_path)

# Preprocess data
data = preprocess_data(data)

# Calculate technical indicators
data = calculate_indicators(data)

# Save the preprocessed data
data.to_csv('data/historical_data/btc_usd_preprocessed.csv')

# Display the first few rows of the preprocessed data
data.head()

In [1]:
# Cell 1: Import necessary libraries and verify
try:
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    import requests
    import os
    from dotenv import load_dotenv
    from datetime import datetime
    %matplotlib inline
    print("Libraries loaded successfully. Let's proceed!")
except ImportError as e:
    print(f"Uh-oh! Please verify the installation of: {e.name}")


Libraries loaded successfully. Let's proceed!


In [2]:
# Cell 2: Load environment variables and fetch API keys
load_dotenv()

# Fetch API keys
COINBASE_API_KEY = os.getenv("COINBASE_API_KEY")
COINBASE_API_SECRET = os.getenv("COINBASE_API_SECRET")
ALPHA_VANTAGE_API_KEY = os.getenv("ALPHA_VANTAGE_API_KEY")
CRYPTOCOMPARE_API_KEY = os.getenv("CRYPTOCOMPARE_API_KEY")


In [3]:
# Cell 3: Function to fetch data from Alpha Vantage
def fetch_alpha_vantage_data(symbol):
    base_url = "https://www.alphavantage.co/query"
    params = {
        "function": "DIGITAL_CURRENCY_DAILY",
        "symbol": symbol,
        "market": "USD",
        "apikey": ALPHA_VANTAGE_API_KEY
    }
    try:
        response = requests.get(base_url, params=params)
        response.raise_for_status()
        data = response.json()
        if 'Time Series (Digital Currency Daily)' in data:
            time_series = data['Time Series (Digital Currency Daily)']
            df = pd.DataFrame.from_dict(time_series, orient='index')
            df = df.rename(columns={
                '1a. open (USD)': 'Open',
                '2a. high (USD)': 'High',
                '3a. low (USD)': 'Low',
                '4a. close (USD)': 'Close',
                '5. volume': 'Volume'
            })
            df.index = pd.to_datetime(df.index)
            df.reset_index(inplace=True)
            df = df.rename(columns={'index': 'Date'})
            return df
        else:
            print(f"No 'Time Series (Digital Currency Daily)' data found for {symbol}")
            return None
    except requests.exceptions.RequestException as e:
        print(f"API request failed: {e}")
        return None


In [4]:
# Cell 4: Function to fetch data from Coinbase
def fetch_coinbase_data(currency_pair):
    base_url = f"https://api.coinbase.com/v2/prices/{currency_pair}/spot"
    try:
        response = requests.get(base_url)
        response.raise_for_status()
        data = response.json()
        if 'data' in data and 'amount' in data['data']:
            df = pd.DataFrame([data['data']])
            df['Date'] = pd.to_datetime('now')
            df = df.rename(columns={'amount': 'Close'})
            df['Close'] = df['Close'].astype(float)
            return df[['Date', 'Close']]
        else:
            print(f"No 'amount' data found for {currency_pair} in Coinbase response")
            return None
    except requests.exceptions.RequestException as e:
        print(f"API request failed: {e}")
        return None


In [5]:
# Cell 5: Function to fetch data from CryptoCompare
def fetch_cryptocompare_data(symbol, start_date, end_date):
    base_url = f"https://min-api.cryptocompare.com/data/v2/histoday"
    params = {
        "fsym": symbol,
        "tsym": "USD",
        "toTs": int(pd.Timestamp(end_date).timestamp()),
        "limit": 2000,  # CryptoCompare allows fetching up to 2000 days in one call
        "api_key": CRYPTOCOMPARE_API_KEY
    }
    try:
        response = requests.get(base_url, params=params)
        response.raise_for_status()
        data = response.json()
        if 'Data' in data and 'Data' in data['Data']:
            data = data['Data']['Data']
            df = pd.DataFrame(data)
            df['Date'] = pd.to_datetime(df['time'], unit='s')
            df = df.rename(columns={'open': 'Open', 'high': 'High', 'low': 'Low', 'close': 'Close', 'volumeto': 'Volume'})
            df = df[['Date', 'Open', 'High', 'Low', 'Close', 'Volume']]
            return df
        else:
            print(f"No data available for {symbol} from CryptoCompare")
            return None
    except requests.exceptions.RequestException as e:
        print(f"API request failed: {e}")
        return None


In [6]:
# Cell 6: Function to save data to a CSV file
def save_data_to_csv(data, filename):
    if data is not None and not data.empty:
        data.to_csv(filename, index=False)
        print(f"Data saved to {filename}")
    else:
        print(f"No data to save for {filename}")


In [7]:
# Cell 7: Load manually downloaded data
def load_manual_data():
    manual_data = {}
    cryptos = ["BTC", "ETH", "SOL"]
    for crypto in cryptos:
        try:
            data = pd.read_csv(f'../data/historical_data/{crypto}-USD.csv', parse_dates=['Date'])
            data.rename(columns={'Date': 'Date'}, inplace=True)
            manual_data[crypto] = data
            print(f"Manual data loaded for {crypto}")
        except Exception as e:
            print(f"Error loading manually downloaded data for {crypto}: {e}")
    return manual_data


In [8]:
# Cell 8: Fetch data from APIs
def fetch_api_data():
    api_data = {}
    cryptos = ["BTC", "ETH", "SOL"]
    for crypto in cryptos:
        # Fetch data from Alpha Vantage
        alpha_vantage_data = fetch_alpha_vantage_data(crypto)
        if isinstance(alpha_vantage_data, pd.DataFrame):
            save_data_to_csv(alpha_vantage_data, f'../data/historical_data/alpha_vantage/{crypto}_alpha_vantage.csv')
            api_data[f"{crypto}_alpha_vantage"] = alpha_vantage_data
        
        # Fetch data from Coinbase
        coinbase_data = fetch_coinbase_data(f"{crypto}-USD")
        if isinstance(coinbase_data, pd.DataFrame):
            save_data_to_csv(coinbase_data, f'../data/historical_data/coinbase/{crypto}_coinbase.csv')
            api_data[f"{crypto}_coinbase"] = coinbase_data
        
        # Fetch data from CryptoCompare
        for year in range(2018, 2024):
            start_date = f"{year}-01-01"
            end_date = f"{year}-12-31"
            cryptocompare_data = fetch_cryptocompare_data(crypto, start_date, end_date)
            if isinstance(cryptocompare_data, pd.DataFrame):
                save_data_to_csv(cryptocompare_data, f'../data/historical_data/cryptocompare/{crypto}_cryptocompare_{year}.csv')
                api_data[f"{crypto}_cryptocompare_{year}"] = cryptocompare_data
    
    return api_data


In [9]:
# Cell 9: Combine data from all sources
def combine_data_sources(api_data, manual_data):
    combined_data = {}
    cryptos = ["BTC", "ETH", "SOL"]
    for crypto in cryptos:
        combined_df = pd.DataFrame()
        for source in ['alpha_vantage', 'coinbase', 'cryptocompare']:
            for key, df in api_data.items():
                if key.startswith(f"{crypto}_{source}"):
                    combined_df = pd.concat([combined_df, df], ignore_index=True)
        if crypto in manual_data:
            df_manual = manual_data[crypto]
            combined_df = pd.concat([combined_df, df_manual], ignore_index=True)
        
        combined_df = combined_df.sort_values('Date').drop_duplicates(subset='Date').reset_index(drop=True)
        combined_data[crypto] = combined_df
    return combined_data


In [10]:
# Cell 10: Save combined data to CSV
def save_combined_data(combined_data):
    for crypto, df in combined_data.items():
        df = df[['Date', 'Open', 'High', 'Low', 'Close', 'Volume']]
        file_path = f'../data/cleaned_data/{crypto}_cleaned.csv'
        df.to_csv(file_path, index=False)
        print(f"Combined data saved to {file_path}")


In [11]:
# Cell 11: Execute the data preparation steps
print("Fetching API data...")
api_data = fetch_api_data()

print("Loading manual data...")
manual_data = load_manual_data()

print("Combining data sources...")
combined_data = combine_data_sources(api_data, manual_data)

print("Saving combined data...")
save_combined_data(combined_data)

print("Data preparation complete.")


Fetching API data...
No 'Time Series (Digital Currency Daily)' data found for BTC


  df['Date'] = pd.to_datetime('now')


Data saved to ../data/historical_data/coinbase/BTC_coinbase.csv
Data saved to ../data/historical_data/cryptocompare/BTC_cryptocompare_2018.csv
Data saved to ../data/historical_data/cryptocompare/BTC_cryptocompare_2019.csv
Data saved to ../data/historical_data/cryptocompare/BTC_cryptocompare_2020.csv
Data saved to ../data/historical_data/cryptocompare/BTC_cryptocompare_2021.csv
Data saved to ../data/historical_data/cryptocompare/BTC_cryptocompare_2022.csv
Data saved to ../data/historical_data/cryptocompare/BTC_cryptocompare_2023.csv
No 'Time Series (Digital Currency Daily)' data found for ETH


  df['Date'] = pd.to_datetime('now')


Data saved to ../data/historical_data/coinbase/ETH_coinbase.csv
Data saved to ../data/historical_data/cryptocompare/ETH_cryptocompare_2018.csv
Data saved to ../data/historical_data/cryptocompare/ETH_cryptocompare_2019.csv
Data saved to ../data/historical_data/cryptocompare/ETH_cryptocompare_2020.csv
Data saved to ../data/historical_data/cryptocompare/ETH_cryptocompare_2021.csv
Data saved to ../data/historical_data/cryptocompare/ETH_cryptocompare_2022.csv
Data saved to ../data/historical_data/cryptocompare/ETH_cryptocompare_2023.csv
No 'Time Series (Digital Currency Daily)' data found for SOL


  df['Date'] = pd.to_datetime('now')


Data saved to ../data/historical_data/coinbase/SOL_coinbase.csv
Data saved to ../data/historical_data/cryptocompare/SOL_cryptocompare_2018.csv
Data saved to ../data/historical_data/cryptocompare/SOL_cryptocompare_2019.csv
Data saved to ../data/historical_data/cryptocompare/SOL_cryptocompare_2020.csv
Data saved to ../data/historical_data/cryptocompare/SOL_cryptocompare_2021.csv
Data saved to ../data/historical_data/cryptocompare/SOL_cryptocompare_2022.csv
Data saved to ../data/historical_data/cryptocompare/SOL_cryptocompare_2023.csv
Loading manual data...
Manual data loaded for BTC
Manual data loaded for ETH
Manual data loaded for SOL
Combining data sources...
Saving combined data...
Combined data saved to ../data/cleaned_data/BTC_cleaned.csv
Combined data saved to ../data/cleaned_data/ETH_cleaned.csv
Combined data saved to ../data/cleaned_data/SOL_cleaned.csv
Data preparation complete.
