# 01_data_preparation.ipynb

## Notebook Purpose
This notebook is designed to load, clean, and preprocess historical cryptocurrency data. It will also calculate technical indicators that will be used for further analysis and model training.

## Instructions
1. **Import Necessary Libraries**:
   - Import `pandas` for data manipulation.
   - Import functions from `utils.py` for loading, preprocessing data, and calculating technical indicators.

2. **Load Data**:
   - Use the `load_data` function to load the CSV file containing historical cryptocurrency data.

3. **Preprocess Data**:
   - Use the `preprocess_data` function to clean and preprocess the loaded data.
   - Ensure any missing values are handled appropriately.

4. **Calculate Technical Indicators**:
   - Use the `calculate_indicators` function to add technical indicators (e.g., SMA, EMA, RSI) to the data.

5. **Save Preprocessed Data**:
   - Save the cleaned and preprocessed data, including the calculated technical indicators, to a new CSV file for later use.

6. **Review Data**:
   - Display the first few rows of the preprocessed data to ensure it looks correct.

## Example Code
```python
# Import necessary libraries
import pandas as pd
from scripts.utils import load_data, preprocess_data, calculate_indicators

# Load data
data_path = 'data/historical_data/btc_usd.csv'  # Update this path based on the selected cryptocurrency
data = load_data(data_path)

# Preprocess data
data = preprocess_data(data)

# Calculate technical indicators
data = calculate_indicators(data)

# Save the preprocessed data
data.to_csv('data/historical_data/btc_usd_preprocessed.csv')

# Display the first few rows of the preprocessed data
data.head()


In [22]:
# Cell 1: Import necessary libraries and verify
try:
    import pandas as pd
    import numpy as np
    from datetime import datetime
    import matplotlib.pyplot as plt
    import seaborn as sns
    import requests
    from dotenv import load_dotenv
    import os
    import ccxt
    %matplotlib inline
    print("Libraries loaded successfully. Let's proceed!")
except ImportError as e:
    print(f"Uh-oh! Please verify the installation of: {e.name}")


Libraries loaded successfully. Let's proceed!


In [23]:
# Cell 2: Load environment variables and fetch API keys
from dotenv import load_dotenv
import requests
import ccxt

# Load environment variables
load_dotenv()

# Fetch API keys
COINBASE_API_KEY = os.getenv("COINBASE_API_KEY")
COINBASE_API_SECRET = os.getenv("COINBASE_API_SECRET")
ALPHA_VANTAGE_API_KEY = os.getenv("ALPHA_VANTAGE_API_KEY")
CRYPTOCOMPARE_API_KEY = os.getenv("CRYPTOCOMPARE_API_KEY")

# Function to test API connection for each service
def test_api_connection():
    report = []

    # Test Coinbase API connection using ccxt
    try:
        print("Testing Coinbase connection...")
        coinbase = ccxt.coinbase({
            'apiKey': COINBASE_API_KEY,
            'secret': COINBASE_API_SECRET
        })
        print("Coinbase object created.")
        markets = coinbase.load_markets()
        print("Coinbase markets loaded.")
        balance = coinbase.fetch_balance()
        print("Coinbase balance fetched.")
        report.append("Coinbase: Connection successful.")
    except Exception as e:
        print("Error during Coinbase connection test:")
        print(e)
        report.append(f"Coinbase: Failed to connect. Error: {str(e)}")
    
    # Test Alpha Vantage API connection
    print("Testing Alpha Vantage connection...")
    if ALPHA_VANTAGE_API_KEY:
        try:
            response = requests.get(f"https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=IBM&apikey={ALPHA_VANTAGE_API_KEY}")
            if response.status_code == 200:
                report.append("Alpha Vantage: Connection successful.")
            else:
                report.append(f"Alpha Vantage: Failed to connect. Status code: {response.status_code}, Message: {response.text}")
        except Exception as e:
            report.append(f"Alpha Vantage: Failed to connect. Error: {str(e)}")
    else:
        report.append("Alpha Vantage: API key missing.")
    
    # Test CryptoCompare API connection
    print("Testing CryptoCompare connection...")
    if CRYPTOCOMPARE_API_KEY:
        try:
            response = requests.get(f"https://min-api.cryptocompare.com/data/pricemulti?fsyms=BTC&tsyms=USD", headers={"authorization": f"Apikey {CRYPTOCOMPARE_API_KEY}"})
            if response.status_code == 200:
                report.append("CryptoCompare: Connection successful.")
            else:
                report.append(f"CryptoCompare: Failed to connect. Status code: {response.status_code}, Message: {response.text}")
        except Exception as e:
            report.append(f"CryptoCompare: Failed to connect. Error: {str(e)}")
    else:
        report.append("CryptoCompare: API key missing.")
    
    # Print connection report
    for line in report:
        print(line)

# Run the API connection test
test_api_connection()


Testing Coinbase connection...
Coinbase object created.
Error during Coinbase connection test:
index out of range
Testing Alpha Vantage connection...
Testing CryptoCompare connection...
Coinbase: Failed to connect. Error: index out of range
Alpha Vantage: Connection successful.
CryptoCompare: Connection successful.


In [25]:
# Cell 3: Function to fetch historical data from APIs
def fetch_api_data():
    cryptos = ["BTC", "ETH", "SOL"]
    api_data = {}

    # Fetch data from Alpha Vantage
    for crypto in cryptos:
        alpha_vantage_data = fetch_alpha_vantage_data(crypto)
        if alpha_vantage_data is not None:
            api_data[f"{crypto}_alpha_vantage"] = alpha_vantage_data

    # Fetch data from Coinbase
    for crypto in cryptos:
        coinbase_data = fetch_coinbase_data(crypto, '2018-01-01', '2023-12-31')
        if coinbase_data is not None:
            api_data[f"{crypto}_coinbase"] = coinbase_data

    # Fetch data from CryptoCompare
    for crypto in cryptos:
        cryptocompare_data = fetch_cryptocompare_data(crypto, '2018-01-01', '2023-12-31')
        if cryptocompare_data is not None:
            api_data[f"{crypto}_cryptocompare"] = cryptocompare_data

    return api_data

# Fetch data from APIs
api_data = fetch_api_data()


Alpha Vantage Data Columns for BTC: Index(['date', '1. open', '2. high', '3. low', '4. close', 'volume'], dtype='object')
Missing expected columns in Alpha Vantage data for BTC
Alpha Vantage Data Columns for ETH: Index(['date', '1. open', '2. high', '3. low', '4. close', 'volume'], dtype='object')
Missing expected columns in Alpha Vantage data for ETH
Alpha Vantage Data Columns for SOL: Index(['date', '1. open', '2. high', '3. low', '4. close', 'volume'], dtype='object')
Missing expected columns in Alpha Vantage data for SOL
Creating Coinbase object for BTC...
Coinbase object created.
Failed to fetch data from Coinbase: index out of range
Creating Coinbase object for ETH...
Coinbase object created.
Failed to fetch data from Coinbase: index out of range
Creating Coinbase object for SOL...
Coinbase object created.
Failed to fetch data from Coinbase: index out of range


In [26]:
# Cell 4: Load manually downloaded data
def load_manual_data(cryptos):
    manual_data = {}
    for crypto in cryptos:
        try:
            data_path = f'../data/historical_data/{crypto}-USD.csv'
            df = pd.read_csv(data_path)
            print(f"Columns in {crypto} data: {df.columns}")  # Debug print to show the column names
            if 'Date' in df.columns:
                df = pd.read_csv(data_path, parse_dates=['Date'])
                df = df.rename(columns={'Date': 'time', 'Adj Close': 'adj_close', 'Volume': 'volume', 'Open': 'open', 'High': 'high', 'Low': 'low', 'Close': 'close'})
            else:
                raise ValueError(f"No 'Date' column found in {crypto} data")
            df = df.set_index('time')
            manual_data[crypto] = df
            print(f"Manually downloaded data loaded for {crypto}.")
        except Exception as e:
            print(f"Error loading manually downloaded data for {crypto}: {str(e)}")
    return manual_data

cryptos = ["BTC", "ETH", "SOL"]
manual_data = load_manual_data(cryptos)


Columns in BTC data: Index(['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')
Manually downloaded data loaded for BTC.
Columns in ETH data: Index(['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')
Manually downloaded data loaded for ETH.
Columns in SOL data: Index(['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')
Manually downloaded data loaded for SOL.


In [28]:
# Cell 5: Combine data sources
def combine_data_sources(api_data, manual_data):
    combined_data = {}
    for crypto in ["BTC", "ETH", "SOL"]:
        combined_df = pd.DataFrame()
        for source, df in api_data.items():
            if crypto in source:
                df.index = pd.to_datetime(df.index)
                combined_df = pd.concat([combined_df, df])
        if crypto in manual_data:
            manual_df = manual_data[crypto]
            manual_df.index = pd.to_datetime(manual_df.index)
            combined_df = pd.concat([combined_df, manual_df])
        combined_df = combined_df.sort_index().drop_duplicates()
        combined_data[crypto] = combined_df
    return combined_data

# Combine data
combined_data = combine_data_sources(api_data, manual_data)



In [None]:
# Cell 6: Preprocess the data
def preprocess_data(data):
    data = data.dropna()
    data['SMA_20'] = data['close'].rolling(window=20).mean()
    data['SMA_50'] = data['close'].rolling(window=50).mean()
    data['EMA_20'] = data['close'].ewm(span=20, adjust=False).mean()
    data['EMA_50'] = data['close'].ewm(span=50, adjust=False).mean()
    data['Return'] = data['close'].pct_change()
    data['Volatility'] = data['Return'].rolling(window=20).std()
    return data.dropna()

# Preprocess combined data
preprocessed_data = {crypto: preprocess_data(df) for crypto, df in combined_data.items()}


In [31]:
# Cell 7: Save preprocessed data to CSV
# Function to save preprocessed data to CSV files
def save_preprocessed_data(preprocessed_data):
    os.makedirs('../data/cleaned_data', exist_ok=True)
    for crypto, df in preprocessed_data.items():
        file_path = f"../data/cleaned_data/{crypto}_cleaned.csv"
        df.to_csv(file_path)
        print(f"Preprocessed data saved to {file_path}")

# Save preprocessed data
save_preprocessed_data(combined_data)

Preprocessed data saved to ../data/cleaned_data/BTC_cleaned.csv
Preprocessed data saved to ../data/cleaned_data/ETH_cleaned.csv
Preprocessed data saved to ../data/cleaned_data/SOL_cleaned.csv


In [33]:
# Cell 8: Verify the preprocessed data
for crypto, df in combined_data.items():
    print(f"\n{crypto} Preprocessed Data Head:")
    print(df.head())



BTC Preprocessed Data Head:
                                    date     open     high      low    close  \
1970-01-01 00:00:00.000000000 2018-07-10  6668.84  6683.61  6277.23  6306.85   
1970-01-01 00:00:00.000000001 2018-07-11  6306.87  6405.59  6293.68  6394.36   
1970-01-01 00:00:00.000000002 2018-07-12  6394.36  6394.93  6084.00  6253.60   
1970-01-01 00:00:00.000000003 2018-07-13  6253.66  6349.21  6131.54  6229.83   
1970-01-01 00:00:00.000000004 2018-07-14  6229.61  6332.46  6190.18  6268.75   

                                     volume  adj_close  
1970-01-01 00:00:00.000000000  4.704321e+08        NaN  
1970-01-01 00:00:00.000000001  3.276678e+08        NaN  
1970-01-01 00:00:00.000000002  4.090782e+08        NaN  
1970-01-01 00:00:00.000000003  3.198023e+08        NaN  
1970-01-01 00:00:00.000000004  1.744168e+08        NaN  

ETH Preprocessed Data Head:
                                    date    open    high     low   close  \
1970-01-01 00:00:00.000000000 2018-07-10  4