# 03_model_generation.ipynb

## Notebook Purpose
This notebook is designed to develop and train machine learning models using the preprocessed cryptocurrency data. The trained models will be used for making predictions in subsequent notebooks.

## Instructions
1. **Import Necessary Libraries**:
   - Import `pandas` for data manipulation.
   - Import functions from `models.py` for training models.

2. **Load Preprocessed Data**:
   - Load the preprocessed CSV file created in the first notebook.

3. **Train Machine Learning Models**:
   - Use the `train_model` function to train a machine learning model (e.g., Random Forest) on the historical data.
   - Split the data into training and testing sets.

4. **Save the Trained Model**:
   - Save the trained model to a file for later use in making predictions.

5. **Evaluate Model Performance**:
   - Evaluate the model's performance using appropriate metrics (e.g., R^2 score).

## Example Code
```python
# Import necessary libraries
import pandas as pd
from scripts.models import train_model
import joblib

# Load preprocessed data
data_path = 'data/historical_data/btc_usd_preprocessed.csv'  # Update this path based on the selected cryptocurrency
data = pd.read_csv(data_path, parse_dates=['Date'], index_col='Date')

# Train model
model, X_test, y_test = train_model(data)

# Save the model and test data
joblib.dump(model, 'models/trained_model.pkl')
X_test.to_csv('data/historical_data/X_test.csv')
y_test.to_csv('data/historical_data/y_test.csv')

# Display model performance
print(f"Model trained. R^2 score on training data: {model.score(X_test, y_test)}")


In [1]:
# Cell 1: Import necessary libraries and verify
try:
    import pandas as pd
    import numpy as np
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.metrics import mean_squared_error, r2_score
    import joblib
    import os
    print("Libraries imported successfully.")
except ImportError as e:
    print(f"Error importing libraries: {e}")


Libraries imported successfully.


In [2]:
# Cell 2: Load preprocessed data
def load_data(data_path):
    try:
        data = pd.read_csv(data_path, parse_dates=['Date'], index_col='Date')
        print(f"Data loaded successfully from {data_path}.")
        return data
    except FileNotFoundError as e:
        print(f"Error loading data: {e}")
        return None
    except ValueError as e:
        print(f"Error parsing dates: {e}")
        return pd.read_csv(data_path)

data_paths = {
    'BTC': '../data/cleaned_data/BTC_cleaned.csv',
    'ETH': '../data/cleaned_data/ETH_cleaned.csv',
    'SOL': '../data/cleaned_data/SOL_cleaned.csv'
}

crypto_data = {crypto: load_data(path) for crypto, path in data_paths.items()}

# Verify loaded data
for crypto, data in crypto_data.items():
    if data is not None:
        print(f"\n{crypto} data:")
        print(data.head())


Data loaded successfully from ../data/cleaned_data/BTC_cleaned.csv.
Data loaded successfully from ../data/cleaned_data/ETH_cleaned.csv.
Data loaded successfully from ../data/cleaned_data/SOL_cleaned.csv.

BTC data:
             Open    High    Low  Close      Volume
Date                                               
2013-07-10  76.70   89.84  75.53  88.00  4916740.89
2013-07-11  88.00   90.70  85.00  88.98  3084484.64
2013-07-12  88.98  104.17  88.00  93.99  9759561.48
2013-07-13  93.99   98.32  87.76  98.32  3186590.74
2013-07-14  98.32   99.00  92.86  94.42  1171458.48

ETH data:
              Open     High     Low   Close     Volume
Date                                                  
2013-07-10  0.0000   0.0000  0.0000  0.0000       0.00
2015-08-07  0.7812  27.7900  0.7809  2.7730  148608.32
2015-08-08  2.7730   2.5810  0.5958  0.8076  583543.48
2015-08-09  0.8076   0.9581  0.6043  0.7428  547528.03
2015-08-10  0.7428   0.7628  0.5990  0.6846  401107.09

SOL data:
              

In [3]:
# Cell 3: Prepare features and target variable with added debug prints
def prepare_features_target(data):
    data = data.copy()
    data['returns'] = data['Close'].pct_change()
    data.dropna(inplace=True)
    
    # Remove any inf or -inf values from the target variable
    y = data['returns'].replace([np.inf, -np.inf], np.nan).dropna()

    # Align features with the cleaned target variable
    X = data.loc[y.index].drop(columns=['returns'])
    
    print(f"Prepared features X and target y:")
    print(f"X shape: {X.shape}")
    print(f"y shape: {y.shape}")
    
    return X, y

crypto_features_targets = {crypto: prepare_features_target(data) for crypto, data in crypto_data.items()}

# Verify prepared features and targets
for crypto, (X, y) in crypto_features_targets.items():
    print(f"\n{crypto} features and target:")
    print(X.head(), y.head())


Prepared features X and target y:
X shape: (4009, 5)
y shape: (4009,)
Prepared features X and target y:
X shape: (3251, 5)
y shape: (3251,)
Prepared features X and target y:
X shape: (1543, 5)
y shape: (1543,)

BTC features and target:
             Open    High    Low  Close      Volume
Date                                               
2013-07-11  88.00   90.70  85.00  88.98  3084484.64
2013-07-12  88.98  104.17  88.00  93.99  9759561.48
2013-07-13  93.99   98.32  87.76  98.32  3186590.74
2013-07-14  98.32   99.00  92.86  94.42  1171458.48
2013-07-15  94.42  101.94  93.11  98.89  3366165.02 Date
2013-07-11    0.011136
2013-07-12    0.056305
2013-07-13    0.046069
2013-07-14   -0.039666
2013-07-15    0.047342
Name: returns, dtype: float64

ETH features and target:
              Open    High     Low   Close      Volume
Date                                                  
2015-08-08  2.7730  2.5810  0.5958  0.8076   583543.48
2015-08-09  0.8076  0.9581  0.6043  0.7428   547528.03
2015

In [4]:
# Cell 4: Split data into training and testing sets
def split_data(X, y):
    print(f"Splitting data: X shape: {X.shape}, y shape: {y.shape}")
    min_length = min(len(X), len(y))
    X = X.iloc[:min_length]
    y = y.iloc[:min_length]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    print(f"Data split: X_train shape: {X_train.shape}, X_test shape: {X_test.shape}, y_train shape: {y_train.shape}, y_test shape: {y_test.shape}")
    return X_train, X_test, y_train, y_test

crypto_splits = {crypto: split_data(X, y) for crypto, (X, y) in crypto_features_targets.items()}

# Verify data splits
for crypto, (X_train, X_test, y_train, y_test) in crypto_splits.items():
    print(f"\n{crypto} data split:")
    print(f"X_train shape: {X_train.shape}, X_test shape: {X_test.shape}")
    print(f"y_train shape: {y_train.shape}, y_test shape: {y_test.shape}")


Splitting data: X shape: (4009, 5), y shape: (4009,)
Data split: X_train shape: (3207, 5), X_test shape: (802, 5), y_train shape: (3207,), y_test shape: (802,)
Splitting data: X shape: (3251, 5), y shape: (3251,)
Data split: X_train shape: (2600, 5), X_test shape: (651, 5), y_train shape: (2600,), y_test shape: (651,)
Splitting data: X shape: (1543, 5), y shape: (1543,)
Data split: X_train shape: (1234, 5), X_test shape: (309, 5), y_train shape: (1234,), y_test shape: (309,)

BTC data split:
X_train shape: (3207, 5), X_test shape: (802, 5)
y_train shape: (3207,), y_test shape: (802,)

ETH data split:
X_train shape: (2600, 5), X_test shape: (651, 5)
y_train shape: (2600,), y_test shape: (651,)

SOL data split:
X_train shape: (1234, 5), X_test shape: (309, 5)
y_train shape: (1234,), y_test shape: (309,)


In [5]:
# Cell 5: Train the machine learning models with debug prints
def train_model(X_train, y_train):
    try:
        model = RandomForestRegressor(n_estimators=100, random_state=42)
        model.fit(X_train, y_train)
        print("Model trained successfully.")
        return model
    except Exception as e:
        print(f"Error training model: {e}")
        return None

crypto_models = {}
for crypto, (X_train, X_test, y_train, y_test) in crypto_splits.items():
    print(f"Training model for {crypto}")
    model = train_model(X_train, y_train)
    if model is not None:
        crypto_models[crypto] = model

# Verify trained models
for crypto, model in crypto_models.items():
    print(f"{crypto} model trained successfully.")


Training model for BTC
Model trained successfully.
Training model for ETH
Model trained successfully.
Training model for SOL
Model trained successfully.
BTC model trained successfully.
ETH model trained successfully.
SOL model trained successfully.


In [6]:
# Cell 6: Evaluate model performance with debug prints
def evaluate_model(model, X_test, y_test):
    try:
        y_pred = model.predict(X_test)
        mse = mean_squared_error(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)
        print(f"Model evaluation - MSE: {mse}, R2: {r2}")
        return mse, r2
    except Exception as e:
        print(f"Error evaluating model: {e}")
        return None, None

crypto_evaluations = {}
for crypto, model in crypto_models.items():
    X_test, y_test = crypto_splits[crypto][1], crypto_splits[crypto][3]
    print(f"Evaluating model for {crypto}")
    mse, r2 = evaluate_model(model, X_test, y_test)
    if mse is not None and r2 is not None:
        crypto_evaluations[crypto] = (mse, r2)

# Verify model evaluations
for crypto, (mse, r2) in crypto_evaluations.items():
    print(f"{crypto} model - MSE: {mse}, R2: {r2}")


Evaluating model for BTC
Model evaluation - MSE: 0.002026521418412084, R2: -0.3141519041306471
Evaluating model for ETH
Model evaluation - MSE: 0.002238998520861141, R2: 0.4239349895228589
Evaluating model for SOL
Model evaluation - MSE: 0.0013092403702317306, R2: 0.6325179616639734
BTC model - MSE: 0.002026521418412084, R2: -0.3141519041306471
ETH model - MSE: 0.002238998520861141, R2: 0.4239349895228589
SOL model - MSE: 0.0013092403702317306, R2: 0.6325179616639734


In [7]:
# Cell 7: Save the trained models and test data with debug prints
def save_model_data(model, X_test, y_test, model_path, X_test_path, y_test_path):
    try:
        os.makedirs(os.path.dirname(model_path), exist_ok=True)
        os.makedirs(os.path.dirname(X_test_path), exist_ok=True)
        os.makedirs(os.path.dirname(y_test_path), exist_ok=True)
        
        joblib.dump(model, model_path)
        X_test.to_csv(X_test_path)
        y_test.to_csv(y_test_path)
        print(f"Model and test data saved to {model_path}, {X_test_path}, {y_test_path}")
    except Exception as e:
        print(f"Error saving model data: {e}")

for crypto, model in crypto_models.items():
    X_train, X_test, y_train, y_test = crypto_splits[crypto]
    print(f"Saving model and test data for {crypto}")
    save_model_data(model, X_test, y_test, f'../models/{crypto}_trained_model.pkl', f'../data/cleaned_data/{crypto}_X_test.csv', f'../data/cleaned_data/{crypto}_y_test.csv')


Saving model and test data for BTC
Model and test data saved to ../models/BTC_trained_model.pkl, ../data/cleaned_data/BTC_X_test.csv, ../data/cleaned_data/BTC_y_test.csv
Saving model and test data for ETH
Model and test data saved to ../models/ETH_trained_model.pkl, ../data/cleaned_data/ETH_X_test.csv, ../data/cleaned_data/ETH_y_test.csv
Saving model and test data for SOL
Model and test data saved to ../models/SOL_trained_model.pkl, ../data/cleaned_data/SOL_X_test.csv, ../data/cleaned_data/SOL_y_test.csv


In [8]:
# Cell 8: Display model performance for each cryptocurrency
print("Content of crypto_evaluations:")
print(crypto_evaluations)

for crypto, (mse, r2) in crypto_evaluations.items():
    print(f"{crypto} model - MSE: {mse}, R2: {r2}")


Content of crypto_evaluations:
{'BTC': (0.002026521418412084, -0.3141519041306471), 'ETH': (0.002238998520861141, 0.4239349895228589), 'SOL': (0.0013092403702317306, 0.6325179616639734)}
BTC model - MSE: 0.002026521418412084, R2: -0.3141519041306471
ETH model - MSE: 0.002238998520861141, R2: 0.4239349895228589
SOL model - MSE: 0.0013092403702317306, R2: 0.6325179616639734
