
### 03_model_generation.ipynb
```markdown
# 03_model_generation.ipynb

## Notebook Purpose
This notebook is designed to develop and train machine learning models using the preprocessed cryptocurrency data. The trained models will be used for making predictions in subsequent notebooks.

## Instructions
1. **Import Necessary Libraries**:
   - Import `pandas` for data manipulation.
   - Import functions from `models.py` for training models.

2. **Load Preprocessed Data**:
   - Load the preprocessed CSV file created in the first notebook.

3. **Train Machine Learning Models**:
   - Use the `train_model` function to train a machine learning model (e.g., Random Forest) on the historical data.
   - Split the data into training and testing sets.

4. **Save the Trained Model**:
   - Save the trained model to a file for later use in making predictions.

5. **Evaluate Model Performance**:
   - Evaluate the model's performance using appropriate metrics (e.g., R^2 score).

## Example Code
```python
# Import necessary libraries
import pandas as pd
from scripts.models import train_model
import joblib

# Load preprocessed data
data_path = 'data/historical_data/btc_usd_preprocessed.csv'  # Update this path based on the selected cryptocurrency
data = pd.read_csv(data_path, parse_dates=['Date'], index_col='Date')

# Train model
model, X_test, y_test = train_model(data)

# Save the model and test data
joblib.dump(model, 'models/trained_model.pkl')
X_test.to_csv('data/historical_data/X_test.csv')
y_test.to_csv('data/historical_data/y_test.csv')

# Display model performance
print(f"Model trained. R^2 score on training data: {model.score(X_test, y_test)}")


In [1]:
# Cell 1: Import Necessary Libraries
try:
    import pandas as pd
    import numpy as np
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.metrics import mean_squared_error, r2_score
    import joblib  # For saving the model
    import matplotlib.pyplot as plt
    import seaborn as sns
    print("Libraries imported successfully.")
except ImportError as e:
    print(f"Error importing libraries: {e.name}")


Libraries imported successfully.


In [16]:
# Cell 2: Load Preprocessed Data for BTC, ETH, and SOL
cryptos = ['BTC', 'ETH', 'SOL']
crypto_data = {}

for crypto in cryptos:
    data_path = f'../data/cleaned_data/{crypto}_cleaned.csv'
    try:
        data = pd.read_csv(data_path)
        print(f"Columns in {crypto} data: {data.columns.tolist()}")
        print(data.head())  # Show the first few rows to understand the structure
        
        # Print the column names and first few rows to debug
        print(f"First few rows of {crypto} data:\n", data.head())
        
        if 'Date' in data.columns:
            data['Date'] = pd.to_datetime(data['Date'])
            data.set_index('Date', inplace=True)
        elif 'date' in data.columns:  # Handle the lowercase 'date' column
            data['date'] = pd.to_datetime(data['date'])
            data.set_index('date', inplace=True)
        elif 'time' in data.columns:
            data['time'] = pd.to_datetime(data['time'])
            data.set_index('time', inplace=True)
        elif 'Time' in data.columns:
            data['Time'] = pd.to_datetime(data['Time'])
            data.set_index('Time', inplace=True)
        else:
            raise ValueError(f"No date column found in {crypto} data")
        crypto_data[crypto] = data
        print(f"Data loaded successfully from {data_path}.")
    except FileNotFoundError as e:
        print(f"Error loading data for {crypto}: {e}")

print("All data loaded successfully.")


Columns in BTC data: ['date', 'open', 'high', 'low', 'close', 'volume', 'price', 'bad_data']
         date   open    high    low  close      volume      price  bad_data
0  2013-07-10  76.70   89.84  75.53  88.00  4916740.89  60910.985     False
1  2013-07-11  88.00   90.70  85.00  88.98  3084484.64  60910.985     False
2  2013-07-12  88.98  104.17  88.00  93.99  9759561.48  60910.985     False
3  2013-07-13  93.99   98.32  87.76  98.32  3186590.74  60910.985     False
4  2013-07-14  98.32   99.00  92.86  94.42  1171458.48  60910.985     False
First few rows of BTC data:
          date   open    high    low  close      volume      price  bad_data
0  2013-07-10  76.70   89.84  75.53  88.00  4916740.89  60910.985     False
1  2013-07-11  88.00   90.70  85.00  88.98  3084484.64  60910.985     False
2  2013-07-12  88.98  104.17  88.00  93.99  9759561.48  60910.985     False
3  2013-07-13  93.99   98.32  87.76  98.32  3186590.74  60910.985     False
4  2013-07-14  98.32   99.00  92.86  94.42

In [18]:
# Cell 3: Create Features for the Model
def create_features(df):
    if 'close' in df.columns:
        df['Close'] = df['close']  # If the column is lowercase, rename it to uppercase
    elif 'price' in df.columns:
        df['Close'] = df['price']  # If the column is 'price', rename it to 'Close'
    else:
        raise KeyError("No 'Close' or 'price' column found in the data.")
    
    df['SMA_20'] = df['Close'].rolling(window=20).mean()
    df['SMA_50'] = df['Close'].rolling(window=50).mean()
    df['EMA_20'] = df['Close'].ewm(span=20, adjust=False).mean()
    df['EMA_50'] = df['Close'].ewm(span=50, adjust=False).mean()
    df['Return'] = df['Close'].pct_change()
    df['Volatility'] = df['Return'].rolling(window=20).std()
    df = df.dropna()
    return df

for crypto, data in crypto_data.items():
    print(f"Columns in {crypto} data before feature creation: {data.columns.tolist()}")
    crypto_data[crypto] = create_features(data)
    print(f"Features created successfully for {crypto}.")
    print(f"Columns in {crypto} data after feature creation: {crypto_data[crypto].columns.tolist()}")


Columns in BTC data before feature creation: ['open', 'high', 'low', 'close', 'volume', 'price', 'bad_data']
Features created successfully for BTC.
Columns in BTC data after feature creation: ['open', 'high', 'low', 'close', 'volume', 'price', 'bad_data', 'Close', 'SMA_20', 'SMA_50', 'EMA_20', 'EMA_50', 'Return', 'Volatility']
Columns in ETH data before feature creation: ['open', 'high', 'low', 'close', 'volume', 'price', 'bad_data']
Features created successfully for ETH.
Columns in ETH data after feature creation: ['open', 'high', 'low', 'close', 'volume', 'price', 'bad_data', 'Close', 'SMA_20', 'SMA_50', 'EMA_20', 'EMA_50', 'Return', 'Volatility']
Columns in SOL data before feature creation: ['open', 'high', 'low', 'close', 'volume', 'price', 'bad_data']
Features created successfully for SOL.
Columns in SOL data after feature creation: ['open', 'high', 'low', 'close', 'volume', 'price', 'bad_data', 'Close', 'SMA_20', 'SMA_50', 'EMA_20', 'EMA_50', 'Return', 'Volatility']


In [19]:
# Cell 4: Split Data into Training and Testing Sets
from sklearn.model_selection import train_test_split

train_data = {}
test_data = {}

for crypto, data in crypto_data.items():
    X = data.drop(columns=['Close'])
    y = data['Close']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)
    train_data[crypto] = (X_train, y_train)
    test_data[crypto] = (X_test, y_test)
    print(f"Data split into training and testing sets for {crypto}.")
    print(f"Training data shape: {X_train.shape}, Testing data shape: {X_test.shape}")


Data split into training and testing sets for BTC.
Training data shape: (3262, 13), Testing data shape: (816, 13)
Data split into training and testing sets for ETH.
Training data shape: (2656, 13), Testing data shape: (665, 13)
Data split into training and testing sets for SOL.
Training data shape: (1478, 13), Testing data shape: (370, 13)


In [20]:
# Cell 5: Train the Random Forest Model
from sklearn.ensemble import RandomForestRegressor

models = {}

for crypto, (X_train, y_train) in train_data.items():
    model = RandomForestRegressor(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    models[crypto] = model
    print(f"Random Forest model trained for {crypto}.")


Random Forest model trained for BTC.
Random Forest model trained for ETH.
Random Forest model trained for SOL.


In [21]:
# Cell 6: Evaluate the Model
from sklearn.metrics import mean_squared_error, r2_score

evaluation_results = {}

for crypto, model in models.items():
    X_train, y_train = train_data[crypto]
    X_test, y_test = test_data[crypto]
    
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    train_mse = mean_squared_error(y_train, y_train_pred)
    train_r2 = r2_score(y_train, y_train_pred)
    test_mse = mean_squared_error(y_test, y_test_pred)
    test_r2 = r2_score(y_test, y_test_pred)
    
    evaluation_results[crypto] = {
        'train_mse': train_mse,
        'train_r2': train_r2,
        'test_mse': test_mse,
        'test_r2': test_r2
    }
    
    print(f"Evaluation results for {crypto}:")
    print(f"Train MSE: {train_mse}, Train R2: {train_r2}")
    print(f"Test MSE: {test_mse}, Test R2: {test_r2}")


Evaluation results for BTC:
Train MSE: 1187.5786553279734, Train R2: 0.9999953563300599
Test MSE: 791181.9746404784, Test R2: 0.9967433037702443
Evaluation results for ETH:
Train MSE: 2.8359457335444063, Train R2: 0.9999977010159834
Test MSE: 20.163046351264395, Test R2: 0.9999562896614471
Evaluation results for SOL:
Train MSE: 0.05530670301114794, Train R2: 0.9999805088346241
Test MSE: 0.4823760262702612, Test R2: 0.9998230732869194


In [22]:
# Cell 7: Save the Trained Model
import joblib
import os

os.makedirs("../models", exist_ok=True)

for crypto, model in models.items():
    model_path = f"../models/{crypto}_random_forest_model.pkl"
    joblib.dump(model, model_path)
    print(f"Model saved to {model_path}.")


Model saved to ../models/BTC_random_forest_model.pkl.
Model saved to ../models/ETH_random_forest_model.pkl.
Model saved to ../models/SOL_random_forest_model.pkl.


In [23]:
# Cell 8: Summary of Model Training
for crypto, results in evaluation_results.items():
    print(f"Model Summary for {crypto} using Random Forest:")
    print(f"Train MSE: {results['train_mse']}, Train R2: {results['train_r2']}")
    print(f"Test MSE: {results['test_mse']}, Test R2: {results['test_r2']}")
    print(f"Model saved to ../models/{crypto}_random_forest_model.pkl")


Model Summary for BTC using Random Forest:
Train MSE: 1187.5786553279734, Train R2: 0.9999953563300599
Test MSE: 791181.9746404784, Test R2: 0.9967433037702443
Model saved to ../models/BTC_random_forest_model.pkl
Model Summary for ETH using Random Forest:
Train MSE: 2.8359457335444063, Train R2: 0.9999977010159834
Test MSE: 20.163046351264395, Test R2: 0.9999562896614471
Model saved to ../models/ETH_random_forest_model.pkl
Model Summary for SOL using Random Forest:
Train MSE: 0.05530670301114794, Train R2: 0.9999805088346241
Test MSE: 0.4823760262702612, Test R2: 0.9998230732869194
Model saved to ../models/SOL_random_forest_model.pkl
