---

# BTC Historical Data Analysis and Classification

Bitcoin (BTC), known for its high volatility, presents a unique opportunity for predictive modeling in financial markets. The objective of this notebook is to explore whether machine learning techniques like **Logistic Regression** and **Random Forest Classification** can accurately predict the direction of BTC's price movement based on historical trading data.

The classification task is framed as a binary prediction problem: will the BTC price increase or decrease after a specified number of days (`lookahead_days`)? Success in this task could provide valuable insights for designing trading strategies. However, the inherent complexity and noise in financial data require careful preprocessing, feature engineering, and model selection to achieve meaningful results.

We structure the analysis by first engineering features tailored to capture BTC price patterns, followed by training and evaluating both models. The comparison highlights how logistic regression, a linear model, struggles with non-linear patterns, and why Random Forest, a more flexible algorithm, might offer better performance.

---

## Feature Engineering

The quality of the features used can significantly influence the performance of a predictive model. For this task, we opted to create indicators commonly used in financial analysis, balancing simplicity with effectiveness:

1. **Moving Averages (MA):**
   These smooth out short-term fluctuations in price, helping to capture overall trends. We chose 7, 14, and 30-day windows to provide varying perspectives on price movement over different time frames. This decision ensures both short-term dynamics and broader trends are accounted for.

2. **Relative Strength Index (RSI):**
   RSI measures price momentum, helping identify overbought or oversold conditions. This feature was included to capture scenarios where price reversals might occur. Momentum indicators like RSI often add depth to trend-following features like moving averages.

3. **Bollinger Bands:**
   Volatility plays a critical role in cryptocurrency markets. By incorporating Bollinger Bands, which measure price deviations from a moving average, the model can account for periods of high volatility where prices deviate significantly.

4. **Volume-Based Indicators:**
   Volume can often signal upcoming price changes. A 7-day average of trading volume was chosen to highlight shifts in market activity, complementing price-based features.

These features were selected to provide a comprehensive yet interpretable view of the market. While they are straightforward, they align well with the domain knowledge of financial markets, making them robust starting points for model training.

---

## Dataset Details

The dataset includes key trading metrics such as opening price, closing price, highest and lowest price, and traded volume. We define the target variable as binary: 

- `1` if the BTC price increases after the specified `lookahead_days`.
- `0` if the BTC price decreases.

This simple formulation allows us to focus on evaluating the models' capabilities in capturing price direction rather than precise numerical predictions.

While logistic regression serves as a baseline due to its simplicity and interpretability, the dataset's complexity and the likely presence of non-linear patterns in BTC price movement motivate the inclusion of Random Forest. This approach reflects a progression from understanding fundamental relationships to leveraging a model capable of capturing more intricate interactions.

By iteratively building upon these features and evaluating model performance, we aim to strike a balance between interpretability and predictive power. This process ensures that each decision is deliberate and grounded in the requirements of the task. 

---

### Imports

Import necessary modules.

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
    

The `load_and_preprocess` function focuses on preparing the raw BTC dataset for analysis and model training by integrating the feature engineering described earlier. It begins by importing the data, standardizing column names, converting dates to a usable format, and sorting the dataset chronologically to maintain the integrity of time-series data.

The primary objective of this function is to apply the predefined feature engineering pipeline—such as calculating moving averages, RSI, Bollinger Bands, and volume-based metrics—while ensuring compatibility for model training. It also defines the binary target variable (`1` for a price increase and `0` for a decrease) using a forward-looking approach based on `lookahead_days`. 

To finalize the preparation, any rows containing null values—introduced during rolling window calculations—are removed, ensuring a clean and complete dataset. This function operationalizes the feature engineering strategy and outputs a dataset ready for training and evaluation.

In [None]:

# Load Data
def load_and_preprocess(csv_path, lookahead_days):
    df = pd.read_csv(csv_path)
    df.columns = df.columns.str.lower()
    df['date'] = pd.to_datetime(df['date'])
    df = df.sort_values('date')

    # Feature Engineering
    df['ma_7day'] = df['close'].rolling(7).mean()
    df['ma_14day'] = df['close'].rolling(14).mean()
    df['ma_30day'] = df['close'].rolling(30).mean()

    # RSI Calculation
    df['price_change'] = df['close'].diff()
    df['gain'] = df['price_change'].clip(lower=0)
    df['loss'] = -1 * df['price_change'].clip(upper=0)
    avg_gain = df['gain'].rolling(14).mean()
    avg_loss = df['loss'].rolling(14).mean()
    rs = avg_gain / avg_loss
    df['rsi'] = 100 - (100 / (1 + rs))

    # Bollinger Bands
    df['bb_middle'] = df['close'].rolling(20).mean()
    df['bb_std'] = df['close'].rolling(20).std()
    df['bb_upper'] = df['bb_middle'] + (2 * df['bb_std'])
    df['bb_lower'] = df['bb_middle'] - (2 * df['bb_std'])

    # Volume-Based Indicators
    df['volume_ma_7day'] = df['volume'].rolling(7).mean()

    # Target Variable
    df['future_close'] = df['close'].shift(-lookahead_days)
    df['target'] = (df['future_close'] > df['close']).astype(int)

    # Drop rows with null values
    df = df.dropna()

    return df

# Dataset Path
csv_path = "../data/raw/btc_usd.csv"
df = load_and_preprocess(csv_path, lookahead_days=7)
    

: 

### Feature Selection
The `feature_columns` list specifies the input features that will be used for model training. These include raw trading data (`open`, `high`, `low`, `volume`) as well as engineered features like moving averages, RSI, Bollinger Bands, and volume-based indicators. These features were chosen to capture a mix of trend, momentum, and volatility signals, providing a comprehensive view of BTC price dynamics. The variable `X` represents the feature set, while `y` holds the binary target variable (`1` for price increase, `0` for decrease).

### Train-Test Split
The data is split into two subsets:
- **Training Set (`X_train`, `y_train`)**: Used for training the model, comprising 70% of the data.
- **Testing Set (`X_test`, `y_test`)**: Used to evaluate model performance on unseen data, comprising the remaining 30%.

The `random_state` parameter ensures that the split is reproducible, maintaining consistent results across runs. Splitting the data helps evaluate the model's ability to generalize rather than overfit to the training data.

### Feature Scaling
A `StandardScaler` is applied to standardize the feature values. This process centers each feature around a mean of 0 and scales it to have a standard deviation of 1:
- **Training Data (`X_train_scaled`)**: The scaler computes the mean and standard deviation from the training data and scales the features accordingly.
- **Testing Data (`X_test_scaled`)**: The same scaling parameters are applied to the testing data, ensuring consistency.

Feature scaling is crucial for algorithms like Logistic Regression, as it prevents features with larger ranges (e.g., volume) from dominating the learning process, ensuring all features contribute equally.

In [None]:

# Define Features and Target
feature_columns = [
    'open', 'high', 'low', 'volume',
    'ma_7day', 'ma_14day', 'ma_30day', 'rsi',
    'bb_upper', 'bb_lower', 'bb_middle', 'volume_ma_7day'
]
X = df[feature_columns]
y = df['target']

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale Features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
    

### Logistic Regression

Logistic regression is chosen as a baseline model for this classification task due to its simplicity and interpretability. It assumes a linear relationship between the features and the log-odds of the target variable, making it a good starting point to understand how the features influence the outcome. The model is instantiated with a fixed random seed (`random_state=42`) for consistent results and a maximum iteration limit (`max_iter=500`) to ensure convergence during optimization.


---

### Predictions

Once trained, the logistic regression model is used to make predictions:
- **Training Set Predictions**: The model predicts outcomes for the training set (`logistic_y_train_pred`) to assess how well it has captured patterns in the data it was trained on.
- **Testing Set Predictions**: Predictions on the testing set (`logistic_y_test_pred`) are used to evaluate how well the model generalizes to unseen data.

This dual evaluation provides a clearer picture of both overfitting and generalization performance.

---

### Metrics Calculation

To evaluate the logistic regression model, a range of metrics is computed:
- **Training Accuracy**: Measures how well the model fits the training data. High accuracy here could indicate a strong fit, but overly high accuracy relative to testing accuracy might signal overfitting.
- **Testing Accuracy**: Assesses the model's ability to generalize to unseen data, a critical factor in financial applications like BTC price prediction.
- **Precision**: Indicates the quality of positive predictions. In financial tasks, precision can be important when false positives carry a high cost (e.g., predicting price increases that do not occur).
- **Recall**: Reflects the model's ability to identify actual positive cases. In scenarios where missing opportunities (e.g., failing to predict a price increase) is costly, recall becomes vital.
- **F1 Score**: Provides a balanced view of precision and recall, especially useful when there’s an imbalance between classes.
- **Confusion Matrix**: Summarizes the prediction results into true positives, true negatives, false positives, and false negatives. This helps in understanding the types of errors the model is making.

---

### Thoughts on Logistic Regression

Logistic regression is a strong baseline model, but it may struggle in capturing non-linear relationships and complex interactions between features, which are often present in financial datasets like BTC price movements. The results from this model will provide a benchmark to compare against more complex models, such as Random Forest, which can better handle non-linearities and feature interactions. This decision-making process ensures that the progression from simple to complex models is logical and data-driven.

In [None]:

# Logistic Regression
logistic_model = LogisticRegression(random_state=42, max_iter=500)
logistic_model.fit(X_train_scaled, y_train)

# Logistic Regression Predictions
logistic_y_train_pred = logistic_model.predict(X_train_scaled)
logistic_y_test_pred = logistic_model.predict(X_test_scaled)

# Logistic Regression Metrics
logistic_metrics = {
    'train_accuracy': accuracy_score(y_train, logistic_y_train_pred),
    'test_accuracy': accuracy_score(y_test, logistic_y_test_pred),
    'precision': precision_score(y_test, logistic_y_test_pred),
    'recall': recall_score(y_test, logistic_y_test_pred),
    'f1_score': f1_score(y_test, logistic_y_test_pred),
    'confusion_matrix': confusion_matrix(y_test, logistic_y_test_pred)
}
    

### Report on Logistic Regression Results

The logistic regression model yielded the following metrics:

- **Training Accuracy**: 56.15%
- **Testing Accuracy**: 55.46%
- **Precision**: 56.45%
- **Recall**: 87.20%
- **F1 Score**: 68.54%

The **confusion matrix** shows significant imbalance in performance:
```
Confusion Matrix:
 [[ 77 415]
 [ 79 538]]
```
- **Class 0 (Price Decrease)**: The model performed poorly, with very few true negatives (77) and a large number of false positives (415).
- **Class 1 (Price Increase)**: The model achieved high recall (538 true positives), but at the cost of precision, as false positives are prevalent.

---

### Thoughts on Results

1. **Model Performance**:
   - The low overall accuracy (~55%) indicates that the model struggles to differentiate between price increases and decreases effectively.
   - The disparity between precision and recall suggests the model heavily favors predicting the positive class (`1`, price increase), leading to high recall but low precision.

2. **Feature Engineering and Data**:
   - Logistic regression assumes a **linear relationship** between features and the log-odds of the target variable. However, financial markets like BTC prices often exhibit **non-linear dynamics**, which linear models struggle to capture.
   - While the engineered features (moving averages, RSI, Bollinger Bands) are robust, they might not fully encapsulate the non-linear interactions in the data.

3. **Class Imbalance**:
   - Although the target classes are not explicitly imbalanced, the model's prediction bias suggests that it struggles with misclassifying negative cases (price decreases). This could stem from subtle patterns in the data that logistic regression cannot pick up.

---


## Experiment: Increasing the Lookahead Timeframe

In this section, we explore the impact of varying the `lookahead_days` parameter on the model's performance. By increasing the lookahead period, we aim to observe how the model handles changes in the prediction window and whether it affects the accuracy, precision, recall, and other metrics.
    

In [None]:

# Define different lookahead timeframes to test
timeframes = [7, 14, 30, 60]  # Lookahead periods in days
results = {}

for days in timeframes:
    print(f"Running for lookahead_days = {days}")
    lookahead_days = days  # Update the lookahead period
    model, metrics = process_and_train_binary_classification(csv_path, lookahead_days=lookahead_days)
    results[days] = metrics
    print(f"Results for lookahead_days = {days}: {metrics}")
        

In [None]:

# Summarize and analyze results for different lookahead periods
for days, metrics in results.items():
    print(f"
Lookahead Days: {days}")
    print(metrics)
    


To further investigate the effect of the prediction window (`lookahead_days`), we will test the model with two extended periods: two weeks (`lookahead_days = 14`), one month (`lookahead_days = 30`) and  two months (`lookahead_days = 60`).
    


## Final Analysis: Limitations of Logistic Regression and Transition to Random Forest

### Observations from Lookahead Experiments

#### Results Summary:
- **Two Weeks (`lookahead_days = 14`)**:
  - Training Accuracy: 58.19%
  - Testing Accuracy: 56.10%
  - F1 Score: 68.40%

- **One Month (`lookahead_days = 30`)**:
  - Training Accuracy: 59.69%
  - Testing Accuracy: 58.89%
  - F1 Score: 71.13%

- **Two Months (`lookahead_days = 60`)**:
  - Training Accuracy: 61.40%
  - Testing Accuracy: 62.03%
  - F1 Score: 72.99%

As the lookahead period increases, the model's performance improves marginally, especially in terms of accuracy and F1 score. This indicates that longer timeframes reduce short-term noise and allow the model to identify more stable patterns. However, the confusion matrix still highlights persistent challenges in predicting price decreases (`0`), with low precision and recall for this class.

### Why Logistic Regression Struggles

1. **Linear Assumptions**:
   Logistic regression assumes a linear relationship between features and the target, which is often too simplistic for financial data. BTC prices exhibit non-linear dynamics, where trends, momentum, and volatility interact in complex ways that logistic regression cannot capture.

2. **Feature Sensitivity**:
   Features like RSI and moving averages are more aligned with upward trends, introducing bias toward predicting price increases (`1`). This bias skews the confusion matrix, with many false positives for class `0`.

3. **Class Imbalance**:
   The model optimizes for overall accuracy, often favoring the dominant class (`1`) at the expense of precision for the minority class (`0`). This trade-off is evident in the low precision and recall for `0` across all experiments.

### Why Random Forest Can Help

1. **Non-Linear Relationships**:
   Random Forest does not assume linearity and can capture complex patterns in the data by building multiple decision trees that split the feature space based on non-linear thresholds.

2. **Feature Interactions**:
   Random Forest naturally incorporates feature interactions, such as the combined effect of RSI and Bollinger Bands, which may be critical for distinguishing between upward and downward price movements.

3. **Robustness to Noise**:
   By averaging predictions across trees, Random Forest reduces sensitivity to noise, making it better suited for volatile financial data.

4. **Balanced Performance**:
   With hyperparameter tuning, Random Forest can address class imbalances, improving precision and recall for the minority class.

### Next Steps

To address the limitations of logistic regression, the next logical step is to implement Random Forest. By leveraging its ability to handle non-linear relationships and interactions, we aim to achieve better-balanced predictions and improved performance across both classes.
    