<a href="https://colab.research.google.com/github/Samuel-Solomon-1/Project-3-Machine-Learning-for-Predicting-Trading-Signals/blob/main/Project_3_Machine_Learning_for_Predicting_Trading_Signals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 3: Machine Learning for Predicting Trading Signals

## Overview

This project focuses on applying machine learning to predict 'Buy', 'Sell', or 'Hold' trading signals based on calculated technical indicators like **MACD** and **RSI**. We'll create these indicators manually using domain formulas, use them to generate labeled signals, and apply supervised machine learning models to classify trading decisions.

By the end of this project, we aim to:

- Engineer MACD and RSI from scratch
- Generate trading signals: Buy, Sell, or Hold
- Train and evaluate models: Logistic Regression, Random Forest, and SVM
- Use accuracy, precision, and recall to assess model performance

## Dataset

The dataset used is the cleaned and transformed output from **Project 2**, which contains stock prices and technical features for selected tickers.

## Task 1: Feature Engineering – MACD, RSI & Signal Generation

In this task, we will manually compute the technical indicators MACD and RSI using their mathematical formulas, without any third-party libraries. Based on the behavior of these indicators, we define:

- **Buy**: If both MACD and RSI indicate a buy
- **Sell**: If both indicate a sell
- **Hold**: Otherwise

In [None]:
from google.colab import drive
drive.mount('/content/drive')

import pandas as pd
import numpy as np

# Load dataset (replace with your path)
input_path = '/content/drive/MyDrive/Project2/train_clean.csv'
df = pd.read_csv(input_path, parse_dates=['date'])

# Sort by ticker and date to ensure correct calculation of rolling EMAs
df = df.sort_values(['ticker', 'date'])

# Function to calculate EMA
def ema(series, span):
    return series.ewm(span=span, adjust=False).mean()

# Compute MACD and Signal Line for each ticker group
def compute_macd(group):
    close = group['close_scaled']
    ema12 = ema(close, 12)
    ema26 = ema(close, 26)
    macd_line = ema12 - ema26
    signal_line = ema(macd_line, 9)
    macd_diff = macd_line - signal_line

    group = group.assign(macd=macd_line, macd_signal=signal_line, macd_diff=macd_diff)
    return group

df = df.groupby('ticker').apply(compute_macd).reset_index(drop=True)

# Compute RSI manually for each ticker group
def compute_rsi(group, period=14):
    delta = group['close_scaled'].diff()
    gain = delta.clip(lower=0)
    loss = -delta.clip(upper=0)

    avg_gain = gain.ewm(com=period - 1, min_periods=period).mean()
    avg_loss = loss.ewm(com=period - 1, min_periods=period).mean()

    rs = avg_gain / avg_loss
    rsi = 100 - (100 / (1 + rs))
    group = group.assign(rsi=rsi)
    return group

df = df.groupby('ticker').apply(compute_rsi).reset_index(drop=True)

# Define signal generation functions

def get_macd_signal(row):
    # Buy when MACD crosses above signal line (macd_diff > 0)
    if row['macd_diff'] > 0:
        return 'Buy'
    # Sell when MACD crosses below signal line (macd_diff < 0)
    elif row['macd_diff'] < 0:
        return 'Sell'
    else:
        return 'Hold'

def get_rsi_signal(row):
    if row['rsi'] < 30:
        return 'Buy'
    elif row['rsi'] > 70:
        return 'Sell'
    else:
        return 'Hold'

def combine_signals(row):
    macd_signal = get_macd_signal(row)
    rsi_signal = get_rsi_signal(row)
    if macd_signal == 'Buy' and rsi_signal == 'Buy':
        return 'Buy'
    elif macd_signal == 'Sell' and rsi_signal == 'Sell':
        return 'Sell'
    else:
        return 'Hold'

# Apply signal generation
df['macd_signal_flag'] = df.apply(get_macd_signal, axis=1)
df['rsi_signal_flag'] = df.apply(get_rsi_signal, axis=1)
df['signal'] = df.apply(combine_signals, axis=1)

# Preview results
print(df[['ticker', 'date', 'close_scaled', 'macd', 'macd_signal', 'macd_diff', 'rsi', 'macd_signal_flag', 'rsi_signal_flag', 'signal']].head(20))

# Save to csv for further tasks
output_path = '/content/drive/MyDrive/Project3/train_with_signals.csv'
df.to_csv(output_path, index=False)
print(f"Saved processed data with signals to {output_path}")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


  df = df.groupby('ticker').apply(compute_macd).reset_index(drop=True)
  df = df.groupby('ticker').apply(compute_rsi).reset_index(drop=True)


   ticker       date  close_scaled      macd   macd_signal  macd_diff  \
0    AAPL 1981-01-26     -0.397000  0.000000  0.000000e+00   0.000000   
1    AAPL 1981-01-27     -0.397042 -0.000003 -6.608318e-07  -0.000003   
2    AAPL 1981-01-28     -0.397207 -0.000019 -4.343057e-06  -0.000015   
3    AAPL 1981-01-29     -0.397394 -0.000046 -1.268986e-05  -0.000033   
4    AAPL 1981-01-30     -0.397663 -0.000088 -2.778927e-05  -0.000060   
5    AAPL 1981-02-02     -0.397932 -0.000142 -5.056164e-05  -0.000091   
6    AAPL 1981-02-03     -0.397767 -0.000169 -7.419093e-05  -0.000095   
7    AAPL 1981-02-04     -0.397601 -0.000175 -9.430619e-05  -0.000080   
8    AAPL 1981-02-05     -0.397601 -0.000178 -1.109495e-04  -0.000067   
9    AAPL 1981-02-06     -0.397580 -0.000176 -1.239609e-04  -0.000052   
10   AAPL 1981-02-09     -0.397829 -0.000193 -1.376963e-04  -0.000055   
11   AAPL 1981-02-10     -0.397829 -0.000203 -1.508516e-04  -0.000053   
12   AAPL 1981-02-11     -0.397974 -0.000221 -1.649

### Task 2: Data Preparation and Splitting

In this task, we prepare the dataset for machine learning by integrating the computed technical indicators and trading signals into the main dataset. The `Signal` column (with values "Buy", "Sell", or "Hold") will serve as our target label for supervised learning.

Steps include:

1. Loading the cleaned datasets containing technical indicators and signals.
2. Encoding the categorical target labels (`Signal`) into numeric form suitable for modeling.
3. Selecting relevant feature columns for the prediction task.
4. Splitting the combined dataset into training, validation, and testing sets (if not already split).
5. Saving the prepared datasets for use in the modeling phase.

This preparation ensures the data is structured correctly, with features and labels ready for machine learning algorithms.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Load the processed dataset with signals
input_path = '/content/drive/MyDrive/Project3/train_with_signals.csv'
df = pd.read_csv(input_path, parse_dates=['date'])

# Drop rows with missing RSI if any (optional)
df = df.dropna(subset=['rsi']).reset_index(drop=True)

# Encode target signal labels (Buy, Hold, Sell) to integers
label_encoder = LabelEncoder()
df['signal_encoded'] = label_encoder.fit_transform(df['signal'])

# Select features to use for modeling
feature_cols = ['macd', 'macd_signal', 'rsi']

X = df[feature_cols]
y = df['signal_encoded']

# Split data: 70% train, 15% validation, 15% test (stratified by target)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp
)

# Optional: Combine X and y to save prepared splits
train_df = X_train.copy()
train_df['signal_encoded'] = y_train

val_df = X_val.copy()
val_df['signal_encoded'] = y_val

test_df = X_test.copy()
test_df['signal_encoded'] = y_test

# Save the prepared datasets to Google Drive
output_dir = '/content/drive/MyDrive/Project3'
train_df.to_csv(f'{output_dir}/train_prepared.csv', index=False)
val_df.to_csv(f'{output_dir}/val_prepared.csv', index=False)
test_df.to_csv(f'{output_dir}/test_prepared.csv', index=False)

print(f"Data split and saved successfully!")
print(f"Training set size: {len(train_df)}")
print(f"Validation set size: {len(val_df)}")
print(f"Test set size: {len(test_df)}")
print(f"Encoded classes: {list(label_encoder.classes_)}")

Data split and saved successfully!
Training set size: 81109
Validation set size: 17381
Test set size: 17381
Encoded classes: ['Buy', 'Hold', 'Sell']


## Task 3: Model Building and Validation

In this task, we implement and train three supervised machine learning models to predict stock trading signals (`Buy`, `Sell`, `Hold`) based on the technical indicators computed previously.

The models used are:
1. **Logistic Regression**
2. **Random Forest Classifier**
3. **Support Vector Machine (SVM)**

### Process

- Load the prepared training, validation, and test datasets.
- Select key technical indicator features (`macd`, `macd_signal`, `rsi`) as inputs.
- Encode the categorical target variable `signal` into numerical labels.
- Train each model on the training set.
- Validate each model’s performance on the validation set using classification metrics such as accuracy, precision, recall, and F1-score.
- Compare results to identify the best-performing model.

### Notes

- Logistic Regression provides a simple baseline model.
- Random Forest can capture complex nonlinear relationships.
- SVMs are powerful classifiers but can be computationally intensive on large datasets.

This step establishes a foundation for predictive modeling and informs the choice of models for further optimization and testing.

In [None]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.utils import shuffle

input_dir = '/content/drive/MyDrive/Project3'

train_df = pd.read_csv(f'{input_dir}/train_prepared.csv')
val_df = pd.read_csv(f'{input_dir}/val_prepared.csv')
test_df = pd.read_csv(f'{input_dir}/test_prepared.csv')

# Feature columns based on your data
feature_cols = ['macd', 'macd_signal', 'rsi']

X_train, y_train = train_df[feature_cols], train_df['signal_encoded']
X_val, y_val = val_df[feature_cols], val_df['signal_encoded']
X_test, y_test = test_df[feature_cols], test_df['signal_encoded']

# Shuffle training data
X_train, y_train = shuffle(X_train, y_train, random_state=42)

models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "SVM": SVC(kernel='rbf', probability=True, random_state=42)
}

for name, model in models.items():
    print(f"\nTraining {name}...")
    model.fit(X_train, y_train)

    print(f"Validation performance for {name}:")
    y_pred = model.predict(X_val)
    print(classification_report(y_val, y_pred))

    cm = confusion_matrix(y_val, y_pred)
    print(f"Confusion Matrix:\n{cm}")

best_model_name = "Random Forest"  # Choose the best based on above
best_model = models[best_model_name]

print(f"\nEvaluating best model ({best_model_name}) on test set...")
y_test_pred = best_model.predict(X_test)
print(classification_report(y_test, y_test_pred))
print("Test Confusion Matrix:")
print(confusion_matrix(y_test, y_test_pred))


Training Logistic Regression...
Validation performance for Logistic Regression:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        40
           1       0.99      1.00      1.00     17294
           2       0.00      0.00      0.00        47

    accuracy                           0.99     17381
   macro avg       0.33      0.33      0.33     17381
weighted avg       0.99      0.99      0.99     17381

Confusion Matrix:
[[    0    40     0]
 [    0 17294     0]
 [    0    47     0]]

Training Random Forest...


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Validation performance for Random Forest:
              precision    recall  f1-score   support

           0       0.78      0.35      0.48        40
           1       1.00      1.00      1.00     17294
           2       1.00      0.19      0.32        47

    accuracy                           1.00     17381
   macro avg       0.92      0.51      0.60     17381
weighted avg       1.00      1.00      1.00     17381

Confusion Matrix:
[[   14    26     0]
 [    4 17290     0]
 [    0    38     9]]

Training SVM...
Validation performance for SVM:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        40
           1       0.99      1.00      1.00     17294
           2       0.00      0.00      0.00        47

    accuracy                           0.99     17381
   macro avg       0.33      0.33      0.33     17381
weighted avg       0.99      0.99      0.99     17381

Confusion Matrix:
[[    0    40     0]
 [    0 17294     0]
 [   

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# Task 4: Model Evaluation and Optimization

## Description

In this task, we perform a thorough evaluation of the trained models on the test dataset to measure their real-world predictive performance. Beyond just evaluating, we will optimize model performance by tuning hyperparameters using techniques such as Grid Search with cross-validation. This helps ensure that the models generalize well and are robust to unseen data.

**Key points:**

- Evaluate models (Logistic Regression, Random Forest, SVM) on the test set using metrics like accuracy, precision, recall, F1-score, and confusion matrix.
- Use cross-validation (e.g., Stratified K-Fold) on the training data for hyperparameter tuning.
- Perform hyperparameter search using Grid Search or Randomized Search.
- Re-train models with optimized hyperparameters and compare results.
- Document findings and select the best performing model.

In [None]:
import numpy as np
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix

# Define models with initial parameters and parameter grids for tuning
models_params = {
    "Logistic Regression": {
        "model": LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42),
        "params": {
            "C": [0.01, 0.1, 1, 10],
            "solver": ["liblinear", "lbfgs"]
        }
    },
    "Random Forest": {
        "model": RandomForestClassifier(class_weight='balanced', random_state=42),
        "params": {
            "n_estimators": [100, 200],
            "max_depth": [None, 10, 20],
            "min_samples_split": [2, 5],
            "min_samples_leaf": [1, 2]
        }
    },
    "SVM": {
        "model": SVC(class_weight='balanced', probability=True, random_state=42),
        "params": {
            "C": [0.1, 1, 10],
            "kernel": ["rbf", "linear"],
            "gamma": ["scale", "auto"]
        }
    }
}

# Stratified K-Fold cross-validation setup
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

best_models = {}

for model_name, mp in models_params.items():
    print(f"\nStarting Grid Search for {model_name}...")
    grid_search = GridSearchCV(mp['model'], mp['params'], cv=cv, scoring='f1_weighted', n_jobs=-1, verbose=1)
    grid_search.fit(X_train, y_train)

    print(f"Best params for {model_name}: {grid_search.best_params_}")
    best_models[model_name] = grid_search.best_estimator_

# Evaluate best models on the test set
for model_name, model in best_models.items():
    print(f"\nEvaluating {model_name} on test data:")
    y_pred = model.predict(X_test)

    print(classification_report(y_test, y_pred, target_names=label_encoder.classes_))

    cm = confusion_matrix(y_test, y_pred)
    print("Confusion Matrix:")
    print(cm)


Starting Grid Search for Logistic Regression...
Fitting 5 folds for each of 8 candidates, totalling 40 fits
Best params for Logistic Regression: {'C': 0.01, 'solver': 'liblinear'}

Starting Grid Search for Random Forest...
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Best params for Random Forest: {'max_depth': 20, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 200}

Starting Grid Search for SVM...
Fitting 5 folds for each of 12 candidates, totalling 60 fits


### Insights Gathered and Conclusion

#### Overview

This task summarizes the key findings and insights gathered during the end-to-end process of generating trading signals using machine learning techniques. From manually computing MACD and RSI to optimizing model parameters, the project aimed to evaluate how well simple technical indicators can drive predictive models in a financial context.

#### Key Insights

1. **Manual Feature Engineering with Technical Indicators**
   - Computing MACD and RSI from scratch offered transparency and customization.
   - It ensured we fully understood the inner workings of these indicators, reinforcing their limitations and assumptions.

2. **Class Imbalance Dominated by ‘Hold’ Signals**
   - Most entries in the dataset were labeled as `Hold`, resulting in highly imbalanced data.
   - This made it difficult for models to detect and predict minority classes like `Buy` and `Sell`.

3. **Model Performance Comparison**
   - **Random Forest** showed the best performance among the three models:
     - It handled non-linear patterns well and generalized better across classes.
     - After hyperparameter tuning, it showed improved recall on minority classes.
   - **Logistic Regression** and **SVM** performed poorly on the minority classes even after tuning, primarily predicting `Hold`.

4. **Effectiveness of Grid Search and Cross-Validation**
   - Grid Search CV helped fine-tune hyperparameters for each model.
   - Stratified 5-fold cross-validation ensured a fair evaluation by preserving class proportions across folds.

5. **Limitations of MACD and RSI for Signal Prediction**
   - These indicators are **lagging** and may not capture sharp price movements or market news.
   - This limits their usefulness in high-frequency or volatile trading scenarios.

#### Conclusion

This project demonstrated that technical indicators like MACD and RSI can be used to build machine learning models for generating trading signals. However, their predictive power is limited, especially under class imbalance and market noise.

To improve results:
- Introduce additional features (e.g., Bollinger Bands, volume trends, sentiment scores).
- Apply class balancing techniques such as **SMOTE** or **undersampling**.
- Explore more sophisticated models (e.g., Gradient Boosting, LSTM for sequence modeling).
- Incorporate profitability-focused metrics beyond accuracy and recall.

While machine learning can support trading decisions, it must be paired with domain expertise and strong risk management to be effective in real-world applications.