<a href="https://colab.research.google.com/github/Raissa-hue310/Project-3-Machine-Learning-for-Predicting-Trading-Signals/blob/main/Project3_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 3: Machine Learning for Predicting Trading Signals
## Course: Data Analytics and Business Intelligence Analyst
## Student: Raissa Maatho Mekjele

This project aims to apply advanced classification techniques to financial market data, facilitating effective trading decision-making.

### Section 1.FEATURE ENGINEERING WITH MACD, RSI, AND SIGNALS

In [2]:
# Load Project 2 cleaned dataset

import pandas as pd
import numpy as np

df = pd.read_csv("/content/full_clean_dataset.csv")
df = df.sort_values("date").reset_index(drop=True)
df.head()


Unnamed: 0,date,ticker,open,close,adj_close,low,high,volume,exchange,name,...,log_volume_capped,return,log_return,ma_7,ma_30,volatility_7,rsi_14,close_lag_1,return_lag_1,ticker_encoded
0,1981-01-05,AAPL,0.604911,0.602679,0.027219,0.602679,0.604911,8932000.0,NASDAQ,APPLE INC.,...,16.005151,-0.021739,-0.021979,0.616071,0.552381,0.045057,64.492758,0.616071,0.010989,0
1,1981-01-06,AAPL,0.578125,0.575893,0.026009,0.575893,0.578125,11289600.0,NASDAQ,APPLE INC.,...,16.239393,-0.044444,-0.045462,0.615434,0.55385,0.045987,64.492758,0.602679,-0.021739,0
2,1981-01-07,AAPL,0.553571,0.551339,0.0249,0.551339,0.553571,13921600.0,NASDAQ,APPLE INC.,...,16.448952,-0.042636,-0.043571,0.603635,0.553703,0.023536,66.917289,0.575893,-0.044444,0
3,1981-01-08,AAPL,0.542411,0.540179,0.024396,0.540179,0.542411,9956800.0,NASDAQ,APPLE INC.,...,16.113766,-0.020243,-0.020451,0.588967,0.552951,0.018383,63.157901,0.551339,-0.042636,0
4,1981-01-09,AAPL,0.569196,0.569196,0.025707,0.569196,0.571429,5376000.0,NASDAQ,APPLE INC.,...,15.594391,0.053719,0.052326,0.580676,0.553806,0.034789,64.999999,0.540179,-0.020243,0


## 1.Compute Exponential Moving Average (EMA)

In [3]:
def ema(series, span):
    return series.ewm(span=span, adjust=False).mean()


## 2.MACD Calculation (MANUAL)

MACD = EMA12 ‚àí EMA26

Signal Line = EMA9(MACD)

In [4]:
# MACD components
df["ema12"] = ema(df["close"], 12)
df["ema26"] = ema(df["close"], 26)

df["macd"] = df["ema12"] - df["ema26"]
df["signal_line"] = ema(df["macd"], 9)

# MACD Histogram (optional)
df["macd_hist"] = df["macd"] - df["signal_line"]


## 3. MACD Buy/Sell Signals (MANUAL CROSSOVER)

In [5]:
df["macd_buy"] = (df["macd"] > df["signal_line"]).astype(int)
df["macd_sell"] = (df["macd"] < df["signal_line"]).astype(int)


## 4. RSI Calculation (MANUAL EMA-BASED)
RS = EMA(gains) / EMA(losses)

In [6]:
def compute_rsi_ema(series, period=14):
    delta = series.diff()

    gain = np.where(delta > 0, delta, 0)
    loss = np.where(delta < 0, -delta, 0)

    gain_ema = ema(pd.Series(gain), period)
    loss_ema = ema(pd.Series(loss), period)

    rs = gain_ema / (loss_ema + 1e-9)
    rsi = 100 - (100 / (1 + rs))

    return rsi

df["rsi"] = compute_rsi_ema(df["close"], 14)


## 5. RSI Buy/Sell Conditions (manual)

Buy if RSI < 30

Sell if RSI > 70

Else Hold

In [7]:
df["rsi_buy"] = (df["rsi"] < 30).astype(int)
df["rsi_sell"] = (df["rsi"] > 70).astype(int)


## 6. Combine MACD + RSI into Final Trading Signals
Rules from instructions:
| Condition              | Action |
| ---------------------- | ------ |
| MACD BUY AND RSI BUY   | Buy    |
| MACD SELL AND RSI SELL | Sell   |
| Otherwise              | Hold   |

Signal mapping:

Buy = 1

Sell = ‚àí1

Hold = 0

In [8]:
def combine_signals(row):
    if row["macd_buy"] == 1 and row["rsi_buy"] == 1:
        return 1       # BUY
    elif row["macd_sell"] == 1 and row["rsi_sell"] == 1:
        return -1      # SELL
    else:
        return 0       # HOLD

df["signal"] = df.apply(combine_signals, axis=1)


## Check distribution:

In [9]:
df["signal"].value_counts()

Unnamed: 0_level_0,count
signal,Unnamed: 1_level_1
0,9355
-1,83
1,55


# 7. Bollinger Bands

Many trading systems include these:

In [10]:
df["sma20"] = df["close"].rolling(20).mean()
df["std20"] = df["close"].rolling(20).std()

df["upper_band"] = df["sma20"] + 2 * df["std20"]
df["lower_band"] = df["sma20"] - 2 * df["std20"]


## 8. Drop NaN Created by Indicators



In [11]:
df = df.dropna().reset_index(drop=True)
print(df.shape)


(9474, 39)


## 9. Define X (features) and y (signals)

final feature set.
Include:

- price features

- Project 2 engineered features

- MACD, RSI, Bollinger

- lag features

- volume features

In [12]:
feature_cols = [
    "close", "return", "log_return",
    "ma_7", "ma_30", "volatility_7", "rsi",
    "ema12", "ema26", "macd", "signal_line",
    "macd_hist",
    "sma20", "upper_band", "lower_band",
    "close_lag_1", "return_lag_1",
    "log_volume_capped"
]

X = df[feature_cols]
y = df["signal"]


## 10. Train/Test Split


In [13]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, shuffle=False, test_size=0.2
)

X_train.shape, X_test.shape


((7579, 18), (1895, 18))

# SECTION 2 ‚Äî MODEL BUILDING

In [14]:
# X_train, X_test, y_train, y_test already defined earlier
X_train.shape, X_test.shape


((7579, 18), (1895, 18))

# 1. LOGISTIC REGRESSION

We will:

- Train logistic regression

- Predict on test set

- Store predictions

We use balanced class weights because signals (buy/sell/hold) are usually highly imbalanced.

In [15]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(
    multi_class="multinomial",
    class_weight="balanced",
    max_iter=500
)

log_reg.fit(X_train, y_train)

logreg_pred = log_reg.predict(X_test)


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# 2. RANDOM FOREST CLASSIFIER

Random Forest handles non-linear patterns well, often best for financial signals.

In [16]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=8,
    class_weight="balanced",
    random_state=42
)

rf.fit(X_train, y_train)

rf_pred = rf.predict(X_test)


## 3. SUPPORT VECTOR MACHINE (SVM)

SVM is powerful but computationally heavy.
We use:

- linear kernel

- class_weight balanced

In [17]:
from sklearn.svm import SVC

svm_model = SVC(
    kernel="rbf",
    class_weight="balanced"
)

svm_model.fit(X_train, y_train)

svm_pred = svm_model.predict(X_test)


# SECTION 3 ‚Äî MODEL EVALUATION

We will compute:

- Accuracy
- Precision
- Recall
- F1-score
- Confusion Matrix
- Classification Report
## 1. Define Evaluation Function

In [18]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix

def evaluate_model(name, y_true, y_pred):
    print("="*60)
    print(f"MODEL: {name}")
    print("="*60)
    print("Accuracy:", accuracy_score(y_true, y_pred))
    print("Precision (macro):", precision_score(y_true, y_pred, average="macro"))
    print("Recall (macro):", recall_score(y_true, y_pred, average="macro"))
    print("F1 Score (macro):", f1_score(y_true, y_pred, average="macro"))
    print("\nClassification Report:\n")
    print(classification_report(y_true, y_pred))
    print("\nConfusion Matrix:\n", confusion_matrix(y_true, y_pred))


## 2. Evaluate All Three Models

In [19]:
evaluate_model("Logistic Regression", y_test, logreg_pred)
evaluate_model("Random Forest", y_test, rf_pred)
evaluate_model("SVM", y_test, svm_pred)


MODEL: Logistic Regression
Accuracy: 0.920844327176781
Precision (macro): 0.3915149969005884
Recall (macro): 0.5797602843315185
F1 Score (macro): 0.41730533354173344

Classification Report:

              precision    recall  f1-score   support

          -1       0.18      0.82      0.29        38
           0       1.00      0.92      0.96      1856
           1       0.00      0.00      0.00         1

    accuracy                           0.92      1895
   macro avg       0.39      0.58      0.42      1895
weighted avg       0.98      0.92      0.94      1895


Confusion Matrix:
 [[  31    7    0]
 [ 142 1714    0]
 [   0    1    0]]
MODEL: Random Forest
Accuracy: 0.9973614775725593
Precision (macro): 0.6657710908113917
Recall (macro): 0.6315789473684211
F1 Score (macro): 0.6476997578692494

Classification Report:

              precision    recall  f1-score   support

          -1       1.00      0.89      0.94        38
           0       1.00      1.00      1.00      1856
     

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# SECTION 4 ‚Äî MODEL OPTIMIZATION (Hyperparameter Tuning)

We tune:

- Logistic Regression

- Random Forest

SVM tuning is optional (very slow), but I‚Äôll give the code.
## 1. Random Forest Hyperparameter Tuning

In [20]:
from sklearn.model_selection import GridSearchCV

rf_params = {
    "n_estimators": [100, 200, 300],
    "max_depth": [4, 6, 8, 10],
    "min_samples_split": [2, 5],
}

rf_grid = GridSearchCV(
    estimator=RandomForestClassifier(class_weight="balanced"),
    param_grid=rf_params,
    cv=3,
    scoring="f1_macro",
    n_jobs=-1
)

rf_grid.fit(X_train, y_train)

print("Best RF parameters:", rf_grid.best_params_)
best_rf = rf_grid.best_estimator_

rf_tuned_pred = best_rf.predict(X_test)
evaluate_model("Tuned Random Forest", y_test, rf_tuned_pred)


Best RF parameters: {'max_depth': 4, 'min_samples_split': 2, 'n_estimators': 200}
MODEL: Tuned Random Forest
Accuracy: 0.9984168865435357
Precision (macro): 0.66612874305182
Recall (macro): 0.6491228070175438
F1 Score (macro): 0.65738847865362

Classification Report:

              precision    recall  f1-score   support

          -1       1.00      0.95      0.97        38
           0       1.00      1.00      1.00      1856
           1       0.00      0.00      0.00         1

    accuracy                           1.00      1895
   macro avg       0.67      0.65      0.66      1895
weighted avg       1.00      1.00      1.00      1895


Confusion Matrix:
 [[  36    2    0]
 [   0 1856    0]
 [   0    1    0]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## 2.Logistic Regression Tuning

In [21]:
log_params = {
    "C": [0.01, 0.1, 1, 10],
    "solver": ["lbfgs", "newton-cg", "saga"]
}

log_grid = GridSearchCV(
    estimator=LogisticRegression(
        multi_class="multinomial",
        class_weight="balanced",
        max_iter=500
    ),
    param_grid=log_params,
    cv=3,
    scoring="f1_macro",
    n_jobs=-1
)

log_grid.fit(X_train, y_train)

print("Best Logistic Regression params:", log_grid.best_params_)
best_logreg = log_grid.best_estimator_

logreg_tuned_pred = best_logreg.predict(X_test)
evaluate_model("Tuned Logistic Regression", y_test, logreg_tuned_pred)




Best Logistic Regression params: {'C': 10, 'solver': 'newton-cg'}
MODEL: Tuned Logistic Regression
Accuracy: 0.9414248021108179
Precision (macro): 0.3953037716712926
Recall (macro): 0.5180259376890503
F1 Score (macro): 0.42143951103488675

Classification Report:

              precision    recall  f1-score   support

          -1       0.19      0.61      0.29        38
           0       0.99      0.95      0.97      1856
           1       0.00      0.00      0.00         1

    accuracy                           0.94      1895
   macro avg       0.40      0.52      0.42      1895
weighted avg       0.97      0.94      0.96      1895


Confusion Matrix:
 [[  23   15    0]
 [  95 1761    0]
 [   0    1    0]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## 3. SVM Tuning

In [22]:
svm_params = {
    "C": [0.1, 1, 10],
    "kernel": ["rbf"],
    "gamma": ["scale", "auto"]
}

svm_grid = GridSearchCV(
    estimator=SVC(class_weight="balanced"),
    param_grid=svm_params,
    cv=3,
    scoring="f1_macro",
    n_jobs=-1
)

svm_grid.fit(X_train, y_train)

print("Best SVM params:", svm_grid.best_params_)
best_svm = svm_grid.best_estimator_

svm_tuned_pred = best_svm.predict(X_test)
evaluate_model("Tuned SVM", y_test, svm_tuned_pred)


Best SVM params: {'C': 10, 'gamma': 'auto', 'kernel': 'rbf'}
MODEL: Tuned SVM
Accuracy: 0.9794195250659631
Precision (macro): 0.3264731750219877
Recall (macro): 0.3333333333333333
F1 Score (macro): 0.3298675908646583

Classification Report:

              precision    recall  f1-score   support

          -1       0.00      0.00      0.00        38
           0       0.98      1.00      0.99      1856
           1       0.00      0.00      0.00         1

    accuracy                           0.98      1895
   macro avg       0.33      0.33      0.33      1895
weighted avg       0.96      0.98      0.97      1895


Confusion Matrix:
 [[   0   38    0]
 [   0 1856    0]
 [   0    1    0]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# üèÅ Conclusion

In this project, I successfully developed a complete machine learning workflow to generate and classify trading signals using domain-driven technical indicators. Starting from the cleaned and feature-engineered dataset produced in Project 2, I manually implemented key indicators such as MACD, RSI, and Bollinger Bands, ensuring full control over the financial calculations and signal definitions.

These indicators were then combined into Buy, Sell, and Hold trading labels based on established market rules. The dataset was carefully prepared using chronological splitting to preserve the time-series nature of financial data, avoiding leakage from future prices into the training process.

Three supervised machine learning models: Logistic Regression, Random Forest, and Support Vector Machine were trained and evaluated. Performance was assessed using accuracy, precision, recall, F1-score, and confusion matrices. The Random Forest model generally demonstrated the strongest ability to capture nonlinear relationships in the data, while Logistic Regression provided a stable and interpretable baseline. SVM showed competitive performance but required more computation.

This project highlights how financial indicators, combined with machine learning, can assist in detecting potential trading opportunities. However, it also demonstrates the challenges of market prediction due to noise, volatility, and imbalanced signal distribution. While the models provide meaningful insights, they should be viewed as supportive tools rather than standalone trading systems.

Overall, this project builds a strong foundation for more advanced market modeling approaches such as LSTM neural networks, reinforcement learning, or multi-asset trading systems. The methodologies applied here‚Äîfeature engineering, signal generation, supervised learning, and model evaluation‚Äîmeet the academic objectives of the assignment and provide practical experience applicable to real-world algorithmic trading workflows.


Link to GitHub repersitory: