# **Comparative Analysis of Eighteen Machine Learning Models for Stock Price Forecasting in the Financial Sector Using Historical Yahoo Finance Data.**

# Aim and Objectives of the Study

The general objective of this study is to conduct a comprehensive comparative analysis of eighteen machine learning models for stock price forecasting in the financial sector using historical Yahoo Finance data. The specific objectives of the study are to:

- Examine the predictive performance of traditional regression-based machine learning models in forecasting stock prices.

- Evaluate the effectiveness of tree-based and ensemble learning models for financial time-series prediction.

- Assess the capability of kernel-based and probabilistic models in capturing non-linear stock price dynamics.

- Analyze the forecasting accuracy of deep learning models in comparison with conventional machine learning techniques.

# **Library Importation**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier, HistGradientBoostingClassifier
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.ensemble import IsolationForest
from sklearn.neural_network import MLPClassifier, MLPRegressor

from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.under_sampling import TomekLinks
from imblearn.ensemble import BalancedRandomForestClassifier

import tensorflow as tf
from tensorflow.keras import layers, models

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

from tabulate import tabulate

# **Data Loading**

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
df = pd.read_csv('yahooStock.csv')

In [None]:
df.head(5)

# **Data Preprocessing**

In [None]:
# Feature engineering: Clean 'transactionType' column
df_processed = df.copy()
df_processed['transactionType_cleaned'] = df_processed['transactionType'].astype(str).str.strip().str.upper()

# Debugging: Check unique cleaned transaction types before filtering
print("Unique cleaned transaction types found in the dataset:")
print(df_processed['transactionType_cleaned'].unique())
print("Value counts for cleaned transaction types:")
print(df_processed['transactionType_cleaned'].value_counts())

In [None]:
# Filter for relevant transaction types ('BUY' or 'SELL')
purchase_mask = df_processed['transactionType_cleaned'].str.contains('BUY', na=False)
sale_mask = df_processed['transactionType_cleaned'].str.contains('SELL', na=False)

# Combine masks for filtering
df_filtered = df_processed[purchase_mask | sale_mask].copy()

# Create 'Target' column: 1 for Purchase (BUY), 0 for Sale (SELL)
df_filtered['Target'] = np.where(df_filtered['transactionType_cleaned'].str.contains('BUY', na=False), 1, 0)

print("DataFrame filtered for 'BUY' and 'SELL' transactions, and 'Target' column created.")
print("Head of df_filtered:\n", df_filtered.head())

In [None]:
# Select features from available numerical columns and handle potential NaNs
numerical_features = ['amount', 'reportedPrice', 'usdValue']
for col in numerical_features:
    df_filtered[col] = pd.to_numeric(df_filtered[col], errors='coerce')

# Drop rows with NaN values only in the selected features or target
df_final = df_filtered.dropna(subset=numerical_features + ['Target'])

X = df_final[numerical_features]
y = df_final['Target']

# Check if X is empty after all processing steps
if X.empty:
    raise ValueError("DataFrame is empty after filtering and NaN removal. No data to train on. Check 'transactionType' values and numerical features.")

# Print target class distribution before train-test split
print("\nTarget class distribution before train-test split:")
print(y.value_counts())

## **Exploratory Data Analysis (EDA)**

### **Visualize 'amount' with a Box Plot**


In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='Target', y='amount', data=df_final)
plt.title('Distribution of Transaction Amount by Target')
plt.xlabel('Target (0: SELL, 1: BUY)')
plt.ylabel('Amount')
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.violinplot(x='Target', y='reportedPrice', data=df_final)
plt.title('Distribution of Reported Price by Target')
plt.xlabel('Target (0: SELL, 1: BUY)')
plt.ylabel('Reported Price')
plt.show()

In [None]:
plt.figure(figsize=(10, 7))
sns.scatterplot(x='amount', y='usdValue', hue='Target', data=df_final, palette='viridis', alpha=0.7)
plt.title('Relationship between Amount and USD Value by Target')
plt.xlabel('Amount')
plt.ylabel('USD Value')
plt.legend(title='Target (0: SELL, 1: BUY)')
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(df_final['eodHolding'], bins=50, kde=True)
plt.title('Distribution of End of Day Holding')
plt.xlabel('End of Day Holding')
plt.ylabel('Frequency')
plt.show()

In [None]:
plt.figure(figsize=(8, 5))
sns.countplot(x='symbolType', data=df_final)
plt.title('Distribution of Symbol Type')
plt.xlabel('Symbol Type')
plt.ylabel('Count')
plt.show()

In [None]:
plt.figure(figsize=(8, 8))
df_final['hasOptions'].value_counts().plot.pie(autopct='%1.1f%%', startangle=90, colors=sns.color_palette('pastel'))
plt.title('Distribution of Has Options')
plt.ylabel('') # Hide the default y-label
plt.show()

### **Transaction Type distribution across Symbol Types**

In [None]:
plt.figure(figsize=(12, 7))
sns.countplot(x='symbolType', hue='Target', data=df_final, palette='viridis')
plt.title('Transaction Type Distribution Across Symbol Types')
plt.xlabel('Symbol Type')
plt.ylabel('Count')
plt.legend(title='Target (0: SELL, 1: BUY)')
plt.show()

### **Transaction Type distribution by Has Options**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.countplot(x='hasOptions', hue='Target', data=df_final, palette='viridis')
plt.title('Transaction Type Distribution by Has Options')
plt.xlabel('Has Options')
plt.ylabel('Count')
plt.legend(title='Target (0: SELL, 1: BUY)')
plt.show()

### **Average Amount by Symbol Type**

In [None]:
# Calculate the average 'amount' for each 'symbolType'
average_amount_by_symbol_type = df_final.groupby('symbolType')['amount'].mean().reset_index()

# Create a bar plot
plt.figure(figsize=(12, 7))
sns.barplot(x='symbolType', y='amount', hue='symbolType', data=average_amount_by_symbol_type, palette='viridis', legend=False)
plt.title('Average Transaction Amount by Symbol Type')
plt.xlabel('Symbol Type')
plt.ylabel('Average Amount')
plt.show()

### **Average Amount by Has Options**


In [None]:
plt.figure(figsize=(7, 5))
sns.countplot(x='hasOptions', data=df_final, palette='viridis', hue='hasOptions', legend=False)
plt.title('Distribution of Has Options')
plt.xlabel('Has Options')
plt.ylabel('Count')
plt.show()

### **Distribution of 'symbolCode'**

In [None]:
plt.figure(figsize=(7, 5))
sns.countplot(x='symbolCode', data=df_final, palette='viridis', hue='symbolCode', legend=False)
plt.title('Distribution of Symbol Code')
plt.xlabel('Symbol Code')
plt.ylabel('Count')
plt.show()

### **Pair Plot of Numerical Features and Target**

In [None]:
print("\n### Pair Plot of Numerical Features and Target:\n")
sns.pairplot(df_final[numerical_features + ['Target']], hue='Target', diag_kind='kde')
plt.suptitle('Pair Plot of Numerical Features by Target', y=1.02)
plt.show()

# **Relationships Between Variables**

In [None]:
display(df_final[numerical_features + ['Target']].corr())

In [None]:
print("\n### Correlation Matrix:\n")
plt.figure(figsize=(8, 6))
sns.heatmap(df_final[numerical_features + ['Target']].corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Numerical Features and Target')
plt.show()

# **Models**

### **Train-test split**

In [None]:

x_train, x_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, shuffle=False
)

print("Data split into training and testing sets.")
print(f"x_train shape: {x_train.shape}")
print(f"x_test shape: {x_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

### **Scaling**

In [None]:
# Scaling
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

print("Features scaled using StandardScaler.")

### **Handle imbalance (SMOTE) - only if there are at least two classes in y_train**

In [None]:
# Handle imbalance (SMOTE) - only if there are at least two classes in y_train
if len(np.unique(y_train)) > 1:
    smote = SMOTE(random_state=42)
    x_train, y_train = smote.fit_resample(x_train, y_train)
    print("SMOTE applied to x_train and y_train for imbalance handling.")
    print(f"x_train shape after SMOTE: {x_train.shape}")
    print(f"y_train shape after SMOTE: {y_train.shape}")
    print("Target class distribution after SMOTE:")
    print(pd.Series(y_train).value_counts())
else:
    print("Warning: SMOTE not applied. y_train contains only one class. Binary classification might be problematic.")

## **Model Accuracy**

In [None]:

def calculate_accuracy_at_k(y_true, y_pred_proba, k_percentages):
    """
    Calculates Accuracy@k (precision at top k%) for given true labels and predicted probabilities.

    Args:
        y_true (array-like): The actual labels (0 or 1).
        y_pred_proba (array-like): The predicted probabilities for the positive class.
        k_percentages (list): A list of float values representing the top k% thresholds (e.g., [0.10, 0.20, 0.30]).

    Returns:
        dict: A dictionary where keys are 'Accuracy@k%' (e.g., 'Accuracy@10%') and values are the calculated precision percentages.
    """
    # 1. Combine y_true and y_pred_proba into a DataFrame
    df_combined = pd.DataFrame({'true_label': y_true, 'predicted_proba': y_pred_proba})

    # 2. Sort DataFrame by 'predicted_proba' in descending order
    df_combined = df_combined.sort_values(by='predicted_proba', ascending=False).reset_index(drop=True)

    # 3. Initialize dictionary to store results
    accuracy_at_k_results = {}

    # 4. Iterate through each k_percentage
    for k_percentage in k_percentages:
        # a. Calculate the number of samples for the current top k% threshold
        k_count = int(len(y_true) * k_percentage)

        # Handle the case where k_count is zero (e.g., k_percentage is very small or y_true is empty)
        if k_count == 0:
            accuracy_at_k_results[f'Accuracy@{int(k_percentage * 100)}%'] = 0.0
            continue

        # b. Select the top k_count samples from the sorted DataFrame
        top_k_df = df_combined.head(k_count)

        # c. Within these top k_count samples, count how many have a 'true_label' equal to 1
        correct_predictions = top_k_df['true_label'].sum()

        # d. Calculate the precision at k
        precision_at_k = (correct_predictions / k_count) * 100

        # e. Store this calculated precision (percentage) in the results dictionary
        accuracy_at_k_results[f'Accuracy@{int(k_percentage * 100)}%'] = precision_at_k

    return accuracy_at_k_results

print("The 'calculate_accuracy_at_k' function has been defined.")

## **Linear & Probabilistic Models**

### **1. Linear Regression**

In [None]:
# Train Linear Regression
# It treats y (0 or 1) as a continuous target
lr_model = LinearRegression()
lr_model.fit(x_train, y_train)

# Convert Continuous Output to Binary Class
# We use 0.5 as the default threshold, but in imbalanced data,
# you might lower this (e.g., 0.2) to increase Recall.
raw_predictions = lr_model.predict(x_test)
y_pred = (raw_predictions > 0.5).astype(int)

# Accuracy on test data
train_score_lr = metrics.accuracy_score(y_test, y_pred)
print('Accuracy score : ', train_score_lr)
print("Accuracy percent of Linear Regression : ", round(train_score_lr*100, 1), "%")
print("Model score (on train data): ", lr_model.score(x_train, y_train))

# Classification report and confusion matrix
print("\nLinear Regression Classification Report:\n", metrics.classification_report(y_test, y_pred))
cm = metrics.confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", cm)

# TN, FP, FN, TP
TN, FP, FN, TP = cm.ravel()
print("TN={0}, FP={1}, FN={2}, TP={3}".format(TN, FP, FN, TP))

# Confusion matrix plot
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

# Calculate Accuracy@k
lr_accuracy_at_k = calculate_accuracy_at_k(y_test, raw_predictions, [0.10, 0.20, 0.30])
print("\nLinear Regression Accuracy@k:", lr_accuracy_at_k)


### **2. Logistic Regression (with SMOTE)**:

In [None]:
# Logic: Linear boundary with synthetic oversampling
# SMOTE is already applied globally in the data preparation step (Ggj7iUq-IX7Z) on x_train and y_train.
# So, directly use the resampled x_train, y_train for model training.
model = LogisticRegression(max_iter=1000, random_state=42).fit(x_train, y_train)

# Predict
y_pred = model.predict(x_test)
y_pred_proba = model.predict_proba(x_test)[:, 1] # Get probabilities for the positive class

# Accuracy on test data
train_score_lr = metrics.accuracy_score(y_test, y_pred)
print('Accuracy score : ', train_score_lr)
print("Accuracy percent of Logistic Regression : ", round(train_score_lr*100, 1), "%")
print("Model score (on resampled train data): ", model.score(x_train, y_train))

# Classification report and confusion matrix
print("\nLogistic Regression Classification Report:\n", metrics.classification_report(y_test, y_pred))
cm = metrics.confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", cm)

# TN, FP, FN, TP
TN, FP, FN, TP = cm.ravel()
print("TN={0}, FP={1}, FN={2}, TP={3}".format(TN, FP, FN, TP))

# Confusion matrix plot
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

# Calculate Accuracy@k
logistic_accuracy_at_k = calculate_accuracy_at_k(y_test, y_pred_proba, [0.10, 0.20, 0.30])
print("\nLogistic Regression Accuracy@k:", logistic_accuracy_at_k)


### **3. Naïve Bayes**



In [None]:
# Logic: Assumes independence between features
model = GaussianNB().fit(x_train, y_train)

# Predict
y_pred = model.predict(x_test)
y_pred_proba = model.predict_proba(x_test)[:, 1] # Get probabilities for the positive class

# Accuracy on test data
train_score_nb = metrics.accuracy_score(y_test, y_pred)
print('Accuracy score : ', train_score_nb)
print("Accuracy percent of Naïve Bayes : ", round(train_score_nb*100, 1), "%")
print("Model score (on train data): ", model.score(x_train, y_train))

# Classification report and confusion matrix
print("\nNaïve Bayes Classification Report:\n", metrics.classification_report(y_test, y_pred))
cm = metrics.confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", cm)

# TN, FP, FN, TP
TN, FP, FN, TP = cm.ravel()
print("TN={0}, FP={1}, FN={2}, TP={3}".format(TN, FP, FN, TP))

# Confusion matrix plot
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

# Calculate Accuracy@k
naive_bayes_accuracy_at_k = calculate_accuracy_at_k(y_test, y_pred_proba, [0.10, 0.20, 0.30])
print("\nNaïve Bayes Accuracy@k:", naive_bayes_accuracy_at_k)

### **4. LDA (Linear Discriminant Analysis)**


In [None]:
# Logic: Maximizes class separability
model = LinearDiscriminantAnalysis().fit(x_train, y_train)

# Predict
y_pred = model.predict(x_test)
y_pred_proba = model.predict_proba(x_test)[:, 1] # Get probabilities for the positive class

# Accuracy on test data
train_score_lda = metrics.accuracy_score(y_test, y_pred)
print('Accuracy score : ', train_score_lda)
print("Accuracy percent of LDA : ", round(train_score_lda*100, 1), "%")
print("Model score (on train data): ", model.score(x_train, y_train))

# Classification report and confusion matrix
print("\nLDA Classification Report:\n", metrics.classification_report(y_test, y_pred))
cm = metrics.confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", cm)

# TN, FP, FN, TP
TN, FP, FN, TP = cm.ravel()
print("TN={0}, FP={1}, FN={2}, TP={3}".format(TN, FP, FN, TP))

# Confusion matrix plot
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

# Calculate Accuracy@k
lda_accuracy_at_k = calculate_accuracy_at_k(y_test, y_pred_proba, [0.10, 0.20, 0.30])
print("\nLDA Accuracy@k:", lda_accuracy_at_k)

### **5. Support Vector Machines (SVM)**

In [None]:
# Logic: Finds the hyperplane that maximizes the margin between classes
model = SVC(kernel='rbf', class_weight='balanced', probability=True, random_state=42)
model.fit(x_train, y_train)

# Predict
y_pred = model.predict(x_test)
y_pred_proba = model.predict_proba(x_test)[:, 1] # Get probabilities for the positive class

# Accuracy on test data
train_score_svm = metrics.accuracy_score(y_test, y_pred)
print('Accuracy score : ', train_score_svm)
print("Accuracy percent of SVM : ", round(train_score_svm*100, 1), "%")
print("Model score (on resampled train data): ", model.score(x_train, y_train))

# Classification report and confusion matrix
print("\nSVM Classification Report:\n", metrics.classification_report(y_test, y_pred))
cm = metrics.confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", cm)

# TN, FP, FN, TP
TN, FP, FN, TP = cm.ravel()
print("TN={0}, FP={1}, FN={2}, TP={3}".format(TN, FP, FN, TP))

# Confusion matrix plot
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

# Calculate Accuracy@k
svm_accuracy_at_k = calculate_accuracy_at_k(y_test, y_pred_proba, [0.10, 0.20, 0.30])
print("\nSVM Accuracy@k:", svm_accuracy_at_k)

### **6. k-Nearest Neighbors (k-NN)**



In [None]:
# Logic: Classifies based on local density
# Apply TomekLinks for under-sampling (as per instruction for this cell)
X_res_tl, y_res_tl = TomekLinks().fit_resample(x_train, y_train)

# Initialize and fit k-NN model
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_res_tl, y_res_tl)

# Predict
y_pred = model.predict(x_test)
y_pred_proba = model.predict_proba(x_test)[:, 1] # Get probabilities for the positive class

# Accuracy on test data
train_score_knn = metrics.accuracy_score(y_test, y_pred)
print('Accuracy score : ', train_score_knn)
print("Accuracy percent of k-NN : ", round(train_score_knn*100, 1), "%")
print("Model score (on resampled train data): ", model.score(X_res_tl, y_res_tl))

# Classification report and confusion matrix
print("\nk-NN Classification Report:\n", metrics.classification_report(y_test, y_pred))
cm = metrics.confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", cm)

# TN, FP, FN, TP
TN, FP, FN, TP = cm.ravel()
print("TN={0}, FP={1}, FN={2}, TP={3}".format(TN, FP, FN, TP))

# Confusion matrix plot
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

# Calculate Accuracy@k
knn_accuracy_at_k = calculate_accuracy_at_k(y_test, y_pred_proba, [0.10, 0.20, 0.30])
print("\nk-NN Accuracy@k:", knn_accuracy_at_k)


### **7. t-SNE (t-Distributed Stochastic Neighbor Embedding)**

A manifold learning technique for visualization and non-linear dimensionality reduction (similar to PCA, but better for identifying clusters in complex data)

In [None]:
# Logic: Projects high-dimensional data into 2D or 3D while preserving local structure
# Note: sklearn's TSNE does not have a 'transform' method for new data.
# For demonstration purposes and to adhere to the template, we are applying
# fit_transform separately to train and test data. This means the embedded
# spaces are NOT guaranteed to be consistent, which is generally NOT ideal
# for robust model evaluation on unseen test data.

tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_train_embedded = tsne.fit_transform(x_train)

# For prediction on x_test, we re-run fit_transform on x_test.
# This creates an independent embedding for the test set.
# A more robust approach for production would involve learning a mapping
# from the original feature space to the embedded space.
tsne_test = TSNE(n_components=2, perplexity=30, random_state=42)
X_test_embedded = tsne_test.fit_transform(x_test)

# Train Logistic Regression on the 2D embedded training coordinates
model = LogisticRegression(random_state=42).fit(X_train_embedded, y_train)

# Predict on the 2D embedded test coordinates
y_pred = model.predict(X_test_embedded)
y_pred_proba = model.predict_proba(X_test_embedded)[:, 1] # Get probabilities for the positive class

# Accuracy on test data
train_score_tsne_lr = metrics.accuracy_score(y_test, y_pred)
print('Accuracy score : ', train_score_tsne_lr)
print("Accuracy percent of t-SNE + LR : ", round(train_score_tsne_lr*100, 1), "%")
print("Model score (on embedded train data): ", model.score(X_train_embedded, y_train))

# Classification report and confusion matrix
print("\nt-SNE + Logistic Regression Classification Report:\n", metrics.classification_report(y_test, y_pred))
cm = metrics.confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", cm)

# TN, FP, FN, TP
TN, FP, FN, TP = cm.ravel()
print("TN={0}, FP={1}, FN={2}, TP={3}".format(TN, FP, FN, TP))

# Confusion matrix plot
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

# Calculate Accuracy@k
tsne_lr_accuracy_at_k = calculate_accuracy_at_k(y_test, y_pred_proba, [0.10, 0.20, 0.30])
print("\nt-SNE + Logistic Regression Accuracy@k:", tsne_lr_accuracy_at_k)

## **Tree-Based Ensembles**

### **8. Decision Trees (Cost-Sensitive)**

In [None]:
# Logic: Non-linear splits focusing on minority class weight
model = DecisionTreeClassifier(class_weight='balanced', random_state=42).fit(x_train, y_train)

# Predict
y_pred = model.predict(x_test)
y_pred_proba = model.predict_proba(x_test)[:, 1] # Get probabilities for the positive class

# Accuracy on test data
train_score_dt = metrics.accuracy_score(y_test, y_pred)
print('Accuracy score : ', train_score_dt)
print("Accuracy percent of Decision Tree : ", round(train_score_dt*100, 1), "%")
print("Model score (on train data): ", model.score(x_train, y_train))

# Classification report and confusion matrix
print("\nDecision Tree Classification Report:\n", metrics.classification_report(y_test, y_pred))
cm = metrics.confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", cm)

# TN, FP, FN, TP
TN, FP, FN, TP = cm.ravel()
print("TN={0}, FP={1}, FN={2}, TP={3}".format(TN, FP, FN, TP))

# Confusion matrix plot
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

# Calculate Accuracy@k
dt_accuracy_at_k = calculate_accuracy_at_k(y_test, y_pred_proba, [0.10, 0.20, 0.30])
print("\nDecision Tree Accuracy@k:", dt_accuracy_at_k)

### **9. Random Forest (Balanced)**

In [None]:
# Logic: Bagging with internal under-sampling
rand_model = BalancedRandomForestClassifier(n_estimators=100, random_state=42)
rand_model.fit(x_train, y_train)

# Predict
pred2 = rand_model.predict(x_test)
y_pred_proba = rand_model.predict_proba(x_test)[:, 1] # Get probabilities for the positive class

# Accuracy on test data
train_score_rand = metrics.accuracy_score(y_test, pred2)
print('Accuracy score2 : ', train_score_rand)
print("Accuracy percent of RFC : ", round(train_score_rand*100, 1), "%")
print("Model score: ", rand_model.score(x_train, y_train))

# Classification report and confusion matrix
print(metrics.classification_report(y_test, pred2))
cm = metrics.confusion_matrix(y_test, pred2)
print("Confusion Matrix:\n", cm)

# TN, FP, FN, TP
TN, FP, FN, TP = cm.ravel()
print("TN={0}, FP={1}, FN={2}, TP={3}".format(TN, FP, FN, TP))

# Confusion matrix plot
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

# Calculate Accuracy@k
rand_accuracy_at_k = calculate_accuracy_at_k(y_test, y_pred_proba, [0.10, 0.20, 0.30])
print("\nBalanced Random Forest Accuracy@k:", rand_accuracy_at_k)


### **10. Extra Trees (Extremely Randomized Trees)**

Similar to Random Forest, but it chooses split points completely at random for each feature, which can further reduce variance.

In [None]:
# Logic: Randomized split points to reduce over-fitting on imbalanced noise
model = ExtraTreesClassifier(n_estimators=100, class_weight='balanced', random_state=42)
model.fit(x_train, y_train)

# Predict
y_pred = model.predict(x_test)
y_pred_proba = model.predict_proba(x_test)[:, 1] # Get probabilities for the positive class

# Accuracy on test data
train_score_et = metrics.accuracy_score(y_test, y_pred)
print('Accuracy score : ', train_score_et)
print("Accuracy percent of Extra Trees : ", round(train_score_et*100, 1), "%")
print("Model score (on resampled train data): ", model.score(x_train, y_train))

# Classification report and confusion matrix
print("\nExtra Trees Classification Report:\n", metrics.classification_report(y_test, y_pred))
cm = metrics.confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", cm)

# TN, FP, FN, TP
TN, FP, FN, TP = cm.ravel()
print("TN={0}, FP={1}, FN={2}, TP={3}".format(TN, FP, FN, TP))

# Confusion matrix plot
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

# Calculate Accuracy@k
et_accuracy_at_k = calculate_accuracy_at_k(y_test, y_pred_proba, [0.10, 0.20, 0.30])
print("\nExtra Trees Accuracy@k:", et_accuracy_at_k)

### **11. AdaBoost (Adaptive Boosting)**

In [None]:
# Logic: Weights misclassified instances more heavily
X_res_adasyn, y_res_adasyn = ADASYN(random_state=42).fit_resample(x_train, y_train)
model = AdaBoostClassifier(n_estimators=50, random_state=42).fit(X_res_adasyn, y_res_adasyn)

# Predict
y_pred = model.predict(x_test)
y_pred_proba = model.predict_proba(x_test)[:, 1] # Get probabilities for the positive class

# Accuracy on test data
train_score_ada = metrics.accuracy_score(y_test, y_pred)
print('Accuracy score : ', train_score_ada)
print("Accuracy percent of AdaBoost : ", round(train_score_ada*100, 1), "%")
print("Model score (on resampled train data): ", model.score(X_res_adasyn, y_res_adasyn))

# Classification report and confusion matrix
print("\nAdaBoost Classification Report:\n", metrics.classification_report(y_test, y_pred))
cm = metrics.confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", cm)

# TN, FP, FN, TP
TN, FP, FN, TP = cm.ravel()
print("TN={0}, FP={1}, FN={2}, TP={3}".format(TN, FP, FN, TP))

# Confusion matrix plot
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

# Calculate Accuracy@k
ada_accuracy_at_k = calculate_accuracy_at_k(y_test, y_pred_proba, [0.10, 0.20, 0.30])
print("\nAdaBoost Accuracy@k:", ada_accuracy_at_k)

### **12. Gradient Boosting (HistGradientBoosting)**

Modern gradient boosting that handles missing values and large data well.

In [None]:
# Logic: Sequential boosting focusing on previous errors
model = HistGradientBoostingClassifier(random_state=42)
model.fit(x_train, y_train)

# Predict
y_pred = model.predict(x_test)
y_pred_proba = model.predict_proba(x_test)[:, 1] # Get probabilities for the positive class

# Accuracy on test data
train_score_gb = metrics.accuracy_score(y_test, y_pred)
print('Accuracy score : ', train_score_gb)
print("Accuracy percent of Gradient Boosting : ", round(train_score_gb*100, 1), "%")
print("Model score (on train data): ", model.score(x_train, y_train))

# Classification report and confusion matrix
print("\nGradient Boosting Classification Report:\n", metrics.classification_report(y_test, y_pred))
cm = metrics.confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", cm)

# TN, FP, FN, TP
TN, FP, FN, TP = cm.ravel()
print("TN={0}, FP={1}, FN={2}, TP={3}".format(TN, FP, FN, TP))

# Confusion matrix plot
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

# Calculate Accuracy@k
gb_accuracy_at_k = calculate_accuracy_at_k(y_test, y_pred_proba, [0.10, 0.20, 0.30])
print("\nGradient Boosting Accuracy@k:", gb_accuracy_at_k)


### **13. XGBoost (Extreme Gradient Boosting)**

Highly optimized distributed gradient boosting library.


In [None]:
# Logic: Optimized Gradient Boosting with scale_pos_weight for imbalance
# Calculate scale_pos_weight based on the current y_train distribution
neg_count = np.sum(y_train == 0)
pos_count = np.sum(y_train == 1)
scale_pos_weight_value = neg_count / pos_count if pos_count > 0 else 1

model = XGBClassifier(scale_pos_weight=scale_pos_weight_value, random_state=42)
model.fit(x_train, y_train)

# Predict
y_pred = model.predict(x_test)
y_pred_proba = model.predict_proba(x_test)[:, 1] # Get probabilities for the positive class

# Accuracy on test data
train_score_xgb = metrics.accuracy_score(y_test, y_pred)
print('Accuracy score : ', train_score_xgb)
print("Accuracy percent of XGBoost : ", round(train_score_xgb*100, 1), "%")
print("Model score (on train data): ", model.score(x_train, y_train))

# Classification report and confusion matrix
print("\nXGBoost Classification Report:\n", metrics.classification_report(y_test, y_pred))
cm = metrics.confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", cm)

# TN, FP, FN, TP
TN, FP, FN, TP = cm.ravel()
print("TN={0}, FP={1}, FN={2}, TP={3}".format(TN, FP, FN, TP))

# Confusion matrix plot
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

# Calculate Accuracy@k
xgb_accuracy_at_k = calculate_accuracy_at_k(y_test, y_pred_proba, [0.10, 0.20, 0.30])
print("\nXGBoost Accuracy@k:", xgb_accuracy_at_k)

### **14. LightGBM (Light Gradient Boosting Machine)**

A highly efficient framework that uses leaf-wise tree growth rather than level-wise growth, making it faster than standard XGBoost.

In [None]:
# Logic: Leaf-wise growth and Gradient-based One-Side Sampling (GOSS)
model = LGBMClassifier(is_unbalance=True, random_state=42)
model.fit(x_train, y_train)

# Predict
y_pred = model.predict(x_test)
y_pred_proba = model.predict_proba(x_test)[:, 1] # Get probabilities for the positive class

# Accuracy on test data
train_score_lgbm = metrics.accuracy_score(y_test, y_pred)
print('Accuracy score : ', train_score_lgbm)
print("Accuracy percent of LightGBM : ", round(train_score_lgbm*100, 1), "%")
print("Model score (on train data): ", model.score(x_train, y_train))

# Classification report and confusion matrix
print("\nLightGBM Classification Report:\n", metrics.classification_report(y_test, y_pred))
cm = metrics.confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", cm)

# TN, FP, FN, TP
TN, FP, FN, TP = cm.ravel()
print("TN={0}, FP={1}, FN={2}, TP={3}".format(TN, FP, FN, TP))

# Confusion matrix plot
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

# Calculate Accuracy@k
lgbm_accuracy_at_k = calculate_accuracy_at_k(y_test, y_pred_proba, [0.10, 0.20, 0.30])
print("\nLightGBM Accuracy@k:", lgbm_accuracy_at_k)

### **15. CatBoost (Categorical Boosting)**

Developed by Yandex, this algorithm handles categorical features automatically and uses a "symmetric tree" structure to prevent prediction shift.



In [None]:
# Logic: Ordered boosting to solve the gradient bias problem
model = CatBoostClassifier(auto_class_weights='Balanced', verbose=0, random_state=42)
model.fit(x_train, y_train)

# Predict
y_pred = model.predict(x_test)
y_pred_proba = model.predict_proba(x_test)[:, 1] # Get probabilities for the positive class

# Accuracy on test data
train_score_cat = metrics.accuracy_score(y_test, y_pred)
print('Accuracy score : ', train_score_cat)
print("Accuracy percent of CatBoost : ", round(train_score_cat*100, 1), "%")
print("Model score (on train data): ", model.score(x_train, y_train))

# Classification report and confusion matrix
print("\nCatBoost Classification Report:\n", metrics.classification_report(y_test, y_pred))
cm = metrics.confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", cm)

# TN, FP, FN, TP
TN, FP, FN, TP = cm.ravel()
print("TN={0}, FP={1}, FN={2}, TP={3}".format(TN, FP, FN, TP))

# Confusion matrix plot
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

# Calculate Accuracy@k
cat_accuracy_at_k = calculate_accuracy_at_k(y_test, y_pred_proba, [0.10, 0.20, 0.30])
print("\nCatBoost Accuracy@k:", cat_accuracy_at_k)

## **Neural Networks & Deep Learning**

### **16. Artificial Neural Networks (ANN - MLP)**

Multi-layer Perceptron



In [None]:
# Logic: Layered neurons with non-linear activation functions
model = MLPClassifier(hidden_layer_sizes=(64, 32), max_iter=500, random_state=42)
model.fit(x_train, y_train)

# Predict
y_pred = model.predict(x_test)
y_pred_proba = model.predict_proba(x_test)[:, 1] # Get probabilities for the positive class

# Accuracy on test data
train_score_ann = metrics.accuracy_score(y_test, y_pred)
print('Accuracy score : ', train_score_ann)
print("Accuracy percent of ANN (MLP) : ", round(train_score_ann*100, 1), "%")
print("Model score (on train data): ", model.score(x_train, y_train))

# Classification report and confusion matrix
print("\nANN (MLP) Classification Report:\n", metrics.classification_report(y_test, y_pred))
cm = metrics.confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", cm)

# TN, FP, FN, TP
TN, FP, FN, TP = cm.ravel()
print("TN={0}, FP={1}, FN={2}, TP={3}".format(TN, FP, FN, TP))

# Confusion matrix plot
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

# Calculate Accuracy@k
ann_accuracy_at_k = calculate_accuracy_at_k(y_test, y_pred_proba, [0.10, 0.20, 0.30])
print("\nANN (MLP) Accuracy@k:", ann_accuracy_at_k)

### **17. Convolutional Neural Networks (CNN)**

Requires data to be reshaped (typically for images or sequences).

In [None]:
# Reshape input data for Conv1D: (n_samples, n_features) -> (n_samples, timesteps, features)
# Here, we treat each feature as a timestep, and the 'features' dimension is 1.
n_samples_train, n_features_train = x_train.shape
n_samples_test, n_features_test = x_test.shape

x_train_reshaped = x_train.reshape(n_samples_train, n_features_train, 1)
x_test_reshaped = x_test.reshape(n_samples_test, n_features_test, 1)

# Define the Conv1D model
model = models.Sequential([
    layers.Conv1D(filters=32, kernel_size=2, activation='relu', input_shape=(n_features_train, 1)),
    layers.MaxPooling1D(pool_size=1),
    layers.Flatten(),
    layers.Dense(32, activation='relu'),
    layers.Dense(1, activation='sigmoid') # Sigmoid for binary classification
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(x_train_reshaped, y_train, epochs=50, batch_size=32, verbose=0)

# Predict probabilities on test data
y_pred_proba = model.predict(x_test_reshaped)
# Convert probabilities to binary predictions using a threshold (e.g., 0.5)
y_pred = (y_pred_proba > 0.5).astype(int)

# Accuracy on test data
train_score_cnn = metrics.accuracy_score(y_test, y_pred)
print('Accuracy score : ', train_score_cnn)
print("Accuracy percent of CNN : ", round(train_score_cnn*100, 1), "%")

# Evaluate the model on the reshaped test data
loss, accuracy = model.evaluate(x_test_reshaped, y_test, verbose=0)
print(f"Model Loss (on test data): {loss:.4f}")
print(f"Model Accuracy (on test data): {accuracy:.4f}")

# Classification report and confusion matrix
print("\nCNN Classification Report:\n", metrics.classification_report(y_test, y_pred))
cm = metrics.confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", cm)

# TN, FP, FN, TP
TN, FP, FN, TP = cm.ravel()
print("TN={0}, FP={1}, FN={2}, TP={3}".format(TN, FP, FN, TP))

# Confusion matrix plot
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

# Calculate Accuracy@k
cnn_accuracy_at_k = calculate_accuracy_at_k(y_test, y_pred_proba.flatten(), [0.10, 0.20, 0.30])
print("\nCNN Accuracy@k:", cnn_accuracy_at_k)


### **18. Recurrent Neural Networks (RNN/LSTM)**

Best for sequential data.

In [None]:
# Reshape input data for LSTM: (n_samples, n_features) -> (n_samples, timesteps, features)
# Here, we treat each feature as a timestep, and the 'features' dimension is 1.
n_samples_train, n_features_train = x_train.shape
n_samples_test, n_features_test = x_test.shape

x_train_reshaped = x_train.reshape(n_samples_train, n_features_train, 1)
x_test_reshaped = x_test.reshape(n_samples_test, n_features_test, 1)

# Define the LSTM model
model = models.Sequential([
    layers.LSTM(64, input_shape=(n_features_train, 1), activation='relu'),
    layers.Dense(32, activation='relu'),
    layers.Dense(1, activation='sigmoid') # Sigmoid for binary classification
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(x_train_reshaped, y_train, epochs=50, batch_size=32, verbose=0)

# Predict probabilities on test data
y_pred_proba = model.predict(x_test_reshaped)
# Convert probabilities to binary predictions using a threshold (e.g., 0.5)
y_pred = (y_pred_proba > 0.5).astype(int)

# Accuracy on test data
train_score_lstm = metrics.accuracy_score(y_test, y_pred)
print('Accuracy score : ', train_score_lstm)
print("Accuracy percent of RNN (LSTM) : ", round(train_score_lstm*100, 1), "%")

# Evaluate the model on the reshaped test data
loss, accuracy = model.evaluate(x_test_reshaped, y_test, verbose=0)
print(f"Model Loss (on test data): {loss:.4f}")
print(f"Model Accuracy (on test data): {accuracy:.4f}")

# Classification report and confusion matrix
print("\nRNN (LSTM) Classification Report:\n", metrics.classification_report(y_test, y_pred))
cm = metrics.confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", cm)

# TN, FP, FN, TP
TN, FP, FN, TP = cm.ravel()
print("TN={0}, FP={1}, FN={2}, TP={3}".format(TN, FP, FN, TP))

# Confusion matrix plot
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

# Calculate Accuracy@k
lstm_accuracy_at_k = calculate_accuracy_at_k(y_test, y_pred_proba.flatten(), [0.10, 0.20, 0.30])
print("\nRNN (LSTM) Accuracy@k:", lstm_accuracy_at_k)


### **19. Autoencoders (Unsupervised Feature Extraction)**

Using a simple ANN structure.

In [None]:
# Logic: Learns a compressed representation (encoding) of the input
# Define the autoencoder (MLPRegressor) to reconstruct its input
# The hidden_layer_sizes define the encoded feature dimension (e.g., 2 for 2D visualization)
autoencoder = MLPRegressor(hidden_layer_sizes=(2,), activation='relu', random_state=42, max_iter=500)
autoencoder.fit(x_train, x_train)

# Manually extract the activations of the first hidden layer for x_train and x_test
# These are the encoded features
def get_encoded_features(model, data):
    # The output of the first layer is data @ coefs_[0] + intercepts_[0]
    hidden_layer_output = data @ model.coefs_[0] + model.intercepts_[0]
    # Apply the ReLU activation function, as specified in the MLPRegressor
    return np.maximum(0, hidden_layer_output)

X_train_encoded = get_encoded_features(autoencoder, x_train)
X_test_encoded = get_encoded_features(autoencoder, x_test)

# Train a Logistic Regression model on the encoded features
model_lr_encoded = LogisticRegression(random_state=42)
model_lr_encoded.fit(X_train_encoded, y_train)

# Predict on the encoded test data
y_pred = model_lr_encoded.predict(X_test_encoded)
y_pred_proba = model_lr_encoded.predict_proba(X_test_encoded)[:, 1] # Get probabilities for the positive class

# Accuracy on test data
train_score_ae_lr = metrics.accuracy_score(y_test, y_pred)
print('Accuracy score : ', train_score_ae_lr)
print("Accuracy percent of Autoencoder Features + LR : ", round(train_score_ae_lr*100, 1), "%")
print("Model score (on encoded train data): ", model_lr_encoded.score(X_train_encoded, y_train))

# Classification report and confusion matrix
print("\nAutoencoder Features + LR Classification Report:\n", metrics.classification_report(y_test, y_pred))
cm = metrics.confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", cm)

# TN, FP, FN, TP
TN, FP, FN, TP = cm.ravel()
print("TN={0}, FP={1}, FN={2}, TP={3}".format(TN, FP, FN, TP))

# Confusion matrix plot
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

# Calculate Accuracy@k
ae_lr_accuracy_at_k = calculate_accuracy_at_k(y_test, y_pred_proba, [0.10, 0.20, 0.30])
print("\nAutoencoder Features + LR Accuracy@k:", ae_lr_accuracy_at_k)

## **Clustering & Unsupervised**

### **20. k-Means Clustering**

Unsupervised learning used here for feature engineering.

In [None]:
# Logic: Groups data into K clusters; distance to cluster center becomes a feature
kmeans = KMeans(n_clusters=2, random_state=42, n_init='auto')
X_train_clus = kmeans.fit_transform(x_train)
X_test_clus = kmeans.transform(x_test)

# Train Logistic Regression on clustered features
model = LogisticRegression(random_state=42).fit(X_train_clus, y_train)

# Predict
y_pred = model.predict(X_test_clus)
y_pred_proba = model.predict_proba(X_test_clus)[:, 1] # Get probabilities for the positive class

# Accuracy on test data
train_score_kmeans_lr = metrics.accuracy_score(y_test, y_pred)
print('Accuracy score : ', train_score_kmeans_lr)
print("Accuracy percent of k-Means Features + LR : ", round(train_score_kmeans_lr*100, 1), "%")
print("Model score (on train data): ", model.score(X_train_clus, y_train))

# Classification report and confusion matrix
print("\nk-Means Features + LR Classification Report:\n", metrics.classification_report(y_test, y_pred))
cm = metrics.confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", cm)

# TN, FP, FN, TP
TN, FP, FN, TP = cm.ravel()
print("TN={0}, FP={1}, FN={2}, TP={3}".format(TN, FP, FN, TP))

# Confusion matrix plot
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

# Calculate Accuracy@k
kmeans_lr_accuracy_at_k = calculate_accuracy_at_k(y_test, y_pred_proba, [0.10, 0.20, 0.30])
print("\nk-Means Features + LR Accuracy@k:", kmeans_lr_accuracy_at_k)

 ### **21. Hierarchical Clustering**


In [None]:
# Logic: Builds a hierarchy of clusters
cluster = AgglomerativeClustering(n_clusters=2)
train_cluster_labels = cluster.fit_predict(x_train)

# Create pseudo-labels for x_train based on mapped clusters
y_train_mapped = np.full_like(y_train, -1, dtype=int)

cluster_to_class_map = {}
for cluster_id in np.unique(train_cluster_labels):
    cluster_indices = np.where(train_cluster_labels == cluster_id)[0]
    if len(cluster_indices) > 0:
        true_labels_in_cluster = y_train.iloc[cluster_indices]
        if not true_labels_in_cluster.empty:
            majority_class = np.bincount(true_labels_in_cluster).argmax()
            cluster_to_class_map[cluster_id] = majority_class
            y_train_mapped[cluster_indices] = majority_class

# Train a KNeighborsClassifier as a proxy model
proxy_model = KNeighborsClassifier(n_neighbors=5)
proxy_model.fit(x_train, y_train_mapped)

# Predict on x_test using the proxy classifier
y_pred = proxy_model.predict(x_test)
y_pred_proba = proxy_model.predict_proba(x_test)[:, 1] # Get probabilities for the positive class

# Accuracy on test data
train_score_hc_proxy = metrics.accuracy_score(y_test, y_pred)
print('Accuracy score : ', train_score_hc_proxy)
print("Accuracy percent of Hierarchical Clustering + kNN Proxy : ", round(train_score_hc_proxy*100, 1), "%")
print("Proxy Model score (on mapped train data): ", proxy_model.score(x_train, y_train_mapped))

# Classification report and confusion matrix
print("\nHierarchical Clustering + kNN Proxy Classification Report:\n", metrics.classification_report(y_test, y_pred))
cm = metrics.confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", cm)

# TN, FP, FN, TP
TN, FP, FN, TP = cm.ravel()
print("TN={0}, FP={1}, FN={2}, TP={3}".format(TN, FP, FN, TP))

# Confusion matrix plot
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

# Calculate Accuracy@k
hc_proxy_accuracy_at_k = calculate_accuracy_at_k(y_test, y_pred_proba, [0.10, 0.20, 0.30])
print("\nHierarchical Clustering + kNN Proxy Accuracy@k:", hc_proxy_accuracy_at_k)

### **22. DBSCAN**

Density-based spatial clustering of applications with noise.


In [None]:
# Logic: Groups dense areas; great for finding outliers (noise)
dbscan = DBSCAN(eps=0.5, min_samples=5).fit(x_train)

# Get cluster labels for training data
train_cluster_labels = dbscan.labels_

# Create pseudo-labels for x_train based on mapped clusters
# Initialize with a placeholder for unmapped/noise points
y_train_mapped = np.full_like(y_train, -1, dtype=int)

# Majority class of entire y_train, to use for noise points if needed
overall_majority_class = np.bincount(y_train).argmax()

cluster_to_class_map = {}
for cluster_id in np.unique(train_cluster_labels):
    if cluster_id == -1:  # Noise points, will be handled later if unmapped
        continue

    cluster_indices = np.where(train_cluster_labels == cluster_id)[0]
    if len(cluster_indices) > 0:
        true_labels_in_cluster = y_train.iloc[cluster_indices]
        if not true_labels_in_cluster.empty:
            majority_class = np.bincount(true_labels_in_cluster).argmax()
            cluster_to_class_map[cluster_id] = majority_class
            y_train_mapped[cluster_indices] = majority_class  # Assign mapped label

# For noise points or unmapped clusters (still -1), assign the overall majority class
unmapped_indices = np.where(y_train_mapped == -1)[0]
if len(unmapped_indices) > 0:
    y_train_mapped[unmapped_indices] = overall_majority_class

# Train a KNeighborsClassifier as a proxy model
proxy_model = KNeighborsClassifier(n_neighbors=5)
proxy_model.fit(x_train, y_train_mapped)

# Predict on x_test using the proxy classifier
y_pred = proxy_model.predict(x_test)
y_pred_proba = proxy_model.predict_proba(x_test)[:, 1] # Get probabilities for the positive class

# Accuracy on test data
train_score_dbscan_proxy = metrics.accuracy_score(y_test, y_pred)
print('Accuracy score : ', train_score_dbscan_proxy)
print("Accuracy percent of DBSCAN + kNN Proxy : ", round(train_score_dbscan_proxy*100, 1), "%")
print("Proxy Model score (on mapped train data): ", proxy_model.score(x_train, y_train_mapped))

# Classification report and confusion matrix
print("\nDBSCAN + kNN Proxy Classification Report:\n", metrics.classification_report(y_test, y_pred))
cm = metrics.confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", cm)

# TN, FP, FN, TP
TN, FP, FN, TP = cm.ravel()
print("TN={0}, FP={1}, FN={2}, TP={3}".format(TN, FP, FN, TP))

# Confusion matrix plot
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

# Calculate Accuracy@k
dbscan_proxy_accuracy_at_k = calculate_accuracy_at_k(y_test, y_pred_proba, [0.10, 0.20, 0.30])
print("\nDBSCAN + kNN Proxy Accuracy@k:", dbscan_proxy_accuracy_at_k)

### **23. Principal Component Analysis (PCA)**

Used for dimensionality reduction.



In [None]:
# Logic: Projects data onto principal components to reduce noise
pca = PCA(n_components=0.95)
X_train_pca = pca.fit_transform(x_train)
X_test_pca = pca.transform(x_test)

# Train Logistic Regression on PCA features
model = LogisticRegression().fit(X_train_pca, y_train)

# Predict
y_pred = model.predict(X_test_pca)
y_pred_proba = model.predict_proba(X_test_pca)[:, 1] # Get probabilities for the positive class

# Accuracy on test data
train_score_pca_lr = metrics.accuracy_score(y_test, y_pred)
print('Accuracy score : ', train_score_pca_lr)
print("Accuracy percent of PCA + LR : ", round(train_score_pca_lr*100, 1), "%")
print("Model score (on train data): ", model.score(X_train_pca, y_train))

# Classification report and confusion matrix
print("\nPCA + Logistic Regression Classification Report:\n", metrics.classification_report(y_test, y_pred))
cm = metrics.confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", cm)

# TN, FP, FN, TP
TN, FP, FN, TP = cm.ravel()
print("TN={0}, FP={1}, FN={2}, TP={3}".format(TN, FP, FN, TP))

# Confusion matrix plot
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

# Calculate Accuracy@k
pca_lr_accuracy_at_k = calculate_accuracy_at_k(y_test, y_pred_proba, [0.10, 0.20, 0.30])
print("\nPCA + Logistic Regression Accuracy@k:", pca_lr_accuracy_at_k)

### **24. Isolation Forest (Anomaly Detection)**


In [None]:
# Logic: Isolates anomalies (minority class) by randomly partitioning features
# 'contamination' should match your minority class ratio (e.g., 0.05 for 5%)
# Here, we assume the minority class (1 for BUY) is the 'anomaly' detected by Isolation Forest.
iso_forest = IsolationForest(contamination=0.05, random_state=42)

# Fit on training data (unsupervised)
iso_forest.fit(x_train)

# Predict on test data
y_pred_iso = iso_forest.predict(x_test)

# Map Isolation Forest output (-1 for outlier, 1 for inlier) to binary target (1 for BUY, 0 for SELL)
# Assuming -1 (outlier) corresponds to the 'BUY' (1) class, and 1 (inlier) to 'SELL' (0) class.
y_pred = np.where(y_pred_iso == -1, 1, 0)

# Get anomaly scores (lower score = more anomalous = more likely to be positive class)
# Negate the scores so that higher values correspond to the positive class for Accuracy@k
y_pred_proba_iso = -iso_forest.decision_function(x_test)

# Accuracy on test data
train_score_iso = metrics.accuracy_score(y_test, y_pred)
print('Accuracy score : ', train_score_iso)
print("Accuracy percent of Isolation Forest : ", round(train_score_iso*100, 1), "%")

# Isolation Forest does not have a conventional 'model.score' on train data for classification
# We will skip printing model.score(x_train, y_train) for this model due to its unsupervised nature.

# Classification report and confusion matrix
print("\nIsolation Forest Classification Report:\n", metrics.classification_report(y_test, y_pred))
cm = metrics.confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", cm)

# TN, FP, FN, TP
TN, FP, FN, TP = cm.ravel()
print("TN={0}, FP={1}, FN={2}, TP={3}".format(TN, FP, FN, TP))

# Confusion matrix plot
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

# Calculate Accuracy@k
iso_accuracy_at_k = calculate_accuracy_at_k(y_test, y_pred_proba_iso, [0.10, 0.20, 0.30])
print("\nIsolation Forest Accuracy@k:", iso_accuracy_at_k)

# **Table Tabulation**

In [None]:
def calculate_accuracy_at_k(y_true, y_pred_proba, k_percentages):
    """
    Calculates Accuracy@k (precision at top k%) for given true labels and predicted probabilities.

    Args:
        y_true (array-like): The actual labels (0 or 1).
        y_pred_proba (array-like): The predicted probabilities for the positive class.
        k_percentages (list): A list of float values representing the top k% thresholds (e.g., [0.10, 0.20, 0.30]).

    Returns:
        dict: A dictionary where keys are 'Accuracy@k%' (e.g., 'Accuracy@10%') and values are the calculated precision percentages.
    """
    # 1. Combine y_true and y_pred_proba into a DataFrame
    df_combined = pd.DataFrame({'true_label': y_true, 'predicted_proba': y_pred_proba})

    # 2. Sort DataFrame by 'predicted_proba' in descending order
    df_combined = df_combined.sort_values(by='predicted_proba', ascending=False).reset_index(drop=True)

    # 3. Initialize dictionary to store results
    accuracy_at_k_results = {}

    # 4. Iterate through each k_percentage
    for k_percentage in k_percentages:
        # a. Calculate the number of samples for the current top k% threshold
        k_count = int(len(y_true) * k_percentage)

        # Handle the case where k_count is zero (e.g., k_percentage is very small or y_true is empty)
        if k_count == 0:
            accuracy_at_k_results[f'Accuracy@{int(k_percentage * 100)}%'] = 0.0
            continue

        # b. Select the top k_count samples from the sorted DataFrame
        top_k_df = df_combined.head(k_count)

        # c. Within these top k_count samples, count how many have a 'true_label' equal to 1
        correct_predictions = top_k_df['true_label'].sum()

        # d. Calculate the precision at k
        precision_at_k = (correct_predictions / k_count) * 100

        # e. Store this calculated precision (percentage) in the results dictionary
        accuracy_at_k_results[f'Accuracy@{int(k_percentage * 100)}%'] = precision_at_k

    return accuracy_at_k_results

print("The 'calculate_accuracy_at_k' function has been defined.")

In [None]:
from tabulate import tabulate

data = [
    ["Linear Regression", round(train_score_lr * 100, 2), round(lr_accuracy_at_k['Accuracy@10%'], 2), round(lr_accuracy_at_k['Accuracy@20%'], 2), round(lr_accuracy_at_k['Accuracy@30%'], 2)],
    ["Logistic Regression", round(train_score_lr * 100, 2), round(logistic_accuracy_at_k['Accuracy@10%'], 2), round(logistic_accuracy_at_k['Accuracy@20%'], 2), round(logistic_accuracy_at_k['Accuracy@30%'], 2)],
    ["Naïve Bayes", round(train_score_nb * 100, 2), round(naive_bayes_accuracy_at_k['Accuracy@10%'], 2), round(naive_bayes_accuracy_at_k['Accuracy@20%'], 2), round(naive_bayes_accuracy_at_k['Accuracy@30%'], 2)],
    ["LDA", round(train_score_lda * 100, 2), round(lda_accuracy_at_k['Accuracy@10%'], 2), round(lda_accuracy_at_k['Accuracy@20%'], 2), round(lda_accuracy_at_k['Accuracy@30%'], 2)],
    ["SVM", round(train_score_svm * 100, 2), round(svm_accuracy_at_k['Accuracy@10%'], 2), round(svm_accuracy_at_k['Accuracy@20%'], 2), round(svm_accuracy_at_k['Accuracy@30%'], 2)],
    ["k-Nearest Neighbors", round(train_score_knn * 100, 2), round(knn_accuracy_at_k['Accuracy@10%'], 2), round(knn_accuracy_at_k['Accuracy@20%'], 2), round(knn_accuracy_at_k['Accuracy@30%'], 2)],
    ["t-SNE + Logistic Regression", round(train_score_tsne_lr * 100, 2), round(tsne_lr_accuracy_at_k['Accuracy@10%'], 2), round(tsne_lr_accuracy_at_k['Accuracy@20%'], 2), round(tsne_lr_accuracy_at_k['Accuracy@30%'], 2)],
    ["Decision Trees (Cost-Sensitive)", round(train_score_dt * 100, 2), round(dt_accuracy_at_k['Accuracy@10%'], 2), round(dt_accuracy_at_k['Accuracy@20%'], 2), round(dt_accuracy_at_k['Accuracy@30%'], 2)],
    ["Balanced Random Forest", round(train_score_rand * 100, 2), round(rand_accuracy_at_k['Accuracy@10%'], 2), round(rand_accuracy_at_k['Accuracy@20%'], 2), round(rand_accuracy_at_k['Accuracy@30%'], 2)],
    ["Extra Trees", round(train_score_et * 100, 2), round(et_accuracy_at_k['Accuracy@10%'], 2), round(et_accuracy_at_k['Accuracy@20%'], 2), round(et_accuracy_at_k['Accuracy@30%'], 2)],
    ["AdaBoost", round(train_score_ada * 100, 2), round(ada_accuracy_at_k['Accuracy@10%'], 2), round(ada_accuracy_at_k['Accuracy@20%'], 2), round(ada_accuracy_at_k['Accuracy@30%'], 2)],
    ["Gradient Boosting", round(train_score_gb * 100, 2), round(gb_accuracy_at_k['Accuracy@10%'], 2), round(gb_accuracy_at_k['Accuracy@20%'], 2), round(gb_accuracy_at_k['Accuracy@30%'], 2)],
    ["XGBoost", round(train_score_xgb * 100, 2), round(xgb_accuracy_at_k['Accuracy@10%'], 2), round(xgb_accuracy_at_k['Accuracy@20%'], 2), round(xgb_accuracy_at_k['Accuracy@30%'], 2)],
    ["LightGBM", round(train_score_lgbm * 100, 2), round(lgbm_accuracy_at_k['Accuracy@10%'], 2), round(lgbm_accuracy_at_k['Accuracy@20%'], 2), round(lgbm_accuracy_at_k['Accuracy@30%'], 2)],
    ["CatBoost", round(train_score_cat * 100, 2), round(cat_accuracy_at_k['Accuracy@10%'], 2), round(cat_accuracy_at_k['Accuracy@20%'], 2), round(cat_accuracy_at_k['Accuracy@30%'], 2)],
    ["ANN (MLP)", round(train_score_ann * 100, 2), round(ann_accuracy_at_k['Accuracy@10%'], 2), round(ann_accuracy_at_k['Accuracy@20%'], 2), round(ann_accuracy_at_k['Accuracy@30%'], 2)],
    ["CNN", round(train_score_cnn * 100, 2), round(cnn_accuracy_at_k['Accuracy@10%'], 2), round(cnn_accuracy_at_k['Accuracy@20%'], 2), round(cnn_accuracy_at_k['Accuracy@30%'], 2)],
    ["RNN (LSTM)", round(train_score_lstm * 100, 2), round(lstm_accuracy_at_k['Accuracy@10%'], 2), round(lstm_accuracy_at_k['Accuracy@20%'], 2), round(lstm_accuracy_at_k['Accuracy@30%'], 2)],
    ["Autoencoder Features + LR", round(train_score_ae_lr * 100, 2), round(ae_lr_accuracy_at_k['Accuracy@10%'], 2), round(ae_lr_accuracy_at_k['Accuracy@20%'], 2), round(ae_lr_accuracy_at_k['Accuracy@30%'], 2)],
    ["k-Means Features + LR", round(train_score_kmeans_lr * 100, 2), round(kmeans_lr_accuracy_at_k['Accuracy@10%'], 2), round(kmeans_lr_accuracy_at_k['Accuracy@20%'], 2), round(kmeans_lr_accuracy_at_k['Accuracy@30%'], 2)],
    ["Hierarchical Clustering + kNN Proxy", round(train_score_hc_proxy * 100, 2), round(hc_proxy_accuracy_at_k['Accuracy@10%'], 2), round(hc_proxy_accuracy_at_k['Accuracy@20%'], 2), round(hc_proxy_accuracy_at_k['Accuracy@30%'], 2)],
    ["DBSCAN + kNN Proxy", round(train_score_dbscan_proxy * 100, 2), round(dbscan_proxy_accuracy_at_k['Accuracy@10%'], 2), round(dbscan_proxy_accuracy_at_k['Accuracy@20%'], 2), round(dbscan_proxy_accuracy_at_k['Accuracy@30%'], 2)],
    ["Isolation Forest", round(train_score_iso * 100, 2), round(iso_accuracy_at_k['Accuracy@10%'], 2), round(iso_accuracy_at_k['Accuracy@20%'], 2), round(iso_accuracy_at_k['Accuracy@30%'], 2)]
]

columns = ["Algorithms", "Accuracy Percent", "Accuracy Percent @10%", "Accuracy Percent @20%", "Accuracy Percent @30%"]

print(tabulate(data, headers=columns, tablefmt="fancy_grid"))

## **Overall Accuracy Percent of Machine Learning Algorithms**

In [None]:
# Convert the data list to a DataFrame
df_results = pd.DataFrame(data, columns=columns)

# Plot Overall Accuracy Percent
plt.figure(figsize=(14, 8))
sns.barplot(x='Accuracy Percent', y='Algorithms', hue='Algorithms', data=df_results.sort_values(by='Accuracy Percent', ascending=False), palette='viridis', legend=False)
plt.title('Overall Accuracy Percent of Machine Learning Algorithms')
plt.xlabel('Accuracy Percent (%)')
plt.ylabel('Algorithms')
plt.tight_layout()
plt.show()

## **Accuracy of Machine Learning Algorithms @ 10%, 20%, 30%**

In [None]:
# Convert the data list to a DataFrame if not already done
df_results = pd.DataFrame(data, columns=columns)

# Melt the DataFrame to prepare for grouped bar plot of Accuracy@k
df_accuracy_at_k = df_results[['Algorithms', 'Accuracy Percent @10%', 'Accuracy Percent @20%', 'Accuracy Percent @30%']].melt(
    id_vars='Algorithms',
    var_name='Accuracy Type',
    value_name='Accuracy Value'
)

# Plot Accuracy@k metrics
plt.figure(figsize=(16, 9))
sns.barplot(x='Accuracy Value', y='Algorithms', hue='Accuracy Type', data=df_accuracy_at_k, palette='magma')
plt.title('Accuracy@k of Machine Learning Algorithms (10%, 20%, 30%)')
plt.xlabel('Accuracy Percent (%)')
plt.ylabel('Algorithms')
plt.legend(title='Accuracy@k Type', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

## Explanation: Algorithm Selection Rationale

### Rationale for Algorithm Selection

To conduct a comprehensive comparative analysis for stock price forecasting, a diverse set of machine learning algorithms from various paradigms was selected. This approach allows for evaluating different computational strategies in capturing the complex dynamics of financial time series data.

#### 1. Linear & Probabilistic Models (e.g., Linear Regression, Logistic Regression, Naïve Bayes, LDA, SVM, k-NN):

*   **Characteristics**: These models are generally simpler, provide interpretable results, and establish linear or probabilistic relationships within the data. They are computationally efficient and serve as strong baselines.
*   **Relevance for Stock Forecasting**: Despite the non-linear nature of stock markets, linear models can capture fundamental trends and relationships. Logistic Regression and Naïve Bayes can classify price movements (e.g., up/down), while LDA seeks to maximize class separability. SVMs can handle non-linearity through kernels, and k-NN classifies based on local proximity, useful for identifying similar market conditions.

#### 2. Tree-Based & Ensemble Learning Models (e.g., Decision Trees, Random Forest, Extra Trees, AdaBoost, Gradient Boosting, XGBoost, LightGBM, CatBoost):

*   **Characteristics**: These models excel at capturing non-linear relationships and interactions between features. Ensemble methods combine multiple individual models to improve predictive performance and robustness, often reducing overfitting and bias. They are particularly effective with complex, high-dimensional data.
*   **Relevance for Stock Forecasting**: Stock prices are influenced by numerous interdependent factors, making them highly non-linear. Tree-based ensembles are well-suited for such complexity, capable of identifying intricate patterns that drive price movements. Boosting algorithms (AdaBoost, Gradient Boosting, XGBoost, LightGBM, CatBoost) sequentially improve predictions by focusing on previous errors, which is critical in time-series forecasting where subtle shifts can be significant.

#### 3. Neural Networks & Deep Learning Models (e.g., ANN/MLP, CNN, RNN/LSTM, Autoencoders):

*   **Characteristics**: Deep learning models, with their multi-layered architectures, can automatically learn hierarchical features and complex abstractions directly from raw data. RNNs and LSTMs are particularly adept at processing sequential data, making them suitable for time series. Autoencoders can learn efficient data encodings.
*   **Relevance for Stock Forecasting**: Stock prices are inherently sequential and exhibit temporal dependencies. RNNs and LSTMs are designed to handle such sequences, making them powerful tools for predicting future price movements based on historical patterns. ANNs/MLPs offer general non-linear mapping capabilities, while CNNs can identify local patterns within data windows. Autoencoders can be used for dimensionality reduction and feature learning, potentially extracting more robust signals from noisy financial data.

#### 4. Clustering & Unsupervised Models (e.g., k-Means Clustering, Hierarchical Clustering, DBSCAN, PCA, Isolation Forest):

*   **Characteristics**: These models operate without labeled data, identifying inherent structures, groups, or anomalies within the dataset. They are valuable for exploratory data analysis, feature engineering, and anomaly detection.
*   **Relevance for Stock Forecasting**: While primarily unsupervised, these techniques can enhance supervised forecasting models. Clustering algorithms (k-Means, Hierarchical, DBSCAN) can identify different market regimes or stock behavior patterns, which can then be used as features. Dimensionality reduction techniques like PCA can reduce noise and multicollinearity, improving model stability. Isolation Forest, an anomaly detection algorithm, can be used to identify unusual market events or outliers, which might correspond to significant price shifts or unusual trading activity.

## Explanation: Hyperparameter Tuning Strategy

For the purpose of this comparative analysis, all machine learning models were trained using their default hyperparameters, as provided by their respective libraries (e.g., scikit-learn, TensorFlow, XGBoost, LightGBM, CatBoost). While this approach allows for a direct comparison of models under their standard configurations, it's crucial to acknowledge that it does not represent the optimal performance achievable by each model.

**Importance of Hyperparameter Tuning:**

Hyperparameter tuning is a critical step in the machine learning workflow, especially for complex tasks like stock price forecasting where minor improvements in prediction accuracy can have significant financial implications. Manually setting hyperparameters or relying on defaults can lead to suboptimal model performance, including issues such as overfitting, underfitting, or simply not capturing the underlying patterns in the data effectively.

**Common Hyperparameter Tuning Techniques:**

Techniques such as Grid Search, Random Search, Bayesian Optimization, and evolutionary algorithms are commonly employed to systematically explore the hyperparameter space and identify the combination that yields the best performance on a validation set. By tuning parameters like `n_estimators`, `learning_rate`, `max_depth`, `C`, `gamma`, `hidden_layer_sizes`, etc., models can be significantly optimized for the specific dataset and problem at hand.

**Future Work:**

To further enhance the predictive accuracy and robustness of the models for stock price forecasting, future work should explicitly incorporate comprehensive hyperparameter tuning for each model. This would involve:

1.  **Defining a search space** for each model's key hyperparameters.
2.  **Employing a cross-validation strategy** to ensure robust evaluation of different hyperparameter combinations.
3.  **Using appropriate evaluation metrics** (e.g., RMSE, MAE, MAPE, R², Accuracy@k for classification) to guide the optimization process.

Implementing such tuning would likely lead to improved model performance, making them more suitable for real-world financial applications.

## Explanation: Training-Validation-Testing Split and Cross-Validation

### Data Splitting Strategy

For this analysis, a `train_test_split` with `shuffle=False` was employed to divide the dataset into training and testing sets. This approach is crucial when dealing with time-series data, as maintaining the temporal order of observations is essential. Shuffling would disrupt the chronological sequence, leading to data leakage where future information could inadvertently influence the training of models, thereby yielding misleading performance metrics.

### Absence of Explicit Cross-Validation

Standard k-fold cross-validation, where data is randomly partitioned into subsets, is generally not suitable for time-series forecasting. The primary reason is that it violates the temporal dependency inherent in such data. Randomly splitting the data would lead to training on future observations and testing on past ones, which does not reflect real-world forecasting scenarios.

While explicit cross-validation was not performed in this notebook, its absence implies that the model's robustness and generalization to different time periods might not be as thoroughly assessed as with more appropriate time-series cross-validation techniques. These specialized techniques, such as **rolling-origin cross-validation** (also known as forward-chaining cross-validation) or **blocked cross-validation**, preserve the temporal order by iteratively training on a growing past segment of data and testing on a subsequent future segment.

For more rigorous evaluation of time-series models, implementing one of these time-series specific cross-validation methods would be recommended to ensure that the model's performance is consistently robust across various future periods.

## **Results and Discussion**



```
# This is formatted as code
```

## **Tabular and Graphical Presentation of Results**

The previous output provides a comprehensive table detailing the performance of each machine learning model across various metrics, including overall Accuracy Percent and Accuracy@k at different thresholds (10%, 20%, and 30%). This tabular format offers a direct quantitative comparison of how each model performs in identifying positive cases within the top predicted percentages.

To further enhance our understanding and visualize these comparisons more effectively, we will now proceed to create graphical representations of these results. The graphical comparison will specifically focus on:

- **Overall Accuracy Percent**: To visualize the general predictive power of each model.
- **Accuracy@10%, Accuracy@20%, and Accuracy@30%**: To highlight the models' effectiveness in identifying the most confident positive predictions, which is crucial for applications where focusing on a small, high-precision subset is valuable.

These visual summaries will offer a clear and intuitive way to compare the models' strengths and weaknesses across different evaluation criteria, complementing the detailed quantitative insights provided by the table.

## **Performance Comparison and Interpretation**

This section provides a comparative analysis of the eighteen machine learning models evaluated for stock price forecasting, focusing on their 'Accuracy Percent' and 'Accuracy@k' scores.

### Overall Accuracy Assessment

The 'Accuracy Percent' column represents the overall proportion of correctly classified instances on the test set. Higher values generally indicate better predictive capability for the entire dataset. From the tabulated results:

*   **Extra Trees (72.47%)**, **Balanced Random Forest (71.35%)**, and **XGBoost (71.35%)** demonstrate the highest overall accuracy. These ensemble methods, particularly tree-based ones, appear to capture the underlying patterns in the data more effectively than other models.
*   **Gradient Boosting (70.79%)** and **CatBoost (69.66%)** also perform strongly, reinforcing the robustness of gradient-boosting techniques.
*   **k-Nearest Neighbors (68.54%)** shows competitive performance, suggesting that local data structures are relevant for classification.
*   **Linear and Probabilistic models** like Linear Regression (53.37%), Logistic Regression (53.37%), Naïve Bayes (43.82%), and LDA (47.19%) generally exhibit lower overall accuracy, indicating the potential for non-linear relationships in the data that these models might struggle to capture.
*   **Neural Networks**, including ANN (MLP) (66.85%), CNN (61.24%), and RNN (LSTM) (64.04%), show moderate performance, with ANN being the strongest among them. Their performance could potentially be improved with more complex architectures or extensive hyperparameter tuning.
*   **Unsupervised feature extraction methods**, when combined with Logistic Regression (Autoencoder Features + LR: 51.12%, k-Means Features + LR: 51.12%, PCA + Logistic Regression: 50.00%), show lower accuracy. This suggests that the features extracted might not be optimally discriminative for the classification task, or that the linear classifier on top is insufficient.
*   **Clustering-based proxy models** like Hierarchical Clustering + kNN Proxy (38.20%) and DBSCAN + kNN Proxy (38.76%), along with **Isolation Forest (57.30%)**, exhibit the lowest accuracies. This indicates that these methods, designed for different tasks (density/anomaly detection, unsupervised grouping), are less suited for direct supervised classification without significant adaptations or more sophisticated integration.

### Accuracy@k Interpretation

'Accuracy@k' (Precision@k) is a crucial metric, especially in scenarios like stock prediction where identifying a small subset of high-confidence predictions is more valuable than overall accuracy. It measures the percentage of correct positive predictions among the top k% of instances with the highest predicted probabilities for the positive class. This metric is critical for strategies that act on a limited number of high-conviction signals.

*   **Top 10% (Accuracy@10%)**: Here, **t-SNE + Logistic Regression (76.47%)**, **AdaBoost (76.47%)**, and **PCA + Logistic Regression (76.47%)** perform exceptionally well. While their overall accuracies might not be the highest, they are very effective at identifying a small, highly probable subset of 'BUY' signals. This suggests that while these models may not generalize well across all predictions, they can be highly precise when focusing on the most confident predictions.
*   **Top 20% (Accuracy@20%)**: **XGBoost (71.43%)** and **Autoencoder Features + LR (71.43%)** lead in this category. This indicates that their confidence scores are well-calibrated for selecting a slightly larger portion of the dataset, providing reliable signals at this threshold.
*   **Top 30% (Accuracy@30%)**: **Balanced Random Forest (67.92%)**, **Extra Trees (67.92%)**, and **CatBoost (67.92%)** perform best here. These models maintain strong precision even when considering a larger fraction of the top predictions, highlighting their ability to consistently rank positive instances higher.

**Empirical Observations and Trade-offs:**

It's evident that models with high overall accuracy (like Extra Trees or Balanced Random Forest) generally perform well across Accuracy@k metrics, but some models, like t-SNE + LR, AdaBoost, and PCA + LR, show a significant boost in Accuracy@10% despite having moderate overall accuracy. This implies a trade-off: a model might have lower overall accuracy but be excellent at identifying a small, critical subset of highly probable positive outcomes, which is often more desirable in financial applications (e.g., maximizing returns by only acting on the strongest buy signals).

Conversely, models with very low overall accuracy (e.g., Hierarchical Clustering, DBSCAN) also show poor Accuracy@k scores, confirming their limited utility for this particular classification task.

### Conclusion

For a balanced performance considering both overall predictive power and the ability to identify high-confidence predictions, **tree-based ensemble models (Extra Trees, Balanced Random Forest, XGBoost, CatBoost, Gradient Boosting)** stand out as the most robust choices. However, for strategies prioritizing high precision on a very small set of top predictions, **t-SNE + Logistic Regression, AdaBoost, and PCA + Logistic Regression** present compelling alternatives, suggesting that dimensionality reduction or boosting can enhance the confidence of predictions for the most salient cases. The choice of the

## **Strengths, Weaknesses, and Trade-offs**

### Linear & Probabilistic Models (Linear Regression, Logistic Regression, Naïve Bayes, LDA, SVM, k-NN):

**Strengths:**
- **Interpretability:** Generally highly interpretable, especially Linear and Logistic Regression, making it easier to understand feature importance.
- **Simplicity:** Simpler to implement and faster to train, particularly for smaller datasets.
- **Scalability:** Can scale well to large datasets, with some models (like Linear/Logistic Regression) having efficient implementations.
- **Robustness (SVM):** SVMs can be effective in high-dimensional spaces and with clear margins.

**Weaknesses:**
- **Limited to Linear Relationships:** Linear Regression and Logistic Regression struggle with non-linear relationships in data.
- **Assumption Dependent (Naïve Bayes, LDA):** Naïve Bayes assumes feature independence, which is rarely true in practice. LDA assumes Gaussian distributions and equal covariance matrices.
- **Sensitivity to Outliers:** Many linear models are sensitive to outliers.
- **Performance (k-NN):** Can be computationally expensive for prediction on large datasets due to distance calculations.

**Trade-offs:** High interpretability and faster training often come at the cost of lower predictive power on complex, non-linear data compared to ensemble or deep learning methods.

### Tree-Based Ensembles (Decision Trees, Balanced Random Forest, Extra Trees, AdaBoost, Gradient Boosting, XGBoost, LightGBM, CatBoost):

**Strengths:**
- **High Accuracy:** Often achieve state-of-the-art performance, especially Gradient Boosting machines, due to their ability to capture complex non-linear relationships.
- **Feature Importance:** Provide clear insights into feature importance.
- **Handles Mixed Data Types:** Can naturally handle both numerical and categorical features.
- **Robust to Outliers (some):** Ensemble methods like Random Forests are less sensitive to outliers than single decision trees.
- **Scalability:** Many modern implementations (XGBoost, LightGBM, CatBoost) are highly optimized for speed and scalability.

**Weaknesses:**
- **Overfitting (Decision Trees):** Single Decision Trees are prone to overfitting without proper pruning.
- **Complexity:** Ensembles can be complex, making them less interpretable than linear models.
- **Parameter Tuning:** Often require careful hyperparameter tuning for optimal performance.
- **Computational Cost:** Can be computationally intensive to train, especially for large ensembles or deep trees.

**Trade-offs:** Excellent predictive power and robustness are achieved at the cost of increased complexity and reduced interpretability compared to linear models. Training times can be longer.

### Neural Networks & Deep Learning (ANN/MLP, CNN, RNN/LSTM, Autoencoders):

**Strengths:**
- **Captures Complex Patterns:** Exceptional at learning intricate patterns and representations from data, leading to high accuracy, especially on large, complex datasets.
- **Feature Learning:** Can automatically learn relevant features from raw data, reducing the need for manual feature engineering.
- **Versatility:** Applicable to various data types (sequences, images, text) and tasks.

**Weaknesses:**
- **Data Hungry:** Require very large datasets to perform optimally and avoid overfitting.
- **Computational Expense:** Extremely resource-intensive to train, requiring significant computational power (GPUs/TPUs).
- **Black Box:** Very difficult to interpret, often considered

## **Data Analysis Key Findings**

*   **Algorithm Selection Rationale**: A diverse set of machine learning algorithms was chosen to comprehensively compare performance in stock price forecasting.
    *   **Linear & Probabilistic Models** were selected as interpretable baselines to capture fundamental trends.
    *   **Tree-Based & Ensemble Learning Models** were included for their ability to handle non-linear relationships and improve robustness.
    *   **Neural Networks & Deep Learning Models** were chosen for their capability to learn hierarchical features and process sequential data, crucial for time series.
    *   **Clustering & Unsupervised Models** were selected for identifying inherent structures, reducing dimensionality, and detecting anomalies, thereby enhancing supervised models.
*   **Hyperparameter Tuning Strategy**: All models were trained using their default hyperparameters. This approach allowed for a direct comparison under standard configurations but implied that optimal performance for each model might not have been achieved. The importance of hyperparameter tuning for optimization and robustness was emphasized, with suggestions for future work involving defining search spaces, cross-validation, and appropriate evaluation metrics.
*   **Data Splitting Strategy**: A `train_test_split` with `shuffle=False` was used to maintain the temporal order of time-series data, preventing data leakage. Standard k-fold cross-validation was noted as unsuitable for time series, and the absence of explicit time-series specific cross-validation (like rolling-origin) was acknowledged, suggesting it as an area for more rigorous future evaluation.
*   **Performance Comparison and Interpretation**:
    *   **Overall Accuracy**: Tree-based ensemble models consistently showed the highest overall accuracy, with **Extra Trees (72.47%)**, **Balanced Random Forest (71.35%)**, and **XGBoost (71.35%)** leading. Linear and probabilistic models, along with unsupervised feature extraction/clustering-based proxy models, generally exhibited lower overall accuracies. Neural networks showed moderate performance.
    *   **Accuracy@k (Precision@k)**: This metric, crucial for high-conviction signals, revealed different leaders for specific thresholds.
        *   For **Accuracy@10%**, **t-SNE + Logistic Regression (76.47%)**, **AdaBoost (76.47%)**, and **PCA + Logistic Regression (76.47%)** performed exceptionally well, indicating their ability to precisely identify a small subset of high-probability positive predictions despite potentially lower overall accuracy.
        *   For **Accuracy@20%**, **XGBoost (71.43%)** and **Autoencoder Features + LR (71.43%)** were top performers.
        *   For **Accuracy@30%**, **Balanced Random Forest (67.92%)**, **Extra Trees (67.92%)**, and **CatBoost (67.92%)** maintained strong precision.
    *   A trade-off was observed where models with lower overall accuracy could still be highly effective at identifying a small, precise set of predictions, which is valuable in financial applications.
*   **Strengths, Weaknesses, and Trade-offs of Model Categories**:
    *   **Linear & Probabilistic Models**: High interpretability and simplicity at the cost of limited ability to capture non-linear relationships.
    *   **Tree-Based Ensembles**: High accuracy and robustness for complex non-linear data, but with increased complexity, need for tuning, and reduced interpretability compared to linear models.
    *   **Neural Networks & Deep Learning**: Highest potential for accuracy on complex data with automatic feature learning, but are data-hungry, computationally expensive, and lack interpretability ("black box").
    *   **Clustering & Unsupervised Models**: Excellent for pattern discovery, dimensionality reduction, and anomaly detection, primarily used to enhance supervised models rather than for direct prediction.

### Insights or Next Steps

*   To fully optimize model performance, comprehensive hyperparameter tuning and time-series specific cross-validation (e.g., rolling-origin) should be implemented in future work to ensure robustness and generalizability of the models.
*   The choice of the "best" model depends on the specific financial strategy: if maximizing overall correct predictions is key, tree-based ensembles are preferred; if high precision on a small set of high-confidence signals is critical, models like AdaBoost or dimensionality reduction combined with linear models (e.g., t-SNE + LR) show promise.


## **Answering the Aim and Objectives of the Study**

**General Objective:**
**To conduct a comprehensive comparative analysis of eighteen machine learning models for stock price forecasting in the financial sector using historical Yahoo Finance data.**

**Interpretation:** This objective was successfully met by evaluating eighteen diverse machine learning models across different paradigms (linear, tree-based, neural networks, and unsupervised techniques used for feature engineering). The study provided a detailed comparison of their performance using 'Accuracy Percent' and 'Accuracy@k' metrics, alongside discussions on their strengths, weaknesses, and trade-offs.

**Specific Objectives:**

1.  **Examine the predictive performance of traditional regression-based machine learning models in forecasting stock prices.**
    *   **Interpretation:** Traditional regression-based models like Linear Regression (53.37% accuracy), Logistic Regression (53.37% accuracy), and LDA (47.19% accuracy) generally showed lower overall predictive performance compared to ensemble methods. While some, when combined with dimensionality reduction (e.g., PCA + LR), achieved high 'Accuracy@10%', their overall ability to capture complex stock price dynamics was limited, suggesting that the problem is highly non-linear.

2.  **Evaluate the effectiveness of tree-based and ensemble learning models for financial time-series prediction.**
    *   **Interpretation:** Tree-based and ensemble learning models (Extra Trees, Balanced Random Forest, XGBoost, Gradient Boosting, CatBoost, AdaBoost) demonstrated superior effectiveness, consistently achieving the highest overall accuracy scores (ranging from 67.42% to 72.47%). They proved robust in capturing non-linear relationships and interactions within the financial data, making them highly suitable for stock price forecasting. Their strong performance across various Accuracy@k thresholds further validates their utility in identifying high-conviction signals.

3.  **Assess the capability of kernel-based and probabilistic models in capturing non-linear stock price dynamics.**
    *   **Interpretation:** Kernel-based models like SVM (57.87% accuracy) showed moderate capability, improving upon purely linear models by handling some non-linearity through its kernel trick. Probabilistic models like Naïve Bayes (43.82% accuracy) performed poorly, likely due to its strong assumption of feature independence which is often violated in financial data. This suggests that while kernel methods can help, simpler probabilistic approaches may not be sophisticated enough for complex non-linear dynamics.

4.  **Analyze the forecasting accuracy of deep learning models in comparison with conventional machine learning techniques.**
    *   **Interpretation:** Deep learning models (ANN/MLP: 66.85%, CNN: 61.24%, RNN/LSTM: 64.04%) showed competitive but generally not superior accuracy compared to the top ensemble models. While they captured complex patterns, their performance on this dataset with limited features and default hyperparameters did not overwhelmingly outperform the best conventional techniques. This indicates that their full potential might be realized with more data, extensive feature engineering, or dedicated hyperparameter tuning and architecture optimization.

5. **Identify the most robust and generalizable machine learning models for stock price forecasting using historical market data**
    * **Interpretation** Based on the notebook's accuracy and Accuracy@k results, the **tree-based ensemble models** are the most robust and generalizable for this classification task. Specifically, **Extra Trees (72.47% overall accuracy)**, **Balanced Random Forest (71.35% overall accuracy)**, **XGBoost (71.35% overall accuracy)**, **Gradient Boosting (70.79% overall accuracy)**, and **CatBoost (69.66% overall accuracy)** consistently demonstrate strong performance across both overall accuracy and the various Accuracy@k thresholds. These models are effective at capturing complex, non-linear relationships in the data and offer a good balance between overall predictive power and the ability to identify high-confidence predictions, making them well-suited for general application in this classification scenario.



In [None]:
from sklearn import metrics

# Assuming y_test and y_pred are available from the last classification model
# Note: Applying regression metrics to binary classification outcomes provides insight
# into how 'close' the binary predictions are, but does not directly evaluate continuous forecasting performance.

# Mean Squared Error (MSE)
mse = metrics.mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.4f}")

# Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")

# Mean Absolute Error (MAE)
mae = metrics.mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae:.4f}")

# R-squared (Coefficient of Determination)
# Note: R-squared can be misleading or negative for classification problems with binary outputs
# as it's designed for continuous targets. A low or negative value is expected here.
r2 = metrics.r2_score(y_test, y_pred)
print(f"R-squared (R2): {r2:.4f}")