<a href="https://colab.research.google.com/github/FarazHeydar/Forecasting-Stock-Market-Trends/blob/main/Forecasting_Stock_Market_Trends.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Forecasting Stock Market Trends

## **Library Imports and Data Loading and Analysis**
In this section, we import the necessary libraries to handle financial data, visualize trends, and build machine learning models. We perform the following steps:
1.  Data Handling: Import `pandas` and `numpy` for time-series manipulation.
2.  Visualization: Import `seaborn` and `matplotlib` to plot market trends and heatmaps.
3.  Machine Learning: Import `sklearn` modules for PCA, TimeSeriesSplit, and various classifiers (SVM, KNN, Random Forest) to build our predictive pipeline.

Then, we download the historical stock data and prepare it for analysis. We perform the following steps:
1.  Data Retrieval: Use `kagglehub` to fetch the Reliance 30 Years of Market Data dataset.
2.  Date formatting: Convert the Date column to datetime objects to ensure chronological sorting.
3.  Handling Missing Values: Apply Forward Fill (`ffill`) to propagate the last valid observation forward. This is critical in finance to simulate real-world conditions where prices remain unchanged on non-trading days, avoiding "look-ahead bias."

In [None]:
import kagglehub
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_curve, auc
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from mpl_toolkits.mplot3d import Axes3D
from sklearn.ensemble import VotingClassifier

# ==========================================
# RAW DATA LOADING & ANALYSIS
# ==========================================

print('=' * 65)
print("--- RAW DATA ANALYSIS ---")
print('=' * 65)

# Load Data
try:
    print("Downloading dataset...")
    path = kagglehub.dataset_download("jatinkalra17/reliance-30-years-of-market-data19942025")
    dataset_path = f'{path}/RELIANCE_NSE_1994-2025.csv'
    df = pd.read_csv(dataset_path)
except FileNotFoundError:
    print("Error: Dataset file not found. Please check the path.")
    exit()

# Basic Prep
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values('Date').reset_index(drop=True)

# Structural Analysis
print("\nData Structure:")
print(f"\nFirst 5 rows of the DataFrame:\n{df.head()}")

# Missing Value Analysis
print("\nMissing Values:")
missing_values_count = df.isnull().sum().sum()
print(f"Total missing values in the dataset = {missing_values_count}")

# Handling Missing Values (Standard Financial Approach)
# We ignore the 'Trades' column as it has too many missing values and is not needed for indicators.
print(f"Missing values before fill: {df[['Open', 'High', 'Low', 'Close', 'Volume']].isnull().sum().sum()}")

# Forward Fill: If a day has missing price, assume it's same as yesterday
df = df.ffill()
df = df.dropna() # Drop any remaining NaNs (e.g., at the very start)

print("Data cleaned using Forward Fill.")
print(f"Final Raw Shape: {df.shape}")
print(f"Final Raw Head: {df.head}")

## **Feature Engineering (Technical Indicators)**
In this section, we mathematically derive 13 technical indicators to capture market momentum, trend, and volatility. We perform the following steps:
1.  Trend Indicators: Calculate EMA (Exponential Moving Average) and SMA to identify the general direction of the stock price.
2.  Momentum Indicators: Compute RSI (Relative Strength Index) and MACD to detect overbought or oversold conditions.
3.  Target Generation: Create a binary `Target` variable. We check if Tomorrow's Close > Today's Close (assigned as `1`) or otherwise (assigned as `0`), which serves as the label for our supervised learning.

In [None]:
# ==========================================
# COMPREHENSIVE FEATURE GENERATION (ALL 13 INDICATORS)
# ==========================================

print('=' * 65)
print("--- GENERATING ALL 13 INDICATORS ---")
print('=' * 65)

def calculate_all_indicators(data):
    df = data.copy()

    # 1. EMA (Exponential Moving Average)
    # Paper Eq(1): N=14 day
    df['EMA_14'] = df['Close'].ewm(span=14, adjust=False).mean()

    # 2. MACD
    # Paper Eq(2,3): EMA(12) - EMA(26) with Signal(9)
    exp12 = df['Close'].ewm(span=12, adjust=False).mean()
    exp26 = df['Close'].ewm(span=26, adjust=False).mean()
    df['MACD'] = exp12 - exp26

    # 3. VTS (Volatility Stop)
    # Paper Eq(5-8): Uses True Range (alpha)
    # Calculating True Range (alpha)
    high_low = df['High'] - df['Low']
    high_close = np.abs(df['High'] - df['Close'].shift())
    low_close = np.abs(df['Low'] - df['Close'].shift())
    # alpha (True Range)
    tr = pd.concat([high_low, high_close, low_close], axis=1).max(axis=1)
    # Calculate ATR (Smoothed TR) for stability over 14 days
    atr = tr.rolling(window=14).mean()
    # VTS = Close - (beta * alpha). Assuming beta=3 (Standard)
    df['VTS'] = df['Close'] - (3 * atr)

    # 4. T3 (Triple Exponential Moving Average)
    # Paper cites Tim Tillson. Standard Implementation:
    vfactor = 0.7
    a = vfactor
    c1 = -a**3
    c2 = 3*a**2 + 3*a**3
    c3 = -6*a**2 - 3*a - 3*a**3
    c4 = 1 + 3*a + a**3 + 3*a**2

    e1 = df['Close'].ewm(span=5, adjust=False).mean()
    e2 = e1.ewm(span=5, adjust=False).mean()
    e3 = e2.ewm(span=5, adjust=False).mean()
    e4 = e3.ewm(span=5, adjust=False).mean()
    e5 = e4.ewm(span=5, adjust=False).mean()
    e6 = e5.ewm(span=5, adjust=False).mean()
    df['T3'] = c1*e6 + c2*e5 + c3*e4 + c4*e3

    # 5. Parabolic SAR (Iterative Implementation)
    # This loop is necessary for accurate SAR calculation
    high = df['High'].values
    low = df['Low'].values
    close = df['Close'].values
    sar = np.zeros(len(df))

    # Initial values
    af = 0.02
    max_af = 0.2
    is_bull = True
    sar[0] = low[0]
    ep = high[0] # Extreme Point

    for i in range(1, len(df)):
        prev_sar = sar[i-1]

        # Calculate today's SAR
        curr_sar = prev_sar + af * (ep - prev_sar)

        # Check for reversal
        if is_bull:
            if low[i] < curr_sar: # Reversal to Bearish
                is_bull = False
                curr_sar = ep
                ep = low[i]
                af = 0.02
            else: # Continue Bullish
                if high[i] > ep:
                    ep = high[i]
                    af = min(af + 0.02, max_af)
                # SAR cannot be higher than the last two lows
                curr_sar = min(curr_sar, low[max(0, i-1)], low[max(0, i-2)])
        else:
            if high[i] > curr_sar: # Reversal to Bullish
                is_bull = True
                curr_sar = ep
                ep = high[i]
                af = 0.02
            else: # Continue Bearish
                if low[i] < ep:
                    ep = low[i]
                    af = min(af + 0.02, max_af)
                # SAR cannot be lower than the last two highs
                curr_sar = max(curr_sar, high[max(0, i-1)], high[max(0, i-2)])

        sar[i] = curr_sar

    df['SAR'] = sar

    # 6. Bollinger Bands (BB)
    # Paper mentions "Middle band values were used".
    sma20 = df['Close'].rolling(window=20).mean()
    std20 = df['Close'].rolling(window=20).std()
    df['BB_Middle'] = sma20 # Specifically requested by paper text
    df['BB_Upper'] = sma20 + (2 * std20)
    df['BB_Lower'] = sma20 - (2 * std20)

    # 7. OBV (On Balance Volume)
    # Paper: Momentum indicator based on volume flow.
    df['OBV'] = (np.sign(df['Close'].diff()) * df['Volume']).fillna(0).cumsum()

    # 8. CCI (Commodity Channel Index)
    # Paper: Detects beginning/ending trends.
    tp = (df['High'] + df['Low'] + df['Close']) / 3
    sma_tp = tp.rolling(window=14).mean()
    mean_dev = tp.rolling(window=14).apply(lambda x: np.abs(x - x.mean()).mean())
    df['CCI'] = (tp - sma_tp) / (0.015 * mean_dev)

    # 9. MTM (Momentum)
    # Paper Eq(9): C(t) - C(-n).
    # Table 4 Header: MTM(6_6).
    # CRITICAL FIX: Changed from 10 to 6 to match Table 4.
    df['MTM'] = df['Close'].diff(6)

    # 10. PPO (Percentage Price Oscillator)
    # Paper Eq(10) seems simplified , but Table 4 data  implies percentage.
    # Header implies (26, 12, 9).
    ema12 = df['Close'].ewm(span=12, adjust=False).mean()
    ema26 = df['Close'].ewm(span=26, adjust=False).mean()
    df['PPO'] = ((ema12 - ema26) / ema26) * 100

    # 11. PERF (Performance)
    # Paper Eq(11): % change from start price.
    first_price = df['Close'].iloc[0]
    df['PERF'] = ((df['Close'] - first_price) / first_price) * 100

    # 12. RSI (Added by us as "Strong Predictor" from analysis)
    delta = df['Close'].diff()
    gain = (delta.where(delta > 0, 0)).rolling(window=14).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window=14).mean()
    rs = gain / loss
    df['RSI'] = 100 - (100 / (1 + rs))

    # 13. SMA (Simple Moving Averages) - Included in the previous list for comparison with EMA
    df['SMA_15'] = df['Close'].rolling(window=15).mean()
    df['SMA_50'] = df['Close'].rolling(window=50).mean()

    return df

df_all = calculate_all_indicators(df)

# Create Target (1 if Next Day Up, 0 if Down)
df_all['Target'] = (df_all['Close'].shift(-1) > df_all['Close']).astype(int)
df_all = df_all.dropna() # Cleaning NaNs from indicator calculations

print("Features Generated: EMA, MACD, VTS, T3, SAR, BB, OBV, CCI, MTM, PPO, PERF")
print(f"Dataset Shape after Engineering: {df_all.shape}")

# Structural Analysis
print("\nData Structure:")
print(f"\nFirst 5 rows of the DataFrame:\n{df_all.head()}")
print(f"\nTotal rows loaded: {len(df_all)}")
print(f"\nShape: {df_all.shape}")
print(f"\nDuplicates: {df_all.duplicated().sum()}")
print(f"\nUnique Values per Column (Content Check):\n{df_all.nunique()}")
print(f"\nDataFrame Info (Shows structure and missing data):\n{df_all.info()}")
print(f"\nFull Statistical Summary (Before Cleaning):\n{df_all.describe().T}")

## **Exploratory Data Analysis (EDA)**
In this section, we visualize the data to understand feature relationships and identify potential anomalies. We perform the following steps:
1.  Correlation Heatmap: Generate a heatmap to check for multicollinearity. High correlations (e.g., between SMA_15 and EMA_14) indicate redundant features, justifying our later use of PCA.
2.  Class Balance Inspection: Plot the distribution of the `Target` variable to ensure the dataset has a roughly equal number of "Up" and "Down" days.
3.  Outlier Detection: Use boxplots to identify extreme values that might skew model training.

In [None]:
# ==========================================
# ADVANCED EDA ON ENGINEERED FEATURES
# ==========================================

print('=' * 65)
print("--- ADVANCED EDA (FEATURE SELECTION) ---")
print('=' * 65)

# List of all generated features
indicators = ['EMA_14', 'MACD', 'VTS', 'T3', 'SAR', 'BB_Middle', 'OBV', 'CCI', 'MTM', 'PPO', 'PERF', 'RSI', 'SMA_15', 'SMA_50']
all_features = ['Open', 'High', 'Low', 'Close', 'Volume'] + indicators
target_col = ['Target']

# Target Variable Distribution
print("\nVisualizing Target Variable Distribution...")
plt.figure(figsize=(10, 5))
sns.countplot(x='Target', data=df_all, palette='viridis', hue='Target', legend=False)
plt.title('Distribution of Target Variable\n', fontsize=14)
plt.xlabel('Target (0: Down, 1: Up)', fontsize=10)
plt.ylabel('Count', fontsize=10)

# Add text labels
target_counts = df_all['Target'].value_counts().sort_index()
for i in range(len(target_counts)):
    plt.text(i, target_counts[i] + 100, f"{target_counts[i]} ({target_counts[i]/len(df_all)*100:.1f}%)", ha='center', fontsize=11)
plt.show()

# Feature Distribution (Visual Check)
print("\nVisualizing New Data Distributions...")
plt.suptitle('Univariate Feature Distributions', fontsize=20, y=1.02)
for i, feature in enumerate(indicators):
    plt.figure(figsize=(10, 5)) # Create a new figure for each plot
    sns.histplot(df_all[feature], kde=True, bins=50)
    plt.title(f'Distribution of {feature}', fontsize=14)
    plt.tight_layout()
    plt.show()

# Time Series Trends
print("\nVisualizing Time Series Trends...")
for i, col in enumerate(indicators):
    plt.figure(figsize=(10, 5)) # Create a new figure for each plot
    plt.plot(df_all['Date'], df_all[col], label=f'{col} Trend')
    plt.title(f'[1.5] {col} Time Series Trend', fontsize=14)
    plt.xlabel('Date')
    plt.ylabel(col)
    plt.legend()
    plt.show()

# Outlier Analysis (Boxplots)
print("\nVisualizing Outlier...")
for i, col in enumerate(all_features):
    plt.figure(figsize=(10, 5)) # Create a new figure for each plot
    sns.boxplot(x=df_all[col], palette='pastel')
    plt.title(f'Boxplot of {col}', fontsize=14)
    plt.grid(True)
    plt.show()

# Feature vs Target Analysis (Boxplots)
print("\nAnalyzing Feature vs Target Separation...")

# Note: Calculating RSI just for plotting comparison if needed, or use CCI/MTM
for i, col in enumerate(indicators):
    plt.figure(figsize=(10, 5))
    sns.boxplot(x='Target', y=col, data=df_all, palette='coolwarm')
    plt.title(f'{col} Distribution by Target', fontsize=14)
    plt.tight_layout()
    plt.show()

# Correlation & Multicollinearity Analysis (Heatmap)
# This section is crucial to show why PCA or feature removal is necessary
print("\nCorrelation Heatmap (Checking for Redundancy)...")
plt.figure(figsize=(10, 10))
corr_matrix = df_all[indicators + target_col].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', annot_kws={"size": 8})
plt.title('Correlation Matrix of ALL 11 Indicators', fontsize=14)
plt.show()

## **Data Preprocessing and Dimensionality Reduction**
In this section, we transform the raw features into a format suitable for machine learning models. We perform the following steps:
1.  Standard Scaling: Normalize all 19 features using `StandardScaler` (Mean=0, Variance=1) so that large values like Volume don't dominate the model.
2.  Principal Component Analysis (PCA): Apply PCA to reduce the 19 correlated features into a smaller set of Principal Components that explain 95% of the variance. This removes noise and speeds up training.

In [None]:
# ==========================================
# FEATURE SELECTION & PREPROCESSING
# ==========================================

print('=' * 65)
print("--- FEATURE SELECTION & SCALING ---")
print('=' * 65)

# Based on the above analysis, we keep all the features but remove the redundancy with PCA. This method also covers "Task 4" of your project.
X = df_all[all_features]
y = df_all['Target']

# Train/Test Split (TimeSeries safe)
split = int(len(X) * 0.8)
X_train_raw, X_test_raw = X.iloc[:split], X.iloc[split:]
y_train, y_test = y.iloc[:split], y.iloc[split:]

# Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_raw)
X_test_scaled = scaler.transform(X_test_raw)

# PCA (Dimensionality Reduction)
pca = PCA(n_components=0.95)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

print(f"Original Features: {X_train_scaled.shape[1]}")
print(f"Features after PCA: {X_train_pca.shape[1]}")

In [None]:
# --- Fit Linear SVM to find the separating plane ---

# The plane equation is: w0*x + w1*y + w2*z + b = 0
svc = SVC(kernel='linear')
svc.fit(X_train_pca, y_train)

# Extract weights and bias
w = svc.coef_[0]
b = svc.intercept_[0]

# Generate grid for the plane
# z = -(w0*x + w1*y + b) / w2
x_min, x_max = X_train_pca[:, 0].min() - 1, X_train_pca[:, 0].max() + 1
y_min, y_max = X_train_pca[:, 1].min() - 1, X_train_pca[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 50),
                     np.linspace(y_min, y_max, 50))
zz = (-w[0] * xx - w[1] * yy - b) / w[2]

# Plotting
fig = plt.figure(figsize=(13, 13))
ax = fig.add_subplot(111, projection='3d')

# Plot Data Points
# Red = Down (0), Green = Up (1)
ax.scatter(X_train_pca[y_train==0, 0], X_train_pca[y_train==0, 1], X_train_pca[y_train==0, 2],
           c='red', label='Down (0)', alpha=0.5, s=25, edgecolors='k', linewidth=0.2)
ax.scatter(X_train_pca[y_train==1, 0], X_train_pca[y_train==1, 1], X_train_pca[y_train==1, 2],
           c='green', label='Up (1)', alpha=0.5, s=25, edgecolors='k', linewidth=0.2)

# Plot Hyperplane (The Blue Sheet)
ax.plot_surface(xx, yy, zz, alpha=0.3, color='blue', shade=False)

# Formatting
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel('Principal Component 3')
ax.set_title(f'3D PCA Visualization \nBlue Plane = Linear SVM Decision Boundary', fontsize=14)
ax.legend(loc='upper left', fontsize=12)
ax.view_init(elev=20, azim=45) # Adjust camera angle for best view

## **Model Training and Comparison**
In this section, we train and evaluate five different classifiers to predict stock direction. We perform the following steps:
1.  Time Series Cross-Validation: Use `TimeSeriesSplit` to create rolling training windows (e.g., Train on Years 1-5, Test on Year 6). This prevents data leakage by ensuring we never train on future data.
2.  Model Selection: Train Logistic Regression, KNN, SVM, Random Forest, and MLP (Neural Network).
3.  Metric Evaluation: Measure ROC-AUC and Accuracy for each model to determine which algorithm best captures the market signal.

In [None]:
# ==========================================
# MODEL TRAINING & COMPARISON
# ==========================================

print('=' * 65)
print("--- MODEL TRAINING & HYPERPARAMETER TUNING ---")
print('=' * 65)

# Define Models and Hyperparameter Grids
# We selected 5 diverse algorithms to cover different learning styles.
model_params = {
    'Logistic Regression': {
        'model': LogisticRegression(random_state=42, solver='liblinear'),
        'params': {
            'C': [0.1, 1, 10],
            'penalty': ['l1', 'l2']
        }
    },
    'K-Nearest Neighbors': {
        'model': KNeighborsClassifier(),
        'params': {
            'n_neighbors': [3, 5, 7, 9],
            'weights': ['uniform', 'distance'],
            'metric': ['euclidean', 'manhattan']
        }
    },
    'Support Vector Machine': {
        'model': SVC(probability=True, random_state=42),
        'params': {
            'C': [0.1, 1, 10],
            'kernel': ['rbf', 'linear'],
            'gamma': ['scale', 'auto']
        }
    },
    'Random Forest': {
        'model': RandomForestClassifier(random_state=42),
        'params': {
            'n_estimators': [50, 100, 200],
            'max_depth': [10, 20, None],
            'min_samples_split': [2, 5]
        }
    },
    'MLP Neural Network': {
        'model': MLPClassifier(max_iter=500, random_state=42),
        'params': {
            'hidden_layer_sizes': [(50,), (100,), (50, 50)],
            'activation': ['relu', 'tanh'],
            'alpha': [0.0001, 0.001]
        }
    }
}

# Training Loop with Time Series Cross-Validation
# We use TimeSeriesSplit to prevent data leakage (training on future data)
tscv = TimeSeriesSplit(n_splits=3)

results = []
best_estimators = {}
roc_data = {}

print("Starting training loop... (This may take a few minutes)")

for model_name, config in model_params.items():
    print(f"\nTraining {model_name}...")

    # Grid Search for Hyperparameter Tuning
    clf = GridSearchCV(config['model'], config['params'], cv=tscv, scoring='accuracy', n_jobs=-1, verbose=1)
    clf.fit(X_train_pca, y_train)

    # Store Best Model
    best_model = clf.best_estimator_
    best_estimators[model_name] = best_model

    # Evaluate on Test Set
    y_pred = best_model.predict(X_test_pca)
    y_prob = best_model.predict_proba(X_test_pca)[:, 1]

    # Calculate Metrics
    acc = accuracy_score(y_test, y_pred)
    fpr, tpr, _ = roc_curve(y_test, y_prob)
    roc_auc = auc(fpr, tpr)

    # Store Results
    results.append({
        'Model': model_name,
        'Accuracy': acc,
        'AUC Score': roc_auc,
        'Best Params': str(clf.best_params_)
    })
    roc_data[model_name] = (fpr, tpr, roc_auc)

    print(f"  > Best Accuracy (CV): {clf.best_score_:.4f}")
    print(f"  > Test Accuracy: {acc:.4f}")
    print(f"  > Test AUC: {roc_auc:.4f}")



## **Performance Evaluation and Visualization**
In this section, we visually assess the trade-off between sensitivity and specificity for our models. We perform the following steps:
1.  ROC Curve: Plot the False Positive Rate vs. True Positive Rate. An AUC score > 0.5 confirms the model performs better than random guessing.
2.  Confusion Matrix: Generate a matrix for the best model (KNN) to analyze False Positives vs. False Negatives, helping us understand if the model is biased toward predicting "Up" or "Down."

In [None]:
# ==========================================
# PERFORMANCE COMPARISON & VISUALIZATION
# ==========================================

print('\n' + '=' * 65)
print("--- COMPARATIVE ANALYSIS ---")
print('=' * 65)

# Comparison Table
results_df = pd.DataFrame(results).sort_values(by='Accuracy', ascending=False)
print("\nModel Performance Table:")
print(results_df[['Model', 'Accuracy', 'AUC Score']])

# Performance Bar Chart
plt.figure(figsize=(10, 6))
sns.barplot(x='Accuracy', y='Model', hue='Model', data=results_df, palette='viridis', legend=False)
plt.title('Model Accuracy Comparison', fontsize=15)
plt.xlabel('Accuracy Score')
plt.xlim(0.4, 1.0) # Adjust limit to see differences clearly
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.show()

# ROC Curve Comparison (Crucial for Classification)
plt.figure(figsize=(10, 8))
for model_name, (fpr, tpr, roc_auc) in roc_data.items():
    plt.plot(fpr, tpr, lw=2, label=f'{model_name} (AUC = {roc_auc:.2f})')

plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve Comparison', fontsize=15)
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

# Detailed Classification Report for the Best Model
best_model_name = results_df.iloc[0]['Model']
best_model_instance = best_estimators[best_model_name]
print(f"\nDetailed Analysis for Best Model: {best_model_name}")

y_pred_best = best_model_instance.predict(X_test_pca)
print(classification_report(y_test, y_pred_best))

# Confusion Matrix for Best Model
plt.figure(figsize=(6, 5))
cm = confusion_matrix(y_test, y_pred_best)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.title(f'Confusion Matrix: {best_model_name}', fontsize=14)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

print("\nFull Project Execution Completed Successfully.")

## **Ensemble Learning (Voting Classifier)**
In this section, we combine our individual models to create a more robust predictor. We perform the following steps:
1.  Soft Voting Strategy: Implement a `VotingClassifier` with `voting='soft'`.
2.  Probability Aggregation: Instead of counting votes, we average the predicted probabilities of all models. This captures the confidence level of each classifier, typically resulting in higher stability and generalization.

In [None]:
# ==========================================
# VOTING CLASSIFIER (ENSEMBLE)
# ==========================================

print('=' * 65)
print("--- ENSEMBLE LEARNING (BOOSTING PERFORMANCE) ---")
print('=' * 65)

# 1. Retrieve best estimators from previous step
clf1 = best_estimators['Logistic Regression']
clf2 = best_estimators['K-Nearest Neighbors']
clf3 = best_estimators['Support Vector Machine']
clf4 = best_estimators['Random Forest']
clf5 = best_estimators['MLP Neural Network']

# 2. Create Voting Classifier (Soft Voting uses probabilities)
voting_clf = VotingClassifier(
    estimators=[
        ('lr', clf1),
        ('knn', clf2),
        ('svm', clf3),
        ('rf', clf4),
        ('mlp', clf5)
    ],
    voting='soft'
)

print("Training Voting Classifier...")
voting_clf.fit(X_train_pca, y_train)

# 3. Evaluate Ensemble
y_pred_ens = voting_clf.predict(X_test_pca)
y_prob_ens = voting_clf.predict_proba(X_test_pca)[:, 1]

acc_ens = accuracy_score(y_test, y_pred_ens)
fpr, tpr, _ = roc_curve(y_test, y_prob_ens)
roc_auc_ens = auc(fpr, tpr)

print(f"\n[Ensemble Results]")
print(f"  > Ensemble Accuracy: {acc_ens:.4f}")
print(f"  > Ensemble AUC: {roc_auc_ens:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_ens))

## **Feature Importance Analysis**
In this section, we identify which technical indicators are driving the market predictions. We perform the following steps:
1.  Raw Feature Training: Train a separate Random Forest Classifier on the original 19 features (without PCA) to access interpretable feature names.
2.  Gini Importance Extraction: Extract and plot the `feature_importances` to rank indicators. This helps us verify financial theories, such as whether Volume or Momentum precedes price changes.

In [None]:
# ==========================================
# FEATURE IMPORTANCE ANALYSIS
# ==========================================

print('\n' + '=' * 65)
print("--- FEATURE IMPORTANCE (DIAGNOSTICS) ---")
print('=' * 65)

# We use Random Forest to see which features actually matter
# We must use the Random Forest trained on SCALED data (before PCA) to interpret feature names
rf_diag = RandomForestClassifier(n_estimators=100, random_state=42)
rf_diag.fit(X_train_scaled, y_train)

importances = rf_diag.feature_importances_
feature_names = X.columns

# Create DataFrame for plotting
feat_imp_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

print("\nTop 10 Most Important Features:")
print(feat_imp_df.head(10))

# Plot
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feat_imp_df, palette='magma')
plt.title('Feature Importance (What drives the model?)', fontsize=15)
plt.show()