Learning Objectives  
<font color="#12A80D"> <b>By the end of this notebook, you will be able to:</br>  
1. Set up and configure the execution environment</br>• Install and upgrade required Python packages for machine learning, data processing, and visualization.</br>• Mount and interact with Google Drive for data and model storage.
2. Load and manage ensemble model prediction inputs</br>• Locate and retrieve the latest prediction CSV files for multiple historical lookback periods (1D–365D).</br>• Organize and validate ensemble input data for further processing.
3. Build and train a meta-model for stock price prediction</br>• Use Ridge Regression with L2 regularization to ensemble predictions from multiple base models.</br>• Fit the meta-model on historical prediction outputs to optimize performance.
4. Evaluate predictive performance</br>• Apply regression metrics including R² score, MAE, and RMSE to assess accuracy.</br>• Compare meta-model performance against individual lookback models.
5. Save and deploy trained models</br>•  Serialize and save trained models and scalers to persistent storage for future inference.</br>  
6. Integrate results into an end-to-end workflow</br>• Maintain a structured process from raw predictions to meta-model training, evaluation, and export.</b>

</font>


## Load Dependencies into the Colab Runtime Environment
<font color="#12A80D"> <b>• Installs and upgrades required Python packages in the Colab environment.</br>• Any installation errors can be ignored, as unused dependencies do not affect the execution of <code>Nvidia_Next_Day_Closing_Meta_Model_Train_Week5.ipynb</code>.</b> </font>

In [None]:
# =====================================
# SETUP AND INSTALL DEPENDENCIES
# =====================================
!pip install --upgrade pip
!pip install --quiet ipywidgets
!pip install --upgrade numpy==2.0.2
!pip install --quiet tensorflow==2.18.0
!pip install --quiet pandas==2.2.2 matplotlib seaborn scikit-learn==1.6.1 tqdm
!pip install --quiet transformers==4.53.1 tokenizers newsapi-python==0.2.7



## Import libraries for data processing, modeling, and saving results
<font color="#12A80D"> <b>• Imports built-in modules for file system access (<code>os</code>, <code>glob</code>, <code>time</code>)</br>• Loads <code>pandas</code> and <code>numpy</code> for structured data handling and numerical operations</br>• Imports <code>Ridge</code> from scikit-learn for linear regression with L2 regularization</br>• Imports regression metrics (<code>r2_score</code>, <code>mean_absolute_error</code>, <code>mean_squared_error</code>) for performance evaluation</br>• Uses <code>joblib</code> for model serialization and saving to disk</b> </font>

In [None]:
# File and path handling
import os        # For file and directory operations
import glob      # For file pattern matching and searching

# Data manipulation and analysis
import pandas as pd   # DataFrames for structured data
import numpy as np    # Numerical computations

# Machine learning models and metrics
from sklearn.linear_model import Ridge  # Ridge regression model (L2 regularization)
from sklearn.metrics import (           # Performance metrics
    r2_score,               # Coefficient of determination
    mean_absolute_error,    # Mean absolute error
    mean_squared_error      # Mean squared error
)

# Model persistence
import joblib   # Save and load trained models and scalers

# Time measurement
import time     # Track execution time for performance logging

## Mount Google Drive in the Colab notebook to access its contents
<font color="#12A80D"> <b>• Requires granting access to Google Drive</br>• Forces remounting even if already mounted</b> </font>

In [None]:
# Mount Google Drive in Google Colab
from google.colab import drive
drive.mount('/content/drive', force_remount=True)  # force_remount=True ensures a fresh mount

Mounted at /content/drive


## Define the target directory for saving the meta-model
<font color="#12A80D"> <b>• Specifies the Google Drive folder path where the trained meta-model and its related files will be saved</br>• Uses an absolute path to ensure outputs are stored in <code>/content/drive/My Drive/Nvidia_Stock_Market_History/Training/Meta_Model_Trained</code></br>• Requires Google Drive to be mounted before saving to this location</b> </font>

In [None]:
# Define the target directory for saving the trained meta-model
META_MODEL_SAVE_FOLDER = "/content/drive/My Drive/Nvidia_Stock_Market_History/Training/Meta_Model_Trained"

## Discover the latest prediction CSVs in each lookback folder
<font color="#12A80D"> <b>• Sets the root path for the ensemble input folders in Google Drive</br>• Defines the list of lookback period subfolders to scan (e.g., <code>365D</code>, <code>270D</code>, <code>1D</code>)</br>• For each lookback period, checks if the folder exists and contains subfolders</br>• Sorts subfolders by timestamp in their names to identify the most recent run</br>• Finds the latest <code>*_predictions.csv</code> file within that subfolder</br>• Appends the path of each latest CSV to the <code>csv_files</code> list for later ensemble processing</b> </font>

In [None]:
# ---------------------------
# Discover latest prediction CSVs in each lookback folder
# ---------------------------

# Base folder containing lookback subfolders with prediction results
ENSEMBLE_INPUTS_FOLDER = "/content/drive/My Drive/Nvidia_Stock_Market_History/Training/ensemble_inputs"

# Ordered list of lookback periods to search
LOOKBACK_FOLDERS = ["365D", "270D", "180D", "90D", "60D", "30D", "14D", "1D"]

csv_files = []  # Will store the paths to the latest prediction CSV from each lookback folder

for lookback in LOOKBACK_FOLDERS:
    # Path to the current lookback's folder
    lookback_path = os.path.join(ENSEMBLE_INPUTS_FOLDER, lookback)

    # Ensure the folder exists
    if not os.path.exists(lookback_path):
        raise FileNotFoundError(f"Lookback folder does not exist: {lookback_path}")

    # Get all subfolders inside the current lookback folder
    subfolders = [d for d in glob.glob(os.path.join(lookback_path, "*")) if os.path.isdir(d)]
    if not subfolders:
        raise FileNotFoundError(f"No subfolders found in {lookback_path}")

    # Sort subfolders by datetime in their name
    # Assumes folder names end with a timestamp in format YYYY-MM-DD_HH-MM-SS
    subfolders_sorted = sorted(
        subfolders,
        key=lambda x: time.strptime(
            x.split("_")[-2] + "_" + x.split("_")[-1],
            "%Y-%m-%d_%H-%M-%S"
        ),
        reverse=True
    )

    # Select the latest subfolder
    latest_subfolder = subfolders_sorted[0]

    # Find prediction CSV files in the latest subfolder
    prediction_csvs = glob.glob(os.path.join(latest_subfolder, "*_predictions.csv"))
    if not prediction_csvs:
        raise FileNotFoundError(f"No *_predictions.csv found in {latest_subfolder}")

    # Add the first prediction CSV found to the list
    csv_files.append(prediction_csvs[0])

## Print the discovered prediction file paths
<font color="#12A80D"> <b>• Outputs the list of most recent <code>*_predictions.csv</code> files found in each lookback folder</br>• Iterates over the <code>csv_files</code> list and prints each file path to verify correct discovery before proceeding with ensemble model processing</b> </font>

In [None]:
# Print all discovered prediction file paths
print("Found prediction files:")
for f in csv_files:
    print(f)

Found prediction files:
/content/drive/My Drive/Nvidia_Stock_Market_History/Training/ensemble_inputs/365D/Nvidia_Stock_Training_365D_SA_2025-08-13_04-56-15/Nvidia_C1D64_BiG550_BiG350_BAtt_D1_Lookback365_predictions.csv
/content/drive/My Drive/Nvidia_Stock_Market_History/Training/ensemble_inputs/270D/Nvidia_Stock_Training_270D_SA_2025-08-13_02-24-48/Nvidia_C1D64_BiG250_BiG250_BiG250_BAtt_D1_Lookback270_predictions.csv
/content/drive/My Drive/Nvidia_Stock_Market_History/Training/ensemble_inputs/180D/Nvidia_Stock_Training_180D_SA_2025-08-13_02-55-32/Nvidia_C1D64_BiG250_BiG250_BiG250_BAtt_D1_Lookback180_predictions.csv
/content/drive/My Drive/Nvidia_Stock_Market_History/Training/ensemble_inputs/90D/Nvidia_Stock_Training_90D_SA_2025-08-13_03-13-06/Nvidia_C1D64_BiG250_BiG250_BAtt_D1_Lookback90_predictions.csv
/content/drive/My Drive/Nvidia_Stock_Market_History/Training/ensemble_inputs/60D/Nvidia_Stock_Training_60D_SA_2025-08-12_05-01-55/Nvidia_C1D64_BiG250_BiG250_BAtt_D1_Lookback60_predictio

## Load and merge prediction CSVs for ensemble preparation
<font color="#12A80D"> <b>• Iterates through the discovered prediction CSV files and extracts the lookback period from the filename</br>• Reads each CSV, selects only the <code>Date</code> and <code>Predicted_Close</code> columns, and renames <code>Predicted_Close</code> to include the lookback period (e.g., <code>Pred_365</code>)</br>• Merges all prediction DataFrames on the <code>Date</code> column using an inner join to keep only matching dates</br>• Adds the <code>Actual_Close</code> column from one of the CSVs for reference</br>• Ensures the <code>Date</code> column is converted to datetime format for consistent handling</br>• Prints the merged DataFrame’s shape and first few rows for verification before training the meta-model</b> </font>

In [None]:
# ---------------------------
# Load each CSV and rename Predicted_Close to include Lookback
# ---------------------------

dfs = []  # Will store DataFrames from each lookback prediction file

for f in csv_files:
    # Extract lookback value from filename (assumes "LookbackXX" format is in filename)
    lookback = [s for s in f.split("_") if "Lookback" in s][0].replace("Lookback", "")

    # Read the CSV
    df = pd.read_csv(f)

    # Keep only Date and Predicted_Close columns
    df = df[["Date", "Predicted_Close"]]

    # Rename Predicted_Close to include the lookback period (e.g., Pred_365)
    df = df.rename(columns={"Predicted_Close": f"Pred_{lookback}"})

    dfs.append(df)

# Merge all prediction DataFrames on Date
merged_df = dfs[0]
for df in dfs[1:]:
    merged_df = pd.merge(merged_df, df, on="Date", how="inner")

# Load Actual_Close values from the first CSV and merge with predictions
actual_df = pd.read_csv(csv_files[0])[["Date", "Actual_Close"]]
merged_df = pd.merge(merged_df, actual_df, on="Date", how="inner")

# Ensure Date column is in datetime format
merged_df["Date"] = pd.to_datetime(merged_df["Date"])

# Print shape and preview of merged DataFrame
print("Merged DataFrame shape:", merged_df.shape)
print(merged_df.head())

Merged DataFrame shape: (6114, 10)
        Date  Pred_365  Pred_270  Pred_180   Pred_90   Pred_60   Pred_30  \
0 2001-04-18  0.569765  0.291505  0.298354  0.190794  0.158766  0.713176   
1 2001-04-19  0.573504  0.285288  0.304174  0.206957  0.159088  0.691540   
2 2001-04-20  0.574017  0.281664  0.312384  0.219704  0.162637  0.686194   
3 2001-04-23  0.575973  0.280502  0.319873  0.223180  0.171424  0.659945   
4 2001-04-24  0.577554  0.286060  0.324303  0.226387  0.188635  0.685888   

    Pred_14    Pred_1  Actual_Close  
0  0.371754  0.470702      0.295198  
1  0.338006  0.537584      0.320568  
2  0.319723  0.530042      0.332260  
3  0.281302  0.522054      0.314722  
4  0.237170  0.428162      0.292447  


## Prepare training features and target for the meta-model
<font color="#12A80D"> <b>• Selects all columns in the merged DataFrame whose names start with <code>Pred_</code> as meta-model features</br>• Stores these prediction columns in <code>X</code> as the feature matrix</br>• Assigns the <code>Actual_Close</code> column to <code>y</code> as the target variable for supervised training</br>• This setup allows the meta-model to learn how to optimally combine predictions from multiple lookback models</b> </font>

In [None]:
# ---------------------------
# Prepare training data
# ---------------------------

# Select all columns that start with "Pred_" — these are model prediction features
feature_cols = [c for c in merged_df.columns if c.startswith("Pred_")]

# Feature matrix (X): prediction columns from all lookback models
X = merged_df[feature_cols].values

# Target vector (y): actual closing prices
y = merged_df["Actual_Close"].values

## Save the feature column order for inference
<font color="#12A80D"> <b>• Stores the list of feature column names (<code>feature_cols</code>) as a serialized <code>joblib</code> file in the meta-model save folder</br>• Ensures the target directory exists by creating it if necessary</br>• Preserves the exact feature ordering used during training so that future predictions use the same input structure</br>• Prints the save path for verification</b> </font>

In [None]:
# --------------------------
# Save feature column order for later use in prediction
# --------------------------

# Define path for saving the list of feature column names
feature_cols_path = os.path.join(META_MODEL_SAVE_FOLDER, "feature_cols.joblib")

# Ensure the target folder exists
os.makedirs(META_MODEL_SAVE_FOLDER, exist_ok=True)

# Save the feature column names using joblib for easy loading later
joblib.dump(feature_cols, feature_cols_path)

print(f"Feature columns saved to: {feature_cols_path}")

Feature columns saved to: /content/drive/My Drive/Nvidia_Stock_Market_History/Training/Meta_Model_Trained/feature_cols.joblib


## Train the meta-model (Ridge regression)
<font color="#12A80D"> <b>• Initializes a <code>Ridge</code> regression model with <code>alpha=1.0</code>, which applies L2 regularization to prevent overfitting while combining predictions from multiple lookback models</br>• Fits the meta-model on the merged feature set (<code>X</code>) containing predictions from all base models as inputs, and the actual closing prices (<code>y</code>) as targets</br>• This meta-model effectively learns optimal weights for blending each lookback model’s predictions into a single, more accurate ensemble output</b> </font>

In [None]:
# ---------------------------
# Train Ridge regression
# ---------------------------


# Initialize Ridge regression model with L2 regularization strength alpha=1.0
meta_model = Ridge(alpha=1.0)

# Fit the model on stacked predictions (X) and actual closing prices (y)
meta_model.fit(X, y)

## Evaluate the meta-model performance
<font color="#12A80D"> <b>• Uses the trained <code>Ridge</code> regression model to generate predictions (<code>y_pred</code>) on the same data it was trained on</br>• Computes evaluation metrics to assess how well the meta-model fits the ensemble prediction task</b> </br></br>
Metrics computed:</br>
<font color="#12A80D">
<code>• R² Score</code> — Proportion of variance in the actual closing prices explained by the meta-model’s predictions.</br>
<font color="#12A80D">
<code>• Mean Absolute Error (MAE)</code> — Average absolute difference between predictions and actual values.</br>
<font color="#12A80D">
<code>• Mean Squared Error (MSE)</code> — Average squared prediction error, which penalizes larger mistakes more heavily.</br>
<font color="#12A80D">
<code>• Root Mean Squared Error (RMSE)</code> — Square root of MSE, making the error metric directly comparable to the original target scale (closing price).

Purpose:
This evaluation validates whether combining all lookback models into a single weighted prediction improves accuracy and stability compared to individual lookback models.
</font>

In [None]:
# ---------------------------
# Evaluate meta-model performance
# ---------------------------

# Predict using the trained meta-model
y_pred = meta_model.predict(X)

# Calculate performance metrics
r2 = r2_score(y, y_pred)                       # Coefficient of determination
mae = mean_absolute_error(y, y_pred)           # Mean Absolute Error
mse = mean_squared_error(y, y_pred)            # Mean Squared Error
rmse = np.sqrt(mse)                            # Root Mean Squared Error (manual calc)

# Print results
print(f"\nMeta-model training complete.")
print(f"R2 Score: {r2:.4f}")
print(f"Mean Absolute Error: {mae:.4f}")
print(f"Mean Squared Error: {mse:.4f}")
print(f"Root Mean Squared Error: {rmse:.4f}")



Meta-model training complete.
R2 Score: 0.9961
Mean Absolute Error: 0.7166
Mean Squared Error: 3.5630
Root Mean Squared Error: 1.8876


## Metrics Explained
<font color="#12A80D">
1. R² Score: 0.9961</br>
Interpretation:</br>
• The model explains 99.61% of the variance in the actual NVIDIA closing prices based on the ensemble of predictions from different lookback models.</br>

Implication:
This is an exceptionally high goodness-of-fit, meaning the weighted combination of the base models is capturing almost all patterns in the data.</br></br>

2. Mean Absolute Error (MAE): 0.7166</br>
Interpretation:</br>
• On average, predictions are $0.72 away from the actual closing price.</br>

Implication:</br>
• Given NVIDIA’s share price volatility and scale, this level of error is quite low, suggesting high predictive precision.</br></br>

3. Mean Squared Error (MSE): 3.5630</br>
Interpretation:</br>
• The average of squared prediction errors is 3.56, where larger errors are penalized more than smaller ones.</br>

Implication:</br>
• This confirms that extreme deviations are rare, since MSE is only slightly above MAE².</br>
</br>
4. Root Mean Squared Error (RMSE): 1.8876</br>
Interpretation:</br>
• Predictions are off by about $1.89 on the same scale as the stock price.</br>

Implication:</br>
• Since RMSE > MAE, it suggests there are occasional larger errors, but they’re not frequent enough to greatly reduce overall accuracy.</br></br>
<font color="#12A80D">
Overall Conclusion</br>
• This Ridge regression meta-model is highly effective at blending predictions from multiple lookback windows into a unified forecast.</br>
The metrics indicate:

<font color="#12A80D">
Extremely strong fit (R² close to 1.0)</br>
<font color="#12A80D">
Low average deviation (< $2)

Minimal extreme prediction errors

• If the evaluation set is also the training set (as here), these numbers show it can represent the training data very well. The next step would be to validate on unseen data to confirm the generalization ability.
</font>

## Save trained meta-model
<font color="#12A80D"> <b>• Creates the target directory if it does not exist, then serializes the fitted Ridge regression model to a .joblib file named meta_model_ridge.joblib in the META_MODEL_SAVE_FOLDER path</br>• Allows quick reloading later without retraining</br>• Ensures that ensemble prediction workflows can reuse the trained model in production or evaluation scripts</b> </font>

In [None]:
# ---------------------------
# Save model
# ---------------------------

# Ensure the meta-model save folder exists
os.makedirs(META_MODEL_SAVE_FOLDER, exist_ok=True)

# Define full path for saving the trained Ridge meta-model
meta_model_path = os.path.join(META_MODEL_SAVE_FOLDER, "meta_model_ridge.joblib")

# Save the model using joblib for efficient serialization
joblib.dump(meta_model, meta_model_path)

print(f"\nMeta-model saved to: {meta_model_path}")


Meta-model saved to: /content/drive/My Drive/Nvidia_Stock_Market_History/Training/Meta_Model_Trained/meta_model_ridge.joblib
