# Week 2 — EDA & Feature Engineering for Nvidia Stock Market History

<font color="#12A80D"><b>Course:</b> Advanced AI Forecasting with TensorFlow and NLP</font>  
<font color="#12A80D"><b>Module:</b> Time Series Foundations, Dataset Exploration, and Lookback Theory</font>  

<font color="#12A80D">
This notebook is designed for <b>Google Colab</b> but can also be run locally with minor adaptions.  
It prepares the NVIDIA stock dataset for modeling in Week 2.
</font>

<br><font color='#12A80D'>
<b>Running this notebook requires a Google account with access to Google Drive.</b>
</font></br>

---

## Learning Objectives
<font color="#12A80D">
<br>- Load and inspect the NVIDIA stock dataset (OHLCV).</br>
<br>- Perform basic exploratory data analysis (EDA) with visualizations.</br>
<br>- Engineer baseline features (scaling, moving averages, rolling volatility).</br>
<br>- Understand the **theory of lookback windows** (implementation next week).</br>
</font>

---



# Environment Setup

## Load Dependencies into the Colab Runtime Environment
<font color="#12A80D"> <b>• Installs and upgrades required Python packages in the Colab environment.</br>• Any installation errors can be ignored, as unused dependencies do not affect the execution of <code>NSMH_EDA_Feature_Engineering_Week2.ipynb</code>.</b> </font>

In [None]:
# =====================================
# SETUP AND INSTALL DEPENDENCIES
# =====================================
!pip install --upgrade pip
!pip install --quiet ipywidgets
!pip install --upgrade numpy==2.0.2
!pip install --quiet tensorflow==2.18.0
!pip install --quiet pandas==2.2.2 matplotlib seaborn scikit-learn==1.6.1 tqdm

Collecting pip
  Downloading pip-25.2-py3-none-any.whl.metadata (4.7 kB)
Downloading pip-25.2-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m72.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-25.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-decision-forests 1.12.0 requires tensorflow==2.19.0, but you have tensorflow 2.18.0 which is incompatible.
tensorstore 0.1.76 requires ml_dtypes>=0.5.0, but you have ml-dtypes 0.4.1 which is incompatible.
tf-keras 2.19.0 requires tensorflow<2.20,>=2.19, but you have tensorflow 2.18.0 which is incompatible.
tensorflow-text 2.19.0 requires tensorflow<2.20,

## Import the Necessary Libraries and Modules
<font color="#12A80D"> <b>• Colab custom widget support and imports libraries for data handling, preprocessing, visualization, modeling, evaluation, and interactivity.</br>• Covers file I/O, feature scaling, deep learning layers, training callbacks, date handling, and randomization.</b> </font>

In [None]:
# Core libraries
import sys, platform                # System and platform information
import pandas as pd                 # Data manipulation and analysis
import numpy as np                  # Numerical computations
import matplotlib.pyplot as plt     # Data visualization
from matplotlib.ticker import FuncFormatter  # Custom axis formatting in plots
from sklearn.preprocessing import MinMaxScaler # Feature scaling
import tensorflow as tf             # Deep learning framework
from datetime import datetime       # Date and time handling
import random                       # Random number generation

# Display environment information for reproducibility
print("Python:", sys.version.split()[0])
print("Pandas:", pd.__version__)
print("NumPy:", np.__version__)
print("Matplotlib:", plt.matplotlib.__version__)
print("TensorFlow:", tf.__version__)
print("Platform:", platform.platform())

Python: 3.11.13
Pandas: 2.2.2
NumPy: 2.0.2
Matplotlib: 3.10.0
TensorFlow: 2.18.0
Platform: Linux-6.1.123+-x86_64-with-glibc2.35


## Mount Google Drive in the Colab notebook to access its contents
<font color="#12A80D"> <b>• Requires granting access to Google Drive</br>• Forces remounting even if already mounted</b> </font>

In [None]:
# =====================================
# MOUNT GOOGLE DRIVE
# =====================================
# Mount Google Drive in Google Colab
from google.colab import drive
drive.mount('/content/drive', force_remount=True)  # force_remount=True ensures a fresh mount

Mounted at /content/drive


## Create the project directory in Google Drive
<font color="#12A80D"> <b>• Requires Google Drive to be mounted before directory creation.</br>• Creates the directory if it does not exist; no error if it already exists.</b> </font>

In [None]:
# =====================================
# CREATE PROJECT DIRECTORY
# =====================================
import os

project_dir = '/content/drive/MyDrive/Nvidia_Stock_Market_History'
os.makedirs(project_dir, exist_ok=True)

print(f"Project directory created at: {project_dir}")

Project directory created at: /content/drive/MyDrive/Nvidia_Stock_Market_History


## Upload the dataset to Google Drive (flexible file name)
<font color="#12A80D"> <b>Allows uploading any local file and saves it as `nvidia_stock_data.csv` in the project directory.</b> </font>

In [None]:
# =====================================
# DATASET UPLOAD TO GOOGLE DRIVE (Flexible Name)
# =====================================
from google.colab import files
import shutil

# Define destination path in Google Drive
drive_dataset_path = f'{project_dir}/Data/nvidia_stock_data.csv'

# Ask user to upload the dataset
print("Please upload your dataset (nvidia_stock_data.csv) from your local machine...")
uploaded = files.upload()

# Copy the first uploaded file (regardless of name)
uploaded_filename = list(uploaded.keys())[0]
shutil.copy(f'/content/{uploaded_filename}', drive_dataset_path)

print(f"Dataset successfully copied to: {drive_dataset_path}")

Please upload your dataset (nvidia_stock_data.csv) from your local machine...


Saving Nvidia_stock_data.csv to Nvidia_stock_data.csv
Dataset successfully copied to: /content/drive/MyDrive/Nvidia_Stock_Market_History/nvidia_stock_data.csv


## List the contents of the project directory in Google Drive to verify successful upload of dataset
<font color="#12A80D"> <b>Displays file sizes in a human-readable format.</b> </font>

In [None]:
!ls -lh /content/drive/MyDrive/Nvidia_Stock_Market_History_test

total 619K
-rw------- 1 root root 615K Aug  8 04:31 nvidia_stock_data.csv
drwx------ 2 root root 4.0K Aug  8 04:33 Training


## Define directories and collect user input for configuration
<font color="#12A80D"> <b>• Prompts for lookback window, output folder name, and graph name.</br>• Creates a timestamped subfolder; no error if it already exists.</b> </font>

In [None]:
# =====================================
# DIRECTORIES AND USER INPUTS
# =====================================

# Base directories for dataset and training outputs
base_dir = '/content/drive/MyDrive/Nvidia_Stock_Market_History'
training_base_dir = f'{base_dir}/Training'

# Path to the dataset CSV file
dataset_path = f'{base_dir}/Data/nvidia_stock_data.csv'
print(f"Dataset: {dataset_path}")

# Collect user inputs (with defaults if Enter is pressed without typing)
lookback = int(input("Enter lookback window (e.g., 20): ") or 20)        # Number of past days for sequence input
base_name = input("Enter a base name for output folder: ") or "Nvidia_Stock"
graph_base_name = input("Enter a base name for graphs: ") or "NvidiaGraph"

# Create a timestamped subfolder for saving outputs
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
subfolder = os.path.join(training_base_dir, f"{base_name}_{timestamp}")
os.makedirs(subfolder, exist_ok=True)

print(f"Outputs will be saved in: {subfolder}")

Dataset: /content/drive/MyDrive/Nvidia_Stock_Market_History_test/nvidia_stock_data.csv
Enter lookback window (e.g., 20): 365
Enter a base name for output folder: Nvidia_Stock_Training_365_days_SA
Enter a base name for graphs: Nvidia_
Outputs will be saved in: /content/drive/MyDrive/Nvidia_Stock_Market_History_test/Training/Nvidia_Stock_Training_365_days_SA_2025-08-08_04-33-41


## Set random seeds for reproducibility
<font color="#12A80D"> <b>• Ensures consistent results across runs by fixing random seeds for Python, NumPy, and TensorFlow</br>• Enables deterministic TensorFlow operations for reproducible training</b> </font>

In [None]:
# =====================================
# Reproducibility Settings
# =====================================

# Fixed random seed value for reproducible results across runs
SEED = 42

# Ensure Python's hashing operations are deterministic by fixing the hash seed
os.environ['PYTHONHASHSEED'] = str(SEED)

# Set the seed for Python's built-in random module
random.seed(SEED)

# Set the seed for NumPy's random number generator
np.random.seed(SEED)

# Set the seed for TensorFlow's random number generator
tf.random.set_seed(SEED)

# Force TensorFlow to use deterministic operations where possible
# (may slightly slow down training but ensures reproducibility)
os.environ['TF_DETERMINISTIC_OPS'] = '1'

## Load the dataset and perform feature engineering for time series modeling
<font color="#12A80D"> <b>• Loads NVIDIA OHLCV data from CSV, converts dates, and sorts chronologically.</br>• Creates technical indicators (moving averages, volatility, returns, Bollinger Bands, ranges, lag values, and rolling rank) to help the model capture market trends, volatility patterns, momentum, and price relationships.</br>• Drops NaN rows from rolling calculations before modeling.</b> </font>

In [None]:
# =====================================
# LOAD AND FEATURE ENGINEER DATA
# =====================================

# Load the dataset from CSV file
df = pd.read_csv(dataset_path)

# Convert 'Date' column to datetime format for proper time-series handling
df['Date'] = pd.to_datetime(df['Date'])

# Sort data chronologically and reset the index (important for time-series models)
df = df.sort_values('Date').reset_index(drop=True)

# Feature Engineering from Jupyter Notebook

# Moving Averages: capture short-, medium-, and long-term trends
df['MA20'] = df['Close'].rolling(window=20).mean() # 20-day moving average
df['MA50'] = df['Close'].rolling(window=50).mean() # 50-day moving average
df['MA200'] = df['Close'].rolling(window=200).mean() # 200-day moving average

# Rolling Standard Deviation: measure volatility over short and medium terms
df['STD20'] = df['Close'].rolling(window=20).std() # 20-day volatility
df['STD50'] = df['Close'].rolling(window=50).std() # 50-day volatility

# Percentage Returns: short-, medium-, and long-term return changes
df['Return1'] = df['Close'].pct_change()  # 1-day return
df['Return5'] = df['Close'].pct_change(5) # 5-day return
df['Return20'] = df['Close'].pct_change(20)  # 20-day return

# Bollinger Bands: volatility-based upper and lower price bounds
df['Bollinger_Upper'] = df['MA20'] + 2 * df['STD20']
df['Bollinger_Lower'] = df['MA20'] - 2 * df['STD20']

# Daily range: difference between the high and low price of the day
df['Range'] = df['High'] - df['Low']

# Intraday change: difference between closing and opening price
df['Close_Open'] = df['Close'] - df['Open']

# Lag feature: previous day's closing price (helps capture autocorrelation)
df['Lag1'] = df['Close'].shift(1)

# Rolling percentile rank: how the latest close ranks within the last 20 days
df['Rank20'] = df['Close'].rolling(20).apply(lambda x: pd.Series(x).rank(pct=True).iloc[-1])

# List of selected feature columns for modeling
features = [
    'Close', 'Volume', 'Open', 'High', 'Low',
    'MA20', 'MA50', 'MA200', 'STD20', 'STD50',
    'Return1', 'Return5', 'Return20',
    'Bollinger_Upper', 'Bollinger_Lower',
    'Range', 'Close_Open', 'Lag1', 'Rank20'
]

# Remove rows with NaN values (caused by rolling/shift operations) and reset index
df = df.dropna().reset_index(drop=True)

# Print summary of features used and final dataset shape
print("Columns used as features:", features)
print("Data shape after dropping NaNs:", df.shape)

Columns used as features: ['Close', 'Volume', 'Open', 'High', 'Low', 'MA20', 'MA50', 'MA200', 'STD20', 'STD50', 'Return1', 'Return5', 'Return20', 'Bollinger_Upper', 'Bollinger_Lower', 'Range', 'Close_Open', 'Lag1', 'Rank20']
Data shape after dropping NaNs: (6475, 20)


## Scale feature columns using Min-Max normalization
<font color="#12A80D"> <b>• Scales each feature to the [0, 1] range using `MinMaxScaler`.</br>• Stores a separate scaler for each column for later inverse transformations.</b> </font>

In [None]:
# =====================================
# Feature Scaling
# =====================================

# Dictionary to store the fitted scaler for each feature (for inverse transforms later)
scalers = {}

# List to temporarily hold the scaled arrays for each feature
scaled_columns = []

# Loop through each feature column and scale individually
for col in features:
    scaler = MinMaxScaler() # Initialize Min-Max Scaler (scales data to [0, 1])
    scaled = scaler.fit_transform(df[[col]])  # Fit on the column and transform it
    scalers[col] = scaler # Save scaler for this column
    scaled_columns.append(scaled) # Append scaled column to list

# Combine all scaled columns horizontally into one NumPy array
# Result: rows = samples, columns = features, all values scaled to [0, 1]
scaled_data = np.hstack(scaled_columns)

## Create input-output sequences for time series modeling
<font color="#12A80D"> <b>• Generates sequences of length `lookback` as model inputs (`X`) and the next time step’s target value (`y`).</br>• Uses the first feature column as the prediction target.</b> </font>

In [None]:
# =====================================
# Create sequences for time-series modeling
# =====================================

# X will hold the input sequences, y will hold the target values
X, y = [], []

# Loop through the dataset starting from 'lookback' index
# This ensures each sequence contains exactly 'lookback' timesteps of data
for i in range(lookback, len(scaled_data)):
  # Append the past 'lookback' rows (all features) as one training example
    X.append(scaled_data[i-lookback:i])

    # Append the target value: the 'Close' price at the current time step
    # Assumes the first column (index 0) in scaled_data corresponds to 'Close'
    y.append(scaled_data[i, 0])

# Convert lists to NumPy arrays for model input
# Shape of X: (num_samples, lookback, num_features)
# Shape of y: (num_samples,)
X, y = np.array(X), np.array(y)

## Split the dataset into training, validation, and test sets
<font color="#12A80D"> <b>Uses 70% of the sequences for training, 20% for validation, and 10% for testing based on index ranges.</b> </font>

In [None]:
# =====================================
# Split dataset into train, validation, and test sets
# =====================================

# Calculate split indices
split_1 = int(0.7 * len(X)) # First 70% for training
split_2 = int(0.9 * len(X))  # Next 20% for validation, final 10% for testing

# Training set: model learns patterns from this subset
X_train, y_train = X[:split_1], y[:split_1]

# Validation set: used to tune hyperparameters and monitor overfitting
X_val, y_val = X[split_1:split_2], y[split_1:split_2]

# Test set: completely unseen data for final performance evaluation
X_test, y_test = X[split_2:], y[split_2:]