<a href="https://colab.research.google.com/github/CrisMcode111/DI_Bootcamp/blob/main/w5_d4_stock_market.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Daily Challenge: Stock Price Prediction with LSTM


*  What You’ll learn
How to preprocess and prepare time-series data for machine learning models.
How to build and train an LSTM (Long Short-Term Memory) model using PyTorch.
How to evaluate the performance of a regression model using metrics like R².

*  What you will create
A preprocessed dataset for stock price prediction.
A trained LSTM model to predict future stock prices.


* Understanding PyTorch
PyTorch is an open-source machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing. It’s known for its flexibility and ease of use, making it popular for both research and production.

Key PyTorch Functions You’ll Use:

torch.nn.Module: Base class for all neural network modules. You’ll use this to define your LSTM model.
torch.nn.LSTM: Implements a Long Short-Term Memory (LSTM) network.
torch.nn.Linear: Applies a linear transformation to the incoming data (i.e., a fully connected layer).
torch.nn.Dropout: Applies dropout regularization to prevent overfitting.
torch.optim.Adam: Implements the Adam optimization algorithm.
torch.nn.MSELoss: Implements the Mean Squared Error loss function.
torch.utils.data.Dataset: An abstract class representing a dataset.
torch.utils.data.DataLoader: Combines a dataset and a sampler, and provides single- or multi-process iterators over the dataset.
torch.Tensor: A multi-dimensional matrix containing elements of a single data type.
torch.save and torch.load: used to save and load trained models.



1. Install Required Libraries

Ensure you have the necessary libraries installed, including gensim, spacy, torch, and scikit-learn.

In [1]:
!pip -q install --upgrade scikit-learn gensim spacy
!python -m spacy download en_core_web_sm

import torch, sklearn, gensim, spacy, pandas as pd, numpy as np, matplotlib
print("Torch:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
print("scikit-learn:", sklearn.__version__)
print("gensim:", gensim.__version__)
print("spaCy:", spacy.__version__)


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m43.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m49.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m82.7 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Torch: 2.8.0+cu126
CUDA available: False
scikit-learn: 1.7.2
gensim: 4.4.0
spaCy: 3.8.7


In [6]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [14]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("jacksoncrow/stock-market-dataset")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'stock-market-dataset' dataset.
Path to dataset files: /kaggle/input/stock-market-dataset


In [15]:
!ls /kaggle/input/stock-market-dataset


etfs  stocks  symbols_valid_meta.csv


2. Load and Preprocess the Dataset

Download the stock market dataset.
Drop unnecessary columns and create a target column for the next day’s closing price.
Normalize the dataset using MinMaxScaler.

In [17]:
!ls /kaggle/input/stock-market-dataset/stocks | head


AACG.csv
AA.csv
AAL.csv
AAMC.csv
AAME.csv
AAN.csv
AAOI.csv
AAON.csv
AAP.csv
AAPL.csv


In [18]:
# 2. Load and Preprocess the Dataset
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Load one stock file (AAPL)
df = pd.read_csv("/kaggle/input/stock-market-dataset/stocks/AAPL.csv")

# Keep only the useful columns
df = df[['Date', 'Close']]

# Create the target column for the next day’s closing price
df['Target'] = df['Close'].shift(-1)
df = df.dropna()

# Normalize the dataset using MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(df[['Close', 'Target']])
df_scaled = pd.DataFrame(scaled, columns=['Close', 'Target'])

df_scaled.head()


Unnamed: 0,Close,Target
0,0.000969,0.000887
1,0.000887,0.000778
2,0.000778,0.000812
3,0.000812,0.000853
4,0.000853,0.000942


3. Prepare the Dataset for Training

Split the dataset into training, validation, and testing sets.
Create a custom PyTorch Dataset class to handle the data.
Use DataLoader to create iterable datasets for training and evaluation.

In [19]:
# 3. Prepare the Dataset for Training
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader

# --- Params ---
LOOKBACK = 60      # zile în istoric pentru o secvență
BATCH_SIZE = 64

# df_scaled trebuie să existe din Pasul 2, cu coloanele ['Close','Target']

def make_sequences(df_scaled, lookback=60):
    close = df_scaled['Close'].values.astype(np.float32)
    target = df_scaled['Target'].values.astype(np.float32)
    X, y = [], []
    for i in range(lookback, len(close)):
        # fereastra: ultimele `lookback` valori de close
        X.append(close[i-lookback:i])
        # ținta: target-ul asociat ultimei zile din fereastră (predictia pentru ziua următoare)
        y.append(target[i-1])
    X = np.array(X)[:, :, None]  # (num_samples, lookback, 1 feature)
    y = np.array(y)[:, None]     # (num_samples, 1)
    return X, y

X_all, y_all = make_sequences(df_scaled, LOOKBACK)

# --- Split cronologic: 80% train, 10% val, 10% test ---
N = len(X_all)
train_end = int(0.8 * N)
val_end   = int(0.9 * N)

X_train, y_train = X_all[:train_end], y_all[:train_end]
X_val,   y_val   = X_all[train_end:val_end], y_all[train_end:val_end]
X_test,  y_test  = X_all[val_end:], y_all[val_end:]

class StockWindowDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.from_numpy(X).float()
        self.y = torch.from_numpy(y).float()
    def __len__(self):
        return len(self.X)
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

train_ds = StockWindowDataset(X_train, y_train)
val_ds   = StockWindowDataset(X_val,   y_val)
test_ds  = StockWindowDataset(X_test,  y_test)

train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=False)
val_loader   = DataLoader(val_ds,   batch_size=BATCH_SIZE, shuffle=False)
test_loader  = DataLoader(test_ds,  batch_size=BATCH_SIZE, shuffle=False)

len(train_ds), len(val_ds), len(test_ds)


(7878, 985, 985)

4. Define the LSTM Model

Create an LSTM model using PyTorch.
Define the model architecture, including GRU layers, dropout, and a dense layer.

In [21]:
# 4. Define the LSTM Model
import torch
import torch.nn as nn

class StockPriceLSTM(nn.Module):
    def __init__(self, input_size=1, hidden_size=64, num_layers=2, dropout=0.2):
        super(StockPriceLSTM, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # LSTM layers
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers,
                            batch_first=True, dropout=dropout)

        # Dropout regularization
        self.dropout = nn.Dropout(dropout)

        # Fully connected (dense) output layer
        self.fc = nn.Linear(hidden_size, 1)

    def forward(self, x):
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)

        out, _ = self.lstm(x, (h0, c0))
        out = self.dropout(out[:, -1, :])
        out = self.fc(out)
        return out


5. Train the Model

Set up the optimizer and loss function.
Implement training and validation loops.
Train the model for a specified number of epochs.

In [22]:
# 5. Train the Model
import torch
import torch.nn as nn

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = StockPriceLSTM().to(device)

# Loss function and optimizer
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

EPOCHS = 5

for epoch in range(EPOCHS):
    model.train()
    train_loss = 0.0
    for X_batch, y_batch in train_loader:
        X_batch, y_batch = X_batch.to(device), y_batch.to(device)

        optimizer.zero_grad()
        preds = model(X_batch)
        loss = criterion(preds, y_batch)
        loss.backward()
        optimizer.step()

        train_loss += loss.item()

    # Validation
    model.eval()
    val_loss = 0.0
    with torch.no_grad():
        for X_val_batch, y_val_batch in val_loader:
            X_val_batch, y_val_batch = X_val_batch.to(device), y_val_batch.to(device)
            val_preds = model(X_val_batch)
            v_loss = criterion(val_preds, y_val_batch)
            val_loss += v_loss.item()

    print(f"Epoch [{epoch+1}/{EPOCHS}] - Train Loss: {train_loss/len(train_loader):.6f}, Val Loss: {val_loss/len(val_loader):.6f}")

print("Training complete!")


Epoch [1/5] - Train Loss: 0.000404, Val Loss: 0.004056
Epoch [2/5] - Train Loss: 0.002245, Val Loss: 0.003880
Epoch [3/5] - Train Loss: 0.003423, Val Loss: 0.036346
Epoch [4/5] - Train Loss: 0.002926, Val Loss: 0.050371
Epoch [5/5] - Train Loss: 0.002603, Val Loss: 0.051012
Training complete!


6. Evaluate the Model

Calculate the R² score to evaluate the model’s performance on the test set.
Save the scaler object for future predictions.

In [23]:
# 6. Evaluate the Model
from sklearn.metrics import r2_score
import joblib

model.eval()
y_true, y_pred = [], []

with torch.no_grad():
    for X_batch, y_batch in test_loader:
        X_batch = X_batch.to(device)
        preds = model(X_batch).cpu().numpy()
        y_pred.extend(preds.flatten())
        y_true.extend(y_batch.numpy().flatten())

# Compute R² score
r2 = r2_score(y_true, y_pred)
print(f"R² score on test set: {r2:.4f}")

# Save the scaler for future predictions
joblib.dump(scaler, "scaler.save")
print("✅ Scaler saved as 'scaler.save'")


R² score on test set: -8.3502
✅ Scaler saved as 'scaler.save'
