## Лабораторна робота 2 з ІСППР (варіант 26(1))
### Виконали студенти групи КІ-31мп Шабо О.А. та Сотник Д.C.

In [42]:
import torch
import random
import gc
import os

import torch.nn as nn
import torch.optim as optim

import pandas as pd
import numpy as np
import itertools as it
import matplotlib.pyplot as plt
import itertools as it

from torch.utils.data import DataLoader, TensorDataset

from tqdm.notebook import tqdm
from sklearn.preprocessing import StandardScaler

In [43]:
if torch.cuda.is_available():
    print("PyTorch GPU is available")
else:
    print("PyTorch GPU is not available")
DEVICE = "cuda"

PyTorch GPU is available


In [44]:
RANDOM_SEED = 10

# Seed the RNG for all devices (both CPU and CUDA)
torch.manual_seed(RANDOM_SEED)
# Set python seed
random.seed(RANDOM_SEED)
# Set numpy seed
np.random.seed(RANDOM_SEED)

torch.cuda.manual_seed_all(RANDOM_SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
os.environ['PYTHONHASHSEED'] = str(RANDOM_SEED)

# Worker initialization function for data loaders (simplest approach)
def seed_worker(worker_id):
    worker_seed = (torch.initial_seed() + worker_id) % 2**32
    np.random.seed(worker_seed)
    random.seed(worker_seed)

g_train = torch.Generator().manual_seed(RANDOM_SEED)

In [45]:
df = pd.read_csv('train.csv')

df.head()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1278 entries, 0 to 1277
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Date    1278 non-null   object 
 1   Open    1278 non-null   float64
 2   High    1278 non-null   float64
 3   Low     1278 non-null   float64
 4   Close   1278 non-null   object 
 5   Volume  1278 non-null   object 
dtypes: float64(3), object(3)
memory usage: 60.0+ KB


In [47]:
df['Close'] = pd.to_numeric(df['Close'].str.replace(',', ''), errors='coerce')
df['Volume'] = pd.to_numeric(df['Volume'].str.replace(',', ''), errors='coerce')

# Convert 'Date' to datetime (optional for modeling, but useful for analysis)
df['Date'] = pd.to_datetime(df['Date'])

# Select features and target for modeling
features = df[['Open', 'High', 'Low', 'Volume']]
target = df['Close']

# Normalize features
scaler = StandardScaler()
features_normalized = scaler.fit_transform(features)

# Split the data
split_index = int(len(df) * 0.75)
X_train = features_normalized[:split_index]
y_train = target.values[:split_index]
X_test = features_normalized[split_index:]
y_test = target.values[split_index:]

# Convert to tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).view(-1, 1)  # Reshaping for a single target
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32).view(-1, 1)

In [48]:
X_train_tensor.shape

torch.Size([958, 4])

In [49]:
y_train_tensor.shape

torch.Size([958, 1])

In [51]:
class RNNModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, seq_length):
        super(RNNModel, self).__init__()
        self.rnn = nn.RNN(input_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)
        self.seq_length = seq_length

    def forward(self, x):
        if x.dim() == 2:  # If there are only 2 dimensions (batch_size, features)
            x = x.unsqueeze(1)  # Add a seq_length dimension
        out, _ = self.rnn(x)
        out = self.fc(out[:, -1, :])
        return out

    def fit(self, X_train, y_train, X_test, y_test, learning_rate=0.01,
            num_epochs=100, batch_size=32, verbose=True, optimizer_class=optim.Adam,
            loss_fn=nn.MSELoss):
        optimizer = optimizer_class(self.parameters(), lr=learning_rate)
        criterion = loss_fn()

        # Convert to tensor datasets
        train_dataset = TensorDataset(X_train, y_train)
        test_dataset = TensorDataset(X_test, y_test)

        # Create data loaders
        train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size,
                                  drop_last=True, worker_init_fn=seed_worker,
                                  generator=g_train, shuffle=True)
        test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size,
                                 shuffle=False)

        train_losses = []
        test_losses = []

        for epoch in tqdm(range(num_epochs), disable=not verbose):
            self.train()
            train_loss = 0.0
            for X_batch, y_batch in train_loader:
                optimizer.zero_grad()
                outputs = self(X_batch)
                loss = criterion(outputs, y_batch)
                loss.backward()
                optimizer.step()
                train_loss += loss.item()
            train_loss /= len(train_loader.dataset)
            train_losses.append(train_loss)

            test_loss = self.evaluate(test_loader, criterion)
            test_losses.append(test_loss)

            if verbose and (epoch+1) % 10 == 0:
                print(f'Epoch {epoch+1}/{num_epochs}, Training Loss: {train_loss:.4f}, Test Loss: {test_loss:.4f}')

        self.train_losses = train_losses
        self.test_losses = test_losses
        gc.collect()

    def plot_loss(self):
        plt.figure(figsize=(10, 5))
        plt.plot(self.train_losses, label='Training Loss')
        plt.plot(self.test_losses, label='Test Loss')
        plt.xlabel('Epoch')
        plt.ylabel('Loss')
        plt.title('Loss over Epochs')
        plt.legend()
        plt.show()

    def evaluate(self, data_loader, criterion):
        self.eval()
        total_loss = 0.0
        with torch.no_grad():
            for X_batch, y_batch in data_loader:
                outputs = self(X_batch)
                loss = criterion(outputs, y_batch)
                total_loss += loss.item()
        return total_loss / len(data_loader.dataset)

    @staticmethod
    def find_best_parameters(X_train, y_train, X_test, y_test, parameter_grid,
                             criterion = nn.MSELoss()):
        # Split data into train/validation sets
        best_loss = float('inf')
        best_params = None

        for params in tqdm(list(it.product(*parameter_grid.values()))):
            hidden_dim, seq_length, num_epochs, batch_size = params
            input_dim = X_train.shape[-1]  # Assuming X_train is already appropriately shaped
            output_dim = y_train.shape[-1]  # Assuming y_train is also appropriately shaped

            model = RNNModel(input_dim=input_dim, hidden_dim=hidden_dim, output_dim=output_dim, seq_length=seq_length)
            model.fit(X_train, y_train, X_test, y_test, num_epochs=num_epochs, batch_size=batch_size, verbose=False)

            test_dataset = TensorDataset(X_test, y_test)
            test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size,
                                    shuffle=False)
            val_loss = model.evaluate(test_loader, criterion)

            if val_loss < best_loss:
                best_loss = val_loss
                best_params = params

        print(f'Best parameters found: {best_params}, with validation loss: {best_loss}')
        return best_params

In [52]:
class LSTMModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(LSTMModel, self).__init__()
        self.lstm = nn.LSTM(input_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        out, (hidden, cell) = self.lstm(x)
        out = self.fc(out[:, -1, :])
        return out

In [53]:
class GRUModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GRUModel, self).__init__()
        self.gru = nn.GRU(input_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        out, _ = self.gru(x)
        out = self.fc(out[:, -1, :])
        return out

In [54]:
parameter_grid = {
    'hidden_dim': [20, 50, 100],
    'seq_length': [10, 20, 30],
    'num_epochs': [50, 100, 200],
    'batch_size': [16, 32, 64]
}

# parameter_grid = {
#     'hidden_dim': [20],
#     'seq_length': [10],
#     'num_epochs': [10],
#     'batch_size': [16]
# }

best_params = RNNModel.find_best_parameters(X_train_tensor, y_train_tensor, X_test_tensor, y_test_tensor, parameter_grid)


  0%|          | 0/1 [00:00<?, ?it/s]

Best parameters found: (20, 10, 10, 16), with validation loss: 23828.55771484375
