<a href="https://colab.research.google.com/github/MicheleGiambelli/Deep-Learning-Project/blob/Michele/Solana.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
import yfinance as yf
from google.colab import files
from torch import nn
import torch
from torch.utils.data import TensorDataset, DataLoader
from sklearn.preprocessing import StandardScaler
import torch.optim as optim


In [16]:
ticker = "SOL-USD"
btc_ticker = "BTC-USD"

solana_data = yf.download(ticker, start="2020-04-10")
btc_data = yf.download(btc_ticker, start="2020-04-10")

[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed


In [17]:

# Ensure the columns are renamed every time this cell is run
solana_data = solana_data.rename(columns={"Adj Close": "SOL_Adj_Close"})
btc_data = btc_data.rename(columns={"Adj Close": "BTC_Adj_Close"})

# Unione dei dati
# Instead of selecting individual columns, use the renamed dataframes directly
df = pd.concat([solana_data, btc_data], axis=1)
df = df.dropna()

# Calcolo dei rendimenti giornalieri
df["SOL_Return"] = df["SOL_Adj_Close"].pct_change()
df["BTC_Return"] = df["BTC_Adj_Close"].pct_change()

# Funzione per calcolare Beta
def rolling_beta(df, window):
    cov = df["SOL_Return"].rolling(window).cov(df["BTC_Return"])
    var = df["BTC_Return"].rolling(window).var()
    return cov / var

# Define n before using it in rolling_beta
n = 20  # For example, a 20-day rolling window for beta calculation

# Aggiungere Beta
df["Beta"] = rolling_beta(df, n)
df = df.dropna()


In [18]:
n = 20  # Periodo per bande di Bollinger e Beta
k = 2  # Deviazioni standard per bande di Bollinger

# Calcolo Bande di Bollinger
df["SMA"] = df["SOL_Adj_Close"].rolling(window=n).mean()
df["StdDev"] = df["SOL_Adj_Close"].rolling(window=n).std()
df["Upper_Band"] = df["SMA"] + k * df["StdDev"]
df["Lower_Band"] = df["SMA"] - k * df["StdDev"]
df = df.dropna()


In [19]:
def rolling_sharpe_ratio(df, window, risk_free_rate):
    rolling_mean = df["SOL_Return"].rolling(window).mean()
    rolling_std = df["SOL_Return"].rolling(window).std()
    return (rolling_mean - risk_free_rate) / rolling_std
risk_free_rate = 0.01
# Aggiungere Sharpe Ratio
df["Sharpe_Ratio"] = rolling_sharpe_ratio(df, n, risk_free_rate)
df = df.dropna()


In [20]:
df.drop(df.columns[[6,7,8,9,10,11,13]], axis=1, inplace=True) # tolgo le colonne relative a BTC che non mi servono
df.head()

Price,SOL_Adj_Close,Close,High,Low,Open,Volume,SOL_Return,Beta,SMA,StdDev,Upper_Band,Lower_Band,Sharpe_Ratio
Ticker,SOL-USD,SOL-USD,SOL-USD,SOL-USD,SOL-USD,SOL-USD,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2
2020-06-07,0.616578,0.616578,0.624444,0.593398,0.622443,716785,-0.009423,0.427358,0.595371,0.027581,0.650533,0.540209,-0.201951
2020-06-08,0.668313,0.668313,0.679001,0.61331,0.615078,1440234,0.083907,0.42893,0.597564,0.031483,0.660529,0.534599,-0.085599
2020-06-09,0.658002,0.658002,0.668088,0.627242,0.667784,988327,-0.015428,0.346123,0.601506,0.0339,0.669306,0.533706,-0.03844
2020-06-10,0.644867,0.644867,0.670043,0.633404,0.658038,1096203,-0.019962,0.524364,0.603304,0.03524,0.673784,0.532823,-0.103663
2020-06-11,0.573742,0.573742,0.650535,0.570082,0.644888,1122221,-0.110294,0.739265,0.600047,0.034787,0.66962,0.530474,-0.229623


**<h1>Pytorch Dataset</h1>**

### Description of the Code below

This code demonstrates how to preprocess a dataset stored in a Pandas DataFrame and transform it into a format suitable for use with PyTorch, while ensuring the data can be efficiently loaded in batches for training or inference.

1. **Feature and Target Definition**:
   - The column `Close` is chosen as the target variable (`y`), representing the Solana price.
   - The remaining columns are used as the feature set (`X`).

2. **Feature Normalization**:
   - The features are scaled using `StandardScaler` from `sklearn` to ensure they have zero mean and unit variance. This helps improve the stability and convergence of training neural networks.

3. **Conversion to PyTorch Tensors**:
   - The normalized features (`X`) are converted into a PyTorch tensor of type `torch.float32`.
   - The target (`y`) is similarly converted into a tensor, reshaped to have a shape of `(-1, 1)` to align with PyTorch's supervised learning expectations.

4. **Custom DataLoader Function**:
   - A function, `load_array`, is defined to create a PyTorch `DataLoader`. This function:
     - Accepts `data_arrays` (a tuple of feature and target tensors).
     - Uses unpacking (`*data_arrays`) to pass the tensors dynamically to `TensorDataset`.
     - Returns a `DataLoader` with a specified batch size.
     - Includes the `is_train` parameter to control whether data shuffling is enabled. Here, it defaults to `False` to maintain the historical order of the data.

5. **Batch Loading**:
   - The tensors for features (`X_tensor`) and target (`y_tensor`) are packed into a tuple `data_arrays`.
   - A `DataLoader` is created with a batch size of 32, ensuring that the data is processed in manageable chunks.

This code is well-suited for historical data processing (e.g., time-series or financial data) where the order of data is important, thanks to the use of `shuffle=False`. The preprocessing ensures the dataset is ready for efficient model training or evaluation in PyTorch.

In [21]:
# Assume your DataFrame pandas is named `df`
# Define 'Close' as the target and the rest as features
X = df.drop(columns=['Close'], level = 0).values  # Features
y = df['Close'].values  # Target

# Normalize the features, referring to the same numerical scale
scaler = StandardScaler()
X_normalized = scaler.fit_transform(X)
y_normalized = scaler.fit_transform(y.reshape(-1, 1))

# Convert to PyTorch tensors
X_tensor = torch.tensor(X_normalized, dtype=torch.float32)
y_tensor = torch.tensor(y_normalized, dtype=torch.float32)

# Create sequences
sequence_length = 5
X_sequences = []
y_sequences = []

for i in range(len(X_normalized) - sequence_length):
    X_seq = X_normalized[i:i + sequence_length]
    y_seq = y[i + sequence_length]
    X_sequences.append(X_seq)
    y_sequences.append(y_seq)

X_sequences = np.array(X_sequences)
y_sequences = np.array(y_sequences)

X_tensor = torch.tensor(X_sequences, dtype=torch.float32)
y_tensor = torch.tensor(y_sequences, dtype=torch.float32)

# Suddivisione in training e validation set
# Suddivisione temporale del dataset
n = len(X_tensor)

# Percentuali di suddivisione
train_split = int(n * 0.7)  # Prime 70% osservazioni per il training
val_split = int(n * 0.9)    # Successive 20% osservazioni per la validation
test_split = n              # Ultime 10% osservazioni per il test

train_dataset = (X_tensor[:train_split], y_tensor[:train_split])
val_dataset = (X_tensor[train_split:val_split], y_tensor[train_split:val_split])
test_dataset = (X_tensor[val_split:test_split], y_tensor[val_split:test_split])

# Define a function for loading data with unpacking
def load_array(data_arrays, batch_size, is_train=False): # False because we want to read the data historically
    """Construct a PyTorch DataLoader."""
    dataset = TensorDataset(*data_arrays)  # Unpacking the tensors
    return DataLoader(dataset, batch_size=batch_size, shuffle=is_train)

# Create the DataLoader
data_arrays = (X_tensor, y_tensor)  # Pack tensors into a tuple
batch_size = 32
train_loader = load_array(train_dataset, batch_size)
valid_loader = load_array(val_dataset, batch_size)
test_loader = load_array(test_dataset, batch_size)


In [22]:
batch_X, batch_y = next(iter(train_loader))
print(batch_X.shape)  # Expected: torch.Size([32, 10, 12])
print(batch_y.shape)

torch.Size([32, 5, 12])
torch.Size([32, 1])


# LSTM

In [23]:
class LSTM(nn.Module):

    def __init__(self, input_size, hidden_size, num_layers, output_size):
        super(LSTM, self).__init__()

        self.hidden_size = hidden_size
        self.num_layers = num_layers

        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True, dropout=0.2)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        # Stato nascosto iniziale e stato della cella
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)

        # Passaggio attraverso la LSTM
        out, _ = self.lstm(x, (h0, c0))

        # Passaggio attraverso il layer fully connected
        out = self.fc(out[:, -1, :])  # Usare solo l'ultimo stato nascosto
        return out

In [94]:
# Iperparametri
input_size = 12
hidden_size = 128
num_layers = 2
output_size = 1
learning_rate = 0.001
num_epochs = 200

# Inizializzazione del modello, della loss function e dell'optimizer
model = LSTM(input_size, hidden_size, num_layers, output_size)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

In [95]:
# Addestramento del modello
from sklearn.metrics import mean_absolute_error


# Early stopping
best_loss = float('inf')
epochs_no_improve = 0
early_stop = False


model.train()
for epoch in range(num_epochs):
    if early_stop:
        print(f"Early stopping at epoch {epoch}")
        break

    model.train()
    train_loss = 0.0
    train_mae = 0.0
    y_preds, y_real = [], []

    for batch_X, batch_y in train_loader:
        # Forward pass
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)

        # Backward pass e aggiornamento dei pesi
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        train_loss += loss.item()

        # Calcolo del MAE
        y_preds.extend(outputs.detach().cpu().numpy())
        y_real.extend(batch_y.cpu().numpy())
        # predictions = outputs.detach().cpu().numpy()  # Denormalizza le predizioni
        # actuals = scaler.inverse_transform(batch_y.numpy())  # Denormalizza i target
        # train_mae += mean_absolute_error(actuals, predictions)

    train_loss /= len(train_loader)
    train_mae = mean_absolute_error(y_real, y_preds)

    # Validation
    model.eval()
    val_loss = 0.0
    # val_mae = 0.0
    y_preds, y_real = [], []


    with torch.no_grad():
        for val_X, val_y in valid_loader:
            val_outputs = model(val_X)
            loss = criterion(val_outputs, val_y)
            val_loss += loss.item()

            # Calcolo del MAE per il validation set
            y_preds.extend(outputs.detach().cpu().numpy())
            y_real.extend(batch_y.cpu().numpy())

            # val_predictions = scaler.inverse_transform(val_outputs.numpy())
            # val_actuals = scaler.inverse_transform(val_y.numpy())
            # val_mae += mean_absolute_error(val_actuals, val_predictions)

    val_loss /= len(valid_loader)
    # val_mae /= len(valid_loader)
    val_mae = mean_absolute_error(y_real, y_preds)


    # Stampa della loss e del MAE ogni 10 epoche
    # if (epoch + 1) %2 == 0:
    print(f"Epoch [{epoch+1}/{num_epochs}], Train Loss: {train_loss:.4f}, Train MAE: {train_mae:.4f}, "
              f"Val Loss: {val_loss:.4f}, Val MAE: {val_mae:.4f}")

# Early stopping check
    if val_loss < best_loss:
      best_loss = val_loss
      epochs_no_improve = 0
      # Salva i pesi del miglior modello
      best_epoch = epoch+1
      best_model_weights = model.state_dict()
    elif val_loss > best_loss+0.001:
        epochs_no_improve += 1

    if epochs_no_improve >= 5:
        early_stop = True

if best_model_weights is not None:
   model.load_state_dict(best_model_weights)
   print(f"Weights of best epoch {best_epoch} uploaded {best_loss}.")

Epoch [1/200], Train Loss: 5452.2496, Train MAE: 46.0098, Val Loss: 11390.7333, Val MAE: 16.5236
Epoch [2/200], Train Loss: 4737.8772, Train MAE: 40.9565, Val Loss: 10199.9257, Val MAE: 12.1298
Epoch [3/200], Train Loss: 4438.6283, Train MAE: 39.2018, Val Loss: 9690.0341, Val MAE: 9.0332
Epoch [4/200], Train Loss: 4249.8868, Train MAE: 38.1356, Val Loss: 9271.3195, Val MAE: 6.3967
Epoch [5/200], Train Loss: 4099.1237, Train MAE: 37.3590, Val Loss: 8910.1502, Val MAE: 4.0524
Epoch [6/200], Train Loss: 3974.1109, Train MAE: 36.9032, Val Loss: 8591.5486, Val MAE: 2.3154
Epoch [7/200], Train Loss: 3868.7219, Train MAE: 36.6732, Val Loss: 8307.3341, Val MAE: 1.5473
Epoch [8/200], Train Loss: 3779.0808, Train MAE: 36.5920, Val Loss: 8052.0860, Val MAE: 2.0757
Epoch [9/200], Train Loss: 3702.0532, Train MAE: 36.5558, Val Loss: 7821.4944, Val MAE: 3.4192
Epoch [10/200], Train Loss: 3638.7742, Train MAE: 36.7916, Val Loss: 7607.6673, Val MAE: 4.9015
Epoch [11/200], Train Loss: 3549.2151, Train 

In [96]:
# Valutazione finale
X_test, y_test = next(iter(train_loader))
model.eval()
with torch.no_grad():
    predictions = model(X_test).numpy()
    predictions = scaler.inverse_transform(predictions)  # Denormalizza
    actuals = scaler.inverse_transform(y_test.numpy())

# Output delle performance
print("Predizioni vs Valori Reali")
print(np.concatenate((predictions[:10], actuals[:10]), axis=1))  # Mostra i primi 10 risultati

Predizioni vs Valori Reali
[[193.6508   108.08182 ]
 [219.04265  109.079506]
 [241.48936  107.632286]
 [222.2844   106.54432 ]
 [212.59007  107.532936]
 [220.9826   107.267265]
 [220.54163  109.8746  ]
 [243.81787  114.069214]
 [278.4994   111.37273 ]
 [230.44618  113.329056]]


Il formato del tensore torch.Size([32, 12]) che ottieni con next(iter(data_loader))[0].shape è parzialmente corretto per creare una LSTM, ma manca una dimensione fondamentale: la dimensione di sequenza temporale.

Per una LSTM, il tensore di input dovrebbe avere la seguente forma:

(
batch_size
,
sequence_length
,
input_size
)
(batch_size,sequence_length,input_size)

Spesso la scelta di sequence_length viene fatta sperimentalmente:

Prova diversi valori (ad esempio, 1, 5, 7) e confronta le prestazioni del modello (ad esempio, tramite metriche come MSE o MAE).
Usa la validazione incrociata per capire quale valore produce le previsioni più accurate.
