# Progetto di Applicazioni Data Intensive

## Studente Luca Rubboli 0000923420

### Anno Accademico 2021-2022

## Requisiti

Per scaricare il dataset dalla piattaforma Kaggle è necessario avere un account Kaggle, da cui creare un'API token, necessario per l'utilizzo dei dataset.

In [48]:
import os.path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from ipywidgets import interact, IntSlider
import ipywidgets as widgets
from datetime import datetime
import copy

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold, cross_validate, train_test_split

## Analisi

L'analisi verte sull'andamento dei titoli Brent Oil, Heating Oil, Crude Oil WTI e Natural Gas, sfruttando il dataset presente sulla piattaforma Kaggle, che contiene informazioni a partire dal 2000 fino ad arrivare all'attuale 2022.

I dati a disposizione (aggiornato al 07/07/2022) sono rispettivamente
- 5768 per Brent Oil
- 5770 per Heating Oil
- 5744 per Crude Oil
- 5742 per Natural Gas

Per un totale di 23024 osservazioni.

Ogni osservazione è composta da una data (Date), un valore di apertura (Open), che rappresenta il valore azionario del titolo all'inizio della giornata, un valore di chiusura (Close), che rappresenta il valore azionario del titolo alla fine della giornata, il valore massimo raggiunto in giornata (High), il valore minimo raggiunto in giornata (Low) e il numero di asset di mercato scambiati in giornata (Volume).
Si può notare come la colonna Symbol possa essere considerata ridondante qualora si lavori su dataframe distinti per ogni titolo, mentre la colonna Currency è totalmente ridondante, dato che considererò sempre i valori economici in valuta USD.

In [2]:
path = "./historical-daily-oil-and-natural-gas-prices/oil and gas.csv"
if not os.path.exists(path):
    try:
        import opendatasets as od
    except ImportError as e:
        !pip install opendatasets
        import opendatasets as od
    od.download("https://www.kaggle.com/datasets/prasertk/historical-daily-oil-and-natural-gas-prices/download")

In [3]:
def print_stats(dataframe, name=None):
    if name == None:
        if "Symbol" in dataframe.columns:
            print("Dataframe for", ", ".join(str(title) for title in dataframe["Symbol"].unique()))
    else:
        print("Dataframe for", name)
    print("Starting from", dataframe["Date"].min().date() if "Date" in dataframe.columns else dataframe.index.min().date(), "to", dataframe["Date"].max().date() if "Date" in dataframe.columns else dataframe.index.max().date())
    display(dataframe)
    display(dataframe.describe())
    display(dataframe["Open"].value_counts().sort_index(ascending=True))

In [4]:
data = pd.read_csv(path)
data["Date"] = pd.to_datetime(data["Date"], format='%Y-%m-%d')
print_stats(data)

Dataframe for Brent Oil, Crude Oil WTI, Natural Gas, Heating Oil
Starting from 2000-01-04 to 2022-06-17


Unnamed: 0,Symbol,Date,Open,High,Low,Close,Volume,Currency
0,Brent Oil,2000-01-04,23.9000,24.7000,23.8900,24.3900,32509,USD
1,Brent Oil,2000-01-05,24.2500,24.3700,23.7000,23.7300,30310,USD
2,Brent Oil,2000-01-06,23.5500,24.2200,23.3500,23.6200,44662,USD
3,Brent Oil,2000-01-07,23.5700,23.9800,23.0500,23.0900,34826,USD
4,Brent Oil,2000-01-10,23.0400,23.7800,23.0400,23.7300,26388,USD
...,...,...,...,...,...,...,...,...
23019,Heating Oil,2022-06-13,4.3612,4.3762,4.1949,4.2834,46406,USD
23020,Heating Oil,2022-06-14,4.2749,4.4570,4.2488,4.3940,36652,USD
23021,Heating Oil,2022-06-15,4.3816,4.6070,4.3557,4.5470,36908,USD
23022,Heating Oil,2022-06-16,4.5320,4.5825,4.4124,4.5713,28269,USD


Unnamed: 0,Open,High,Low,Close,Volume
count,23024.0,23024.0,23024.0,23024.0,23024.0
mean,33.360681,33.849664,32.849399,33.255359,128810.3
std,36.010741,36.469603,35.521156,35.897015,147402.7
min,-14.0,0.5085,-16.74,0.4999,0.0
25%,2.741,2.785,2.695675,2.740875,32912.5
50%,14.3645,15.5325,13.8505,16.414,71120.5
75%,60.6525,61.5,59.7825,58.93,178621.2
max,146.3,147.5,144.25,146.08,1404916.0


-14.0000     1
 0.5000      1
 0.5010      1
 0.5070      1
 0.5075      1
            ..
 144.4000    1
 144.6900    1
 144.7600    1
 145.1900    1
 146.3000    1
Name: Open, Length: 13739, dtype: int64

Il dataset non presenta valori mancanti, però ho reputato necessario rimuovere eventuali osservazioni che non rispettano le condizioni di
- valori di entrata e chiusura contenuti tra il minimo e il massimo
- valori di titolo positivi

Dopo aver applicato queste restrizioni, le osservazioni valide risultano
- 5768 per Brent Oil
- 5763 per Heating Oil
- 4657 per Crude Oil
- 5731 per Natural Gas

Per un totale di 21919 osservazioni valide.

Considerando però le features High e Low è necessario scartare un'ulteriore osservazione per ogni titolo, per predire infatti l'andamento di una giornata si ha a disposizione al più le caratteristiche di High e Low del giorno precedente, in quanto queste vengono stilate a fine giornata insieme al valore di chiusura.

In [5]:
condition_between_low_high = lambda col: (data[col] >= data["Low"]) & (data[col] <= data["High"])
condition_positive = lambda col: data[col] >= 0

data.drop(data.loc[~condition_between_low_high("Open")].index, inplace=True)
data.drop(data.loc[~condition_between_low_high("Close")].index, inplace=True)
data.drop(data.loc[~condition_positive("Open")].index, inplace=True)
data.drop(data.loc[~condition_positive("Close")].index, inplace=True)

print_stats(data)

Dataframe for Brent Oil, Crude Oil WTI, Natural Gas, Heating Oil
Starting from 2000-01-04 to 2022-06-17


Unnamed: 0,Symbol,Date,Open,High,Low,Close,Volume,Currency
0,Brent Oil,2000-01-04,23.9000,24.7000,23.8900,24.3900,32509,USD
1,Brent Oil,2000-01-05,24.2500,24.3700,23.7000,23.7300,30310,USD
2,Brent Oil,2000-01-06,23.5500,24.2200,23.3500,23.6200,44662,USD
3,Brent Oil,2000-01-07,23.5700,23.9800,23.0500,23.0900,34826,USD
4,Brent Oil,2000-01-10,23.0400,23.7800,23.0400,23.7300,26388,USD
...,...,...,...,...,...,...,...,...
23019,Heating Oil,2022-06-13,4.3612,4.3762,4.1949,4.2834,46406,USD
23020,Heating Oil,2022-06-14,4.2749,4.4570,4.2488,4.3940,36652,USD
23021,Heating Oil,2022-06-15,4.3816,4.6070,4.3557,4.5470,36908,USD
23022,Heating Oil,2022-06-16,4.5320,4.5825,4.4124,4.5713,28269,USD


Unnamed: 0,Open,High,Low,Close,Volume
count,21919.0,21919.0,21919.0,21919.0,21919.0
mean,32.367632,32.841575,31.870425,32.369054,115484.3
std,36.510599,36.980259,36.010603,36.518215,116576.3
min,0.5,0.5085,0.493,0.4999,0.0
25%,2.6588,2.702,2.609,2.656,33192.0
50%,7.535,7.729,7.325,7.503,69263.0
75%,60.375,61.25,59.43,60.32,166058.0
max,146.3,147.5,144.25,146.08,1147389.0


0.5000      1
0.5010      1
0.5070      1
0.5075      1
0.5080      1
           ..
144.4000    1
144.6900    1
144.7600    1
145.1900    1
146.3000    1
Name: Open, Length: 13437, dtype: int64

## Obiettivo

Il programma ha l'obiettivo di individuare un modello che possa prevedere l'andamento delle chiusure giornaliere dei titoli, massimizzando così un eventuale ROI derivante da un investimento.

## Divisione dei titoli

Per poter lavorare in maniera indipendente su ciascun titolo, creo nuovi dataframe, uno per ciascuno, e ne stampo le statistiche.

In [6]:
brent_oil = data[data["Symbol"] == "Brent Oil"].set_index("Date").copy()
heating_oil = data[data["Symbol"] == "Heating Oil"].set_index("Date").copy()
crude_oil = data[data["Symbol"] == "Crude Oil WTI"].set_index("Date").copy()
natural_gas = data[data["Symbol"] == "Natural Gas"].set_index("Date").copy()
titles = {"Brent Oil": brent_oil, "Heating Oil": heating_oil, "Crude Oil": crude_oil, "Natural Gas": natural_gas}

@interact(dataset=titles.keys())
def show_dataset_stats(dataset):
    print_stats(titles[dataset])

interactive(children=(Dropdown(description='dataset', options=('Brent Oil', 'Heating Oil', 'Crude Oil', 'Natur…

## Visualizzazione

Per permettere una miglior visualizzazione degli andamenti dei titoli, ho costruito diversi grafici, rappresentanti le caratteristiche che ho reputato più importanti.

Per poter visualizzare nel dettaglio le features, ho impostato degli slider che permettono all'utente di impostare l'anno, a partire dal 2000, fino a cui la visualizzazione deve arrivare, e il numero di anni antecedenti a quello impostato da cui partire.

Ho adottato un approccio di questo tipo in quanto ritenevo fosse necessario poter visualizzare i dati in maniera più accurata, e quindi selezionarne solo una frazione dai dataset originari.

Per la feature riguardante i volumi, ho deciso di mostrare solo l'andamento mensile, in quanto la visualizzazione per singola giornata sarebbe risultata fin troppo confusionaria, e ho reputato che il riepilogo mensile possa ritenersi sufficientemente dettagliato per l'analisi in questione.

In [7]:
year_widget = IntSlider(min=2000, max=2022, step=1, value=2022)
n_year_widget = IntSlider(min=1, max=23, step=1, value=23)

def update_n_year_range(*args):
    n_year_widget.max = year_widget.value - 1999
year_widget.observe(update_n_year_range, 'value')

@interact(title=titles.keys(), y_finish=year_widget, n_years=n_year_widget)
def plot_data(title, y_finish=2022, n_years=23):
    dataframe = titles[title]
    dataframe_to_print = dataframe[(dataframe.index.year > y_finish - n_years) & (dataframe.index.year <= y_finish)]
    gain = dataframe_to_print[(dataframe_to_print["Close"] - dataframe_to_print["Open"]) > 0]
    loss = dataframe_to_print[(dataframe_to_print["Close"] - dataframe_to_print["Open"]) < 0]
    even = dataframe_to_print[(dataframe_to_print["Close"] - dataframe_to_print["Open"]) == 0]
    fig = plt.figure(figsize=(12, 8))
    fig.suptitle("Open and Close", fontsize=14, fontweight='bold')
    plt.plot(dataframe_to_print.index, dataframe_to_print["Open"])
    plt.plot(dataframe_to_print.index, dataframe_to_print["Close"])
    plt.scatter(gain.index, gain["Close"] - gain["Open"], s=15, c="green")
    plt.scatter(loss.index, loss["Close"] - loss["Open"], s=15, c="red")
    plt.scatter(even.index, even["Close"] - even["Open"], s=15, c="orange")
    plt.plot(dataframe_to_print.index, dataframe_to_print["Close"] - dataframe_to_print["Open"], c="pink")
    plt.legend(["Open", "Close", "Daily change"])
    plt.ylabel("USD")
    plt.xticks(rotation=45, ha='right')
    plt.grid()
    plt.show()
    
    fig = plt.figure(figsize=(12, 8))
    fig.suptitle("Highest and Lowest", fontsize=14, fontweight='bold')
    plt.plot(dataframe_to_print.index, dataframe_to_print["High"])
    plt.plot(dataframe_to_print.index, dataframe_to_print["Low"])
    plt.legend(["High", "Low"])
    plt.ylabel("USD")
    plt.xticks(rotation=45, ha='right')
    plt.grid()
    plt.show()
    
    fig = plt.figure(figsize=(12, 8))
    fig.suptitle("Volumes", fontsize=14, fontweight='bold')
    dataframe_monthly_grouped = dataframe_to_print.groupby([dataframe_to_print.index.year, dataframe_to_print.index.month])
    dataframe_month = dataframe_monthly_grouped.aggregate(np.sum)
    for idx in dataframe_month.index:
        plt.bar(datetime(idx[0], idx[1], 1), dataframe_month.loc[idx, "Volume"] / 10**3, 20)
    plt.ylabel("Volume [k units]")
    plt.xticks(rotation=45, ha='right')
    plt.grid(axis="y")
    plt.show()
    

interactive(children=(Dropdown(description='title', options=('Brent Oil', 'Heating Oil', 'Crude Oil', 'Natural…

Al fine di ottenere dataframe efficienti su cui lavorare, seleziono solo le features rilevanti per l'analisi in questione.

## Individuazione di modelli

Una volta aver tabulato i dati, individuo più modelli in base a differenti features prese in considerazione o alla tipologia di modello, per effettuare poi un confronto e determinare il modello migliore, ovvero quello in grado di massimizzare il ROI.

Per ottenere un modello di regressione, sfrutto il metodo di hold out, dividendo i dati in training set e validation set secondo l'anno di osservazione.

In [8]:
def train_split(X, y, title, year):
    is_train = X.index.year < year
    X_train = X.loc[is_train]
    y_train = y.loc[is_train]
    X_val = X.loc[~is_train]
    y_val = y.loc[~is_train]
    print(title)
    print(f"Training set: {X_train.size/X.size:.2%}")
    print(f"Validation set: {X_val.size/X.size:.2%}", "\n")
    return [X_train, X_val, y_train, y_val]

Eseguo la separazione dei dataset in training e validation, optando per una divisione basata sull'anno in cui viene effettuata l'osservazione, privilegiando il training set con un numero maggiore di osservazioni.

## Modelli

Ho deciso di costruire diversi tipi di modelli, a seconda dei dati presi in considerazione.
Ho effettuato una regressione Lasso preliminare sul dataframe complessivo al fine di definire le peculiarità più importanti (con un peso $\neq$ 0).
Con $\alpha$ = 0.1 ottengo come rilevanti le feature di apertura, High e Low, queste due ultime prese dal giorno precedente, in quanto non disponibili per prevedere la chiusura, dato che vengono stilate a fine giornata.
- Modello 1, costituito dalle sole features ottenute dalla regressione Lasso.
- Modello 2, prevede l'aggiunta di $n$ valori di Open delle $n$ giornate precedenti del titolo in esame.
- Modello 3, prevede l'aggiunta degli altri valori di apertura dei titoli presi in considerazione; ho reputato utile aggiungere questo modello perchè trattandosi di titoli legati al petrolio e al gas naturale possono essere ritenuti correlati.

In [9]:
# Strutture dati per il modello 1
titles_work_first = {}
titles_split_first = {}
features_first = []

# Strutture dati per il modello 2
titles_work_second = {}
titles_split_second = {}
features_second = []

# Strutture dati per il modello 3
titles_work_third = {}
titles_split_third = {}
features_third = {}
# Dependency_features è un dizionario perchè è necessario che la lista di features di ogni modello abbia come primo
# elemento il valore di apertura del titolo stesso, introducendo i valori di apertura degli altri titoli, ho modificato
# il nome delle colonne con "Open {nome_titolo}", per cui è necessario avere una lista personalizzata per ogni titolo

In [10]:
# Escludo le colonne che non hanno un peso effettivo per il modello
X, y = data.drop(columns=["Close", "Date", "Symbol", "Currency"]).copy(), data["Close"].copy()

def shift_col(dataset, col, days=1):
    dataset[col] = dataset[col].shift(days)

shift_col(X, "High")
shift_col(X, "Low")
shift_col(X, "Volume")
X.dropna(inplace=True)

y = y.loc[X.index]

# Regressione Lasso sulle features rimanenti per scremare quelle reputate poco utili
model = Pipeline([
    ("scale", StandardScaler()),
    ("regr", Lasso(alpha=0.1))
])
model.fit(X, y)

weights = pd.Series(model.named_steps["regr"].coef_, X.columns)
print(weights)
weights = weights[weights != 0]
features_first = weights.index.tolist()
features_first.append(y.name)
print(features_first)

Open      36.404929
High       0.000001
Low        0.001638
Volume     0.000000
dtype: float64
['Open', 'High', 'Low', 'Close']


In [11]:
# Compongo la struttura dati per i dataframe su cui lavorare, per cui escludo tutte le features_first reputate poco utili
brent_oil_work = brent_oil[features_first].copy()
heating_oil_work = heating_oil[features_first].copy()
crude_oil_work = crude_oil[features_first].copy()
natural_gas_work = natural_gas[features_first].copy()
titles_work_first = {"Brent Oil": brent_oil_work, "Heating Oil": heating_oil_work, "Crude Oil": crude_oil_work, "Natural Gas": natural_gas_work}

# Traslo di 1 giorno le features_first che non posso utilizzare per la previsione del giorno, perchè di fatto posso accederevi
# solo in fase di chiusura della borsa, quando la previsione risulta ormai non sfruttabile
for title_name, title in titles_work_first.items():
    if "High" in features_first:
        title["High"] = title["High"].shift(1)
    if "Low" in features_first:
        title["Low"] = title["Low"].shift(1)
    if "Volume" in features_first:
        title["Volume"] = title["Volume"].shift(1)
    title.dropna(inplace=True)

@interact(title=titles_work_first.keys())
def show_stats(title):
    print_stats(titles_work_first[title], title)

interactive(children=(Dropdown(description='title', options=('Brent Oil', 'Heating Oil', 'Crude Oil', 'Natural…

In [86]:
# Struttura dati per contenere il training set e il validation, secondo il metodo di hold-out
for title in titles_work_first.keys():
    titles_split_first[title] = \
    list(train_test_split(titles_work_first[title][features_first[:-1]], titles_work_first[title][features_first[-1]], test_size=1/3, random_state=22))
    #train_split(titles_work_first[title][features_first[:-1]], titles_work_first[title][features_first[-1]], title,
    #2014)

In [87]:
# Partendo dai dataframe a cui ho già scremato le features meno utili, estraggo i dataframe per il secondo modello,
# aggiungendo i colonne che corrispondono alle aperture di i giorni traslati
for title_name, title in titles_work_first.items():
    dframe = title.copy()
    for i in range(1, 30):
        dframe.insert(loc=dframe.columns.size-1, column="OpenLag{}".format(i), value=dframe["Open"].shift(i))
    dframe.dropna(inplace=True)
    if len(features_second) == 0:
        features_second = dframe.columns.tolist()
    titles_work_second[title_name] = dframe
    titles_split_second[title_name] = \
    list(train_test_split(dframe[features_second[:-1]], dframe[features_second[-1]], test_size=1/3, random_state=22))
    #train_split(dframe[features_second[:-1]], dframe[features_second[-1]], title_name, 2014)

In [88]:
# Partendo dai dataframe a cui ho già scremato le features meno utili, aggiungo i valori di apertura degli altri titoli e,
# reputando i valori dei titoli dipendenti tra loro, ho aggiunto anche le colonne relative al prodotto tra i valori di 
# apertura dei titoli stessi
for title_name, title in titles_work_first.items():
    dframe = title.copy()
    dframe.rename(columns={"Open": "Open {}".format(title_name)}, inplace=True)
    for other_title_name, other_title in titles_work_first.items():
        if other_title_name != title_name:
            dframe.insert(loc=dframe.columns.size-1, column="Open {}".format(other_title_name), value=other_title["Open"])
    for i, tit in enumerate(titles_work_first.keys()):
        if i < len(titles_work_first.keys()) -1:
            for other_tit in list(titles_work_first.keys())[i+1:]:
                dframe.insert(loc=dframe.columns.size-1, column="{} * {}".format(tit, other_tit), value=dframe["Open {}".format(tit)] * dframe["Open {}".format(other_tit)])
    dframe.dropna(inplace=True)
    features_third[title_name] = dframe.columns.tolist()
    titles_work_third[title_name] = dframe
    titles_split_third[title_name] = \
    list(train_test_split(dframe[features_third[title_name][:-1]], dframe[features_third[title_name][-1]], test_size=1/3, random_state=22))
    #train_split(dframe[features_third[title_name][:-1]], dframe[features_third[title_name][-1]], title_name, 2014)

Ho definito alcune funzioni per stampare e graficare le peculiarità principali di ogni modello.

In [15]:
def err_rel(model, X, y, features):
    return np.mean(np.abs(model.predict(X[features[:model.n_features_in_]]) - y) / y)

In [16]:
def print_model_coefficients(features, coefficients, intercept, poly=None):
    if poly is None:
        thetas = [(a, round(b, 2)) for (a, b) in zip(features[:-1], coefficients)]
    else:
        thetas = [(a, round(b, 2)) for (a, b) in zip(poly.get_feature_names(features), coefficients)]
    print("Theta:", thetas)
    print(f"Intercept: {intercept:.2f}")

In [17]:
def print_model_stats(title, model, dataset, features):
    print("\n" + title)
    if "lin_regr" in model.named_steps:
        regr_name = "lin_regr"
    if "ridge_regr" in model.named_steps:
        regr_name = "ridge_regr"
    if "lasso_regr" in model.named_steps:
        regr_name = "lasso_regr"
    if "poly" not in model.named_steps:
        print_model_coefficients(features, model.named_steps[regr_name].coef_, model.named_steps[regr_name].intercept_)
    else:
        print_model_coefficients(features, model.named_steps[regr_name].coef_, model.named_steps[regr_name].intercept_, model.named_steps["poly"])
    print(f"Errore relativo training set: {err_rel(model, dataset[0], dataset[2], features):.2%}")
    print(f"Errore relativo validation set: {err_rel(model, dataset[1], dataset[3], features):.2%}")
    print(f"R^2: {model.score(dataset[1][features[:model.n_features_in_]], dataset[3]):.5}")

In [18]:
def plot_model_on_data(X, y, title, features, model=None):
    fig = plt.figure(figsize=(12, 8))
    fig.suptitle(title, fontsize=14, fontweight='bold')
    plt.scatter(X[features[0]], y, s=10)
    if model is not None:
        preds = model.predict(X[features[:model.n_features_in_]])
        plt.scatter(X[features[0]], preds, s=10)
    plt.grid()
    plt.legend(["Real values", "Predicted"])
    plt.ylabel("Close values")
    plt.xlabel("Open values")
    plt.show()
    
    fig = plt.figure(figsize=(12, 8))
    plt.plot(X.index, y, c="blue")
    if model is not None:
        plt.plot(X.index, preds, c="red")
    plt.grid()
    plt.legend(["Real values", "Predicted"])
    plt.ylabel("Close values")
    plt.show()

In [19]:
models = {
    "first": {
        "poly": {},
        "ridge": {},
        "lasso": {}
    },
    "second": {
        "poly": {},
        "ridge": {},
        "lasso": {}
    },
    "third": {
        "poly": {},
        "ridge": {},
        "lasso": {}
    }
}

Per ottenere più modelli di learning, ne costruisco diversi e, successivamente, eseguirò un confronto tra questi per decretare il migliore.
- Regressione polinomiale (grado 1 se non indicato)
- Regressione polinomiale con regolarizzazione Ridge
- Regressione polinomiale con regolarizzazione Lasso

In [20]:
def polynomial_regression(X, y, features, dg=1):
    model = Pipeline([
        ("poly", PolynomialFeatures(degree=dg, include_bias=False)),
        ("scale", StandardScaler()),
        ("lin_regr", LinearRegression())
    ])
    model.fit(pd.DataFrame(X[features[:-1] if len(features) > 1 else features[0]]), y)
    return model

In [21]:
def ridge_regression(X, y, features, dg=1, alfa=1):
    model = Pipeline([
        ("poly", PolynomialFeatures(degree=dg, include_bias=False)),
        ("scale", StandardScaler()),
        ("ridge_regr", Ridge(alpha=alfa))
    ])
    model.fit(pd.DataFrame(X[features[:-1] if len(features) > 1 else features[0]]), y)
    return model

In [22]:
def lasso_regression(X, y, features, dg=1, alfa=0.1):
    model = Pipeline([
        ("poly", PolynomialFeatures(degree=dg, include_bias=False)),
        ("scale", StandardScaler()),
        ("lasso_regr", Lasso(alpha=alfa))
    ])
    model.fit(pd.DataFrame(X[features[:-1] if len(features) > 1 else features[0]]), y)
    return model

## Modello 1

In [60]:
for title in titles.keys():
    models["first"]["poly"][title] = polynomial_regression(titles_split_first[title][0], titles_split_first[title][2], features_first)
    models["first"]["ridge"][title] = ridge_regression(titles_split_first[title][0], titles_split_first[title][2], features_first)
    models["first"]["lasso"][title] = lasso_regression(titles_split_first[title][0], titles_split_first[title][2], features_first)

In [61]:
@interact(model = models["first"].keys(), title=titles_work_first.keys())
def regr_first(model, title):
    print_model_stats(title, models["first"][model][title], titles_split_first[title], features_first)
    plot_model_on_data(titles_work_first[title][features_first[:models["first"][model][title].n_features_in_]], titles_work_first[title][features_first[-1]], title, features_first, models["first"][model][title])

interactive(children=(Dropdown(description='model', options=('poly', 'ridge', 'lasso'), value='poly'), Dropdow…

## Modello 2

In [62]:
for title in titles.keys():
    models["second"]["poly"][title] = polynomial_regression(titles_split_second[title][0], titles_split_second[title][2], features_second)
    models["second"]["ridge"][title] = ridge_regression(titles_split_second[title][0], titles_split_second[title][2], features_second)
    models["second"]["lasso"][title] = ridge_regression(titles_split_second[title][0], titles_split_second[title][2], features_second)

In [63]:
@interact(model = models["second"].keys(), title=titles_work_second.keys())
def regr_second(model, title):
    print_model_stats(title, models["second"][model][title], titles_split_second[title], features_second)
    plot_model_on_data(titles_work_second[title][features_second[:models["second"][model][title].n_features_in_]], titles_work_second[title][features_second[-1]], title, features_second, models["second"][model][title])

interactive(children=(Dropdown(description='model', options=('poly', 'ridge', 'lasso'), value='poly'), Dropdow…

## Modello 3

In [64]:
for title in titles.keys():
    models["third"]["poly"][title] = polynomial_regression(titles_split_third[title][0], titles_split_third[title][2], features_third[title])
    models["third"]["ridge"][title] = ridge_regression(titles_split_third[title][0], titles_split_third[title][2], features_third[title])
    models["third"]["lasso"][title] = ridge_regression(titles_split_third[title][0], titles_split_third[title][2], features_third[title])

In [65]:
@interact(model = models["third"].keys(), title=titles_work_third.keys())
def regr_multivar(model, title):
    print_model_stats(title, models["third"][model][title], titles_split_third[title], features_third[title])
    plot_model_on_data(titles_work_third[title][features_third[title][:models["third"][model][title].n_features_in_]], titles_work_third[title][features_third[title][-1]], title, features_third[title], models["third"][model][title])

interactive(children=(Dropdown(description='model', options=('poly', 'ridge', 'lasso'), value='poly'), Dropdow…

## Confronto tra modelli

Il valore R^2 che determina lo score di un modello risulta molto per tutti e 3 i modelli, questo perchè il dominio dell'andamento dei titoli di borsa prevede variazioni molto piccole; inoltre, le features di riferimento per l'addestramento, risultano molto correlate con la variabile da predire.

Per poter stimare un effettivo guadagno, è necessario introdurre nuove misure, una è il Return On Investment (ROI), che quantifica l'effettivo ritorno di un investimento.

L'investimento consiste nella vendita o acquisto, data la previsione $C_{pred}$ della chiusura di giornata, di un titolo.
- Qualora il valore $C_{pred}$ sia maggiore del valore di apertura $O$, un titolo viene acquistato in apertura e rivenduto in chiusura, comportando il guadagno $G$ = $C$ - $O$
- Qualora il valore $C_{pred}$ sia minore del valore di apertura $O$, il titolo viene venduto allo scoperto, ciò prevede la vendita di un titolo che non si possiede, a discapito poi, in fase di chiusura di giornata, dell'acquisto del suddetto, comportando un guadagno $G$ = $O$ - $C$

In [29]:
def gain(dataset, model, features):
    C_pred = model.predict(dataset[features[:model.n_features_in_]])
    G = dataset[features[-1]] - dataset[features[0]]
    growth = C_pred > dataset[features[0]]
    decline = C_pred < dataset[features[0]]
    return G[growth].sum() - G[decline].sum()

In [30]:
def roi(dataset, model, features):
    mean_open = dataset[features[0]].mean()
    return gain(dataset, model, features) / mean_open

In [31]:
def print_eval(dataset, model, title, features):
    print(title)
    print(f"Gain: {gain(dataset, model, features):.2f}$")
    print(f" ROI: {roi(dataset, model, features):.2%}")

## Guadagno e ROI dei modelli

Per ottenere le metriche effettive, queste vengono calcolate solo sul validation set.

In [66]:
# Calcolo il ROI e il guadagno per ogni modello e per ogni titolo
gains_first = {}
rois_first = {}
investment_first = {}
for model_name, model in models["first"].items():
    #print(model_name)
    gain_first_title = {}
    roi_first_title = {}
    for title_name, title in titles_work_first.items():
        dframe = pd.merge(titles_split_first[title_name][1][features_first[:model[title_name].n_features_in_]], titles_split_first[title_name][3], left_index=True, right_index=True).dropna()
        #display(dframe)
        #print_eval(titles_work_first[title_name], model[title_name], title_name, features_first)
        gain_first_title[title_name] = gain(dframe, model[title_name], features_first)
        roi_first_title[title_name] = roi(dframe, model[title_name], features_first)
        if(len(investment_first) < len(titles_work_first.keys())):
            investment_first[title_name] = dframe[features_first[0]].mean()
    gains_first[model_name] = gain_first_title
    rois_first[model_name] = roi_first_title
    
@interact(model=models["first"].keys())
def show_model_stats(model):
    stats_to_print_first = {}
    gain_first_title = gains_first[model]
    roi_first_title = rois_first[model]
    model_first_gain = sum(gain_first_title.values())
    model_first_investment = sum(investment_first.values())
    model_first_roi = model_first_gain / model_first_investment
    stats_to_print_first["Gain"] = [f"{value:.2f}$" for value in gain_first_title.values()]
    stats_to_print_first["Roi"] = [f"{value:.2%}" for value in roi_first_title.values()]
    stats_to_print_first["Gain"].append(f"{model_first_gain:.2f}$")
    stats_to_print_first["Roi"].append(f"{model_first_roi:.2%}")
    titles_list = list(titles_work_first.keys())
    titles_list.append("Total")
    display(pd.DataFrame(data=stats_to_print_first, index=titles_list))
    #print(f"Guadagno totale: {model_first_gain:.2f}$ con base di investimento {model_first_investment:.2f}$ e roi del {model_first_roi:.2%}\n")

interactive(children=(Dropdown(description='model', options=('poly', 'ridge', 'lasso'), value='poly'), Output(…

In [67]:
gains_second = {}
rois_second = {}
investment_second = {}
for model_name, model in models["second"].items():
    #print(model_name)
    gain_second_title = {}
    roi_second_title = {}
    for title_name, title in titles_work_second.items():
        dframe = pd.merge(titles_split_second[title_name][1][features_second[:model[title_name].n_features_in_]], titles_split_second[title_name][3], left_index=True, right_index=True).dropna() #pd.DataFrame(frame)
        #display(dframe)
        #print_eval(dframe, model[title_name], title_name, features_second)
        gain_second_title[title_name] = gain(dframe, model[title_name], features_second)
        roi_second_title[title_name] = roi(dframe, model[title_name], features_second)
        if(len(investment_second) < len(titles_work_second.keys())):
            investment_second[title_name] = dframe[features_second[0]].mean()
    gains_second[model_name] = gain_second_title
    rois_second[model_name] = roi_second_title
    
@interact(model=models["second"].keys())
def show_model_stats(model):
    stats_to_print_second = {}
    gain_second_title = gains_second[model]
    roi_second_title = rois_second[model]
    model_second_gain = sum(gain_second_title.values())
    model_second_investment = sum(investment_second.values())
    model_second_roi = model_second_gain / model_second_investment
    stats_to_print_second["Gain"] = [f"{value:.2f}$" for value in gain_second_title.values()]
    stats_to_print_second["Roi"] = [f"{value:.2%}" for value in roi_second_title.values()]
    stats_to_print_second["Gain"].append(f"{model_second_gain:.2f}$")
    stats_to_print_second["Roi"].append(f"{model_second_roi:.2%}")
    titles_list = list(titles_work_second.keys())
    titles_list.append("Total")
    display(pd.DataFrame(data=stats_to_print_second, index=titles_list))
    #print(f"Guadagno totale: {model_second_gain:.2f}$ con base di investimento {model_second_investment:.2f}$ e roi del {model_second_roi:.2%}\n")

interactive(children=(Dropdown(description='model', options=('poly', 'ridge', 'lasso'), value='poly'), Output(…

In [68]:
gains_third = {}
rois_third = {}
investment_third = {}
for model_name, model in models["third"].items():
    #print(model_name)
    gain_third_title = {}
    roi_third_title = {}
    for title_name, title in titles_work_third.items():
        dframe = pd.merge(titles_split_third[title_name][1][features_third[title_name][:model[title_name].n_features_in_]], titles_split_third[title_name][3], left_index=True, right_index=True).dropna() #pd.DataFrame(frame)
        #display(dframe)
        #print_eval(dframe, model[title_name], title_name, features_third[title_name])
        gain_third_title[title_name] = gain(dframe, model[title_name], features_third[title_name])
        roi_third_title[title_name] = roi(dframe, model[title_name], features_third[title_name])
        if(len(investment_third) < len(titles_work_third.keys())):
            investment_third[title_name] = dframe[features_third[title_name][0]].mean()
    gains_third[model_name] = gain_third_title
    rois_third[model_name] = roi_third_title
    
@interact(model=models["third"].keys())
def show_model_stats(model):
    stats_to_print_third = {}
    gain_third_title = gains_third[model]
    roi_third_title = rois_third[model]
    model_third_gain = sum(gain_third_title.values())
    model_third_investment = sum(investment_third.values())
    model_third_roi = model_third_gain / model_third_investment
    stats_to_print_third["Gain"] = [f"{value:.2f}$" for value in gain_third_title.values()]
    stats_to_print_third["Roi"] = [f"{value:.2%}" for value in roi_third_title.values()]
    stats_to_print_third["Gain"].append(f"{model_third_gain:.2f}$")
    stats_to_print_third["Roi"].append(f"{model_third_roi:.2%}")
    titles_list = list(titles_work_third.keys())
    titles_list.append("Total")
    display(pd.DataFrame(data=stats_to_print_third, index=titles_list))
    #print(f"Guadagno totale: {model_third_gain:.2f}$ con base di investimento {model_third_investment:.2f}$ e roi del {model_third_roi:.2%}\n")

interactive(children=(Dropdown(description='model', options=('poly', 'ridge', 'lasso'), value='poly'), Output(…

## Valutazione con K-fold-cross

Per ottenere una valutazione più veritiera, una volta aver addestrato i modelli misuro le metriche di score su tutte le osservazioni, dividendole in training e validation set differenti.

In [69]:
kf = KFold(7, shuffle=True, random_state=42)

In [70]:
results = {}
for model_name, model in models["first"].items():
    results_model = {}
    for title_name, title in titles_work_first.items():
        cv_result = cross_validate(model[title_name], title[features_first[:model[title_name].n_features_in_]], title[features_first[-1]], cv=kf, return_train_score=True)
        cv_table = pd.DataFrame(cv_result)
        results_model[title_name] = cv_table
    results[model_name] = results_model

@interact(model=models["first"].keys(), title=titles_work_first.keys())
def show_model_stats(model, title):
    table = results[model][title]
    display(table.agg(["mean", "std"]))

interactive(children=(Dropdown(description='model', options=('poly', 'ridge', 'lasso'), value='poly'), Dropdow…

In [71]:
results = {}
for model_name, model in models["second"].items():
    results_model = {}
    for title_name, title in titles_work_second.items():
        cv_result = cross_validate(model[title_name], titles_work_second[title_name][features_second[:model[title_name].n_features_in_]], titles_work_second[title_name][features_second[-1]], cv=kf, return_train_score=True)
        cv_table = pd.DataFrame(cv_result)
        results_model[title_name] = cv_table
    results[model_name] = results_model

@interact(model=models["second"].keys(), title=titles_work_second.keys())
def show_model_stats(model, title):
    table = results[model][title]
    display(table.agg(["mean", "std"]))

interactive(children=(Dropdown(description='model', options=('poly', 'ridge', 'lasso'), value='poly'), Dropdow…

In [72]:
results = {}
for model_name, model in models["third"].items():
    results_model = {}
    for title_name, title in titles_work_third.items():
        cv_result = cross_validate(model[title_name], titles_work_third[title_name][features_third[title_name][:model[title_name].n_features_in_]], titles_work_third[title_name][features_third[title_name][-1]], cv=kf, return_train_score=True)
        cv_table = pd.DataFrame(cv_result)
        results_model[title_name] = cv_table
    results[model_name] = results_model

@interact(model=models["third"].keys(), title=titles_work_third.keys())
def show_model_stats(model, title):
    table = results[model][title]
    display(table.agg(["mean", "std"]))

interactive(children=(Dropdown(description='model', options=('poly', 'ridge', 'lasso'), value='poly'), Dropdow…

## Individuazione parametri migliori con grid search

Al fine di ottenere gli iperparametri migliori, eseguo una grid search su più valori di iperparametri.

In [73]:
from sklearn.model_selection import GridSearchCV

best_params = {
    "first": {},
    "second": {},
    "third": {}
}

model_with_best_params = {
    "first": {},
    "second": {},
    "third": {}
}

In [74]:
# Definisco i parametri su cui effettuare la ricerca, lasciandoli parametrici per consentire in maniera più agile di
# definirli in maniera personalizzata ad ogni modello
degrees = lambda n: range(1, n)
scalers = [None, StandardScaler()]
alphas = lambda n: np.linspace(0.1, 10, n)

In [75]:
for model_name, model in models["first"].items():
    best_params_model = {}
    model_fitted_best_params = {}
    for title_name, title in titles_work_first.items():
        grid = {}
        if("poly" in model[title_name].named_steps.keys()):
            grid["poly__degree"] = degrees(3)
        if("scale" in model[title_name].named_steps.keys()):
            grid["scale"] = scalers
        if("ridge_regr" in model[title_name].named_steps.keys()):
            grid["ridge_regr__alpha"] = alphas(5)
        if("lasso_regr" in model[title_name].named_steps.keys()):
            grid["lasso_regr__alpha"] = alphas(5)
        gs = GridSearchCV(model[title_name], grid, cv=kf)
        gs.fit(title[features_first[:model[title_name].n_features_in_]], title[features_first[-1]]);
        best_params_model[title_name] = pd.DataFrame(gs.cv_results_).sort_values("mean_test_score", ascending=False)
        model_fitted_best_params[title_name] = gs.best_estimator_
    best_params["first"][model_name] = best_params_model
    model_with_best_params["first"][model_name] = model_fitted_best_params
    

@interact(model=models["first"].keys(), title=titles_work_first.keys())
def show_model_best_params(model, title):
    table = best_params["first"][model][title]
    display(table.head(3))

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = c

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


interactive(children=(Dropdown(description='model', options=('poly', 'ridge', 'lasso'), value='poly'), Dropdow…

In [76]:
for model_name, model in models["second"].items():
    best_params_model = {}
    model_fitted_best_params = {}
    for title_name, title in titles_work_second.items():
        grid = {}
        if("poly" in model[title_name].named_steps.keys()):
            grid["poly__degree"] = degrees(3)
        if("scale" in model[title_name].named_steps.keys()):
            grid["scale"] = scalers
        if("ridge_regr" in model[title_name].named_steps.keys()):
            grid["ridge_regr__alpha"] = alphas(5)
        if("lasso_regr" in model[title_name].named_steps.keys()):
            grid["lasso_regr__alpha"] = alphas(5)
        gs = GridSearchCV(model[title_name], grid, cv=kf)
        gs.fit(titles_work_second[title_name][features_second[:model[title_name].n_features_in_]], titles_work_second[title_name][features_second[-1]]);
        best_params_model[title_name] = pd.DataFrame(gs.cv_results_).sort_values("mean_test_score", ascending=False)
        model_fitted_best_params[title_name] = gs.best_estimator_
    best_params["second"][model_name] = best_params_model
    model_with_best_params["second"][model_name] = model_fitted_best_params

@interact(model=models["second"].keys(), title=titles_work_second.keys())
def show_model_best_params(model, title):
    table = best_params["second"][model][title]
    display(table.head(3))

interactive(children=(Dropdown(description='model', options=('poly', 'ridge', 'lasso'), value='poly'), Dropdow…

In [77]:
for model_name, model in models["third"].items():
    best_params_model = {}
    model_fitted_best_params = {}
    for title_name, title in titles_work_third.items():
        grid = {}
        if("poly" in model[title_name].named_steps.keys()):
            grid["poly__degree"] = degrees(3)
        if("scale" in model[title_name].named_steps.keys()):
            grid["scale"] = scalers
        if("ridge_regr" in model[title_name].named_steps.keys()):
            grid["ridge_regr__alpha"] = alphas(5)
        if("lasso_regr" in model[title_name].named_steps.keys()):
            grid["lasso_regr__alpha"] = alphas(5)
        gs = GridSearchCV(model[title_name], grid, cv=kf)
        gs.fit(titles_work_third[title_name][features_third[title_name][:model[title_name].n_features_in_]], titles_work_third[title_name][features_third[title_name][-1]]);
        best_params_model[title_name] = pd.DataFrame(gs.cv_results_).sort_values("mean_test_score", ascending=False)
        model_fitted_best_params[title_name] = gs.best_estimator_
    best_params["third"][model_name] = best_params_model
    model_with_best_params["third"][model_name] = model_fitted_best_params

@interact(model=models["third"].keys(), title=titles_work_third.keys())
def show_model_best_params(model, title):
    table = best_params["third"][model][title]
    display(table.head(3))

  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return li

  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return li

  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return li

  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return li

  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return li

  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return li

  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return li

interactive(children=(Dropdown(description='model', options=('poly', 'ridge', 'lasso'), value='poly'), Dropdow…

In [78]:
gains_best_first = {}
rois_best_first = {}
investment_first = {}
for model_name, model in model_with_best_params["first"].items():
    #print(model_name)
    gain_first_title = {}
    roi_first_title = {}
    for title_name, title in titles_work_first.items():
        dframe = pd.merge(titles_split_first[title_name][1][features_first[:model[title_name].n_features_in_]], titles_split_first[title_name][3], left_index=True, right_index=True).dropna()
        #display(dframe)
        #print_eval(titles_work_first[title_name], model[title_name], title_name, features_first)
        gain_first_title[title_name] = gain(dframe, model[title_name], features_first)
        roi_first_title[title_name] = roi(dframe, model[title_name], features_first)
        if(len(investment_first) < len(titles_work_first.keys())):
            investment_first[title_name] = dframe[features_first[0]].mean()
    gains_best_first[model_name] = gain_first_title
    rois_best_first[model_name] = roi_first_title
    
@interact(model=model_with_best_params["first"].keys())
def show_model_stats(model):
    stats_to_print_first = {}
    delta_gain_first = {}
    delta_roi_first = {}
    delta_to_print_first = {}
    gain_first_title = gains_best_first[model]
    roi_first_title = rois_best_first[model]
    for title in gains_best_first[model]:
        delta_gain_first[title] = gains_best_first[model][title] - gains_first[model][title]
        delta_roi_first[title] = rois_best_first[model][title] - rois_first[model][title]
    model_first_gain = sum(gain_first_title.values())
    delta_first_gain = sum(delta_gain_first.values())
    model_first_investment = sum(investment_first.values())
    model_first_roi = model_first_gain / model_first_investment
    delta_first_roi = delta_first_gain / model_first_investment
    stats_to_print_first["Gain"] = [f"{value:.2f}$" for value in gain_first_title.values()]
    stats_to_print_first["Roi"] = [f"{value:.2%}" for value in roi_first_title.values()]
    stats_to_print_first["Gain"].append(f"{model_first_gain:.2f}$")
    stats_to_print_first["Roi"].append(f"{model_first_roi:.2%}")
    delta_to_print_first["Delta Gain"] = [f"{value:.2f}$" for value in delta_gain_first.values()]
    delta_to_print_first["Delta Roi"] = [f"{value:.2%}" for value in delta_roi_first.values()]
    delta_to_print_first["Delta Gain"].append(f"{delta_first_gain:.2f}$")
    delta_to_print_first["Delta Roi"].append(f"{delta_first_roi:.2%}")
    titles_list = list(titles_work_first.keys())
    titles_list.append("Total")
    display(pd.DataFrame(data=stats_to_print_first, index=titles_list))
    print("Differenza tra il modello addestrato con i migliori parametri e quello precedentemente addestrato:")
    display(pd.DataFrame(data=delta_to_print_first, index=titles_list))
    #print(f"Guadagno totale: {model_first_gain:.2f}$ con base di investimento {model_first_investment:.2f}$ e roi del {model_first_roi:.2%}\n")

interactive(children=(Dropdown(description='model', options=('poly', 'ridge', 'lasso'), value='poly'), Output(…

In [79]:
gains_best_second = {}
rois_best_second = {}
investment_second = {}
for model_name, model in model_with_best_params["second"].items():
    #print(model_name)
    gain_second_title = {}
    roi_second_title = {}
    for title_name, title in titles_work_second.items():
        dframe = pd.merge(titles_split_second[title_name][1][features_second[:model[title_name].n_features_in_]], titles_split_second[title_name][3], left_index=True, right_index=True).dropna() #pd.DataFrame(frame)
        #display(dframe)
        #print_eval(dframe, model[title_name], title_name, features_second)
        gain_second_title[title_name] = gain(dframe, model[title_name], features_second)
        roi_second_title[title_name] = roi(dframe, model[title_name], features_second)
        if(len(investment_second) < len(titles_work_second.keys())):
            investment_second[title_name] = dframe[features_second[0]].mean()
    gains_best_second[model_name] = gain_second_title
    rois_best_second[model_name] = roi_second_title
    
@interact(model=model_with_best_params["second"].keys())
def show_model_stats(model):
    stats_to_print_second = {}
    delta_gain_second = {}
    delta_roi_second = {}
    delta_to_print_second = {}
    gain_second_title = gains_best_second[model]
    roi_second_title = rois_best_second[model]
    for title in gains_best_second[model]:
        delta_gain_second[title] = gains_best_second[model][title] - gains_second[model][title]
        delta_roi_second[title] = rois_best_second[model][title] - rois_second[model][title]
    model_second_gain = sum(gain_second_title.values())
    delta_second_gain = sum(delta_gain_second.values())
    model_second_investment = sum(investment_second.values())
    model_second_roi = model_second_gain / model_second_investment
    delta_second_roi = delta_second_gain / model_second_investment
    stats_to_print_second["Gain"] = [f"{value:.2f}$" for value in gain_second_title.values()]
    stats_to_print_second["Roi"] = [f"{value:.2%}" for value in roi_second_title.values()]
    stats_to_print_second["Gain"].append(f"{model_second_gain:.2f}$")
    stats_to_print_second["Roi"].append(f"{model_second_roi:.2%}")
    delta_to_print_second["Delta Gain"] = [f"{value:.2f}$" for value in delta_gain_second.values()]
    delta_to_print_second["Delta Roi"] = [f"{value:.2%}" for value in delta_roi_second.values()]
    delta_to_print_second["Delta Gain"].append(f"{delta_second_gain:.2f}$")
    delta_to_print_second["Delta Roi"].append(f"{delta_second_roi:.2%}")
    titles_list = list(titles_work_second.keys())
    titles_list.append("Total")
    display(pd.DataFrame(data=stats_to_print_second, index=titles_list))
    print("Differenza tra il modello addestrato con i migliori parametri e quello precedentemente addestrato:")
    display(pd.DataFrame(data=delta_to_print_second, index=titles_list))
    #print(f"Guadagno totale: {model_second_gain:.2f}$ con base di investimento {model_second_investment:.2f}$ e roi del {model_second_roi:.2%}\n")

interactive(children=(Dropdown(description='model', options=('poly', 'ridge', 'lasso'), value='poly'), Output(…

In [80]:
gains_best_third = {}
rois_best_third = {}
investment_third = {}
for model_name, model in model_with_best_params["third"].items():
    #print(model_name)
    gain_third_title = {}
    roi_third_title = {}
    for title_name, title in titles_work_third.items():
        dframe = pd.merge(titles_split_third[title_name][1][features_third[title_name][:model[title_name].n_features_in_]], titles_split_third[title_name][3], left_index=True, right_index=True).dropna() #pd.DataFrame(frame)
        #display(dframe)
        #print_eval(dframe, model[title_name], title_name, features_third[title_name])
        gain_third_title[title_name] = gain(dframe, model[title_name], features_third[title_name])
        roi_third_title[title_name] = roi(dframe, model[title_name], features_third[title_name])
        if(len(investment_third) < len(titles_work_third.keys())):
            investment_third[title_name] = dframe[features_third[title_name][0]].mean()
    gains_best_third[model_name] = gain_third_title
    rois_best_third[model_name] = roi_third_title
    
@interact(model=model_with_best_params["third"].keys())
def show_model_stats(model):
    stats_to_print_third = {}
    delta_gain_third = {}
    delta_roi_third = {}
    delta_to_print_third = {}
    gain_third_title = gains_best_third[model]
    roi_third_title = rois_best_third[model]
    for title in gains_best_third[model]:
        delta_gain_third[title] = gains_best_third[model][title] - gains_third[model][title]
        delta_roi_third[title] = rois_best_third[model][title] - rois_third[model][title]
    model_third_gain = sum(gain_third_title.values())
    delta_third_gain = sum(delta_gain_third.values())
    model_third_investment = sum(investment_third.values())
    model_third_roi = model_third_gain / model_third_investment
    delta_third_roi = delta_third_gain / model_third_investment
    stats_to_print_third["Gain"] = [f"{value:.2f}$" for value in gain_third_title.values()]
    stats_to_print_third["Roi"] = [f"{value:.2%}" for value in roi_third_title.values()]
    stats_to_print_third["Gain"].append(f"{model_third_gain:.2f}$")
    stats_to_print_third["Roi"].append(f"{model_third_roi:.2%}")
    delta_to_print_third["Delta Gain"] = [f"{value:.2f}$" for value in delta_gain_third.values()]
    delta_to_print_third["Delta Roi"] = [f"{value:.2%}" for value in delta_roi_third.values()]
    delta_to_print_third["Delta Gain"].append(f"{delta_third_gain:.2f}$")
    delta_to_print_third["Delta Roi"].append(f"{delta_third_roi:.2%}")
    titles_list = list(titles_work_third.keys())
    titles_list.append("Total")
    display(pd.DataFrame(data=stats_to_print_third, index=titles_list))
    print("Differenza tra il modello addestrato con i migliori parametri e quello precedentemente addestrato:")
    display(pd.DataFrame(data=delta_to_print_third, index=titles_list))
    #print(f"Guadagno totale: {model_third_gain:.2f}$ con base di investimento {model_third_investment:.2f}$ e roi del {model_third_roi:.2%}\n")

interactive(children=(Dropdown(description='model', options=('poly', 'ridge', 'lasso'), value='poly'), Output(…

## Osservazioni

Dall'analisi in questione, il modello più promettente, a livello di guadagno e ROI, risulta essere il secondo all'aumentare dei giorni precedenti presi in considerazione.
Addestrando i modelli con i migliori parametri trovati con la grid search ottengo:
- Il ROI del primo modello, considerando la media dei 3 sotto-modelli, si appresta intorno allo 0%.
- Il ROI medio relativo al modello 2, considerando 30 giorni,  è $\approx$ 180%.
- Il ROI del terzo modello è $\approx$ 120%.

Queste metriche si basano sui modelli addestrati con le osservazioni fino al 2014, e analizzati dal 2015 al 2022, per un totale di circa 8 anni.

Utilizzando invece la funzione train_test_split interna a sklearn per dividere il training e validation set con seed=22 ottengo valori di ROI medi molto più alti, rispettivamente
- 130%
- 170%
- 180%

Questo risultato può essere dovuto al fatto che il training set selezionato sulla base degli anni e non in maniera casuale mantenendo le proporzioni impostate renda il modello overfitted, e non in grado di prevedere andamenti completamente differenti.

Effettivamente, analizzando gli andamenti dei titoli di borsa, questi tendono ad avere andamenti regolari per diversi periodi.
Addestrando in maniera mirata su un lasso di anni si verifica quindi un adattamento stretto all'andamento del periodo, comportando un'inferiore capacità di adattamento quando il modello viene poi utilizzato per predire l'andamento di valori in intervalli di tempo così discostati rispetto a quelli del training set.

## Analisi sulla correlazione delle features

Il dato più significativo è quello di apertura, seguito dal valore più basso, e, infine, influenza negativamente il valore più alto

- Modello 1: La regolarizzazione Lasso tende ad eliminare le due features meno influenti, High e Low.
- Modello 2: I valori di chiusura dei giorni precedenti non vengono in generale eliminati dalla regolarizzazione Lasso. Possibile effettuare più prove in base al numero di giorni precedenti considerati, va però annotato che ogni giorno aggiunto rimuove una possibile osservazione.
- Modello 3: Necessario considerare anche i prodotti data la dipendenza dei valori di apertura; la regolarizzazione Lasso non li elimina dall'analisi. Possibile, al fine di migliorare il ROI, considerare anche i valori di apertura dei titoli nei giorni precedenti.