#### Logistic Regression in Python - Predicting if the stock market is going Up or Down
https://www.youtube.com/watch?v=X9jjyh0p7x8&t=1302s


#### Che cos'è la funzione logistica?

La regressione logistica è un modello statistico (modello logit) usato negli algoritmi di classificazione del machine learning per ottenere la probabilità di appartenenza a una determinata classe.

L'algoritmo di classificazione basato sulla regressione logistica è del tipo ML supervisionato.

Si basa sull'utilizzo della funzione logistica (sigmoid) che converte i valori reali in un valore compreso tra 0 e 1: tale funzione viene usata per determinare la probabilità con cui un'istanza appartiene ad una determinata classe (nel nostro caso le classi saranno mercato in saltita e mercato in discesa).

__ESEMPIO__: più studiamo, più aumenta la probabilità di passare l'esame; meno studiamo, più diminuisce la probabilità di passare l'esame.

__NB__: nonostante il nome dell'algoritmo "regressione logistica" (logistic regression) faccia pensare a un algoritmo di regressione, perché la funzione logistica è simile alla regressione lineare, si tratta di un algoritmo di _classificazione_.

Nella fase di addestramento l'algoritmo riceve in input un dataset di training composto da N esempi. Ogni esempio è composto da m attributi X e da un'etichetta y che indica la corretta classificazione.

L'algoritmo individua una vettore dei pesi W da associare al vettore degli attributi Xm degli esempi, in modo tale da massimizzare la percentuale di risposte corrette (o minimizzare quelle sbagliate).

La combinazione lineare z dei pesi L per gli attributi X fornisce una risposta del sistema per ogni esempio del training dataset.
Nella regressione logistica la combinazione lineare z è l'argomento della funzione logistica che lo traduce in un valore compreso tra 0 e 1.

Il risultato della funzione logistica è usato anche come funzione di attivazione dei nodi della rete neurale.

https://www.eage.it/machine-learning/regressione-logistica

https://en.wikipedia.org/wiki/Logistic_function


Obiettivo dell'esercizio: prevedere se il mercato azionario domani salirà o scenderà utilizzando informazioni di mercato ritardate

In [1]:
from datetime import date, datetime
import numpy as np
import pandas as pd
import yfinance as yf
import matplotlib.pyplot as plt
import statsmodels.api as sm #usiamo statsmodels, ma il modello logistico è anche in scikit-learn

In [2]:
start_date = '2014-5-31'
end_date = '2024-5-29'

In [3]:
data = yf.download('^GSPC', start_date, end_date) 

[*********************100%%**********************]  1 of 1 completed


In [4]:
data

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014-06-02,1923.869995,1925.880005,1915.979980,1924.969971,1924.969971,2509020000
2014-06-03,1923.069946,1925.069946,1918.790039,1924.239990,1924.239990,2867180000
2014-06-04,1923.060059,1928.630005,1918.599976,1927.880005,1927.880005,2793920000
2014-06-05,1928.520020,1941.739990,1922.930054,1940.459961,1940.459961,3113270000
2014-06-06,1942.410034,1949.439941,1942.410034,1949.439941,1949.439941,2864300000
...,...,...,...,...,...,...
2024-05-21,5298.689941,5324.319824,5297.870117,5321.410156,5321.410156,3662240000
2024-05-22,5319.279785,5323.180176,5286.009766,5307.009766,5307.009766,3847130000
2024-05-23,5340.259766,5341.879883,5256.930176,5267.839844,5267.839844,3869520000
2024-05-24,5281.450195,5311.649902,5278.390137,5304.720215,5304.720215,3005510000


In [5]:
df = data['Adj Close'].pct_change() * 100 # rendimenti

In [6]:
df = df.rename("Today")
df

Date
2014-06-02         NaN
2014-06-03   -0.037922
2014-06-04    0.189166
2014-06-05    0.652528
2014-06-06    0.462776
                ...   
2024-05-21    0.250187
2024-05-22   -0.270612
2024-05-23   -0.738079
2024-05-24    0.700104
2024-05-28    0.024880
Name: Today, Length: 2515, dtype: float64

In [7]:
df = df.reset_index()
df

Unnamed: 0,Date,Today
0,2014-06-02,
1,2014-06-03,-0.037922
2,2014-06-04,0.189166
3,2014-06-05,0.652528
4,2014-06-06,0.462776
...,...,...
2510,2024-05-21,0.250187
2511,2024-05-22,-0.270612
2512,2024-05-23,-0.738079
2513,2024-05-24,0.700104


Creaiamo le colonne con i valori di rendimento ritardati

In [8]:
for i in range(1,6):
    df['Lag_' + str(i)] = df['Today'].shift(i)

In [9]:
df.head(8)

Unnamed: 0,Date,Today,Lag_1,Lag_2,Lag_3,Lag_4,Lag_5
0,2014-06-02,,,,,,
1,2014-06-03,-0.037922,,,,,
2,2014-06-04,0.189166,-0.037922,,,,
3,2014-06-05,0.652528,0.189166,-0.037922,,,
4,2014-06-06,0.462776,0.652528,0.189166,-0.037922,,
5,2014-06-09,0.093877,0.462776,0.652528,0.189166,-0.037922,
6,2014-06-10,-0.024598,0.093877,0.462776,0.652528,0.189166,-0.037922
7,2014-06-11,-0.353704,-0.024598,0.093877,0.462776,0.652528,0.189166


Ogni riga contiene il valore del giorno corrente e i rendimenti dei giorni precedenti (1,2,3,4,5 giorni precedenti).

Aggiungiamo il volume del giorno precedente

In [10]:
df['Volume'] = data.Volume.shift(1).values/1000_000_000

In [14]:
df

Unnamed: 0,Date,Today,Lag_1,Lag_2,Lag_3,Lag_4,Lag_5,Volume
6,2014-06-10,-0.024598,0.093877,0.462776,0.652528,0.189166,-0.037922,2.81218
7,2014-06-11,-0.353704,-0.024598,0.093877,0.462776,0.652528,0.189166,2.70236
8,2014-06-12,-0.708889,-0.353704,-0.024598,0.093877,0.462776,0.652528,2.71062
9,2014-06-13,0.313456,-0.708889,-0.353704,-0.024598,0.093877,0.462776,3.04048
10,2014-06-16,0.083671,0.313456,-0.708889,-0.353704,-0.024598,0.093877,2.59823
...,...,...,...,...,...,...,...,...
2510,2024-05-21,0.250187,0.091639,0.116477,-0.208167,1.171593,0.483781,3.42010
2511,2024-05-22,-0.270612,0.250187,0.091639,0.116477,-0.208167,1.171593,3.66224
2512,2024-05-23,-0.738079,-0.270612,0.250187,0.091639,0.116477,-0.208167,3.84713
2513,2024-05-24,0.700104,-0.738079,-0.270612,0.250187,0.091639,0.116477,3.86952


In [15]:
df = df.dropna()
df

Unnamed: 0,Date,Today,Lag_1,Lag_2,Lag_3,Lag_4,Lag_5,Volume
6,2014-06-10,-0.024598,0.093877,0.462776,0.652528,0.189166,-0.037922,2.81218
7,2014-06-11,-0.353704,-0.024598,0.093877,0.462776,0.652528,0.189166,2.70236
8,2014-06-12,-0.708889,-0.353704,-0.024598,0.093877,0.462776,0.652528,2.71062
9,2014-06-13,0.313456,-0.708889,-0.353704,-0.024598,0.093877,0.462776,3.04048
10,2014-06-16,0.083671,0.313456,-0.708889,-0.353704,-0.024598,0.093877,2.59823
...,...,...,...,...,...,...,...,...
2510,2024-05-21,0.250187,0.091639,0.116477,-0.208167,1.171593,0.483781,3.42010
2511,2024-05-22,-0.270612,0.250187,0.091639,0.116477,-0.208167,1.171593,3.66224
2512,2024-05-23,-0.738079,-0.270612,0.250187,0.091639,0.116477,-0.208167,3.84713
2513,2024-05-24,0.700104,-0.738079,-0.270612,0.250187,0.091639,0.116477,3.86952


Creiamo la colonna con i movimenti di mercato 

In [16]:
df.loc[:,'Direction'] = [1 if i > 0 else 0 for i in df['Today']]

In [17]:
df.head()

Unnamed: 0,Date,Today,Lag_1,Lag_2,Lag_3,Lag_4,Lag_5,Volume,Direction
6,2014-06-10,-0.024598,0.093877,0.462776,0.652528,0.189166,-0.037922,2.81218,0
7,2014-06-11,-0.353704,-0.024598,0.093877,0.462776,0.652528,0.189166,2.70236,0
8,2014-06-12,-0.708889,-0.353704,-0.024598,0.093877,0.462776,0.652528,2.71062,0
9,2014-06-13,0.313456,-0.708889,-0.353704,-0.024598,0.093877,0.462776,3.04048,1
10,2014-06-16,0.083671,0.313456,-0.708889,-0.353704,-0.024598,0.093877,2.59823,1


Aggiungiamo una colonna con una costante altrimenti la regressione non ha intercetta

In [18]:
df = sm.add_constant(df)

In [19]:
df.head()

Unnamed: 0,const,Date,Today,Lag_1,Lag_2,Lag_3,Lag_4,Lag_5,Volume,Direction
6,1.0,2014-06-10,-0.024598,0.093877,0.462776,0.652528,0.189166,-0.037922,2.81218,0
7,1.0,2014-06-11,-0.353704,-0.024598,0.093877,0.462776,0.652528,0.189166,2.70236,0
8,1.0,2014-06-12,-0.708889,-0.353704,-0.024598,0.093877,0.462776,0.652528,2.71062,0
9,1.0,2014-06-13,0.313456,-0.708889,-0.353704,-0.024598,0.093877,0.462776,3.04048,1
10,1.0,2014-06-16,0.083671,0.313456,-0.708889,-0.353704,-0.024598,0.093877,2.59823,1


In [20]:
X = df[['const', 'Lag_1', 'Lag_2', 'Lag_3', 'Lag_4', 'Lag_5', 'Volume']] # variabili indipendenti

In [21]:
y = df.Direction # variabile dipendente

In [22]:
model = sm.Logit(y,X)
result = model.fit()

Optimization terminated successfully.
         Current function value: 0.689207
         Iterations 4


In [24]:
result.summary()

0,1,2,3
Dep. Variable:,Direction,No. Observations:,2509.0
Model:,Logit,Df Residuals:,2502.0
Method:,MLE,Df Model:,6.0
Date:,"Thu, 30 May 2024",Pseudo R-squ.:,0.002017
Time:,11:43:58,Log-Likelihood:,-1729.2
converged:,True,LL-Null:,-1732.7
Covariance Type:,nonrobust,LLR p-value:,0.3218

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,0.1414,0.174,0.812,0.417,-0.200,0.483
Lag_1,-0.0888,0.037,-2.384,0.017,-0.162,-0.016
Lag_2,0.0050,0.037,0.134,0.893,-0.068,0.078
Lag_3,-0.0211,0.037,-0.568,0.570,-0.094,0.052
Lag_4,-0.0241,0.037,-0.653,0.514,-0.097,0.048
Lag_5,-0.0185,0.037,-0.503,0.615,-0.091,0.054
Volume,0.0022,0.043,0.053,0.958,-0.081,0.086


L'unico coefficiente significativo è quello di Lag_1

In [25]:
prediction = result.predict(X) 
prediction # la previsione è espressa come probabilità di rialzo

6       0.530965
7       0.530235
8       0.538311
9       0.549615
10      0.530516
          ...   
2510    0.527144
2511    0.527153
2512    0.543493
2513    0.550980
2514    0.520064
Length: 2509, dtype: float64

Costruiamo una "matrice di confusione" dove mettiamo a confronto previsioni di rialzo o ribasso con effettivi rialzi o ribassi.

In [26]:
def confusion_matrix(act,pred):
    predtrans = ['Up' if i>0.5 else "Down" for i in pred]
    actuals = ['Up' if i > 0 else 'Down' for i in act]
    confusion_matrix = pd.crosstab(pd.Series(actuals),
                                  pd.Series(predtrans),
                                  rownames = ['Actual'],
                                  colnames = ['Predicted'])
    return confusion_matrix



Per capire la capacità previsiva del modello dividiamo la somma dei casi in cui ha avuto ragione (previsto = verificato) per il totale dei casi. I casi in cui il modello ha fatto una previsione corretta sono quelli nella diagonale.

In [27]:
a = confusion_matrix(y,prediction)
print(a)

Predicted  Down    Up
Actual               
Down         63  1102
Up           40  1304


__PER CAPIRE__: in questa matrice, il numero di previsioni corrette è dato dalla somma dei numeri sulla diagonale.

Accuracy = (True positive + True negatives) / Total population

In [29]:
(63+1304)/2509

0.5448385811080112

Il modello ha una capacità di previsione leggermente migliore di quella che si avrebbe tirando una moneta (0.50).

La stima così fatta presenta il problema di essere fatta su tutto il campione. Occorre dividere la serie di una parte di stima o "addestramento" (train) e in una parte di test.

__PER CAPIRE__: bisogna addestrare il modello su un pezzo di dati, per poi dargli in pasto dati che non ha ancora visto.

In [30]:
x_train = df[df.Date.dt.year < 2019][['const', 'Lag_1', 'Lag_2', 'Lag_3', 'Lag_4', 'Lag_5', 'Volume']]
y_train = df[df.Date.dt.year < 2019]['Direction']
x_test = df[df.Date.dt.year == 2019][['const', 'Lag_1', 'Lag_2', 'Lag_3', 'Lag_4', 'Lag_5', 'Volume']]
y_test = df[df.Date.dt.year == 2019]['Direction']

In [31]:
model = sm.Logit(y_train, x_train)

In [32]:
result = model.fit()

Optimization terminated successfully.
         Current function value: 0.689905
         Iterations 4


In [33]:
prediction = result.predict(x_test)

In [35]:
confusion_matrix(y_test, prediction)

Predicted,Down,Up
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
Down,23,79
Up,28,122


In [36]:
(23+122)/len(x_test)

0.5753968253968254

Cosa succede se eliminiamo tutte le variabili salvo Lag_1 e Lag_2?

In [37]:
x_train = df[df.Date.dt.year < 2019][['const', 'Lag_1', 'Lag_2']]
y_train = df[df.Date.dt.year < 2019]['Direction']
x_test = df[df.Date.dt.year == 2019][['const', 'Lag_1', 'Lag_2']]
y_test = df[df.Date.dt.year == 2019]['Direction']

In [38]:
model = sm.Logit(y_train, x_train)

In [39]:
result = model.fit()

Optimization terminated successfully.
         Current function value: 0.690401
         Iterations 4


In [40]:
prediction = result.predict(x_test)

In [41]:
confusion_matrix(y_test, prediction)

Predicted,Down,Up
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
Down,13,89
Up,20,130


In [42]:
(13+130)/len(x_test)

0.5674603174603174

// minuto 29:20 lezione_30maggio_pt2