# Trabalho de Algebra Linear
## Cadeias de Markov aplicado a finanças

### Alunos
- Eric Rodrigues das Chagas - 12623971
- Pedro Perez - 7777970
- Gustavo Blois - 13688162






## O que são cadeias de Markov ?

Em matemática, uma cadeia de Markov é um caso particular de processo estocástico com estados discretos (o parâmetro, em geral o tempo, pode ser discreto ou contínuo) com a propriedade de que a distribuição de probabilidade do próximo estado depende apenas do estado atual e não na sequência de eventos que precederam, uma propriedade chamada de Markoviana, chamada assim em homenagem ao matemático Andrei Andreyevich Markov. A definição dessa propriedade, também chamada de memória markoviana, é que os estados anteriores são irrelevantes para a predição dos estados seguintes, desde que o estado atual seja conhecido.

## Definição formal
Uma cadeia de Markov é uma sequência $X_1, X_2, X_3, ...$ de variáveis aleatórias. O escopo destas variáveis, isto é, o conjunto de valores que elas podem assumir, é chamado de espaço de estados, onde $X_n$ denota o estado do processo no tempo n. Se a distribuição de probabilidade condicional de $X_{n+1}$ nos estados passados é uma função apenas de $X_n$, então:<br> $Pr(X_{n+1} = x | X_0, X_1, X_2, ..., X_n) = Pr(X_{n+1} = x | X_n),$ <br> onde x é algum estado do processo. A identidade acima define a propriedade de Markov.

In [190]:
# Bibliotecas utilizadas. Mais detalhes sobre suas funcionalidades serão fornecidos
# conforme forem utilizados
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import io, base64, os, json, re
import pandas as pd
import numpy as np
import datetime
from random import randint

In [191]:
#Dados obtidos do yahoo! finanças
!pip install yfinance
import yfinance as yf

Defaulting to user installation because normal site-packages is not writeable


In [192]:
# Nesta célula extraímos dados da Ibovespa entre 1º de janeiro de 2021 e 1º de
#novembro de 2023
bvsp = yf.download("^BVSP",period = "4y")
bvsp_df = bvsp.reset_index()
print(bvsp_df)

[*********************100%%**********************]  1 of 1 completed
          Date           Open          High            Low         Close  \
0   2019-12-02  108246.000000  109279.00000  108245.000000  109061.00000   
1   2019-12-03  108931.000000  109198.00000  108190.000000  108956.00000   
2   2019-12-04  108962.000000  110301.00000  108962.000000  110301.00000   
3   2019-12-05  110297.000000  111073.00000  110008.000000  110622.00000   
4   2019-12-06  110623.000000  111430.00000  110623.000000  111126.00000   
..         ...            ...           ...            ...           ...   
989 2023-11-27  125517.000000  125826.00000  124840.000000  125683.00000   
990 2023-11-28  125726.000000  126916.00000  125388.000000  126538.00000   
991 2023-11-29  126541.000000  127388.00000  126018.000000  126101.00000   
992 2023-11-30  126168.000000  127399.00000  126168.000000  127331.00000   
993 2023-12-01  127331.117188  128184.90625  126655.671875  128184.90625   

        Adj Close 

In [193]:
# take random sets of sequential rows
new_set = []
number_lines = 50000
for row_set in range(0, number_lines):
    if row_set%2000==0: print(row_set)
    row_quant = randint(10, 30)
    row_start = randint(0, len(bvsp_df)-row_quant)
    market_subset = bvsp_df.iloc[row_start:row_start+row_quant]

    Close_Date = max(market_subset['Date'])
    if row_set%2000==0: print(Close_Date)

    # Close_Gap = (market_subset['Close'] - market_subset['Close'].shift(1)) / market_subset['Close'].shift(1)
    Close_Gap = market_subset['Close'].pct_change()
    High_Gap = market_subset['High'].pct_change()
    Low_Gap = market_subset['Low'].pct_change()
    Volume_Gap = market_subset['Volume'].pct_change()
    Daily_Change = (market_subset['Close'] - market_subset['Open']) / market_subset['Open']
    Outcome_Next_Day_Direction = (market_subset['Volume'].shift(-1) - market_subset['Volume'])

    new_set.append(pd.DataFrame({'Sequence_ID':[row_set]*len(market_subset),
                            'Close_Date':[Close_Date]*len(market_subset),
                           'Close_Gap':Close_Gap,
                           'High_Gap':High_Gap,
                           'Low_Gap':Low_Gap,
                           'Volume_Gap':Volume_Gap,
                           'Daily_Change':Daily_Change,
                           'Outcome_Next_Day_Direction':Outcome_Next_Day_Direction}))


0
2022-12-19 00:00:00
2000
2022-08-18 00:00:00
4000
2022-05-17 00:00:00
6000
2022-12-09 00:00:00
8000
2020-06-19 00:00:00
10000
2022-03-17 00:00:00
12000
2022-08-30 00:00:00
14000
2022-11-07 00:00:00
16000
2021-04-07 00:00:00
18000
2023-04-25 00:00:00
20000
2022-06-21 00:00:00
22000
2021-07-14 00:00:00
24000
2021-10-26 00:00:00
26000
2022-06-20 00:00:00
28000
2023-05-03 00:00:00
30000
2022-04-13 00:00:00
32000
2021-06-17 00:00:00
34000
2020-12-10 00:00:00
36000
2023-08-25 00:00:00
38000
2021-04-01 00:00:00
40000
2020-10-06 00:00:00
42000
2023-10-31 00:00:00
44000
2023-08-31 00:00:00
46000
2021-10-13 00:00:00
48000
2023-09-12 00:00:00


In [194]:
len(market_subset)

19

In [195]:
new_set_df = pd.concat(new_set)
print(new_set_df.shape)
new_set_df = new_set_df.dropna(how='any')
print(new_set_df.shape)
new_set_df.tail(20)

(1001510, 8)
(900609, 8)


Unnamed: 0,Sequence_ID,Close_Date,Close_Gap,High_Gap,Low_Gap,Volume_Gap,Daily_Change,Outcome_Next_Day_Direction
618,49998,2022-06-07,0.009276,0.006951,0.003573,0.026649,0.009249,-1293600.0
619,49998,2022-06-07,-0.011486,-0.002813,-0.002545,-0.128702,-0.011478,-545100.0
620,49998,2022-06-07,-0.008245,-0.004066,-0.008293,-0.062244,-0.008245,1253800.0
261,49999,2021-01-19,-0.014132,-0.011301,-0.024728,0.198136,-0.016989,-3399200.0
262,49999,2021-01-19,0.002862,-0.009473,0.008001,-0.328517,0.004515,-464600.0
263,49999,2021-01-19,0.01297,0.012044,0.008543,-0.066869,0.010468,688400.0
264,49999,2021-01-19,0.010131,0.007624,0.010023,0.10618,0.010568,-402000.0
265,49999,2021-01-19,0.003561,0.005436,0.008022,-0.056054,0.002896,1466000.0
266,49999,2021-01-19,-0.001415,0.002411,0.001423,0.216553,-0.000871,505700.0
267,49999,2021-01-19,-0.00627,0.001698,-0.007207,0.061403,-0.003915,515700.0


In [196]:
new_set_df.head()

Unnamed: 0,Sequence_ID,Close_Date,Close_Gap,High_Gap,Low_Gap,Volume_Gap,Daily_Change,Outcome_Next_Day_Direction
737,0,2022-12-19,-0.003247,-0.008519,0.000324,-0.123408,-0.003237,-2901500.0
738,0,2022-12-19,0.028965,0.030443,0.008749,-0.219837,0.027424,2012200.0
739,0,2022-12-19,-0.025521,-0.005213,-0.002701,0.195418,-0.025521,-816500.0
740,0,2022-12-19,-0.001789,-0.022754,-0.001603,-0.066333,-0.001789,3229100.0
741,0,2022-12-19,0.019562,0.024763,0.003746,0.280972,0.019543,4981500.0


In [197]:
# create sequences
# simplify the data by binning values into three groups

# Close_Gap
new_set_df['Close_Gap_LMH'] = pd.qcut(new_set_df['Close_Gap'], 3, labels=["L", "M", "H"])

# High_Gap - not used in this example
new_set_df['High_Gap_LMH'] = pd.qcut(new_set_df['High_Gap'], 3, labels=["L", "M", "H"])

# Low_Gap - not used in this example
new_set_df['Low_Gap_LMH'] = pd.qcut(new_set_df['Low_Gap'], 3, labels=["L", "M", "H"])

# Volume_Gap
new_set_df['Volume_Gap_LMH'] = pd.qcut(new_set_df['Volume_Gap'], 3, labels=["L", "M", "H"])

# Daily_Change
new_set_df['Daily_Change_LMH'] = pd.qcut(new_set_df['Daily_Change'], 3, labels=["L", "M", "H"])

# new set
new_set_df = new_set_df[["Sequence_ID",
                         "Close_Date",
                         "Close_Gap_LMH",
                         "Volume_Gap_LMH",
                         "Daily_Change_LMH",
                         "Outcome_Next_Day_Direction"]]

new_set_df['Event_Pattern'] = new_set_df['Close_Gap_LMH'].astype(str) + new_set_df['Volume_Gap_LMH'].astype(str) + new_set_df['Daily_Change_LMH'].astype(str)

  diff_b_a = subtract(b, a)


In [198]:
new_set_df.tail(10)

Unnamed: 0,Sequence_ID,Close_Date,Close_Gap_LMH,Volume_Gap_LMH,Daily_Change_LMH,Outcome_Next_Day_Direction,Event_Pattern
268,49999,2021-01-19,M,M,M,2381100.0,MMM
269,49999,2021-01-19,M,H,M,136600.0,MHM
270,49999,2021-01-19,H,M,H,-689000.0,HMH
271,49999,2021-01-19,H,M,H,-1548200.0,HMH
272,49999,2021-01-19,L,L,L,-588600.0,LLL
273,49999,2021-01-19,H,M,M,1342500.0,HMM
274,49999,2021-01-19,L,H,L,-1317100.0,LHL
275,49999,2021-01-19,H,L,H,413200.0,HLH
276,49999,2021-01-19,L,M,L,-2164500.0,LML
277,49999,2021-01-19,M,L,H,662900.0,MLH


In [199]:
new_set_df['Outcome_Next_Day_Direction'].describe()

count    9.006090e+05
mean    -3.377410e+03
std      3.241571e+06
min     -1.923840e+07
25%     -1.397500e+06
50%     -9.600000e+03
75%      1.440000e+06
max      1.213530e+07
Name: Outcome_Next_Day_Direction, dtype: float64

In [200]:
# reduce the set
compressed_set = new_set_df.groupby(['Sequence_ID',
                                     'Close_Date'])['Event_Pattern'].apply(lambda x: "{%s}" % ', '.join(x)).reset_index()

print(compressed_set.shape)
compressed_set.head()

(50000, 3)


Unnamed: 0,Sequence_ID,Close_Date,Event_Pattern
0,0,2022-12-19,"{MLM, HLH, LHL, MMM, HHH, HHH, LLL, HMH, LLL, ..."
1,1,2023-02-23,"{MHL, LLL, LLL, HHH, HMH, MHH, LML, LLM, HLH, ..."
2,2,2021-08-10,"{LHL, LML, LHL, HLH, MMM, MLM, LML, HMH, LHL, ..."
3,3,2023-06-16,"{HLH, LLL, LHL, LHL, HMH, HMH, MLM, HHH, HMH, ..."
4,4,2022-12-21,"{HHH, LLL, HMH, LLL, MHM, MML, LML, MLM, LHL, ..."


In [201]:
#compressed_outcomes = new_set_df[['Sequence_ID', 'Close_Date', 'Outcome_Next_Day_Direction']].groupby(['Sequence_ID', 'Close_Date']).agg()

compressed_outcomes = new_set_df.groupby(['Sequence_ID', 'Close_Date'])['Outcome_Next_Day_Direction'].mean()
compressed_outcomes = compressed_outcomes.to_frame().reset_index()
print(compressed_outcomes.shape)
compressed_outcomes.describe()

(50000, 3)


Unnamed: 0,Sequence_ID,Close_Date,Outcome_Next_Day_Direction
count,50000.0,50000,50000.0
mean,24999.5,2021-12-17 21:02:51.071999744,-4126.901
min,0.0,2019-12-13 00:00:00,-2463000.0
25%,12499.75,2020-12-23 00:00:00,-115698.5
50%,24999.5,2021-12-20 00:00:00,5947.619
75%,37499.25,2022-12-07 00:00:00,124500.0
max,49999.0,2023-12-01 00:00:00,1746000.0
std,14433.901067,,284369.4


In [202]:
compressed_set = pd.merge(compressed_set, compressed_outcomes, on= ['Sequence_ID', 'Close_Date'], how='inner')
print(compressed_set.shape)
compressed_set.head()

(50000, 4)


Unnamed: 0,Sequence_ID,Close_Date,Event_Pattern,Outcome_Next_Day_Direction
0,0,2022-12-19,"{MLM, HLH, LHL, MMM, HHH, HHH, LLL, HMH, LLL, ...",198611.111111
1,1,2023-02-23,"{MHL, LLL, LLL, HHH, HMH, MHH, LML, LLM, HLH, ...",-299289.285714
2,2,2021-08-10,"{LHL, LML, LHL, HLH, MMM, MLM, LML, HMH, LHL, ...",48833.333333
3,3,2023-06-16,"{HLH, LLL, LHL, LHL, HMH, HMH, MLM, HHH, HMH, ...",302285.714286
4,4,2022-12-21,"{HHH, LLL, HMH, LLL, MHM, MML, LML, MLM, LHL, ...",-137886.666667


In [203]:
# # reduce set

# compressed_set = new_set_df.groupby(['Sequence_ID', 'Close_Date','Outcome_Next_Day_Direction'])['Event_Pattern'].apply(lambda x: "{%s}" % ', '.join(x)).reset_index()

compressed_set['Event_Pattern'] = [''.join(e.split()).replace('{','')
                                   .replace('}','') for e in compressed_set['Event_Pattern'].values]
compressed_set.head()

Unnamed: 0,Sequence_ID,Close_Date,Event_Pattern,Outcome_Next_Day_Direction
0,0,2022-12-19,"MLM,HLH,LHL,MMM,HHH,HHH,LLL,HMH,LLL,MHM,MML,LM...",198611.111111
1,1,2023-02-23,"MHL,LLL,LLL,HHH,HMH,MHH,LML,LLM,HLH,HMH,MMM,LM...",-299289.285714
2,2,2021-08-10,"LHL,LML,LHL,HLH,MMM,MLM,LML,HMH,LHL,HHH,MLM,LH...",48833.333333
3,3,2023-06-16,"HLH,LLL,LHL,LHL,HMH,HMH,MLM,HHH,HMH,HMH,MLM,LM...",302285.714286
4,4,2022-12-21,"HHH,LLL,HMH,LLL,MHM,MML,LML,MLM,LHL,LHL,MHM,ML...",-137886.666667


In [204]:
# use last x days of data for validation
compressed_set_validation = compressed_set[compressed_set['Close_Date'] >= datetime.datetime.now()
                                           - datetime.timedelta(days=90)] # Sys.Date()-90

compressed_set_validation.shape


(3154, 4)

In [205]:
compressed_set = compressed_set[compressed_set['Close_Date'] < datetime.datetime.now()
                                           - datetime.timedelta(days=90)]
compressed_set.shape

(46846, 4)

In [206]:
list(compressed_set)

['Sequence_ID', 'Close_Date', 'Event_Pattern', 'Outcome_Next_Day_Direction']

In [207]:
# drop date field
compressed_set = compressed_set[['Sequence_ID', 'Event_Pattern','Outcome_Next_Day_Direction']]
compressed_set_validation = compressed_set_validation[['Sequence_ID', 'Event_Pattern','Outcome_Next_Day_Direction']]

In [208]:
compressed_set['Outcome_Next_Day_Direction'].describe()

count    4.684600e+04
mean    -5.021442e+03
std      2.817747e+05
min     -2.463000e+06
25%     -1.165140e+05
50%      5.205655e+03
75%      1.227875e+05
max      1.643700e+06
Name: Outcome_Next_Day_Direction, dtype: float64

In [209]:
print(len(compressed_set['Outcome_Next_Day_Direction']))
len(compressed_set[abs(compressed_set['Outcome_Next_Day_Direction']) > 1000])

46846


46595

In [210]:
# keep only keep big/interesting moves
print('all moves:', len(compressed_set))
compressed_set = compressed_set[abs(compressed_set['Outcome_Next_Day_Direction']) > 1000]
compressed_set['Outcome_Next_Day_Direction'] = np.where((compressed_set['Outcome_Next_Day_Direction'] > 0), 1, 0)
compressed_set_validation['Outcome_Next_Day_Direction'] = np.where((compressed_set_validation['Outcome_Next_Day_Direction'] > 0), 1, 0)
print('big moves only:', len(compressed_set))

all moves: 46846
big moves only: 46595


In [211]:
compressed_set.head()

Unnamed: 0,Sequence_ID,Event_Pattern,Outcome_Next_Day_Direction
0,0,"MLM,HLH,LHL,MMM,HHH,HHH,LLL,HMH,LLL,MHM,MML,LM...",1
1,1,"MHL,LLL,LLL,HHH,HMH,MHH,LML,LLM,HLH,HMH,MMM,LM...",0
2,2,"LHL,LML,LHL,HLH,MMM,MLM,LML,HMH,LHL,HHH,MLM,LH...",1
3,3,"HLH,LLL,LHL,LHL,HMH,HMH,MLM,HHH,HMH,HMH,MLM,LM...",1
4,4,"HHH,LLL,HMH,LLL,MHM,MML,LML,MLM,LHL,LHL,MHM,ML...",0


In [212]:
# create two data sets - won/not won
compressed_set_pos = compressed_set[compressed_set['Outcome_Next_Day_Direction']==1][['Sequence_ID', 'Event_Pattern']]
print(compressed_set_pos.shape)
compressed_set_neg = compressed_set[compressed_set['Outcome_Next_Day_Direction']==0][['Sequence_ID', 'Event_Pattern']]
print(compressed_set_neg.shape)

(23937, 2)
(22658, 2)


In [213]:
flat_list = [item.split(',') for item in compressed_set['Event_Pattern'].values ]
unique_patterns = ','.join(str(r) for v in flat_list for r in v)
unique_patterns = list(set(unique_patterns.split(',')))
len(unique_patterns)

19

In [214]:
compressed_set['Outcome_Next_Day_Direction'].head()

0    1
1    0
2    1
3    1
4    0
Name: Outcome_Next_Day_Direction, dtype: int64

In [215]:
# build the markov transition grid
def build_transition_grid(compressed_grid, unique_patterns):
    # build the markov transition grid

    patterns = []
    counts = []
    for from_event in unique_patterns:

        # how many times
        for to_event in unique_patterns:
            pattern = from_event + ',' + to_event # MMM,MlM

            ids_matches = compressed_grid[compressed_grid['Event_Pattern'].str.contains(pattern)]
            found = 0
            if len(ids_matches) > 0:
                Event_Pattern = '---'.join(ids_matches['Event_Pattern'].values)
                found = Event_Pattern.count(pattern)
            patterns.append(pattern)
            counts.append(found)

    # create to/from grid
    grid_Df = pd.DataFrame({'pairs':patterns, 'counts': counts})

    grid_Df['x'], grid_Df['y'] = grid_Df['pairs'].str.split(',', 1).str
    grid_Df.head()

    grid_Df = grid_Df.pivot(index='x', columns='y', values='counts')

    grid_Df.columns= [col for col in grid_Df.columns]
    #del grid_Df.index.name

    # replace all NaN with zeros
    grid_Df.fillna(0, inplace=True)
    grid_Df.head()

    #grid_Df.rowSums(transition_dataframe)
    grid_Df = grid_Df / grid_Df.sum(1)
    return (grid_Df)

In [216]:
grid_pos = build_transition_grid(compressed_set_pos, unique_patterns)
grid_neg = build_transition_grid(compressed_set_neg, unique_patterns)

TypeError: StringMethods.split() takes from 1 to 2 positional arguments but 3 were given

In [None]:
grid_neg.head()

In [None]:
grid_pos.head()

In [None]:
def safe_log(x,y):
   try:
      lg = np.log(x/y)
   except:
      lg = 0
   return lg

# predict on out of sample data
actual = []
predicted = []
for seq_id in compressed_set_validation['Sequence_ID'].values:
    patterns = compressed_set_validation[compressed_set_validation['Sequence_ID'] == seq_id]['Event_Pattern'].values[0].split(',')
    pos = []
    neg = []
    log_odds = []

    for id in range(0, len(patterns)-1):
        # get log odds
        # logOdds = log(tp(i,j) / tn(i,j)
        if (patterns[id] in list(grid_pos) and patterns[id+1] in list(grid_pos) and patterns[id] in list(grid_neg) and patterns[id+1] in list(grid_neg)):

            numerator = grid_pos[patterns[id]][patterns[id+1]]
            denominator = grid_neg[patterns[id]][patterns[id+1]]
            if (numerator == 0 and denominator == 0):
                log_value =0
            elif (denominator == 0):
                log_value = np.log(numerator / 0.00001)
            elif (numerator == 0):
                log_value = np.log(0.00001 / denominator)
            else:
                log_value = np.log(numerator/denominator)
        else:
            log_value = 0

        log_odds.append(log_value)

        pos.append(numerator)
        neg.append(denominator)

    print('outcome:', compressed_set_validation[compressed_set_validation['Sequence_ID']==seq_id]['Outcome_Next_Day_Direction'].values[0])
    print(sum(pos)/sum(neg))
    print(sum(log_odds))

    actual.append(compressed_set_validation[compressed_set_validation['Sequence_ID']==seq_id]['Outcome_Next_Day_Direction'].values[0])
    predicted.append(sum(log_odds))

from sklearn.metrics import confusion_matrix

confusion_matrix(actual, [1 if p > 0 else 0 for p in predicted])

In [None]:
from sklearn.metrics import accuracy_score
score = accuracy_score(actual, [1 if p > 0 else 0 for p in predicted])
print('Accuracy:', round(score * 100,2), '%')

In [None]:
import seaborn as sns
cm = confusion_matrix(actual, [1 if p > 0 else 0 for p in predicted])
fig, ax = plt.subplots(figsize=(5, 5))
sns.heatmap(cm, annot=True, ax = ax, fmt='g')

ax.set_title('Confusion Matrix')
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')

ax.xaxis.set_ticklabels(['up day','down day'])
ax.yaxis.set_ticklabels(['up day','down day'])
ax.set_yticklabels(ax.get_yticklabels(), rotation = 0, fontsize = 8)
ax.set_xticklabels(ax.get_xticklabels(), rotation = 90, fontsize = 8)
plt.show()