##### Created by: https://github.com/ThalesVGomes

# Quantitative Finance - Sequence Break (Trading Strategy)

#### Empirically we can observe that a long daily sequence of ups or downs in the stock market are quite rare.

#### The objective of this study is to build a automatic system that exploits this point by finding stocks with a relatively long daily sequence of one direction movement (based on the stock historical data) and generate signals to assume the opposite position for the next day hoping we can profit with a reverse move.

<img src="Sequence_Break_Example.png" width=600 height=600 />

#### We will analyse the historical prices of a stock, count the distribution of sequences formed in the past and its probabilities to happen then we will try to generate good signs to buy or sell the stock based on the current sequence of the stock.

##### For example, if our model learned that a particular stock had only 1% chance of making a 6 streak movement (up or down) and our threshold (confidence) is bigger than 1%, we'll try to find a period when the same stock have a streak of 5 and we'll assume the opposite position. If it has a 5 streak of up movements, we will go short (sell) and if it has 5 streak of down we will go long (buy)

#### Lets start by importing the external libraries that we will need to use to develop our study

In [1]:
import pandas as pd
import numpy as np
from pandas_datareader import DataReader
from datetime import date

# Defining our main funcions

#### Sometimes we will cross some missing sequence numbers. That means we could have in our past data a sequence that lasted for 8 days but not a sequence of 7.
#### Theoretically, market sequences are random and each day have a 50% chance of going up or down, so they should approach a binomial distribution. In this case that for each day we have 50% of continuing the last day movement, the probability of making a 7 streak should be close to the probability of 6 streak divided by 2, so we'll use this approximation to input the missing data.
### Disclamer: Of course that we are not assuming in this study that every asset follows a binomial distribution. If so, there will be no point counting the historical sequence points.

### Function to create the sequence distribution of a given stock

In [2]:
def get_sequence_distribution(ticker, start, end=date.today()):
    
    """
    Given a ticker, a start date and an end date
    takes the adjusted closing data (avoids distortions) and counts
    how many high/low sequences the ticker had.
    
    Returns a dictionary with the sequence distribution of the ticker
    in the given time period and the current state
    of the sequence.
    ----------------------------------------------------------------
    Example of usage:
    
    seq_dist, curr_seq = get_sequence_distribution(ticker='GOLL4.SA', start='01-01-2020', end='01-01-2021')
    
    seq_dist ->  {1: 0.515, 2: 0.262, 3: 0.108, 4: 0.054, 5: 0.054, 6: 0.008}
                51.5% of the sequences are composed by one movement (1 up or 1 down) and so on.

    curr_seq -> 2
                   Means that in the end date (01-01-2021) the sequence of movements
                   in the same direction is equal to 2.
    
    """
    
    global data # For backtesting purposes
    
    data = None
    
    try:
        data = DataReader(ticker,'yahoo', start, end)['Adj Close'].to_frame()
    except:
        print(f'No data found for: {ticker}')
        data = None
        return
    
    data['Returns'] = data['Adj Close'].pct_change()
    data.dropna(inplace=True)
    
    data['Direction'] = np.where(data['Returns'] > 0, 1, -1 ) # 1 for up and -1 for down
    
    data = data.drop_duplicates(subset=['Adj Close', 'Returns', 'Direction']) 
    # Sometimes there are no tradings and some days keep repeating the same closing value
    
    n_rows = data.shape[0] # number of rows in the dataframe
    
    sequences = np.array([], dtype='int8') # Where the sequences will be stored
    streak = 1
    
    for day in range(1, n_rows):
        if data['Direction'][day] == data['Direction'][day-1]: # If the current movement is equal to the last
            streak += 1
        else:
            sequences = np.append(sequences, streak)
            streak = 1    
    sequences = np.append(sequences, streak) # Append the last sequence
    current_sequence = sequences[-1]
    
    # Creates the sequence distribution
    idx, counts = np.unique(sequences, return_counts=True)

    # Adjust for missing sequence
    aux = dict(zip(idx, counts))
    for i in range(1, idx.max()):
        if i not in idx:
            aux[i] = int(np.ceil(aux[i-1] / 2))
    aux = dict(sorted(aux.items()))
    idx = np.fromiter(aux.keys(), dtype='int8')
    counts = np.fromiter(aux.values(), dtype='int16')
    
    normalized_counts = counts / counts.sum()
    sequence_distribution = dict(zip(idx, normalized_counts))
    return sequence_distribution, current_sequence

### Probability of the movement to keep following the trend

In [3]:
def continue_seq_prob(sequence_distribution, current_sequence):
    """
    Based on the learned sequence distribution,
    gives the probability of the current sequence
    to continue its movement one day ahead.
    
    For example, if our current sequence is equal to 5,
    gives us the probability of a sequence equal to 6 happening
    based on the historical data of the given asset.
    """

    probabilities = np.fromiter(sequence_distribution.values(), dtype=np.float16)
    probabilities = probabilities[::-1]
    summed_probabilities = np.cumsum(probabilities)
    
    try:
        current_probability = summed_probabilities[-current_sequence-1]
    except IndexError:
        current_probability = 0 # If it's the first time that such a long sequence happens
    except:
        pass
    
    return current_probability, summed_probabilities

### Finds stocks in a list of stocks that are currently (based on the end date) in a large sequence of movements in the same direction and have a probability higher than the confidence of breaking the streak.

In [4]:
def run_program(tickers, start, end=date.today(), confidence=0.95, verbose=True):
    """Runs the complete algorithm in a list of tickers
    and returns the tickers with a chance higher
    than the confidence level of breaking
    the sequence for the next trading day after the end date"""
    
    threshold = 1 - confidence
    if isinstance(tickers, str) or not hasattr(tickers, '__iter__'):
        print(f"""Error: Tickers should be an iterable!
        Current tickers type: {type(tickers)}""")
        return
    
    results = {}
    for ticker in tickers:
        
        if verbose:
            print(f'Analysing {ticker}...')
                
        try:   
            sequences_count, current_sequence = get_sequence_distribution(ticker, start)
            probability, _ = continue_seq_prob(sequences_count, current_sequence)
            if probability < threshold:
                results[ticker] = '{:.2f}%'.format(probability*100)
            
        except Exception as error:
#             if verbose:
#                 print(f'Error in: {ticker}. Error code: {error}') # Pollute code
            pass
    if results:
        return results
    else:
        print()
        print('There are no stocks with a high sequence for the current end date.')
        print('Try using another stocks or decrease the confidence level.')

## Example of the program in action:

#### List of brazillian stocks ticker (based on the name at Yahoo Finance - https://finance.yahoo.com/)

In [6]:
tickers_small = ['AALR3.SA','AERI3.SA','AGRO3.SA','ALSO3.SA','ALUP11.SA','AMAR3.SA',
 'AMBP3.SA','ANIM3.SA','ARZZ3.SA','AZUL4.SA','BEEF3.SA','BKBR3.SA','BMGB4.SA',
 'BRPR3.SA','BRSR6.SA','CAML3.SA','CEAB3.SA','CESP6.SA','CIEL3.SA','CSMG3.SA',
 'CYRE3.SA','DIRR3.SA','DTEX3.SA','ECOR3.SA','ENBR3.SA','EVEN3.SA','FESA4.SA',
 'GOAU4.SA','GOLL4.SA','GUAR3.SA','HBOR3.SA','HBSA3.SA','HGTX3.SA','IGTA3.SA',
 'JPSA3.SA','LCAM3.SA','LEVE3.SA','LINX3.SA','LJQQ3.SA','LOGG3.SA','LOGN3.SA',
 'MEAL3.SA','MILS3.SA','MOVI3.SA','MTRE3.SA','MULT3.SA','MYPK3.SA','ODPV3.SA',
 'PETZ3.SA','PNVL3.SA','POMO4.SA','POSI3.SA','PTBL3.SA','QUAL3.SA','RAPT4.SA',
 'RRRP3.SA','SAPR11.SA','SAPR4.SA','SBFG3.SA','SEER3.SA','SEQL3.SA','SIMH3.SA',
 'SMLS3.SA','SOMA3.SA','SQIA3.SA','TASA4.SA','TEND3.SA','TGMA3.SA','TRIS3.SA',
 'TUPY3.SA','UNIP6.SA','VIVA3.SA','VLID3.SA','VULC3.SA','WIZS3.SA']

In [7]:
tickers_ibov = ['ABEV3.SA','AZUL4.SA','B3SA3.SA','BBAS3.SA','BBDC3.SA','BBDC4.SA','BBSE3.SA',
 'BEEF3.SA','BPAC11.SA','BRAP4.SA','BRDT3.SA','BRFS3.SA','BRKM5.SA','BRML3.SA','BTOW3.SA',
 'CCRO3.SA','CIEL3.SA','CMIG4.SA','COGN3.SA','CPFE3.SA','CRFB3.SA','CSAN3.SA','CSNA3.SA',
 'CVCB3.SA','CYRE3.SA','ECOR3.SA','EGIE3.SA','ELET3.SA','ELET6.SA','EMBR3.SA','ENBR3.SA',
 'ENGI11.SA','EQTL3.SA','FLRY3.SA','GGBR4.SA','GNDI3.SA','GOAU4.SA','GOLL4.SA','HAPV3.SA',
 'HGTX3.SA','HYPE3.SA','IGTA3.SA','IRBR3.SA','ITSA4.SA','ITUB4.SA','JBSS3.SA','KLBN11.SA',
 'LAME4.SA','LREN3.SA','MGLU3.SA','MRFG3.SA','MRVE3.SA','MULT3.SA','NTCO3.SA','PCAR3.SA',
 'PETR3.SA','PETR4.SA','QUAL3.SA','RADL3.SA','RAIL3.SA','RENT3.SA','SANB11.SA','SBSP3.SA',
 'SULA11.SA','SUZB3.SA','TAEE11.SA','TIMP3.SA','TOTS3.SA','UGPA3.SA','USIM5.SA','VALE3.SA',
 'VIVT4.SA','VVAR3.SA','WEGE3.SA','YDUQ3.SA']

In [8]:
tickers = tickers_ibov + tickers_small
tickers = list(set(tickers)) # Remove duplicates

### Test with a small sample of tickers to go faster

In [9]:
import random
random.seed(42)
random_tickers = random.sample(tickers, 15)

In [10]:
run_program(random_tickers, start='01-01-2020', end=date.today(), confidence=0.9, verbose=True)

Analysing MOVI3.SA...
Analysing LREN3.SA...
Analysing ENBR3.SA...
Analysing RENT3.SA...
Analysing SULA11.SA...
Analysing TASA4.SA...
Analysing CSMG3.SA...
Analysing GNDI3.SA...
Analysing PETZ3.SA...
Analysing CIEL3.SA...
Analysing BBSE3.SA...
Analysing BPAC11.SA...
Analysing SBSP3.SA...
Analysing BMGB4.SA...
Analysing TGMA3.SA...


{'MOVI3.SA': '9.01%',
 'RENT3.SA': '4.22%',
 'GNDI3.SA': '4.00%',
 'PETZ3.SA': '5.34%',
 'SBSP3.SA': '2.51%'}

#### Based on the result above, we should assume the opposite position on each stock with a probability p of losing money.

# *Backtesting* - The most important part

### Now that our strategy is fully working we will try to simulate our returns if this same strategy was applied in the past

### The first and most import thing to do is to select a time period to serve as train data where our model will learn the sequence distributions and a test data where our model will try to operate based on the learned distributions

In [11]:
def backtest(tickers, start_train, end_train, start_test, end_test, confidence=0.95):
    
    threshold = 1 - confidence
    total_returns = []
    
    for ticker in tickers:
        try:
            # Learns the sequences distribution with the train data
            sequence_distribution, current_sequence = get_sequence_distribution(ticker=ticker,
                                                                                start=start_train, end=end_train)

            _, summed_probabilities = continue_seq_prob(sequence_distribution, current_sequence)
            seq_threshold = (len(summed_probabilities) - np.where(summed_probabilities < threshold)[0])

            seq_break = seq_threshold[-1] # This is the first streak with a probability lower than our threshold
            seq_break -= 1 # So we'll buy in the previous day hoping for a break in the streak
        except:
            continue

        # Apply the strategy in the test data based on the learned distribuitions 
        get_sequence_distribution(ticker=ticker, start=start_test, end=end_test)
        data[f'Sign{seq_break}'] = data['Direction'].rolling(window=seq_break).sum()
        data['Trade'] = data[f'Sign{seq_break}'].shift()
        
        buy_profit = data[data['Trade'] == -seq_break]['Returns'].sum()
        
        # In a short position you earn when the market goes down
        sell_profit = (data[data['Trade'] == seq_break]['Returns'] * -1).sum()
        
        total = buy_profit + sell_profit
        
        print(f'{ticker} Return = {round(total * 100, 2)}%')
        total_returns.append(total)
        
    return f'Total Return = {round(sum(total_returns) * 100, 2)}%'

In [14]:
random_tickers = random.sample(tickers, 60)

In [16]:
backtest(tickers=random_tickers, start_train='01-01-2015', end_train='01-01-2020',
         start_test='02-01-2020', end_test='15-10-2021', confidence=0.95)

TAEE11.SA Return = -6.95%
DIRR3.SA Return = 17.83%
BBAS3.SA Return = -6.57%
ECOR3.SA Return = 11.06%
No data found for: TIMP3.SA
MILS3.SA Return = -0.38%
No data found for: HBSA3.SA
No data found for: LINX3.SA
POMO4.SA Return = -7.23%
KLBN11.SA Return = 8.13%
ENGI11.SA Return = 5.06%
PETR3.SA Return = -12.74%
WIZS3.SA Return = -21.78%
No data found for: VIVT4.SA
AZUL4.SA Return = -2.32%
CVCB3.SA Return = 44.95%
TASA4.SA Return = -28.75%
TGMA3.SA Return = 0.03%
BRFS3.SA Return = -1.05%
ODPV3.SA Return = 25.73%
JBSS3.SA Return = -13.59%
BRAP4.SA Return = 2.7%
TOTS3.SA Return = 7.07%
No data found for: BTOW3.SA
CSMG3.SA Return = -4.26%
GOAU4.SA Return = 13.55%
VULC3.SA Return = -3.41%
TEND3.SA Return = 12.11%
CESP6.SA Return = 4.62%
ALSO3.SA Return = 20.95%
No data found for: SMLS3.SA
ITSA4.SA Return = -0.95%
GGBR4.SA Return = 7.47%
No data found for: AERI3.SA
LEVE3.SA Return = 8.65%
USIM5.SA Return = -40.8%
HGTX3.SA Return = -13.39%
PETR4.SA Return = 8.9%
No data found for: SIMH3.SA
CPFE

'Total Return = 138.56%'

### Our sample portfolio backtesting was quite promising and gave us a really good return. It's possible to use the sequence break sign as an input to compose another more sofisticated model.