## Approximate Nearest Neighbours for Stock Market Pattern Search/Recognition
The purpose of this notebook is to identify candlestick patterns in the stock market that are similar to a given input pattern using approximate nearest neighbor algorithms. There are helper functions defined and all of them are used in the last cell.


### Import Libraries
* Pandas: For Dataframes and Data Management
* Annoy: For Vector Similarity Search
* Yahoo Finances: For Data Retrivial
* Plotly: For Visualizations

In [7]:
#!pip install yfinance

In [8]:
import pandas as pd

from annoy import AnnoyIndex

#import investpy
import yfinance as yf

import plotly.graph_objects as go
import plotly.express as py
from plotly.subplots import make_subplots

#### Data Retrivial with Yahoo Finance
The get_data function retrieves stock market data for a given ticker (i.e. a specific stock or index) and time interval. 

After the retrivial, the function processes the data by sorting it from newest to oldest, removing the "Volume" and "Adj Close" columns, resetting the index, and renaming the columns to "time", "open", "high", "low", and "close". 

It also calculates the daily change in the stock price by dividing the difference between the closing and opening prices by the opening price, multiplying by 10000, and expressing the result as a percentage.

The function returns two dataframes: one containing the raw data and one containing the time and change data.

In [9]:
def get_data(ticker: str, interval: str):
    """
    ticker = 
    interval = 
    """

    df = yf.download(  # or pdr.get_data_yahoo(...
        # tickers list or string as well
        tickers = ticker,

        # use "period" instead of start/end
        # valid periods: 1d,5d,1mo,3mo,6mo,1y,2y,5y,10y,ytd,max
        # (optional, default is '1mo')
        period = "10y",

        # fetch data by interval (including intraday if period < 60 days)
        # valid intervals: 1m,2m,5m,15m,30m,60m,90m,1h,1d,5d,1wk,1mo,3mo
        # (optional, default is '1d')
        interval = interval,

        # Whether to ignore timezone when aligning ticker data from 
        # different timezones. Default is True. False may be useful for 
        # minute/hourly data.
        ignore_tz = False,

        # group by ticker (to access via data['SPY'])
        # (optional, default is 'column')
        group_by = 'column',

        # adjust all OHLC automatically
        # (optional, default is False)
        auto_adjust = False,

        # identify and attempt repair of currency unit mixups e.g. $/cents
        repair = False,

        # download pre/post regular market hours data
        # (optional, default is False)
        prepost = False,

        # use threads for mass downloading? (True/False/Integer)
        # (optional, default is True)
        threads = True,

        # proxy URL scheme use use when downloading?
        # (optional, default is None)
        proxy = None
    )

    # Sort it from newest bar to oldest bar
    df = df.iloc[::-1]
    df = df.drop(["Volume", "Adj Close"], axis = 1)
    df = df.reset_index()
    df.columns = ["time", "open", "high", "low", "close"]

    # Calculate the daily change
    df["change"] = ( (df.close - df.open) / (df.open * 100) ) * 10000


    # finalize the change dataframe
    change_df = df[["time", "change"]].reset_index().drop(["index"], axis=1).copy()
    
    return df, change_df

#### Making the data ready for ANN algorithms

The __alter_df__ function modifies a dataframe containing stock market data and generates a list of embeddings based on that data.

In [10]:
# Add previous bars to the most recent bar in the list for the window of window size
def alter_df(change_df, window_size):
    change_list = change_df["change"].tolist()
    
    # Move elements in the list by 1
    for i in range(window_size-1):
        change_list.pop(0)
        change_list.append(0)
        
        # Assign a column name
        change_df["Candle(" + str(i+1) +")"] = change_list
    
    # Create a list of embeddings
    embeddings = change_df.drop("time", axis=1).values.tolist()
    
    # Get rid of the Null Values
    for i in range(window_size-1):
        embeddings.pop()
        
    return embeddings, change_df

#### Retrieving/Reconstructing the Patterns

The __get_pattern_df__ function generates a dataframe containing a specified number of candlestick patterns from a larger dataset. 

The function takes three arguments: 
* check_for, an integer representing the index of the candlestick pattern to check for
* window_size, an integer specifying the number of previous or future bars to include in the dataframe
* future, a boolean value indicating whether to include future bars in the dataframe.

If future is set to __True__, the function will retrieve __window_size__ bars from the past and __window_size__ bars from the future, if available in the dataset. If future is False or not specified, the function will retrieve only window_size bars from the past. To check the quality of the results, keep it false. Otherwise it will be difficult to say if they are close enough or not.

The function generates a list of candlestick patterns by iterating over a range of values and appending the appropriate rows from the input dataset to the list. The list is then converted to a dataframe and returned by the function.

In [11]:
def get_pattern_df(check_for: int, window_size: int, future: bool = False):
    
    """
    check_for: int -> Check for the candle stick at index = check_for from the data frame df
    window_size: int -> Look into past 'window_size' bars. If 20, return 20 previous bars.
    future: bool -> Look into future 'window_size' bars if checked. It will return additional bars if available in the dataset.
    
    Return the dataframe of candlesticks which belongs to the inputted pattern.
    """
    
    # Returned indexes from Annoy
    candles_list = []
    
    # If the user wants to see future bars as well, future must be True
    if future:
        # Return future window_size bars
        for i in range(window_size*2, 0, -1):
            candles_list.append(df.iloc[check_for])
            check_for -= 1
            
        check_for += window_size*2
            
    # Return past window_size bars
    for i in range(window_size):
        candles_list.append(df.iloc[check_for])
        check_for += 1
    
    
    return pd.DataFrame(candles_list)

### Execute ANN and Visualize the Data

The __visualize_patterns__ function uses the __plotly__ library to generate a graph of candlestick patterns from a given dataset. 

The function takes four arguments:
* main, an integer representing the index of the candlestick pattern to use as the main graph;
* window_size, explained above;
* closest_n, an integer specifying the number of patterns to include in the graph;
* future, explained above. 

### ANN: ANNOY by Spotify
The function first uses the AnnoyIndex method from the __annoy__ library to create an index of the candlestick patterns in the dataset, using the __manhattan__ distance metric. It then calls the __get_nns_by_vector__ method on the index to retrieve the indices of the closest_n patterns that are most similar to the main pattern. The function removes the main pattern from the list of indices if it is present. 

It then generates a main graph of the main pattern using the __get_pattern_df__ function and __go.Candlestick__ from plotly. It also generates subgraphs of the closest patterns using the same methods. Finally, it uses the make_subplots and update_layout functions from plotly to arrange the main and subgraphs in a grid and adjust the layout of the overall graph. The function displays the graph using the show method.

In [17]:
def visualize_patterns(main: int,
                       window_size: int,
                       closest_n: int,
                       future: bool = False) -> None:
    """
    main: The index of the candle stick where it is the newest bar of the input pattern 
    window_size: Length of the input pattern.
    closest_n: return closest n patterns
    future: Set it to true to see future patterns if available
    
    With plotly, visualize the price graph through candlesticks.
    'Main' is the first subgraph, and the closest ones to the main graph are other subgraphs.
    """
    
    vector_size = len(embeddings[0])

    t = AnnoyIndex(vector_size, 'manhattan')
    for i in range(len(embeddings)):
        t.add_item(i, embeddings[i])

    t.build(1000)
    
    nearest_neighs = t.get_nns_by_vector(embeddings[main], closest_n)

    # Remove the duplicates of main bar
    if main in nearest_neighs:
        nearest_neighs.remove(main)
        
    # Get the pattern of the inputted chart
    df_pattern_main = get_pattern_df(main, window_size)
    
    # Create the subgraph spots for down below
    fig = make_subplots(len(nearest_neighs)//2 + 1, cols=2)

    # Print the main graph
    row_i, col_i = 1, 1
    fig.add_trace( go.Candlestick(
                    x=df_pattern_main['time'],
                    open=df_pattern_main['open'],
                    high=df_pattern_main['high'],
                    low=df_pattern_main['low'],
                    close=df_pattern_main['close']),
                    row=1, col=1
                  )
    
    fig.update_xaxes(rangeslider= {'visible':False}, row=1, col=1)
    
    
    
    col_i += 1
    # Print the closest charts in the form of subgraphs
    for i in nearest_neighs:
        df_pattern_next = get_pattern_df(i, window_size, future)
        fig.add_trace( 
            go.Candlestick(
                x=df_pattern_next['time'],
                open=df_pattern_next['open'],
                high=df_pattern_next['high'],
                low=df_pattern_next['low'],
                close=df_pattern_next['close']
            ),
            row=row_i, col=col_i
        )
        if future:
            fig.add_vline(x = df_pattern_next["time"].iloc[window_size*2], row=row_i, col=col_i)
        else:
            fig.add_vline(x = df_pattern_next["time"].iloc[0], row=row_i, col=col_i)
            
        fig.update_xaxes(rangeslider= {'visible':False}, row=row_i, col=col_i)
        
        col_i = (col_i%2) + 1
        row_i = row_i + 1 if col_i == 1 else row_i
        
            


    fig.update_layout(height=1000, width=1000, xaxis_rangeslider_visible=False)
    fig.show()

#### Putting it all together
Edit in between the __PARAM WINDOW__ brackets. Check the ticker on yahoo finance. As default, it is BTC-USD but you can change to anything. It mainly works best for crypto since yesterday's close is today's open. Read the comments in between the __PARAM WINDOW__ and edit accordingly.

In [18]:
# PARAM WINDOW BEGIN ----------------------------

# What stock/currency/comod you want to see?
ticker_str = "BTC-USD"

# What is the time interval you want to work with?
interval_str = "1D"

# How many candle sticks you want to look into? (As increases, the similarity will decrease with a high chance)
window_size = 15

# How many results you want to retrieve (as increases, the quality decreases)?
closest_n = 5

# Input pattern index: If set to 0, start tracingback from the most recent bar. If you increase the value,
# you will start tracingback from the n newest bar.
input_pattern_index = 0

# Do you want to see what happened after the pattern you inputted happened in other times? 
# Advice: To check the quality of results, set to False. If you like it, then set it to true without changing anything else
# and see what happened afterwards in those times.
see_future = False

# PARAM WINDOW END ------------------------------



df, change_df = get_data(ticker = ticker_str, interval = interval_str)


embeddings, change_df = alter_df(change_df, window_size)

vector_size = len(embeddings[0])

visualize_patterns(input_pattern_index,
                   window_size,
                   closest_n,
                   future = see_future)

change_df


[*********************100%***********************]  1 of 1 completed


Unnamed: 0,time,change,Candle(1),Candle(2),Candle(3),Candle(4),Candle(5),Candle(6),Candle(7),Candle(8),...,Candle(15),Candle(16),Candle(17),Candle(18),Candle(19),Candle(20),Candle(21),Candle(22),Candle(23),Candle(24)
0,2022-12-18 00:00:00+00:00,-0.146956,0.889707,-4.129463,-2.519301,0.188864,3.341057,0.607733,-0.148967,-0.032077,...,-1.064133,0.707047,-1.170020,4.396892,1.401824,-1.355823,-0.116958,-0.346788,-0.484438,-0.043174
1,2022-12-17 00:00:00+00:00,0.889707,-4.129463,-2.519301,0.188864,3.341057,0.607733,-0.148967,-0.032077,-0.574485,...,0.707047,-1.170020,4.396892,1.401824,-1.355823,-0.116958,-0.346788,-0.484438,-0.043174,2.563156
2,2022-12-16 00:00:00+00:00,-4.129463,-2.519301,0.188864,3.341057,0.607733,-0.148967,-0.032077,-0.574485,2.291904,...,-1.170020,4.396892,1.401824,-1.355823,-0.116958,-0.346788,-0.484438,-0.043174,2.563156,2.581808
3,2022-12-15 00:00:00+00:00,-2.519301,0.188864,3.341057,0.607733,-0.148967,-0.032077,-0.574485,2.291904,-1.412439,...,4.396892,1.401824,-1.355823,-0.116958,-0.346788,-0.484438,-0.043174,2.563156,2.581808,-3.093319
4,2022-12-14 00:00:00+00:00,0.188864,3.341057,0.607733,-0.148967,-0.032077,-0.574485,2.291904,-1.412439,0.673131,...,1.401824,-1.355823,-0.116958,-0.346788,-0.484438,-0.043174,2.563156,2.581808,-3.093319,-2.519535
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3010,2014-09-21 00:00:00+00:00,-2.270110,3.605767,-6.910351,-7.096262,-1.831006,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3011,2014-09-20 00:00:00+00:00,3.605767,-6.910351,-7.096262,-1.831006,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3012,2014-09-19 00:00:00+00:00,-6.910351,-7.096262,-1.831006,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3013,2014-09-18 00:00:00+00:00,-7.096262,-1.831006,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
