# Project 2: Breakout Strategy
## Instructions
In this project, I implemented several functions to build and evaluate a breakout trading strategy. Each function is designed to solve specific problems outlined in the project. After coding each function, I validated my work using unit tests provided in the `project_tests` package. While these tests help catch major errors, the final accuracy of my code will be confirmed upon submission.

## Packages
For this project, I used Python packages introduced during the course, including [Pandas](https://pandas.pydata.org/) and [Numpy](http://www.numpy.org/). These libraries are sufficient for all the computations and data manipulations required.

In addition, I leveraged custom modules provided with the project:
- `helper` and `project_helper`: These modules offer utility functions and visualization tools that simplify the analysis and interpretation of results.
- `project_tests`: This package contains unit tests to validate the correctness of each implemented function.

### Install Packages

In [33]:
import sys
!{sys.executable} -m pip install -r requirements.txt

Collecting plotly>=4.0.0 (from -r requirements.txt (line 6))
  Using cached plotly-5.24.1-py3-none-any.whl.metadata (7.3 kB)
Using cached plotly-5.24.1-py3-none-any.whl (19.1 MB)
Installing collected packages: plotly
  Attempting uninstall: plotly
    Found existing installation: plotly 3.10.0
    Uninstalling plotly-3.10.0:
      Successfully uninstalled plotly-3.10.0
Successfully installed plotly-5.24.1


In [34]:
!python -m pip install plotly==3.10.0 --no-cache

Collecting plotly==3.10.0
  Downloading plotly-3.10.0-py2.py3-none-any.whl.metadata (6.2 kB)
Downloading plotly-3.10.0-py2.py3-none-any.whl (41.5 MB)
   ---------------------------------------- 41.5/41.5 MB 7.9 MB/s eta 0:00:00
Installing collected packages: plotly
  Attempting uninstall: plotly
    Found existing installation: plotly 5.24.1
    Uninstalling plotly-5.24.1:
      Successfully uninstalled plotly-5.24.1
Successfully installed plotly-3.10.0


In [35]:
# Restart the Kernel
import plotly
print(plotly.__version__)
# Should return plotly==3.10.0

3.10.0


### Load Packages

In [36]:
import pandas as pd
import numpy as np
import helper
import project_helper
import project_tests

## Market Data
### Load Data
Working with real data offers valuable hands-on experience, but it doesn't cover all the concepts I want to demonstrate in this project. To address this, I created a fictional scenario featuring companies mining [Terbium](https://en.wikipedia.org/wiki/Terbium), a sector experiencing significant growth and profitability. These companies are entirely fictional and represent a rapidly expanding market, designed to support the demonstrations and concepts I'll be showcasing later in this project.

In [37]:
# Generate original market data (AAPL and others)
np.random.seed(42)
dates = pd.date_range(start="2020-01-01", end="2022-12-31", freq="D")
tickers = ["AAPL", "MSFT", "GOOGL"]

original_data = {
    "date": [],
    "ticker": [],
    "adj_close": [],
    "adj_high": [],
    "adj_low": [],
}

for ticker in tickers:
    prices = np.cumsum(np.random.normal(loc=0.1, scale=2, size=len(dates))) + 100
    high = prices + np.random.uniform(1, 5, size=len(prices))
    low = prices - np.random.uniform(1, 5, size=len(prices))

    original_data["date"].extend(dates)
    original_data["ticker"].extend([ticker] * len(dates))
    original_data["adj_close"].extend(prices)
    original_data["adj_high"].extend(high)
    original_data["adj_low"].extend(low)

df_original = pd.DataFrame(original_data)

# Add TB sector to the market
df = df_original
df = pd.concat([df] + project_helper.generate_tb_sector(df[df['ticker'] == 'AAPL']['date']), ignore_index=True)

close = df.reset_index().pivot(index='date', columns='ticker', values='adj_close')
high = df.reset_index().pivot(index='date', columns='ticker', values='adj_high')
low = df.reset_index().pivot(index='date', columns='ticker', values='adj_low')

print('Loaded Data')

Loaded Data


### View Data
To see what one of these 2-d matrices looks like, let's take a look at the closing prices matrix.

In [38]:
close

ticker,AAPL,AGENEN,ALTAIC,ARMENA,BAKERI,BIFLOR,CLUSIA,DASYST,GESNER,GOOGL,...,PRAEST,PULCHE,SAXATI,SCHREN,SPRENG,SYLVES,TARDA,TURKES,URUMIE,VVEDEN
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-01-01,101.09342831,1.00457164,1.00123368,1.00933351,1.00041567,1.00594155,1.00714710,1.00032329,0.99468137,98.08614528,...,0.99427935,1.00183018,0.99430803,1.00634851,0.99820120,0.99288202,1.00326732,0.99686412,1.00859650,1.00660858
2020-01-02,100.91689970,1.00677279,0.99940857,1.00798407,1.00709751,1.00266227,1.00406725,1.00427950,1.00565944,98.28179592,...,1.00949623,1.00470930,1.00036397,1.00506064,1.00170909,1.01267471,1.00293829,1.00299351,1.00427780,0.99510207
2020-01-03,102.31227678,1.00457070,1.00898256,1.00647192,1.00897048,1.00853640,1.00480878,1.01536140,1.01264615,97.25340356,...,1.00324139,1.00233587,1.00619996,1.00543483,1.00813889,1.00519262,1.00078950,1.01005972,1.00659446,1.00592032
2020-01-04,105.45833649,1.01363638,1.01804305,1.01047066,1.00945928,1.00944313,1.01015286,1.01072533,1.01232259,94.36532797,...,1.01248435,1.01083288,1.00378929,1.01013325,1.00636580,1.00822530,1.01225353,1.01070300,1.00950610,1.01366693
2020-01-05,105.09002974,1.02224028,1.02195130,1.01034575,1.01426189,1.01498519,1.02092648,1.01108851,1.01297764,96.97561955,...,1.01795258,1.01503990,1.01899131,1.00990380,1.00919581,1.02073056,1.01434053,1.01570268,1.01220415,1.01064064
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-12-27,278.34949500,3004.85459588,3031.41764392,2997.43693201,3011.09157819,3005.71018836,3025.43309959,3018.20101512,3008.89723042,96.55318840,...,3001.00450346,2995.94988383,2981.37405126,3003.02932632,3016.99007629,2994.34822422,3010.55520050,3020.59787338,3024.08491955,2986.61426326
2022-12-28,279.90475409,3040.51585433,3053.73700724,3036.46767390,3041.33237774,3027.60932216,3025.92083584,3054.83912436,3025.27529284,94.45943824,...,3044.00278107,3032.59210008,3048.39295592,3023.78211011,3030.15346652,3019.86423823,3019.60494623,3056.29474104,3042.67535910,3063.50755978
2022-12-29,280.10864586,3057.00938416,3070.49060645,3064.90742724,3089.03402182,3086.91674127,3082.56864938,3068.29984350,3065.68824604,96.20357040,...,3068.63023382,3074.19114545,3070.31728283,3053.00765677,3061.20271612,3064.91116546,3066.54354536,3077.81701946,3055.01384757,3038.69649903
2022-12-30,281.67392601,3082.98085994,3094.25605673,3112.46102603,3076.33318666,3094.51436356,3078.10117229,3081.16351025,3084.42113434,98.04787661,...,3084.99700640,3080.58621444,3115.66049752,3079.01513376,3104.71669324,3099.01064179,3114.07654844,3080.23120394,3083.95368105,3116.65761330


### Stock Example
Let's see what a single stock looks like from the closing prices. For this example and future display examples in this project, we'll use Google's stock (GOOGL).

In [39]:
googl_ticker = 'GOOGL'
project_helper.plot_stock(close[googl_ticker], '{} Stock'.format(googl_ticker))

## The Alpha Research Process

In this project, I’m working on coding and evaluating a "breakout" trading signal. It’s essential to understand how these steps fit into the alpha research workflow. Since the signal-to-noise ratio in trading signals is very low, it’s easy to fall into the trap of overfitting to noise. To avoid this, I’m not jumping straight into coding the signal. Instead, I’m starting with a general observation and hypothesis.

For this project, I’m working under the assumption that the first three steps of the alpha research workflow "observe & research," "form hypothesis," and "validate hypothesis" have already been completed. The hypothesis I’m using is outlined below:
- In the absence of news or significant trading interest, stocks tend to oscillate within a range.
- Traders aim to profit from this range-bound behavior by:
    - Selling or shorting at the top of the range
    - Buying or covering at the bottom
    - This repetitive trading activity reinforces the existence of the range
- When stocks break out of the range, typically caused by significant events like:
    - A major news release
    - Market pressure from large investors:
        - Liquidity traders who have been providing liquidity at the range boundaries scramble to cover their positions to minimize losses.
        - The breakout also attracts other investors. Influenced by the behavioral bias of _herding_ ([Herd Behavior](https://www.investopedia.com/university/behavioral_finance/behavioral8.asp)), these investors build positions that further fuel the trend continuation.

## Compute the Highs and Lows in a Window
For the breakout strategy, I’ll be using **price highs** and **price lows** as key indicators. The goal is to identify ranges that stocks oscillate within, which are crucial for detecting breakouts.
In this section, I implemented the function `get_high_lows_lookback`. This function calculates the **maximum high price** and **minimum low price** over a specified lookback window of days. The number of days to look back is determined by the variable `lookback_days`.
A key detail is that the **current day is excluded** from the lookback calculation to ensure accurate and unbiased signals.

In [40]:
def get_high_lows_lookback(high, low, lookback_days):
    """
    Get the highs and lows in a lookback window.
    
    Parameters
    ----------
    high : DataFrame
        High price for each ticker and date
    low : DataFrame
        Low price for each ticker and date
    lookback_days : int
        The number of days to look back
    
    Returns
    -------
    lookback_high : DataFrame
        Lookback high price for each ticker and date
    lookback_low : DataFrame
        Lookback low price for each ticker and date
    """
    
    # Calculate the rolling maximum for the high prices with min_periods set
    lookback_high = high.rolling(window=lookback_days, min_periods=lookback_days).max().shift(1)
    
    # Calculate the rolling minimum for the low prices with min_periods set
    lookback_low = low.rolling(window=lookback_days, min_periods=lookback_days).min().shift(1)

    print(lookback_days)
    
    print(high)
    
    print(lookback_high)
    
    return lookback_high, lookback_low

project_tests.test_get_high_lows_lookback(get_high_lows_lookback)

2
                   ANV         OHJ        REYF
2011-01-23 35.44110000 34.17990000 34.02230000
2011-01-24 92.11310000 91.05430000 90.95720000
2011-01-25 57.97080000 57.78140000 58.19820000
2011-01-26 34.17050000 92.45300000 58.51070000
                   ANV         OHJ        REYF
2011-01-23         NaN         NaN         NaN
2011-01-24         NaN         NaN         NaN
2011-01-25 92.11310000 91.05430000 90.95720000
2011-01-26 92.11310000 91.05430000 90.95720000
Tests Passed


### View Data
Let's use own implementation of `get_high_lows_lookback` to get the highs and lows for the past 50 days and compare it to it their respective stock.  Just like last time, we'll use Google's stock as the example to look at.

In [41]:
lookback_days = 50
lookback_high, lookback_low = get_high_lows_lookback(high, low, lookback_days)
project_helper.plot_high_low(
    close[googl_ticker],
    lookback_high[googl_ticker],
    lookback_low[googl_ticker],
    'High and Low of {} Stock'.format(googl_ticker))

50
ticker             AAPL        AGENEN        ALTAIC        ARMENA  \
date                                                                
2020-01-01 103.35805273    1.00535556    1.00517576    1.00933351   
2020-01-02 103.94583894    1.00816615    1.00455346    1.01014731   
2020-01-03 103.47856822    1.00635474    1.00898256    1.00779494   
2020-01-04 107.05170930    1.01363638    1.02046070    1.01213387   
2020-01-05 110.03655024    1.02224028    1.02195130    1.01566068   
...                 ...           ...           ...           ...   
2022-12-27 282.88335405 3006.44294346 3031.41764392 3000.03447943   
2022-12-28 282.69803031 3040.51585433 3053.73700724 3041.81308575   
2022-12-29 283.15037581 3057.00938416 3070.49060645 3067.40500385   
2022-12-30 285.18029606 3101.06187362 3109.42541607 3112.46102603   
2022-12-31 286.31802328 3139.05818907 3126.50542734 3125.55081935   

ticker            BAKERI        BIFLOR        CLUSIA        DASYST  \
date                         

## Compute Long and Short Signals
Using the previously calculated indicators of **highs** and **lows**, I implemented a breakout strategy to generate **long** and **short** signals. This is done in the function `get_long_short`, which assigns signals based on the following conditions:

| Signal | Condition |
|----|------|
| -1 | Low > Close Price |
| 1  | High < Close Price |
| 0  | Otherwise |

In this chart, **Close Price** is the `close` parameter. **Low** and **High** are the values generated from `get_high_lows_lookback`, the `lookback_high` and `lookback_low` parameters.

In [42]:
def get_long_short(close, lookback_high, lookback_low):
    """
    Generate the signals long, short, and do nothing.
    
    Parameters
    ----------
    close : DataFrame
        Close price for each ticker and date
    lookback_high : DataFrame
        Lookback high price for each ticker and date
    lookback_low : DataFrame
        Lookback low price for each ticker and date
    
    Returns
    -------
    long_short : DataFrame
        The long, short, and do nothing signals for each ticker and date
    """
    
    # Create a new DataFrame to store signals
    long_short = pd.DataFrame(0, index=close.index, columns=close.columns, dtype='int64')
    
    # Iterate through the DataFrame
    for index in close.index:
        for col in close.columns:
            if close.loc[index, col] < lookback_low.loc[index, col]:
                long_short.at[index, col] = -1  # Long signal
            elif close.loc[index, col] > lookback_high.loc[index, col]:
                long_short.at[index, col] = 1  # Short signal
            else:
                long_short.at[index, col] = 0  # Do nothing
    
    return long_short

project_tests.test_get_long_short(get_long_short)

Tests Passed


### View Data

To evaluate the signals generated by the breakout strategy, I will plot them against the **close prices**. This chart visualizes the long and short signals on the stock price movement.

In [43]:
signal = get_long_short(close, lookback_high, lookback_low)
project_helper.plot_signal(
    close[googl_ticker],
    signal[googl_ticker],
    'Long and Short of {} Stock'.format(googl_ticker))

## Filter Signals

In my initial implementation of the breakout strategy, I noticed that it was generating too many repeated signals. For example, if I'm already shorting a stock, additional short signals don't add any value to the strategy. The same goes for repeated long signals when I'm already long on a stock.

### My Objective
My goal here is to filter out redundant signals within a given `lookahead_days` window. Here's what I aim to achieve:
- If the previous signal for a stock is the same as the current signal within the lookahead window, I'll replace the signal with `0` (a do-nothing signal).

### Example
To illustrate, here's a signal series for a single stock:
`[1, 0, 1, 0, 1, 0, -1, -1]`

After running the `filter_signals` function with a `lookahead_days` of 3, it becomes:
`[1, 0, 0, 0, 1, 0, -1, 0]`

### Helper Function: `clear_signals`
To make this implementation simpler, I’m using the `clear_signals` helper function. This function removes redundant signals within a given window size. For instance, consider this series of long signals:
`[0, 1, 0, 0, 1, 1, 0, 1, 0]`

Using `clear_signals` with a window size of 3, the result is:
`[0, 1, 0, 0, 0, 1, 0, 0, 0]`

It’s important to note that `clear_signals` works only on a single type of signal (either long or short). It doesn’t handle mixed signals.

### My Plan
I’ll implement the `filter_signals` function to remove redundant signals. This will make the strategy more efficient by ensuring I focus only on unique and meaningful signals.


In [44]:
def clear_signals(signals, window_size):
    """
    Clear out signals in a Series of just long or short signals.
    
    Remove the number of signals down to 1 within the window size time period.
    
    Parameters
    ----------
    signals : Pandas Series
        The long, short, or do nothing signals
    window_size : int
        The number of days to have a single signal       
    
    Returns
    -------
    signals : Pandas Series
        Signals with the signals removed from the window size
    """
    # Start with buffer of window size
    # This handles the edge case of calculating past_signal in the beginning
    clean_signals = [0]*window_size
    
    for signal_i, current_signal in enumerate(signals):
        # Check if there was a signal in the past window_size of days
        has_past_signal = bool(sum(clean_signals[signal_i:signal_i+window_size]))
        # Use the current signal if there's no past signal, else 0/False
        clean_signals.append(not has_past_signal and current_signal)
        
    # Remove buffer
    clean_signals = clean_signals[window_size:]

    # Return the signals as a Series of Ints
    return pd.Series(np.array(clean_signals).astype(int), signals.index)


def filter_signals(signal, lookahead_days):
    """
    Filter out signals in a DataFrame.
    
    Parameters
    ----------
    signal : DataFrame
        The long, short, and do nothing signals for each ticker and date
    lookahead_days : int
        The number of days to look ahead
    
    Returns
    -------
    filtered_signal : DataFrame
        The filtered long, short, and do nothing signals for each ticker and date
    """
    
    # Initialize an empty DataFrame to store filtered signals
    filtered_signal = pd.DataFrame(index=signal.index, columns=signal.columns)

    # Loop through each ticker (column) in the DataFrame
    for ticker in signal.columns:
        # Access the series (values in the column) and shift it
        signal_series = signal[ticker]
        
        # Apply the clear_signals function to remove extra signals
        long_signals = clear_signals(signal_series == 1, lookahead_days)
        short_signals = clear_signals(signal_series == -1, lookahead_days)

        # Combine long and short signals back
        filtered_signal[ticker] = long_signals - short_signals
    
    # Fill NaN values with 0 before converting to integers
    return filtered_signal.fillna(0).astype('int64')

project_tests.test_filter_signals(filter_signals)

Tests Passed


### View Data
Let's view the same chart as before, but with the redundant signals removed.

In [45]:
signal_5 = filter_signals(signal, 5)
signal_10 = filter_signals(signal, 10)
signal_20 = filter_signals(signal, 20)
for signal_data, signal_days in [(signal_5, 5), (signal_10, 10), (signal_20, 20)]:
    project_helper.plot_signal(
        close[googl_ticker],
        signal_data[googl_ticker],
        'Long and Short of {} Stock with {} day signal window'.format(googl_ticker, signal_days))

## Lookahead Close Prices
With the trading signal done, we can start working on evaluating how many days to short or long the stocks. In this problem, we will implement `get_lookahead_prices` to get the close price days ahead in time. We can get the number of days from the variable `lookahead_days`. We'll use the lookahead prices to calculate future returns in another problem.

In [46]:
def get_lookahead_prices(close, lookahead_days):
    """
    Get the lookahead prices for `lookahead_days` number of days.
    
    Parameters
    ----------
    close : DataFrame
        Close price for each ticker and date
    lookahead_days : int
        The number of days to look ahead
    
    Returns
    -------
    lookahead_prices : DataFrame
        The lookahead prices for each ticker and date
    """
    
    return close.shift(-lookahead_days)

project_tests.test_get_lookahead_prices(get_lookahead_prices)

Tests Passed


### View Data
Using the `get_lookahead_prices` function, let's generate lookahead closing prices for 5, 10, and 20 days.

Let's also chart a subsection of a few months of the Google stock instead of years. This will allow you to view the differences between the 5, 10, and 20 day lookaheads. Otherwise, they will mesh together when looking at a chart that is zoomed out.

In [47]:
lookahead_5 = get_lookahead_prices(close, 5)
lookahead_10 = get_lookahead_prices(close, 10)
lookahead_20 = get_lookahead_prices(close, 20)
project_helper.plot_lookahead_prices(
    close[googl_ticker].iloc[150:250],
    [
        (lookahead_5[googl_ticker].iloc[150:250], 5),
        (lookahead_10[googl_ticker].iloc[150:250], 10),
        (lookahead_20[googl_ticker].iloc[150:250], 20)],
    '5, 10, and 20 day Lookahead Prices for Slice of {} Stock'.format(googl_ticker))

## Lookahead Price Returns
We will implement `get_return_lookahead` to generate the log price return between the closing price and the lookahead price.

In [48]:
def get_return_lookahead(close, lookahead_prices):
    """
    Calculate the log returns from the lookahead days to the signal day.
    
    Parameters
    ----------
    close : DataFrame
        Close price for each ticker and date
    lookahead_prices : DataFrame
        The lookahead prices for each ticker and date
    
    Returns
    -------
    lookahead_returns : DataFrame
        The lookahead log returns for each ticker and date
    """
    
    return np.log(lookahead_prices/close)

project_tests.test_get_return_lookahead(get_return_lookahead)

Tests Passed


### View Data
Using the same lookahead prices and same subsection of the Google stock from the previous problem, we'll view the lookahead returns.

In order to view price returns on the same chart as the stock, a second y-axis will be added. When viewing this chart, the axis for the price of the stock will be on the left side, like previous charts. The axis for price returns will be located on the right side.

In [49]:
price_return_5 = get_return_lookahead(close, lookahead_5)
price_return_10 = get_return_lookahead(close, lookahead_10)
price_return_20 = get_return_lookahead(close, lookahead_20)
project_helper.plot_price_returns(
    close[googl_ticker].iloc[150:250],
    [
        (price_return_5[googl_ticker].iloc[150:250], 5),
        (price_return_10[googl_ticker].iloc[150:250], 10),
        (price_return_20[googl_ticker].iloc[150:250], 20)],
    '5, 10, and 20 day Lookahead Returns for Slice {} Stock'.format(googl_ticker))

## Compute the Signal Return
Using the price returns generate the signal returns.

In [50]:
def get_signal_return(signal, lookahead_returns):
    """
    Compute the signal returns.
    
    Parameters
    ----------
    signal : DataFrame
        The long, short, and do nothing signals for each ticker and date
    lookahead_returns : DataFrame
        The lookahead log returns for each ticker and date
    
    Returns
    -------
    signal_return : DataFrame
        Signal returns for each ticker and date
    """
    
    return signal * lookahead_returns

project_tests.test_get_signal_return(get_signal_return)

Tests Passed


### View Data
Let's continue using the previous lookahead prices to view the signal returns. Just like before, the axis for the signal returns is on the right side of the chart.

In [51]:
title_string = '{} day Lookahead Signal Returns for {} Stock'
signal_return_5 = get_signal_return(signal_5, price_return_5)
signal_return_10 = get_signal_return(signal_10, price_return_10)
signal_return_20 = get_signal_return(signal_20, price_return_20)
project_helper.plot_signal_returns(
    close[googl_ticker],
    [
        (signal_return_5[googl_ticker], signal_5[googl_ticker], 5),
        (signal_return_10[googl_ticker], signal_10[googl_ticker], 10),
        (signal_return_20[googl_ticker], signal_20[googl_ticker], 20)],
    [title_string.format(5, googl_ticker), title_string.format(10, googl_ticker), title_string.format(20, googl_ticker)])

## Test for Significance
### Histogram
Let's plot a histogram of the signal return values.

In [52]:
project_helper.plot_signal_histograms(
    [signal_return_5, signal_return_10, signal_return_20],
    'Signal Return',
    ('5 Days', '10 Days', '20 Days'))

### Question: What do the histograms tell you about the signal returns?

Here's what the histograms suggest:

- The signal return distributions are centered around zero, showing that most returns are small in magnitude.
- As the time horizon increases (5 days to 20 days), the spread of signal returns becomes wider, reflecting larger cumulative returns or losses over longer periods.
- The distribution shows a slight skewness, which may vary across different time horizons.
- Outliers on both ends (positive and negative) are more noticeable as the time period lengthens, indicating potential extreme values in returns over longer durations.

## Outliers
I noticed the outliers in the 10 and 20-day histograms. To get a clearer view of these outliers, I'll compare the 5, 10, and 20-day signal returns to normal distributions with the same mean and standard deviation as each signal return distribution.

In [53]:
project_helper.plot_signal_to_normal_histograms(
    [signal_return_5, signal_return_10, signal_return_20],
    'Signal Return',
    ('5 Days', '10 Days', '20 Days'))

## Kolmogorov-Smirnov Test
To identify stocks causing outlying returns in the histogram, I’ll use the **Kolmogorov-Smirnov Test (KS-Test)**. This test compares the distribution of signal returns for each ticker to a reference distribution, measuring deviations with the **KS statistic** and **p-value**.

In [54]:
# Filter out returns that don't have a long or short signal.
long_short_signal_returns_5 = signal_return_5[signal_5 != 0].stack()
long_short_signal_returns_10 = signal_return_10[signal_10 != 0].stack()
long_short_signal_returns_20 = signal_return_20[signal_20 != 0].stack()

# Get just ticker and signal return
long_short_signal_returns_5 = long_short_signal_returns_5.reset_index().iloc[:, [1,2]]
long_short_signal_returns_5.columns = ['ticker', 'signal_return']
long_short_signal_returns_10 = long_short_signal_returns_10.reset_index().iloc[:, [1,2]]
long_short_signal_returns_10.columns = ['ticker', 'signal_return']
long_short_signal_returns_20 = long_short_signal_returns_20.reset_index().iloc[:, [1,2]]
long_short_signal_returns_20.columns = ['ticker', 'signal_return']

# View some of the data
long_short_signal_returns_5.head(10)

Unnamed: 0,ticker,signal_return
0,AGENEN,0.02031583
1,ARMENA,0.01861283
2,BAKERI,0.01651901
3,BIFLOR,0.02174528
4,DASYST,0.02707825
5,GREIGI,0.01506557
6,KAUFMA,0.01747242
7,ORPHAN,0.01602386
8,PULCHE,0.02394477
9,SAXATI,0.01965249


This sets up the data required for the KS-Test.

Now, I’ll implement the `calculate_kstest` function to use the Kolmogorov-Smirnov test (KS test) to compare a distribution of stock returns (provided as the input DataFrame) with each stock's signal returns. The KS test will be performed on a normal distribution against each stock's signal returns using [`scipy.stats.kstest`](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.kstest.html#scipy-stats-kstest).

When calculating the standard deviation of the signal returns, I'll ensure the delta degrees of freedom is set to 0.

For this implementation, I won’t attempt a vectorized solution; instead, I'll use the [`groupby`](https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.groupby.html) function and iterate over its groups.


In [55]:
from scipy.stats import kstest
import project_tests
import importlib

importlib.reload(project_tests)

def calculate_kstest(long_short_signal_returns):
    """
    Calculate the KS-Test against the signal returns with a long or short signal.
    
    Parameters
    ----------
    long_short_signal_returns : DataFrame
        The signal returns which have a signal.
        This DataFrame contains two columns, "ticker" and "signal_return"
    
    Returns
    -------
    ks_values : Pandas Series
        KS statistic for all the tickers
    p_values : Pandas Series
        P value for all the tickers
    """
    # Calculate the global mean and standard deviation across all tickers
    global_mean = long_short_signal_returns['signal_return'].mean()
    global_std = long_short_signal_returns['signal_return'].std(ddof=0)
    
    # Group the DataFrame by 'ticker'
    grouped = long_short_signal_returns.groupby('ticker')
    
    ks_values = {}
    p_values = {}
    
    # Iterate over each group (ticker)
    for ticker, group in grouped:
        returns = group['signal_return']
        
        # Perform the KS test against a normal distribution with the global mean and std
        ks_stat, p_value = kstest(returns, 'norm', args=(global_mean, global_std))
        
        # Store the results in dictionaries
        ks_values[ticker] = ks_stat
        p_values[ticker] = p_value
    
    # Convert dictionaries to Pandas Series
    ks_values = pd.Series(ks_values)
    p_values = pd.Series(p_values)
    
    return ks_values, p_values

# Test the function
project_tests.test_calculate_kstest(calculate_kstest)


Tests Passed


### View Data
Using the signal returns we created above, let's calculate the ks and p values.

In [56]:
ks_values_5, p_values_5 = calculate_kstest(long_short_signal_returns_5)
ks_values_10, p_values_10 = calculate_kstest(long_short_signal_returns_10)
ks_values_20, p_values_20 = calculate_kstest(long_short_signal_returns_20)

print('ks_values_5')
print(ks_values_5.head(10))
print('p_values_5')
print(p_values_5.head(10))

ks_values_5
AAPL     0.61240840
AGENEN   0.10093606
ALTAIC   0.10273738
ARMENA   0.10824062
BAKERI   0.04810483
BIFLOR   0.09307686
CLUSIA   0.08653026
DASYST   0.06876108
GESNER   0.09792763
GOOGL    0.79795195
dtype: float64
p_values_5
AAPL     0.00007442
AGENEN   0.05953498
ALTAIC   0.05362969
ARMENA   0.03612092
BAKERI   0.81091716
BIFLOR   0.10210180
CLUSIA   0.14783196
DASYST   0.38741557
GESNER   0.07593228
GOOGL    0.00000000
dtype: float64


## Find Outliers

Now that I have the KS and p-values calculated, the next step is to identify the symbols that qualify as outliers based on specific thresholds. These thresholds will help narrow down the symbols that exhibit unusual behavior.

### Objective
I'll implement the `find_outliers` function to identify symbols that meet the following conditions:
1. **p-value condition**: The symbol must pass the null hypothesis with a p-value less than `pvalue_threshold`.
2. **KS value condition**: The symbol must have a KS value above `ks_threshold`.

### Example
To classify a symbol as an outlier, both conditions must be satisfied:
- `p-value < pvalue_threshold`
- `KS value > ks_threshold`

The function will then return a list of all symbols that meet these criteria.


In [57]:
def find_outliers(ks_values, p_values, ks_threshold, pvalue_threshold=0.05):
    """
    Find outlying symbols using KS values and P-values
    
    Parameters
    ----------
    ks_values : Pandas Series
        KS static for all the tickers
    p_values : Pandas Series
        P value for all the tickers
    ks_threshold : float
        The threshold for the KS statistic
    pvalue_threshold : float
        The threshold for the p-value
    
    Returns
    -------
    outliers : set of str
        Symbols that are outliers
    """
    
    return  set(ks_values[(ks_values > ks_threshold) & (p_values < pvalue_threshold)].index)


project_tests.test_find_outliers(find_outliers)

Tests Passed


### View Data
Using the `find_outliers` function we implemented, let's see what we found.

In [58]:
ks_threshold = 0.8
outliers_5 = find_outliers(ks_values_5, p_values_5, ks_threshold)
outliers_10 = find_outliers(ks_values_10, p_values_10, ks_threshold)
outliers_20 = find_outliers(ks_values_20, p_values_20, ks_threshold)

outlier_tickers = outliers_5.union(outliers_10).union(outliers_20)
print('{} Outliers Found:\n{}'.format(len(outlier_tickers), ', '.join(list(outlier_tickers))))

2 Outliers Found:
MSFT, AAPL


### Show Significance without Outliers
Let's compare the 5, 10, and 20 day signals returns without outliers to normal distributions. Also, let's see how the P-Value has changed with the outliers removed.

In [59]:
good_tickers = list(set(close.columns) - outlier_tickers)

project_helper.plot_signal_to_normal_histograms(
    [signal_return_5[good_tickers], signal_return_10[good_tickers], signal_return_20[good_tickers]],
    'Signal Return Without Outliers',
    ('5 Days', '10 Days', '20 Days'))

The returns are closer to a normal distribution.