# Introduction

I'm going to review different stocks and do the timseries analysis on them. The goal is to find the stocks that are non-stationary and check what is the result of the hurst exponent and get the stock with higher value in the hurst exponent because it will be useful for our momentum trading strategy.

I'm going to work with daily data from yahoo finance.

**Steps to follow:**

1 - Import the libraries

2 - Get the data from yahoo finance

3 - Analyze timeseries using adfuller test and hurst exponent

4 - Get the stocks with a hurst exponent greater than a defined threshold and plot them

5 - Calculate the volatility of the selected stocks and in my case, I'm going to select the stock with lowest and highest volatility to do the momentum trading strategy

6 - According to the hurst exponent and the visual analysis, select the stocks that are non-stationary and get the best one for our momentum trading strategy.

## Syllabus

* [1 - Import the libraries](#1)
* [2 - Get data](#2)
* [3 - Selection timeseries for momentum trading strategy](#3)
* [4 - Analyze selected stocks](#4)
  * [4.1 - Analyze stock with highest volatility](#4.1)
  * [4.2 - Analyze stock with lowest volatility](#4.2)

## 1 - Import the libraries <a id="1"></a>

In [8]:
import yfinance as yf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.stattools import adfuller
import datetime
import os
from hurst import compute_Hc
from typing import List

In [9]:
def get_stocks(tickers: List[str], start_date: str = '2000-01-01', 
               end_date: str = datetime.datetime.today().strftime('%Y-%m-%d')) -> pd.DataFrame:
    """
    Download historical stock data for the specified tickers.

    Parameters:
    ----------
    tickers : List[str]
        A list of ticker symbols to download data for.
    start_date : str, optional (default='2000-01-01')
        The start date for the data download in YYYY-MM-DD format.
    end_date : str, optional (default=today's date in YYYY-MM-DD format)
        The end date for the data download in YYYY-MM-DD format.

    Returns:
    -------
    pd.DataFrame
        A Pandas DataFrame containing the Open, High, Low, Close, and Volume data
        for the specified tickers and date range.
    """
    ohlc = yf.download(tickers, start=start_date, end=end_date, interval='1d')
    return ohlc

In [10]:
def remove_sparse_columns(df: pd.DataFrame, min_rows: int = 5000) -> pd.DataFrame:
    """
    Remove sparse columns from a Pandas DataFrame.

    Parameters:
    ----------
    df : pd.DataFrame
        A Pandas DataFrame to remove sparse columns from.
    min_rows : int, optional (default=5000)
        The minimum number of non-null values a column must have to be kept.

    Returns:
    -------
    pd.DataFrame
        A new Pandas DataFrame with sparse columns removed.
    """
    df_copy = df.copy()

    # Get columns with at least `min_rows` non-null values. More data is better.
    columns_with_non_null_values = df_copy.count()[df_copy.count() >= min_rows].index

    # Remove columns with null values
    df_copy = df_copy[columns_with_non_null_values]
    df_copy = df_copy.dropna(axis=1)

    return df_copy

In [11]:
def normalize_df(df: pd.DataFrame, column: str = 'Adj Close', column_index: str = 'Date') -> pd.DataFrame:
    """
    Normalize a Pandas DataFrame by renaming columns and the index.

    Parameters:
    ----------
    df : pd.DataFrame
        A Pandas DataFrame to be normalized.
    column : str, optional (default='Adj Close')
        The column name to use for the normalized DataFrame.
    column_index : str, optional (default='Date')
        The name to use for the index of the normalized DataFrame.

    Returns:
    -------
    pd.DataFrame
        A new Pandas DataFrame with normalized column names and index.
    """
    df_copy = df.copy()

    # If the DataFrame has multi-level columns, select the specified column
    if df.columns.nlevels > 1:
        df_copy = df_copy[column]
        df_copy.index = df_copy.index.get_level_values(column_index)

    # Rename columns and index to remove special characters and convert to lowercase
    df_copy.columns = df_copy.columns.str.replace('[^0-9a-zA-Z]+', '_', regex=True).str.lower()
    df_copy.index.name = df_copy.index.name.lower()

    return df_copy

## 2 - Get data <a id="2"></a>

We have a file that contains the list of stocks that we want to analyze. We are going to read the file and get the list of stocks. Then, we are going to get the data from yahoo finance if we don't have it in our local machine. If we have a file called stocks.csv, we are going to read it and get the data from there.

In [12]:
pd.read_csv('assets/stock_info.csv')

Unnamed: 0,Ticker,Name,Exchange
0,A,Agilent Technologies,NYSE
1,AA,Alcoa Inc.,NYSE
2,AAN,Aaron's Inc,NYSE
3,AAT,American Assets Trust,NYSE
4,AAV,Advantage Oil & Gas Ltd,NYSE
...,...,...,...
5777,ZN,Zion Oil & Gas Inc,NASDAQ
5778,ZNGA,Zynga Inc.,NASDAQ
5779,ZOLT,Zoltek Companies,NASDAQ
5780,ZOOM,Zoom Technologies,NASDAQ


In [13]:
tickers = pd.read_csv('assets/stock_info.csv')[['Ticker']].rename(columns={'Ticker': 'ticker'}).values.flatten().tolist()

In [14]:
tickers[:5]

['A', 'AA', 'AAN', 'AAT', 'AAV']

In [15]:
loaded_from_file = False
if os.path.exists('assets/stocks.csv'):
    stocks = pd.read_csv('assets/stocks.csv', index_col=0)
    loaded_from_file = True
else:
    stocks = get_stocks(tickers)
    stocks.to_csv('assets/stocks.csv')

In [16]:
if not loaded_from_file:
    stocks = normalize_df(stocks)

In [17]:
stocks = remove_sparse_columns(stocks, min_rows=5000)

In [18]:
stocks.shape

(5827, 593)

## 3 - Selection timeseries for momentum trading strategy <a id="3"></a>

In this section we are going to select the stocks that are non-stationary and get the best one for our momentum trading strategy. We are going to use the hurst exponent to get the stocks that the hurst exponent is greater than a defined threshold. We are going to use the adfuller test to check if the timeseries is stationary or not.

**Hurst exponent threshold:** 0.6

**P value threshold:** 0.05

In [19]:
def get_non_stationary_stocks(df: pd.DataFrame) -> pd.DataFrame:
    """
    Identify and remove non-stationary stock timeseries from a DataFrame using the Augmented Dickey-Fuller test.
    
    Parameters
    ----------
    df : pandas.DataFrame
        Input DataFrame containing the stock timeseries data.
    
    Returns
    -------
    pandas.DataFrame
        DataFrame containing only the stationary stock timeseries data.
    
    Raises
    ------
    ValueError
        If the input DataFrame is empty or contains only non-numeric data.
    """
    selected_stocks = df.copy()

    count = 0

    for ticker in df.columns:
        pvalue = adfuller(df[ticker])[1]

        if pvalue < 0.05:
            selected_stocks = selected_stocks.drop(ticker, axis=1)
            count += 1

    print(f'{count} non-stationary stock timeseries removed')

    return selected_stocks

In [20]:
def get_trending_stocks(df: pd.DataFrame, threshold: float = 0.5) -> pd.DataFrame:
    """
    Identify and remove non-trending stock timeseries from a DataFrame using the Hurst exponent.
    
    Parameters
    ----------
    df : pandas.DataFrame
        Input DataFrame containing the stock timeseries data.
    threshold : float, optional
        Threshold value for the Hurst exponent below which a stock timeseries is considered non-trending.
        Default is 0.5.
    
    Returns
    -------
    pandas.DataFrame
        DataFrame containing only the trending stock timeseries data.
    
    Raises
    ------
    ValueError
        If the input DataFrame is empty or contains only non-numeric data.
    """
    selected_stocks = df.copy()

    count = 0

    for ticker in df.columns:
        try:
            # Compute the Hurst exponent using the random walk method
            H, _, _ = compute_Hc(df[ticker], kind='random_walk', simplified=True)
        except:
            # If an exception is raised (usually due to negative prices), set H to 0.0
            H = 0.0

        if H <= threshold:
            selected_stocks = selected_stocks.drop(ticker, axis=1)
            count += 1

    print(f'{count} non-trending stock timeseries removed')

    return selected_stocks

In [21]:
selected_stocks = get_non_stationary_stocks(stocks)

In [None]:
selected_stocks = get_trending_stocks(selected_stocks, threshold=0.6)

In [None]:
selected_stocks.head(3)

In [None]:
selected_stocks.shape

### 4 - Analyze selected stocks <a id="4"></a>

We are going to analyze the stock with highest and lowest volatility in order to see if is better to use the stock with highest volatility or the stock with lowest volatility.

In [None]:
stock_returns = np.log(selected_stocks / selected_stocks.shift(1)).dropna()
stocks_volatility = pd.DataFrame(stock_returns.std(), columns=['volatility'])
stocks_volatility['annual_volatility'] = stocks_volatility['volatility'] * np.sqrt(252)

In [None]:
stocks_volatility['volatility'].sort_values(ascending=False).plot(kind='bar', figsize=(20, 10))

In [None]:
window = 14

#### 4.1 - Analyze stock with highest volatility <a id="4.1"></a>

In [None]:
most_volatile_stock = stocks_volatility['volatility'].sort_values(ascending=False).index[0]
(stock_returns[most_volatile_stock].rolling(window).std() * 100).plot(figsize=(20, 10))
monthly_returns = pd.DataFrame()
monthly_returns.plot(figsize=(20, 10))

#### 4.2 - Analyze stock with lowest volatility <a id="4.2"></a>

In [None]:
least_volatile_stock = stocks_volatility['volatility'].sort_values(ascending=False).index[-1]
(stock_returns[least_volatile_stock].rolling(window).std() * 100).plot(figsize=(20, 10))