# Financial Data processing for stock analysis

Stock market prediction has long been an intriguing subject for researchers across various fields. Some attempt to forecast stock prices by identifying technical patterns, others analyze news sentiment, while some simply replicate the strategies of more experienced traders.

But the fundamental question remains: Is there truly something within a price chart that can provide reliable insight into the future movement of a stock? In other words, is the market, to some extent, predictable?

If human analysts can identify price movement patterns through extensive data analysis, then it suggests a potential correlation between specific features and price fluctuations. Consequently, if such a correlation exists, machine learning models should also be capable of capturing it effectively.

This notebook aims to explore the feasibility of achieving a meaningful level of accuracy in stock price prediction from scratch. The process involves autonomously collecting data, constructing a dataset enriched with relevant indicators, and ultimately training a machine learning model to forecast future prices.

Regarding predictive precision, expecting to forecast exact prices would be overly ambitious and is not the primary objective of this study. Instead, a binary classification approach is adopted: the model predicts whether the next candlestick’s closing price will be higher (1) or lower (0) than the current one. Based on these signals, a trading strategy could involve taking a long position on a "1" signal and a short position on a "0" signal.

The notebook is structured into four main sections:

- **Data Gathering and Processing**
- **ML Model Creation**
- **Results Analysis**
- **Future Improvements and References**

Throughout this study, the following technical terms will be used:

- **Candle**: object represents the price range of an asset over a specific period, indicating the open, close, high, and low prices, commonly used in market analysis.
- **Technical Indicator**: A mathematical calculation based on historical price, volume, or open interest data, utilized in financial markets to analyze trends, patterns, and potential future price movements.
- **Leverage**: The practice of using borrowed capital (via margin accounts) to amplify position sizes. Traders borrow funds from brokers to control larger trades than their own capital allows, increasing both potential gains and risks.
- **Stop-Loss**: A risk management order placed with a broker to automatically sell a security when it reaches a predetermined price, limiting potential losses on a position.
- **Take-Profit**: A predefined order to sell a security once it reaches a specific price level, ensuring that profits are realized before unfavorable market movements occur.

This study aims to assess the potential of machine learning in stock price prediction while acknowledging the inherent complexities and uncertainties of financial markets.

In [None]:
!pip install --upgrade pip
!pip install yfinance==0.2.61
!pip install pandas==2.2.2
!pip install numpy==1.26.4
!pip install matplotlib==3.9.2
!pip install mlxtend==0.23.3
!pip install imblearn==0.0
!pip install scikit-learn==1.5.2
!pip install missingno==0.5.2
!pip install alpaca_trade_api==3.2.0
!pip install xgboost

In [None]:
import yfinance as yf
import pandas as pd
from datetime import datetime, timedelta
import numpy as np
import warnings
import alpaca_trade_api as tradeapi
import matplotlib.pyplot as plt
import seaborn as sns
import os 
warnings.filterwarnings('ignore')

# Part 1 -  Data gathering/processing

For this project, a large dataset is required—one that is both up-to-date and reliable. While pre-existing datasets from platforms such as Kaggle can be useful, there is no guarantee that they are clean, well-maintained, or updated regularly. Additionally, to ensure flexibility in selecting different stocks for analysis and adjusting date ranges without constraints, it is essential to have full control over data collection and preprocessing.

Thus, the approach involves downloading and cleaning the data independently.

This section of the notebook is structured into four key components:

- **1.1 Data Downloading**: Retrieving raw data from relevant sources.
- **1.2 Feature Calculation Functions**: Computing key features necessary for model input.
- **1.3 Label Generation**: Defining target variables for supervised learning.
- **Main Data Preprocessing "Pipeline"**: A wrapper to streamline and manage the entire preprocessing workflow.

## 1.1- Data downloading

To begin, it is essential to download data from a reliable (and preferably free) source. Two potential sources meeting these criteria have been identified: Yahoo Finance and Alpaca API. Both are excellent choices; however, each comes with its own set of challenges. Yahoo Finance provides live data without delays and covers virtually every available financial instrument, but its access wrapper is unofficial, and intraday candle data is limited to 30 days. On the other hand, Alpaca API offers an official, authorized library with no restrictions on intraday candle data, though its data is not live and it does not encompass all stocks and financial instruments.

Consequently, I have chosen to utilize both sources concurrently. Older data, which is not available via Yahoo Finance, is obtained from Alpaca API, while the most recent data is sourced from Yahoo Finance. Fortunately, the data formats from these two sources are fully compatible.

For Alpaca API, a demo account was employed solely for data retrieval, ensuring that no real funds are involved. The entire data acquisition process was encapsulated in a function to facilitate easier access to all necessary parameters.

Both data sources return a DataFrame with the following columns:
- Datetime
- Open
- High
- Low
- Adj Close
- Volume

Each row represents a candle, and the timeframe for these candles can be selected from various durations. During the study and optimization phase, the following timeframes proved particularly interesting: 5 minutes, 10 minutes, 1 hour, and 1 day. Naturally, one would expect better performance with longer timeframes, as shorter 5- to 10-minute candles tend to contain more background noise, making it more challenging to discern correlations between features and price.

Nevertheless, the 5-minute candles offer several advantages, such as the potential to execute a greater number of trades within a single day, thereby yielding more data points per trading day. This results in a database that is considerably less dated. For instance, while daily candles might accumulate only 8,000 to 10,000 instances over 31 to 39 years—accounting for market closures and holidays—5-minute candles may represent just 6 months of data. Consequently, the latter data is much more recent and coherent.

However, it might be worth considering the use of the hourly candle as it provides data that is easy enough to obtain, but it also reduces the background noise present in the 5-minute candle


In [None]:
def download_data_history(candle_time: int =15, path: str = r"assets/data_history.csv", download_option="yf", ticker="^GSPC", days_delta=100, currency="USD") -> None:
    try:
        os.mkdir("assets")
    except: 
        pass
    
    # WARNING: API keys are exposed here only because they belong to a demo account used exclusively for data download
    # This is NOT recommended practice. In production, always store sensitive credentials in .env files
    API_KEY = 'PKJY9KJHWG0459RC02Y4'
    API_SECRET = '6UXCVPutT2hpqQcUnQjFc74RSxcqrezcD4AXm0aL'
    BASE_URL = 'https://paper-api.alpaca.markets'  
    api = tradeapi.REST(API_KEY, API_SECRET, BASE_URL, api_version='v2')
    print(download_option)
    if download_option == "yf":
        end_date = (datetime.now() + timedelta(days=1))
        start_date = end_date - timedelta(days=days_delta)
        tkr = yf.Ticker(ticker)
        df = tkr.history(start=start_date.strftime('%Y-%m-%d'),
                   end=end_date.strftime('%Y-%m-%d'),
                   interval=candle_time)
        df.reset_index(inplace=True)

        if df['Volume'].iloc[0] == 0 and len(df) > 1:
            df['Volume'].iloc[0] = df['Volume'].iloc[1:6].mean()
        if "d" in candle_time:
            if df['Date'].dt.tz is not None:
                df['Date'] = df['Date'].dt.tz_convert(None)
        df.rename(columns={'Date': 'Datetime', 'Close':'Adj Close'}, inplace=True)
        df.drop(["Dividends",'Stock Splits'], axis=1, inplace=True)
        try:
            df.drop(["Capital Gains"], axis=1, inplace=True)
        except:
            pass
        df['Datetime'] = pd.to_datetime(df['Datetime'])
        df = df[~((df['Datetime'].dt.date == pd.to_datetime('2024-06-30').date()) & (df['Volume'] == 0))]
        df = df[~((df['Datetime'].dt.date == pd.to_datetime('2024-07-01').date()) & (df['Volume'] == 0))]
        df.to_csv(path, index=False)
    
    elif download_option == "alpaca":
        symbol = ticker  # Use provided ticker
        timeframe = f"{candle_time}"  # Adjust for hourly candles
        end_date = datetime.utcnow()
        start_date = end_date - timedelta(days=days_delta)
        all_data = []

        while start_date < end_date:
            chunk_end = start_date + timedelta(days=30)  # Adjust chunk size if needed
            if chunk_end > end_date:
                chunk_end = end_date

            start_str = start_date.strftime('%Y-%m-%dT%H:%M:%SZ')
            end_str = chunk_end.strftime('%Y-%m-%dT%H:%M:%SZ')

            try:
                data = api.get_bars(symbol, timeframe, start=start_str, end=end_str, feed="iex").df
                if not data.empty:
                    all_data.append(data)
                    print(len(data))
            except Exception as e:
                print(f"Error fetching data from {start_str} to {end_str}: {e}")

            start_date = chunk_end

        if all_data:
            print(len(all_data))
            historical_data = pd.concat(all_data)
            historical_data['timestamp'] = historical_data.index.tz_localize(None)
            historical_data.rename(columns={"timestamp": "Datetime", 'close': 'Adj Close', "high": "High", "low": "Low", "volume": "Volume", "open": "Open"}, inplace=True)
            historical_data.dropna(inplace=True)
            historical_data.to_csv(path, index=False)
            print("Data saved successfully!")
        else:
            print("No data retrieved.")


    elif download_option == "mixed":
        #This options allows me to download both the data from yf and from alpacaapi,the two apis return similar df, ony small agiustments are needed.
        end_date_yf= (datetime.now() + pd.Timedelta(days=1)).strftime('%Y-%m-%d')
        start_date_yf = (datetime.now() - pd.Timedelta(days=58)).strftime('%Y-%m-%d')
        end_date_alpaca= (datetime.now() - pd.Timedelta(days=60)).strftime('%Y-%m-%d')
        start_date_slpaca = (datetime.now() - pd.Timedelta(days=days_delta)).strftime('%Y-%m-%d')
        tkr = yf.Ticker(ticker)
        df_yf = tkr.history(start=start_date_yf, end=end_date_yf, interval=f"{candle_time}h")
        df_yf.drop("Close", axis=1, inplace=True)
        df_yf.reset_index(inplace=True)
        df_yf['Datetime'] = df_yf['Datetime'].dt.tz_localize(None)
        df_alpaca = api.get_bars(symbol=ticker, timeframe=f"{candle_time}Hour", start=start_date_slpaca, end=end_date_alpaca).df
        df_alpaca.reset_index(inplace=True)
        df_alpaca.rename(columns={"timestamp":"Datetime",'close': 'Adj Close',"high":"High", "low":"Low","volume":"Volume","open":"Open"}, inplace=True)
        df_alpaca['Datetime'] = df_alpaca['Datetime'].dt.tz_localize(None)  
        df_alpaca.drop(["trade_count","vwap"], axis=1, inplace=True)
        df_alpaca = df_alpaca[["Datetime", "Open", "High", "Low", "Adj Close", "Volume"]]

        df_combined = pd.concat([df_alpaca, df_yf], axis=0)

        df_combined.reset_index(drop=True, inplace=True)
        df_combined.dropna(inplace=True)
        df_combined.to_csv(path, index=False)
        


    elif download_option == "crypto":
        symbol = ticker  # Use the provided ticker, e.g. "BTCUSD"
        timeframe = f"{candle_time}"  # For example, "1H" for hourly candles
        end_date = datetime.utcnow()
        start_date = end_date - timedelta(days=days_delta)
        all_data = []

        while start_date < end_date:
            chunk_end = start_date + timedelta(days=200)  # You can adjust chunk size if needed
            if chunk_end > end_date:
                chunk_end = end_date

            start_str = start_date.strftime('%Y-%m-%dT%H:%M:%SZ')
            end_str = chunk_end.strftime('%Y-%m-%dT%H:%M:%SZ')

            try:
                # Fetch crypto data (ensure to set the correct exchange parameter)
                data = api.get_crypto_bars(symbol, timeframe, start=start_str, end=end_str).df
                if not data.empty:
                    all_data.append(data)
                    print(f"Fetched {len(data)} rows from {start_str} to {end_str}")
            except Exception as e:
                print(f"Error fetching data from {start_str} to {end_str}: {e}")

            start_date = chunk_end

        if all_data:
            print(f"Number of chunks retrieved: {len(all_data)}")
            historical_data = pd.concat(all_data)
            historical_data['timestamp'] = historical_data.index.tz_localize(None)
            historical_data.rename(columns={
                "timestamp": "Datetime",
                "close": "Adj Close",
                "high": "High",
                "low": "Low",
                "volume": "Volume",
                "open": "Open"
            }, inplace=True)
            historical_data.drop(["trade_count","vwap","symbol"], axis=1, inplace=True)
            historical_data.dropna(inplace=True)
            historical_data.to_csv(path, index=False)
            print("Data saved successfully!")
        else:
            print("No data retrieved.")


    elif download_option == "mt5":
        try:
            import MetaTrader5 as mt5
            
            # Initialize MT5 connection
            if not mt5.initialize():
                print(f"MT5 initialization failed. Error code: {mt5.last_error()}")
                return
            
            # Get the symbol from user input
            symbol = ticker
            
            # Get timeframe from user input
            timeframe_options = {
                "1m": mt5.TIMEFRAME_M1,
                "5m": mt5.TIMEFRAME_M5,
                "10m": mt5.TIMEFRAME_M10,
                "15m": mt5.TIMEFRAME_M15,
                "30m": mt5.TIMEFRAME_M30,
                "1h": mt5.TIMEFRAME_H1,
                "4h": mt5.TIMEFRAME_H4,
                "1d": mt5.TIMEFRAME_D1,
                "1w": mt5.TIMEFRAME_W1,
                "1mn": mt5.TIMEFRAME_MN1
            }
            
            
            timeframe_input = candle_time
            if timeframe_input not in timeframe_options:
                print(f"Invalid timeframe. Using default 1d.")
                timeframe = mt5.TIMEFRAME_D1
            else:
                timeframe = timeframe_options[timeframe_input]
            
            # Get date range using days_delta
            try:
                days_delta = int(days_delta)
            except ValueError:
                print("Invalid number of days. Using default 365 days.")
                days_delta = 365
            
            to_date = datetime.now()
            from_date = to_date - timedelta(days=days_delta)
            
            # Request historical data
            rates = mt5.copy_rates_range(symbol, timeframe, from_date, to_date)
            
            if rates is None or len(rates) == 0:
                print(f"No data retrieved for {symbol}. Error: {mt5.last_error()}")
                mt5.shutdown()
                return
            
            # Convert to DataFrame
            df = pd.DataFrame(rates)
            
            # Convert time in seconds into datetime format
            df['Datetime'] = pd.to_datetime(df['time'], unit='s')
            
            # Drop the original time column
            df = df.drop(columns=['time', 'spread', 'real_volume'])
            
            # Rename columns to match our standard format
            df.rename(columns={
                'open': 'Open',
                'high': 'High',
                'low': 'Low',
                'close': 'Adj Close',
                'tick_volume': 'Volume'
            }, inplace=True)
            
            # Reorder columns
            df = df[['Datetime', 'Open', 'High', 'Low', 'Adj Close', 'Volume']]
            # Save to CSV
            df.to_csv(path, index=False)
            
            print(f"Downloaded {len(df)} records for {symbol} from {from_date.strftime('%Y-%m-%d')} to {to_date.strftime('%Y-%m-%d')}")
            # Shutdown MT5 connection
            mt5.shutdown()
            
        except ImportError:
            print("MetaTrader5 module not found. Please install it using: pip install MetaTrader5")
        except Exception as e:
            print(f"Error downloading data from MT5: {str(e)}")


## 1.2- Feature Calculation Functions

In this section, various features will be added for each candlestick, consisting of multiple financial technical indicators. A broader set of indicators is initially included, extending beyond those deemed immediately useful. The goal is to later refine the selection through feature reduction techniques, ensuring that only the most relevant indicators contribute to the model's performance.

In [None]:
def pct_values(df):
    df["pct_close"] = (1-df["Adj Close"]/df["Adj Close"].shift(1))*100
    df["pct_open"] = (1-df["Open"]/df["Open"].shift(1))*100
    df["ptc_high"] = (1-df["High"]/df["Open"])*100
    df["ptc_low"] = (1-df["Low"]/df["Open"])*100
    df.drop(["Adj Close",'Open',"High","Low"], axis=1, inplace=True)
    df.rename(columns={"pct_close": "Adj Close", 'pct_open': 'Open', "ptc_high": "High", "ptc_low": "Low"}, inplace=True)
    return df

In [None]:
def calculate_rolling_std(price_column):
    rolling_std = price_column.rolling(window=14).std()
    return rolling_std

In [None]:
def calculate_rolling_spread_average(high_column, low_column, window=14):
    spread = high_column - low_column    
    spread_average = spread.rolling(window=window).mean()    
    return spread_average


In [None]:
def calculate_rolling_volume_average(volume_column, window=14):    
    volume_average = volume_column.rolling(window=window).mean()    
    return volume_average

In [None]:
def calculate_rolling_range(high_column, low_column):
    # Compute rolling highest high and lowest low
    max_high = high_column.rolling(window=14).max()
    min_low = low_column.rolling(window=14).min()
    # Compute range
    rolling_range = max_high - min_low
    return rolling_range


In [None]:
def calculate_price_rate_of_change(price_column):
    # Calculate the previous price by shifting the price column by one period *backward*
    previous_price = price_column.shift(1)  # This gives you the price from the previous period
    # Calculate the Rate of Change
    roc = (price_column - previous_price) / previous_price
    return roc

In [None]:
def calculate_log_returns(price_column):
    previous_prices = price_column.shift(1)  # Get the previous period's prices
    log_returns = np.log(price_column / previous_prices)
    return log_returns

In [None]:
def calculate_volume_delta(volume_column):
    """
    Calculate the Volume Delta between the current and previous volume.

    Args:
        volume_column (pandas Series): Trading volumes from most recent to oldest.
    Returns:
        pandas Series: Current volume minus previous volume
    """
    previous_volume = volume_column.shift(1)
    volume_delta = volume_column - previous_volume
    return volume_delta

In [None]:
def calculate_price_to_volume_ratio(close_column, volume_column):
    """
    Calculate the Price to Volume Ratio.

    :param close_column: pandas Series of closing prices
    :param volume_column: pandas Series of trading volumes
    :return: pandas Series of Price to Volume Ratios
    """
    volume_column = volume_column.replace(0, np.nan)
    price_to_volume_ratio = close_column / volume_column
    return price_to_volume_ratio


In [None]:
def calculate_cumulative_return(price_column):
    """
    Calculate the cumulative return over a rolling window of 14 periods.

    Args:
        price_column (pandas Series): A pandas Series of prices in chronological order (oldest to most recent).
        
    Returns:
        pandas Series: The cumulative return for each row.
    """
    def rolling_cumulative_return(prices):
        returns = prices.pct_change().dropna()
        if len(returns) < 13:  # Because 14 prices -> 13 returns
            return None
        cumulative_return = (1 + returns).prod() - 1
        return cumulative_return

    rolling_returns = price_column.rolling(window=14, min_periods=14).apply(rolling_cumulative_return, raw=False)
    return rolling_returns

In [None]:
def calculate_ema(adj_close_column, span=14):
    """
    Calculate the Exponential Moving Average (EMA) for a series.

    Args:
        adj_close_column (pandas Series): Adjusted closing prices in chronological order (oldest to most recent).
        span (int): EMA period.

    Returns:
        pandas Series: EMA values.
    """
    ema_series = adj_close_column.ewm(span=span, adjust=False).mean()
    ema_series.iloc[:span] = None  # Set the first `span` values to NaN due to insufficient data
    return ema_series

In [None]:
def calculate_rsi(adj_close_column, period=14):
    """
    Calculate the Relative Strength Index (RSI) for a pandas Series of adjusted closing prices.
    
    Args:
        adj_close_column (Series): Adjusted closing prices in chronological order (oldest to most recent).
        period (int): Period over which to calculate RSI. Default is 14.

    Returns:
        Series: RSI values between 0 and 100.
    """
    # Calculate price changes (deltas) between consecutive periods
    deltas = adj_close_column.diff()

    # Separate gains and losses
    gains = deltas.where(deltas > 0, 0)
    losses = -deltas.where(deltas < 0, 0)

    # Calculate average gain and loss
    avg_gain = gains.rolling(window=period).mean()
    avg_loss = losses.rolling(window=period).mean()

    # Smooth the averages
    avg_gain = avg_gain.ewm(span=period, adjust=False).mean()
    avg_loss = avg_loss.ewm(span=period, adjust=False).mean()

    # Avoid division by zero
    avg_loss = avg_loss.replace(0, 1e-10)

    # Calculate RSI
    rs = avg_gain / avg_loss
    rsi = 100 - (100 / (1 + rs))

    return rsi

In [None]:
def calculate_macd(df):
    """
    Calculate the Moving Average Convergence Divergence (MACD) indicator.

    Steps:
    1. Calculate 12-period EMA of adjusted closing prices
    2. Calculate 26-period EMA of adjusted closing prices  
    3. MACD Line = 12-period EMA - 26-period EMA
    4. Signal Line = 9-period EMA of the MACD Line
    5. MACD Histogram = MACD Line - Signal Line

    Args:
        df (DataFrame): DataFrame containing the 'Adj Close' prices in chronological order (oldest to most recent).

    Returns:
        DataFrame: The same DataFrame with added MACD-related columns.
    """
    df['EMA_12'] = df['Adj Close'].ewm(span=12, adjust=False).mean()
    df['EMA_26'] = df['Adj Close'].ewm(span=26, adjust=False).mean()
    df['MACD'] = df['EMA_12'] - df['EMA_26']
    df['Signal_Line'] = df['MACD'].ewm(span=9, adjust=False).mean()
    df['MACD_Histogram'] = df['MACD'] - df['Signal_Line']
    
    return df

In [None]:
def calculate_mean(close_column):
    return close_column.rolling(window=14).mean()

In [None]:
def calculate_bollinger_upper_bands(price_column, num_std_dev=2):
    """
    Calculate the upper Bollinger Band.

    Parameters:
    price_column (pandas Series): Prices in chronological order (oldest to most recent).
    num_std_dev (float): Number of standard deviations to add to the mean (default is 2).

    Returns:
    pandas Series: Upper Bollinger Band.
    """
    mean = price_column.rolling(window=14).mean()
    std_dev = price_column.rolling(window=14).std()
    upper_band = mean + (num_std_dev * std_dev)
    return upper_band

In [None]:
def calculate_bollinger_lower_bands(price_column, num_std_dev=2):
    """
    Calculate the lower Bollinger Band.

    Parameters:
    price_column (pandas Series): Prices in chronological order (oldest to most recent).
    num_std_dev (float): Number of standard deviations to subtract from the mean (default is 2).

    Returns:
    pandas Series: Lower Bollinger Band.
    """
    mean = price_column.rolling(window=14).mean()
    std_dev = price_column.rolling(window=14).std()
    lower_band = mean - (num_std_dev * std_dev)
    return lower_band

In [None]:
def calculate_atr(high_column, low_column, close_column, period=14):
    """
    Calculate the Average True Range (ATR) using Wilder's smoothing method.

    Parameters:
        high_column (pd.Series): High prices in chronological order.
        low_column (pd.Series): Low prices in chronological order.
        close_column (pd.Series): Close prices in chronological order.
        period (int): Period for ATR calculation (default is 14).

    Returns:
        pd.Series: ATR values.
    """
    # True Range components
    tr1 = high_column - low_column
    tr2 = (high_column - close_column.shift(1)).abs()
    tr3 = (low_column - close_column.shift(1)).abs()
    
    # Combine into True Range
    true_range = pd.concat([tr1, tr2, tr3], axis=1).max(axis=1)

    # Initialize ATR Series with NaNs
    atr = pd.Series(index=true_range.index, dtype='float64')
    
    # First ATR is simple average
    atr.iloc[period - 1] = true_range.iloc[:period].mean()
    
    # Wilder's smoothing for the rest
    for i in range(period, len(true_range)):
        atr.iloc[i] = (atr.iloc[i - 1] * (period - 1) + true_range.iloc[i]) / period

    return atr

In [None]:
def calculate_plus_di(high, low, close, window=14):
    """
    Calculate +DI (Positive Directional Indicator).

    Parameters:
        high (pd.Series): High prices in chronological order.
        low (pd.Series): Low prices in chronological order.
        close (pd.Series): Close prices in chronological order.
        window (int): Rolling window size (default: 14).

    Returns:
        pd.Series: +DI values.
    """
    up_move = high.diff()
    down_move = low.diff().abs()

    plus_dm = up_move.where((up_move > down_move) & (up_move > 0), 0)

    tr = pd.concat([
        (high - low),
        (high - close.shift(1)).abs(),
        (low - close.shift(1)).abs()
    ], axis=1).max(axis=1)

    smoothed_plus_dm = plus_dm.rolling(window=window).mean()
    atr = tr.rolling(window=window).mean()

    plus_di = 100 * (smoothed_plus_dm / atr)
    return plus_di

def calculate_minus_di(high, low, close, window=14):
    """
    Calculate -DI (Negative Directional Indicator).

    Parameters:
        high (pd.Series): High prices in chronological order.
        low (pd.Series): Low prices in chronological order.
        close (pd.Series): Close prices in chronological order.
        window (int): Rolling window size (default: 14).

    Returns:
        pd.Series: -DI values.
    """
    up_move = high.diff().abs()
    down_move = low.diff()

    minus_dm = down_move.where((down_move > up_move) & (down_move > 0), 0)

    tr = pd.concat([
        (high - low),
        (high - close.shift(1)).abs(),
        (low - close.shift(1)).abs()
    ], axis=1).max(axis=1)

    smoothed_minus_dm = minus_dm.rolling(window=window).mean()
    atr = tr.rolling(window=window).mean()

    minus_di = 100 * (smoothed_minus_dm / atr)
    return minus_di

In [None]:
def calculate_adx(high, low, close, window=14):
    """
    Calculate the Average Directional Index (ADX) using high, low, and close prices.

    Parameters:
        high (pd.Series): High prices in chronological order.
        low (pd.Series): Low prices in chronological order.
        close (pd.Series): Close prices in chronological order.
        window (int): Rolling window size (default: 14).

    Returns:
        pd.Series: ADX values.
    """
    up_move = high.diff()
    down_move = low.diff()

    plus_dm = up_move.where((up_move > down_move) & (up_move > 0), 0)
    minus_dm = (-down_move).where((down_move < up_move) & (down_move < 0), 0)

    tr = pd.concat([
        high - low,
        (high - close.shift(1)).abs(),
        (low - close.shift(1)).abs()
    ], axis=1).max(axis=1)

    atr = tr.rolling(window=window).mean()
    smoothed_plus_dm = plus_dm.rolling(window=window).mean()
    smoothed_minus_dm = minus_dm.rolling(window=window).mean()

    plus_di = 100 * (smoothed_plus_dm / atr)
    minus_di = 100 * (smoothed_minus_dm / atr)

    dx = (abs(plus_di - minus_di) / (plus_di + minus_di).replace(0, np.nan)) * 100
    adx = dx.rolling(window=window).mean()

    return adx

In [None]:

def calculate_time_of_day_sin(datetime_col):
    """
    Calculate sine component of time of day encoding.
    
    Parameters:
    datetime_col (pandas.Series): Datetime values
    
    Returns:
    numpy.ndarray: Sine values for time of day (0 to 2π), both 0:00 and 24:00 map to the same point"""
    minutes_in_day = datetime_col.dt.hour * 60 + datetime_col.dt.minute
    return np.sin(2 * np.pi * minutes_in_day / 1440)

def calculate_time_of_day_cos(datetime_col):
    """
    Calculate the cosine component of cyclical time of day encoding.
    
    Parameters:
    datetime_col (pandas.Series): Column containing datetime values
    
    Returns:
    numpy.ndarray: Cosine values representing time of day from 0 to 2π,
                  where both 0:00 and 24:00 map to the same point
    """
    minutes_in_day = datetime_col.dt.hour * 60 + datetime_col.dt.minute
    return np.cos(2 * np.pi * minutes_in_day / 1440)


def calculate_day_of_week_sin(datetime_col):
    """
    Calculate the sine component of cyclical day of week encoding.
    
    Parameters:
    datetime_col (pandas.Series): Column containing datetime values
    
    Returns:
    numpy.ndarray: Sine values representing days from 0 to 2π,
                  where both Sunday and Saturday map to the same point
    """
    return np.sin(2 * np.pi * datetime_col.dt.dayofweek / 7)


def calculate_day_of_week_cos(datetime_col):
    """
    Calculate the cosine component of cyclical day of week encoding.
    
    Parameters:
    datetime_col (pandas.Series): Column containing datetime values
    
    Returns:
    numpy.ndarray: Cosine values representing days from 0 to 2π,
                  where both Sunday and Saturday map to the same point
    """
    return np.cos(2 * np.pi * datetime_col.dt.dayofweek / 7)

## 1.3- Feature Calculation Functions Wrapper  

In this section, I implemented a wrapper function to integrate all the features. The process begins with the base features for each instance:

- **Open**
- **High**
- **Low**
- **Adj Close**
- **Volume**

Subsequently, I calculate additional features derived from these base values:

- **Price Rate of Change**
- **Log Returns**
- **Volume Delta**
- **Price-Volume Ratio**

After computing these, I replicate all of the features by incorporating data from the previous *n* instances. For each feature (e.g., Open), the values from the preceding instances are appended as separate features (e.g., Open-1, Open-2, Open-3, ..., Open-n), and this replication is applied to all the calculated features.

The previously mentioned set of features, referred to as "repeated features," constitutes the first group. In contrast, there exists another group of features which, once computed for an individual instance, are not carried over to subsequent instances.

In [None]:
def extract_advanced_features(df_path, n_precedent_candles=14):

    df = pd.read_csv(df_path, on_bad_lines="warn")
    for column in df.columns:
        if column == "Datetime":
            continue
        df[column] = pd.to_numeric(df[column], errors='coerce')

    df.dropna(inplace=True)

    #repeated features
    df['price_rate_of_change'] = calculate_price_rate_of_change(df['Adj Close'])
    df['log_returns'] = calculate_log_returns(df['Adj Close'])
    df["volume_delta"] = calculate_volume_delta(df['Volume'])
    df["price_volume_ratio"] = calculate_price_to_volume_ratio(df['Adj Close'], df['Volume'])
    try:
        df.drop("Close", axis=1, inplace=True)
    except:
        pass

    columns_names_list = [col for col in df.columns.values.tolist() if col != 'Datetime']
    new_columns = {}
    for n in range(1, n_precedent_candles + 1):
        for i in columns_names_list:
            new_columns[f"{i}_n_{n}"] = df[i].shift(+n)
    df = pd.concat([df, pd.DataFrame(new_columns)], axis=1)

    #Single time features
    #df = candles_analizer(df)
    df["rolling_std"] = calculate_rolling_std(df["Adj Close"])
    df["rolling_spread_average"] = calculate_rolling_spread_average(df["High"], df["Low"])
    df["rolling_volume_average"] = calculate_rolling_volume_average(df["Volume"]) 
    df["rolling_range"] = calculate_rolling_range(df["High"], df["Low"])
    df["cumulative_return"] = calculate_cumulative_return(df["Adj Close"])
    df["ema"] = calculate_ema(df["Adj Close"])
    df["rsi"] = calculate_rsi(df["Adj Close"])
    df["mean"] = calculate_mean(df["Adj Close"])
    df = calculate_macd(df)  
    df["upper_bollinger_bands"] = calculate_bollinger_upper_bands(df["mean"], df["rolling_std"])
    df["lower_bollinger_bands"] = calculate_bollinger_lower_bands(df["mean"], df["rolling_std"])
    df["average_true_range"] = calculate_atr(df["High"], df["Low"], df["Adj Close"])
    df['Datetime'] = pd.to_datetime(df['Datetime'])
    df['time_sin'] = calculate_time_of_day_sin(df['Datetime'])
    df['time_cos'] = calculate_time_of_day_cos(df['Datetime'])
    df['day_sin'] = calculate_day_of_week_sin(df['Datetime'])
    df['day_cos'] = calculate_day_of_week_cos(df['Datetime'])
    df["plus_di"] = calculate_plus_di(df["High"], df["Low"], df["Adj Close"])
    df["minus_di"] = calculate_minus_di(df["High"], df["Low"], df["Adj Close"])
    df["average_directional_index"] = calculate_adx(df["High"], df["Low"], df["Adj Close"])
    return df

## 1.4- Label Generation

This brief section is dedicated solely to highlighting the classification function. It has been maintained as a separate section to ensure a clear and organized code structure. The label is defined as follows:

- **1**: Indicates that the adjusted closing price of the next candlestick will be greater than the current adjusted closing price (just concluded).
- **0**: Indicates that the adjusted closing price of the next candlestick will be lower than the current adjusted closing price (just concluded).

In [None]:
def add_classification_label(df):
    """
    Add binary classification label based on price movement
    
    Parameters:
    -----------
    df : pd.DataFrame
        Input dataframe
    """
    df["label"] = (df["Adj Close"].shift(-1) > df["Adj Close"]).astype(int)
    return df


## 0.5- Data Preprocessing and Saving

In [None]:
input_file_path = 'assets/data_history.csv'
download_data_history(candle_time="1d", path=input_file_path, download_option="yf", ticker="^FTSE", days_delta=10000)

output_file_path_svm = 'assets/data_preprocessed_svm.csv'
preprocessed_df = extract_advanced_features(input_file_path, n_precedent_candles=13)
preprocessed_df = add_classification_label(preprocessed_df)
preprocessed_df.dropna(inplace=True)

preprocessed_df.set_index('Datetime', inplace=True)

preprocessed_df.reset_index(inplace=True)

preprocessed_df.to_csv(output_file_path_svm, index=True)

To ensure the correctness of the function implementation and application, I visualized the distribution of all features through graphical representations. This approach allowed me to identify potential errors more effectively. It proved particularly useful during the model testing phase, as it enabled me to detect and correct mistakes in the application of certain feature transformations.

This will also be useful later for choosing what type of feature scaling to apply 

In [None]:
cols = [column for column in preprocessed_df.columns if "_n_" not in column]
num_columns = len(cols)
num_rows = (num_columns + 1) // 2  
num_cols = 2  

plt.figure(figsize=(15, 5 * num_rows))

for i, column in enumerate(cols, 1):  
    plt.subplot(num_rows, num_cols, i)
    sns.histplot(preprocessed_df[column], kde=True)
    plt.title(f'Distribution of {column}')

plt.tight_layout()
plt.show()


# Part 2- ML model creation


In this section of the notebook, the objective is to identify an optimal machine learning model. This model will be structured as a pipeline comprising three key components:
 
1. **Preprocessing Step**: This stage involves data transformation and feature engineering to enhance model performance.
2. **Dimensionality Reduction (if necessary)**: If required, this step aims to reduce the number of input features while preserving essential information.
3. **Classification Model**: A classifier will be selected and fine-tuned to achieve optimal predictive performance.
 
The pipeline ensures a systematic approach to model development, improving efficiency and scalability while maintaining reproducibility.
 
This section of the notebook is structured into 2 key components:
 
- **2.1 Data loading, splitting and processing**: In this component, the preprocessed dataset is loaded, cleaned, and split into training and testing sets.
- **2.2 Grid search**: In this component, an exhaustive grid search is employed alongside cross-validation to evaluate various hyperparameter combinations for the different stages of the pipeline.

## 2.1 Data loading, splitting and processing

In [None]:
import itertools
import pandas as pd

from scipy.stats import loguniform

from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import Perceptron, LogisticRegression
from sklearn.metrics import f1_score, classification_report
from sklearn.model_selection import (train_test_split,
                                     learning_curve, validation_curve, RandomizedSearchCV, 
                                     cross_validate, RepeatedStratifiedKFold)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import  MinMaxScaler
from sklearn.preprocessing import  StandardScaler

from sklearn.feature_selection import SelectKBest
from sklearn.svm import SVC
from sklearn.cluster import FeatureAgglomeration
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.decomposition import TruncatedSVD, NMF
from sklearn.ensemble import AdaBoostClassifier

from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.svm import SVC, LinearSVC

from xgboost import XGBClassifier

from imblearn.pipeline import Pipeline as IMBPipeline
from imblearn.over_sampling import SMOTE
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

from collections import defaultdict
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import accuracy_score
import joblib



In [None]:
df=pd.read_csv('assets/data_preprocessed_svm.csv')
df.drop(index=df.index[0], axis=0, inplace=True)
df.reset_index(drop=True, inplace=True)


In [None]:
X = df.drop(columns=['label'])
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=True) 


My dataset is already free of null values, and all features are numerical. Therefore, the only transformation required is data scaling to ensure consistency across different magnitudes.

The only exception is the datetime column, which I will retain as an index. This will facilitate time-series visualization and allow for more intuitive plotting of trends in later stages of the analysis.

To apply feature scaling, I used the previously plotted graphs to understand when to apply normalization (the data has a non-Gaussian distribution) and when to apply standardization (the data follows a normal distribution)

In [None]:

pipeline_complete = Pipeline([('scaler', MinMaxScaler(feature_range=(0,1)))])
num_features = [col for col in df.select_dtypes(include=['float64', 'int64']).columns if col != 'label']


preprocessor = ColumnTransformer(transformers=[
    ('norm', pipeline_complete, num_features),])


In [None]:
model_pipeline = IMBPipeline([
    ('preprocessor', preprocessor),
    ('dim_reduction', PCA(n_components=0.9)),  
    ('smote', SMOTE(sampling_strategy='auto')),  
    ('classifier', Perceptron())
])

### Scorer

Initially, the objective was to maximize the **F1 score**, defined as:

$$
F1 = \frac{2 \times (\text{Precision} \times \text{Recall})}{\text{Precision} + \text{Recall}}
$$

However, optimizing for the F1 score caused the model to predominantly predict class 1, resulting in an unbalanced and ineffective classification. This happened probably because the F1 score focuses mainly on the positive class and does not penalize poor performance on the negative class.

To mitigate this issue, a **custom scoring function** was implemented:

$$
\text{Custom Precision Score} =
\begin{cases}
0, & \text{if } P_0 < 0.5 \text{ or } P_1 < 0.5 \text{ or } P_0 = 0 \text{ or } P_1 = 0, \\
\left(P_0 \times P_1 + \max(P_0, P_1)\right) \times \left(1 - \lvert P_0 - P_1 \rvert\right), & \text{otherwise.}
\end{cases}
$$

While this custom metric led to some improvements, it still did not fully resolve the imbalance.

Recognizing the need for a more robust evaluation metric, the **Matthews Correlation Coefficient (MCC)** was explored. The MCC is given by:

$$
\text{MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}
$$

Because MCC accounts for all four elements of the confusion matrix (true positives, true negatives, false positives, and false negatives), it provides a more balanced and comprehensive assessment of the model's performance. Optimizing the model using MCC prevented the bias toward predicting class 1 and encouraged more balanced predictions across both classes.

**Conclusion:**  
Using the MCC as the primary evaluation metric leads to a more balanced classifier, effectively addressing the pitfalls of optimizing solely for the F1 score.


In [None]:
from sklearn.metrics import precision_score, make_scorer
def custom_precision_score(y_true, y_pred):
    precision_0 = precision_score(y_true, y_pred, pos_label=0)
    precision_1 = precision_score(y_true, y_pred, pos_label=1)    
    if (precision_0 < 0.5 or precision_1 < 0.5) or (precision_0 == 0 or precision_1 == 0):
        return 0 
    imbalance_penalty = 1 - abs(precision_0 - precision_1)
    return (precision_0 * precision_1 + max(precision_0, precision_1)) * imbalance_penalty
custom_scorer = make_scorer(custom_precision_score)


## 2.2 Grid search


In [None]:
dim_reduction_configs = [
    {
        'dim_reduction': [None]
    },
    {
        'dim_reduction': [PCA()],
        'dim_reduction__n_components': [0.3, 0.5, 0.7, 0.9]
    },
    {
        'dim_reduction': [LDA()],
        'dim_reduction__n_components': [1]
    },
    {
        'dim_reduction': [SFS(estimator=Perceptron(), cv=None, scoring="matthews_corrcoef")],
        'dim_reduction__estimator': [Perceptron(), LogisticRegression(), SVC()],
        'dim_reduction__k_features': [10, 20, 30, 40, 50, 60, 70, 80]
    },
]

classifier_configs = [
    {
        'classifier': [Perceptron()],
        'classifier__eta0': loguniform(0.0001, 10),
        'classifier__max_iter': [50, 100, 200, 500],
        'classifier__class_weight': [None, 'balanced'],
        'classifier__penalty': ['l1', 'l2', None]
    },
    {
        'classifier': [LogisticRegression(solver='saga')],
        'classifier__C': loguniform(0.0001, 10),
        'classifier__penalty': ['l1', 'l2', 'elasticnet'],
        'classifier__class_weight': [None, 'balanced'],
        'classifier__max_iter': [100, 500, 1000]
    },
    {
        'classifier': [KNeighborsClassifier()],
        'classifier__n_neighbors': [3, 5, 7, 9, 11, 13, 15, 17, 19, 21],
        'classifier__weights': ['uniform', 'distance'],
        'classifier__metric': ['euclidean', 'manhattan', 'chebyshev']
    },
    {
        'classifier': [RandomForestClassifier()],
        'classifier__n_estimators': [50, 100, 200, 500, 1000],
        'classifier__max_depth': [None, 10, 20, 30, 50],
        'classifier__min_samples_split': [2, 5, 10],
        'classifier__min_samples_leaf': [1, 2, 4]
    },
    {
        'classifier': [GradientBoostingClassifier()],
        'classifier__n_estimators': [50, 100, 200, 500],
        'classifier__learning_rate': loguniform(0.001, 0.1),
        'classifier__max_depth': [3, 5, 7, 9],
        'classifier__min_samples_split': [2, 5, 10],
        'classifier__min_samples_leaf': [1, 2, 4]
    },
    {
        'classifier': [XGBClassifier()],
        'classifier__n_estimators': [50, 100, 200],
        'classifier__learning_rate': loguniform(0.001, 0.1),
        'classifier__max_depth': [3, 5, 7],
        'classifier__subsample': [0.5, 0.7, 1.0],
        'classifier__colsample_bytree': [0.5, 0.7, 1.0]
    },
    {
        'classifier': [AdaBoostClassifier()],
        'classifier__n_estimators': [50, 100, 200],
        'classifier__learning_rate': loguniform(0.001, 1)
    },
    {
        'classifier': [LinearSVC()],
        'classifier__C': loguniform(0.001, 100),
        'classifier__max_iter': [1000, 2000, 3000],
        'classifier__class_weight': [None, 'balanced']
    }
]

In [None]:
all_configs = []
for configuration in itertools.product(dim_reduction_configs,classifier_configs):
    all_parameters = []
    for element in configuration:
        for item in element.items():
            all_parameters.append(item)
    all_configs.append(dict(all_parameters)) 
print(len(all_configs))

In [None]:
rs = RandomizedSearchCV(model_pipeline,
    param_distributions=all_configs,
    n_iter=len(all_configs) * 100,
    n_jobs=-1,
    cv = 5,
    scoring='matthews_corrcoef'
)

In [None]:
scores = cross_validate(rs, X_train, y_train, scoring="matthews_corrcoef", cv = 10, return_estimator=True, verbose=3) 

In [None]:
for index, estimator in enumerate(scores['estimator']):
    print(estimator.best_estimator_.get_params()['dim_reduction'])
    print(estimator.best_estimator_.get_params()['classifier'],estimator.best_estimator_.get_params()['classifier'].get_params())
    print(scores['test_score'][index])
    print('-'*10)

In [None]:
for estimator in scores['estimator']:
    pred_train = estimator.best_estimator_.fit(X_train, y_train)
    pred_train = estimator.best_estimator_.predict(X_train)
    pred_test = estimator.best_estimator_.predict(X_test)
    f1_train = f1_score(y_train, pred_train)
    f1_test = f1_score(y_test, pred_test)
    print(f'F1 on training set:{f1_train}, F1 on test set:{f1_test}')

To facilitate result interpretation at a glance, I aim to have a comprehensive visual representation that allows for an immediate assessment of model performance.  
To achieve this, I generate a single plot that displays all five confusion matrices in a unified view. This approach enables a quick and intuitive evaluation of classification performance across different conditions, making it easier to identify patterns, strengths, and potential weaknesses in the model's predictions.


In [None]:

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

num_estimators = len(scores['estimator'])
nrows = 2
ncols = 3

fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(15, 10))
axes = axes.flatten()
for index, estimator in enumerate(scores['estimator']):
    pred_test = estimator.best_estimator_.predict(X_test)
    report = classification_report(y_test, pred_test)
    print(report)
    cm = confusion_matrix(y_test, pred_test)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=estimator.classes_)
    disp.plot(cmap='Blues', ax=axes[index], colorbar=False)
    axes[index].set_title(f'Confusion Matrix for Estimator {index+1}')

for ax in axes[num_estimators:]:
    ax.set_visible(False)

plt.tight_layout()
plt.show()


### Refinement of the selected model


In [None]:
best_model_pipeline_ref = IMBPipeline([
    ('trans', preprocessor),
    ('classifier', LogisticRegression(
        C=1.944795783724781,
        class_weight='balanced',
        max_iter=1000,
        penalty='l1',
        solver='saga',
        random_state=42
    ))
])
params = {
    'classifier__C': np.logspace(np.log10(1.0), np.log10(3.0), num=5),
    'classifier__penalty': ['l1', 'l2'],
    'classifier__max_iter': [500, 1000, 1500]
}


In [None]:
rs_best = RandomizedSearchCV(
    estimator = best_model_pipeline_ref,
    param_distributions = params,
    cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=1),
    n_iter=20,
    scoring='matthews_corrcoef'
)


In [None]:
rs_best.fit(X_train, y_train)
cls = rs_best.best_estimator_
y_pred= cls.predict(X_test)
report = classification_report(y_test, y_pred)
print(report)
print(cls.get_params())

# Part 3 Results Analysis:

## 3.1 Grid Search Result

During the experimentation phase, I tested various configurations, modifying the following parameters:  

- **Stock selection**  
- **Number of preceding candles** considered in each instance  
- **Timeframe** (5 minutes, 10 minutes, 1 hour, 1 day)  

The results revealed the following insights:  

- Since the model relies solely on technical indicator analysis, without incorporating macroeconomic context or investor sentiment, **indices tend to perform better**. Individual stocks introduce excessive volatility, while, unexpectedly, currency pairs do not yield strong performance.  
- **Optimal performance is achieved when considering at least 8-9 preceding candles**. However, beyond 12 candles, performance deteriorates.  
- **The daily timeframe provides the best results** as it reduces noise. However, the difference compared to the 1 hour timeframe is not significant. Given similar performance, the 1 hour timeframe is preferable as it allows for a higher trade frequency, leading to potentially greater profits. 
 
I also decided to save the probabilities assigned to each choice of the model: 0 or 1, this could be very useful later as a threshold

### 3.1.1 Best model Result

In [None]:
#Stock: Spy index
#Timeframe: 1 hour

best_model_pipeline_1 = IMBPipeline([
    ('trans', preprocessor),
    ('classifier', LinearSVC(C=2.0257272878736794))
])
best_model_pipeline_1.fit(X_train,y_train)
y_pred_model_1 =  best_model_pipeline_1.predict(X_test)
report = classification_report(y_test, y_pred_model_1)
prediction_probabilities_model_1 = np.max(best_model_pipeline_1.predict_proba(X_test), axis=1)

print(report)

After evaluating the performance of our best-performing model, the next step is to assess whether applying a basic buy/sell strategy based on the model's predicted labels can yield tangible results.

Approach

1. **Percentage Difference Calculation:**  
   We calculate the absolute percentage difference between two consecutive candlesticks by comparing the current candlestick's (`Adj Close`) with that of the next one.

2. **Normalization for Stop-Loss/Take-Profit:**  
   The computed difference is normalized to 1%, effectively simulating a stop-loss/take-profit mechanism. This adjustment, while unnecessary for 1 hour candlesticks (except under extreme market conditions), proves useful when simulating daily candlesticks.

3. **Utilizing Prediction Probability:**  
   Tracking the prediction probability allows us to filter out low-confidence predictions:
   - By setting a threshold, we can choose to perform more trades (with lower prediction probabilities) or opt for fewer, higher-quality trades.
   - In real trading, favoring quality (i.e., higher prediction probability) is likely preferable, as it increases overall accuracy and reduces transaction costs.

>**Practical example**   
>without the filter, over the course of a month of testing, approximately 2000 trades are executed with an average accuracy of 54% and an average prediction probability of about 53%. However, if I decide not to consider those trades that are nearly random—i.e., with a prediction probability below 60%—the number of trades would drop to ~200/300, but the accuracy would soar to an impressive 60%.

In [None]:
def process_and_analyze(X_test, y_test, y_pred, prediction_probabilitiess, model_n=1, prediction_probabilities_filter=0):
    X_test["difference_abs_percentage"] = abs(100 - X_test["Adj Close"].shift(-1) * 100 / X_test["Adj Close"])
    X_test["label"] = y_test
    X_test["pred_label"] = y_pred
    X_test["prediction_probability"] = prediction_probabilitiess  # Store prediction_probability level
    X_test["Adj difference_abs_percentage"] = X_test["difference_abs_percentage"].apply(lambda x: 1 if x > 1 else -1 if x < -1 else x)
    X_test.fillna(0.02, inplace=True)

    Df = X_test[['Datetime', 'Adj Close', 'difference_abs_percentage', 'Adj difference_abs_percentage', 
                 'pred_label', 'label', 'prediction_probability']]
    Df.to_csv(f"output_model_{model_n}.csv", index=False)
    
    plt.figure(figsize=(10, 6))
    plt.plot(X_test['Datetime'], X_test['Adj Close'], label='Adj Close')
    plt.title('Adj Close Plot for One Day')
    plt.xlabel('Datetime')
    plt.ylabel('Adj Close')
    plt.legend()
    plt.show()

    filtered_X_test = X_test[X_test["prediction_probability"] > prediction_probabilities_filter]
    
    correct_labels = (filtered_X_test['label'] == filtered_X_test['pred_label']).sum()
    wrong_labels = len(filtered_X_test) - correct_labels
    avg_accuracy = correct_labels * 100 / (correct_labels + wrong_labels) if (correct_labels + wrong_labels) > 0 else 0

    print(f"Number of correct labels: {correct_labels}")
    print(f"Number of wrong labels: {wrong_labels}")
    print(f"Avg accuracy: {avg_accuracy:.2f}%")
    print(f"Avg prediction_probability level: {np.mean(filtered_X_test['prediction_probability']) if not filtered_X_test.empty else 0:.2f}")


In [None]:
X_test_model_1= X_test
y_test_model_1= y_test
process_and_analyze(X_test_model_1, y_test_model_1, y_pred_model_1,prediction_probabilities_model_1,1, prediction_probabilities_filter=0.5)

The following function evaluates a high-leverage trading strategy that relies on predicted market movement signals. The strategy's performance is compared to a **Buy-and-Hold benchmark** and verified against key **FTMO Challenge criteria** to ensure it meets strict risk management and profitability standards.

---
**Buy/Short Sell Approach**  
- **Signal-Based Trading**:  
  For each candlestick, the strategy determines the trade direction based on the predicted label:  
    - **Predicted Label = 1** → Go **Long** (buy the asset), expecting the price to rise.  
    - **Predicted Label = 0** → Go **Short** (sell the asset), anticipating a price drop.  

**Investment and Leverage**  
- Each trade invests **2% of the current capital**.  
- The strategy applies **100x leverage**, significantly amplifying both profits and losses.  

**Prediction Filter**  
- Trades are executed only if the **prediction probability** exceeds **60%** threshold, ensuring only high-confidence trades are taken.  
- If the predicted label matches the actual market movement (`label`), the trade results in profit. Otherwise, it incurs a loss.
---

FTMO Challenge Specification  

The function verifies the strategy against FTMO Challenge rules, simulating a real-world evaluation used by proprietary trading firms:  

1. **Minimum Trading Days**:  
   - The strategy must execute trades on at least **4 distinct trading days**.

2. **Daily Loss Limit**:  
   - No single day should result in a loss exceeding **5% of the initial capital**.  

3. **Account Equity Maintenance**:  
   - The account equity must remain above **90% of the initial capital** at all times.  

4. **Profit Target**:  
   - The strategy must achieve at least **5% in profit** over the evaluation period.

> **Why Use FTMO Criteria?**  
> These criteria help evaluate the robustness and risk management of the strategy under realistic constraints. Meeting these requirements demonstrates that the strategy can handle pressure, control losses, and remain profitable over time—essential traits for professional trading.  
---

Performance Comparison  

The function benchmarks the trading strategy against a **Buy-and-Hold approach**:  
- **Buy-and-Hold**: The initial capital is invested in the asset at the start of the period and held without trading.  
- The function tracks the **capital progression** for both strategies and visualizes them for comparison.  

---

Performance Tracking  

The function records and visualizes key metrics:  
1. **Capital Evolution**: Comparison between the trading strategy and Buy-and-Hold over time.  
2. **Maximum Drawdown**: The largest observed drop from a peak in trading capital.  
3. **Consecutive Losses**: The maximum number of consecutive losing trades.  
4. **Daily Profit and Loss (P&L)**: The accumulated profit or loss for each trading day.  
5. **Monthly Error Distribution**: A bar chart showing the number of incorrect predictions (trading errors) per month.
---

In [None]:
def evaluate_trading_strategy_and_verify(df, 
                                           initial_capital=10000, 
                                           investment_pct=0.02, 
                                           leverage=100, 
                                           min_trading_days=4,
                                           max_daily_loss=500,         
                                           min_account_equity_ratio=0.90, 
                                           profit_target_verification=500,
                                           prediction_probability= 0
                                          ):
    """
    Evaluate a trading strategy against a Buy-and-Hold benchmark and verify performance against FTMO challenge rules, it includes a prediction_probability filter: trades happen only if prediction_probability > 55%.
    """
    
    required_columns = ['Datetime', 'Adj Close', 'difference_abs_percentage', 'label', 'pred_label', 'prediction_probability']
    missing = [col for col in required_columns if col not in df.columns]
    if missing:
        raise ValueError("DataFrame is missing required columns: " + ", ".join(missing))
    
    df = df.rename(columns={'Adj Close': 'Adj_Close'})
    
    df['Datetime'] = pd.to_datetime(df['Datetime'])
    
    capital = initial_capital
    capital_history = [capital]
    drawdowns = []
    peak = capital

    current_consecutive_errors = 0
    max_consecutive_errors = 0
    errors_by_month = defaultdict(int)

    initial_price = df.iloc[0]['Adj_Close']
    buy_and_hold_value = [initial_capital]
    datetime_values = [df.iloc[0]['Datetime']]

    daily_pnl = defaultdict(float)
    trading_days = set()
    account_stop_loss_breached = False

    for row in df.itertuples(index=False):
        if row.prediction_probability <= prediction_probability:
            continue  

        if pd.isna(row.label) or pd.isna(row.pred_label):
            continue

        trade_day = row.Datetime.date()
        trade_month = row.Datetime.strftime("%Y-%m")
        trading_days.add(trade_day)

        # Investment calculation
        investment = capital * investment_pct
        raw_trade_pl = investment * (row.difference_abs_percentage / 100) * leverage

        # Determine profit or loss
        pnl = abs(raw_trade_pl) if row.label == row.pred_label else -abs(raw_trade_pl)
        capital += pnl

        # Update error tracking
        if pnl < 0:
            current_consecutive_errors += 1
            max_consecutive_errors = max(max_consecutive_errors, current_consecutive_errors)
            errors_by_month[trade_month] += 1
        else:
            current_consecutive_errors = 0

        # Record daily P&L
        daily_pnl[trade_day] += pnl

        # Update drawdown calculation
        capital_history.append(capital)
        peak = max(peak, capital)
        drawdowns.append((peak - capital) / peak if peak > 0 else 0)

        # Update Buy-and-Hold
        buy_and_hold_value.append(initial_capital * (row.Adj_Close / initial_price))
        datetime_values.append(row.Datetime)

        # Check if account equity falls below threshold
        if capital < initial_capital * min_account_equity_ratio:
            account_stop_loss_breached = True

    max_drawdown = max(drawdowns) if drawdowns else 0
    worst_daily_pnl = min(daily_pnl.values()) if daily_pnl else 0
    maximum_daily_loss_reached = abs(worst_daily_pnl)
    maximum_daily_loss_percentage = (maximum_daily_loss_reached / initial_capital) * 100

    # FTMO Challenge verification
    num_trading_days = len(trading_days)
    min_trading_days_met = num_trading_days >= min_trading_days
    daily_loss_breaches = {day: pnl for day, pnl in daily_pnl.items() if pnl < -max_daily_loss}
    max_daily_loss_met = (len(daily_loss_breaches) == 0)
    account_stop_loss_met = not account_stop_loss_breached
    profit_target_met = (capital - initial_capital) >= profit_target_verification

    # Plot Performance vs. Buy & Hold
    plt.figure(figsize=(12, 6))
    plt.plot(datetime_values, capital_history, label='Trading Strategy', color='blue')
    plt.plot(datetime_values, buy_and_hold_value, label='Buy & Hold', color='green', linestyle='dashed')
    plt.xlabel('Date')
    plt.ylabel('Capital ($)')
    plt.title('Trading Strategy vs. Buy & Hold Performance')
    plt.xticks(rotation=45)
    plt.grid(True)
    plt.legend()
    plt.show()

    # Plot errors per month
    months = sorted(errors_by_month.keys())
    errors = [errors_by_month[month] for month in months]
    
    plt.figure(figsize=(12, 6))
    plt.bar(months, errors, color='red')
    plt.xlabel('Month')
    plt.ylabel('Number of Errors')
    plt.title('Distribution of Wrong Labels by Month')
    plt.xticks(rotation=45)
    plt.grid(True)
    plt.show()

    print(f"Final Capital (Trading Strategy): ${capital:.2f}")
    print(f"Final Capital (Buy & Hold): ${buy_and_hold_value[-1]:.2f}")
    print(f"Max Drawdown: {max_drawdown * 100:.2f}%")
    print(f"Maximum Consecutive Errors: {max_consecutive_errors}")
    print(f"Max Daily Loss: ${maximum_daily_loss_reached:.2f}")
    print(f"Max Daily Loss Percentage: {maximum_daily_loss_percentage:.2f}%\n")
    
    print("FTMO Challenge Verification Results:")
    print(f" - Minimum Trading Days: {num_trading_days} (Required: {min_trading_days}) --> {'PASSED' if min_trading_days_met else 'FAILED'}")
    if daily_loss_breaches:
        for day, pnl in daily_loss_breaches.items():
            print(f" - Day {day} exceeded max daily loss: Loss = ${-pnl:.2f} (Limit: ${max_daily_loss})")
    print(f" - Max Daily Loss Rule: {'PASSED' if max_daily_loss_met else 'FAILED'}")
    print(f" - Account Stop-Loss: {'PASSED' if account_stop_loss_met else 'FAILED'}")
    print(f" - Profit Target: {'PASSED' if profit_target_met else 'FAILED'}")
    
    return {
        'final_capital': capital,
        'buy_and_hold_final': buy_and_hold_value[-1],
        'max_drawdown': max_drawdown,
        'max_consecutive_errors': max_consecutive_errors,
        'num_trading_days': num_trading_days,
        'daily_loss_breaches': daily_loss_breaches,
        'account_stop_loss_met': account_stop_loss_met,
        'profit_target_met': profit_target_met,
        'maximum_daily_loss_reached': maximum_daily_loss_reached,
        'maximum_daily_loss_percentage': maximum_daily_loss_percentage,
        'errors_by_month': dict(errors_by_month)
    }


In [None]:
df_1=pd.read_csv("output_model_1.csv")
evaluate_trading_strategy_and_verify(df_1,prediction_probability=0.5)

### 3.1.2 Second Best model Result

Below are the results presented using the same approach as in the previous paragraph, but with a different decision model.

In [None]:
#Stock: Spy index
#Timeframe: 5min 
best_model_pipeline_2 = IMBPipeline([
    ('trans', preprocessor),
    ('classifier', LogisticRegression(C=0.784907138175208, class_weight='balanced', max_iter=500, penalty='l1', solver='saga'))
])
best_model_pipeline_2.fit(X_train,y_train)
y_pred_model_2= best_model_pipeline_2.predict(X_test)
report = classification_report(y_test, y_pred_model_2)
prediction_probabilities_model_2 = np.max((best_model_pipeline_2.predict_proba(X_test)), axis=1)

print(report)

In [None]:
X_test_model_2= X_test
y_test_model_2= y_test
process_and_analyze(X_test_model_2, y_test_model_2, y_pred_model_2,prediction_probabilities_model_2,2)

In [None]:
evaluate_trading_strategy_and_verify(X_test_model_2)

### 3.1.3 Third Best model Result

Below are the results presented using the same approach as in the previous paragraph, but with a different decision model.

In [None]:
#Stock: Spy index
#Timeframe: 1Hour 
best_model_pipeline_3 = IMBPipeline([
    ('trans', preprocessor),
    ('classifier', LinearSVC(C=2.0257272878736794))
])
best_model_pipeline_3.fit(X_train, y_train)
y_pred_model_3 = best_model_pipeline_3.predict(X_test)
report = classification_report(y_test, y_pred_model_3)
prediction_probabilities_model_3 = [0.7] * len(X_test)

print(report)

In [None]:
X_test_model_3= X_test
y_test_model_3= y_test
process_and_analyze(X_test_model_3, y_test_model_3, y_pred_model_3,prediction_probabilities_model_3,3,prediction_probabilities_filter=0)

In [None]:
evaluate_trading_strategy_and_verify(X_test_model_3,prediction_probability=0)

## 3.2 Learning visualization


In [None]:
train_sizes, train_scores, test_scores = learning_curve(best_model_pipeline_1,
                                                       X=X_train,
                                                       y=y_train,
                                                       train_sizes= [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
                                                       cv = 5,
                                                       n_jobs = -1,
                                                       scoring = 'matthews_corrcoef',
                                                       shuffle = False)

### Learning Curve Plot

This code visualizes training and validation Matthews Correlation Coefficient (MCC) to assess model performance across different training set sizes.

In [None]:
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

fig=plt.figure(figsize=(12,7))
ax = fig.add_subplot()

ax.plot(train_sizes, train_mean,
         color='blue', marker='+',
         markersize=5, label='Training MCC')

ax.fill_between(train_sizes,
                 train_mean + train_std,
                 train_mean - train_std,
                 alpha=0.15, color='blue')

ax.plot(train_sizes, test_mean,
         color='green', linestyle='--',
         marker='d', markersize=5,
         label='Validation MCC')

ax.fill_between(train_sizes,
                 test_mean + test_std,
                 test_mean - test_std,
                 alpha=0.15, color='green')

ax.grid()
ax.set_xlabel('Training set size')
ax.set_ylabel('matthews_corrcoef')
ax.legend(loc='lower right')
ax.set_ylim([-0.05, 0.2])

In [None]:
range_C = [0.001,0.01,0.1,1,10,100]
train_scores, test_scores = validation_curve(best_model_pipeline_1,
        X=X_train, 
        y=y_train, 
        param_range=
        range_C, 
        param_name='classifier__C',
        cv=5, 
        n_jobs=-1, 
        scoring='matthews_corrcoef'
)

### Validation Curve for Parameter C  
This code plots the validation curve for different values of `C`, analyzing its effect on training and validation Matthews Correlation Coefficient (MCC), with shaded regions indicating variance.


In [None]:
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

fig=plt.figure(figsize=(12,7))
ax = fig.add_subplot()
ax.plot(range_C, train_mean,
         color='blue', marker='o',
         markersize=5, label='Training MCC')

ax.fill_between(range_C,
                 train_mean + train_std,
                 train_mean - train_std,
                 alpha=0.15, color='blue')

ax.plot(range_C, test_mean,
         color='green', linestyle='--',
         marker='s', markersize=5,
         label='Validation MCC')

ax.fill_between(range_C,
                 test_mean + test_std,
                 test_mean - test_std,
                 alpha=0.15, color='green')

ax.grid()
ax.set_xlabel('Parameter C')
ax.set_ylabel('matthews_corrcoef')
ax.legend(loc='lower right')
ax.set_ylim([-0.05, 0.15])
ax.set_xscale('log')
ax.set_xlim([0.05,100])

### Precision-Recall Curve  
By plotting **Precision and Recall vs. Threshold**, I can better understand how different scores affect model performance.


In [None]:
scores = best_model_pipeline_1.decision_function(X_train)

precisions, recalls, thresholds = precision_recall_curve(y_train, scores)
threshold = 0.2
fig = plt.figure(figsize=(10, 4))
ax = fig.add_subplot()
ax.plot(thresholds, precisions[:-1], "b--", label="Precision", lw=2)
ax.plot(thresholds, recalls[:-1], "g-", label="Recall", lw=2)
ax.vlines(threshold, 0, 1.0, "k", "dotted", label="threshold")

idx = (thresholds >= threshold).argmax()  # first index ≥ threshold
plt.plot(thresholds[idx], precisions[idx], "bo")
plt.plot(thresholds[idx], recalls[idx], "go")
plt.grid()
ax.set_xlabel("Threshold")
ax.set_xlim((-0.85,1.5))
plt.legend(loc="center right")
plt.show()

In [None]:
fig = plt.figure(figsize=(6, 6))
ax = fig.add_subplot()
ax.plot(recalls, precisions, lw=2, label="Precision/Recall curve")
ax.plot([recalls[idx], recalls[idx]], [0., precisions[idx]], "k:")
ax.plot([0.0, recalls[idx]], [precisions[idx], precisions[idx]], "k:")
ax.plot([recalls[idx]], [precisions[idx]], "ko",
         label=f"Point at threshold {threshold}")
ax.set_xlabel("Recall")
ax.set_ylabel("Precision")
ax.axis([0, 1, 0, 1])
ax.legend(loc="lower left")

### ROC Curve  

Evaluates model performance by showing the trade-off between False Positive Rate and Recall across thresholds, with AUC summarizing overall discrimination ability.

In [None]:
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
fprs, recalls, thresholds = roc_curve(y_train,scores) 
fig = plt.figure(figsize=(6, 5)) 
ax = fig.add_subplot()
ax.plot(fprs, recalls, linewidth=2, label="ROC curve")
ax.plot([0, 1], [0, 1], 'k:', label="Random classifier's ROC curve")
ax.set_xlabel('False Positive Rate - FPR')
ax.set_ylabel('Recall')
ax.axis([0, 1, 0, 1])
ax.legend(loc="lower right", fontsize=13)

roc_auc_score(y_train, scores)

## 3.3 Financial Results

The selected machine learning classification models demonstrated strong performance, with the following algorithms achieving the best results:
- **Logistic Regression**
- **Support Vector Classifier (SVC)**

**Feature Reduction**:
During the experimentation phase, an attempt was made to enforce feature reduction prior to grid search, retaining only the top 70 features deemed most useful by the model. However, this approach did not yield significantly different results compared to a model trained on the full feature set. The primary benefit of this a priori feature reduction was a substantial decrease in grid search computation time.

**Timeframe Analysis**:
The models performed better when analyzing longer timeframes, such as **1-day candles**, though the difference was not drastic:
- **1-Day Timeframe**: Average accuracy ~ **60%**
- **1-Hour Timeframe**: Average accuracy ~ **55%**

While the 1-Hour timeframe allows for more trades within the same period compared to the 1-day timeframe, the accuracy trade-off must be considered.

**Index vs. Single Stock Performance**:
The models exhibited superior performance when analyzing indices, such as the **S&P 500**, which includes the top 500 companies by market capitalization, compared to individual stocks. This is likely due to the current reliance on technical indicators alone, without incorporating news-based analysis, which significantly impacts individual stock performance.

**Investment Strategy**:
A simple investment strategy was employed:
- **Buy** the stock if the predicted label is **1**.
- **Short sell** the stock if the predicted label is **0**.

For this example, a **100x leverage** (highly risky and not recommended for non-experts) and **1% of available capital per trade** were used. No stop-loss or take-profit mechanisms were applied, as the 1-Hour timeframe's volatility allowed trades to be closed at the end of the candle.

**Test Period Results**:
Using the above setup on a **6-month test period** with an initial capital of **$10,000**, trading daily on the S&P 500 with 1-Hour candles, the model achieved the following results :
- **Final Capital**: ~$15,000–$16,000
- **Capital Increase**: ~5–60%
- **Maximum Drawdown**: ~9–11%
- **Maximum Consecutive Errors**: 4–8 times
- **Maximum Daily Loss Percentage**: ~4–4%
- **Accuracy (Correct Trades)**: ~53–58% (not considering the improvement brought by using probabilities as a filter)

**Limitations**:
As demonstrated in Section 3.1, applying an investment strategy based on machine learning decision models is feasible and has shown promising results. However, several critical limitations must be acknowledged:
1. **Leverage Dependency**: The impressive results are largely attributable to the use of **financial leverage**, a powerful tool that can lead to significant losses if not managed properly (e.g., through stop-loss mechanisms). Without leverage, the algorithm's performance would be substantially weaker.
2. **Data Source Limitations**: The model was trained using data from two separate sources due to limitations in data availability. Yahoo Finance restricts historical data to the past month, while Alpaca API does not provide live data access.
3. **Future Performance Uncertainty**: While the model has performed well historically, there is no guarantee of continued success. Continuous retraining and monitoring are essential to maintain performance.

**Disclaimer**
The information presented in this notebook is **not financial advice**. Relying solely on a machine learning model for investment decisions is not a sound strategy and should not be considered without thorough research and professional guidance. Always conduct your own due diligence before making any investment decisions.

# Part 4 Future Improvements and references. 


## Future Upgrades

The following improvements are proposed to enhance the robustness and practicality of the model:

1. **Generalization Across Stocks**  
   The predictor can be tested on a broader range of stocks to evaluate its robustness and adaptability. The goal is to develop a more generalized predictor capable of performing well across diverse stock market conditions.

2. **Portfolio Construction and Risk Diversification**  
   A portfolio comprising multiple stocks can be constructed to diversify risk and improve overall performance. Additionally, transaction costs should be incorporated into the evaluation framework to ensure the strategy’s effectiveness reflects real-world trading conditions.

3. **Real Trading Environment Backtesting**  
   Rigorous backtesting of the model in a simulated real trading environment can be conducted to assess its performance under realistic market dynamics and constraints.

4. **Stop-Loss and Take-Profit Strategies**  
   Robust stop-loss and take-profit mechanisms can be developed to mitigate losses and lock in gains, ensuring the model operates within predefined risk management parameters.

5. **Data Continuity with a Single Data Source**  
   Using a single, consistent data source ensures continuity and reduces discrepancies caused by different data providers. This improvement enhances the model’s stability and reliability during training and evaluation.

6. **Incorporating Additional Technical Features**  
   Expanding the feature set with more technical indicators

7. **Incorporating Additional Non-Technical Features**  
   Integrating sentiment analysis from financial news, social media, and earnings reports can provide valuable insights into market sentiment. Natural Language Processing (NLP) techniques can be used to analyze textual data and extract relevant signals that may impact stock movements.  


## References

- Chen, Y., & Jiang, S. (2015). Stock market forecasting using machine learning algorithms. Stanford University, CS229 Project Report. Retrieved from https://cs229.stanford.edu/proj2015/009_report.pdf

- Dai, X., & Zhang, Y. (2013). Machine learning in stock price trend forecasting. Stanford University, CS229 Project Report. Retrieved from https://cs229.stanford.edu/proj2013/DaiZhang-MachineLearningInStockPriceTrendForecasting.pdf

- Shen, J., Jiang, S., & Zhang, Y. (2012). Stock market forecasting using machine learning algorithms. Stanford University, CS229 Project Report. Retrieved from https://cs229.stanford.edu/proj2012/ShenJiangZhang-StockMarketForecastingusingMachineLearningAlgorithms.pdf

- Scorpionhiccup. (n.d.). Stock price prediction. GitHub repository. Retrieved from https://github.com/scorpionhiccup/StockPricePrediction?tab=readme-ov-file

- Scikit-learn developers. (n.d.). Scikit-learn: Machine learning in Python. Retrieved from http://scikit-learn.org/

- StockCharts.com. (n.d.). Chart school. Retrieved from http://stockcharts.com/school/doku.php?id=chart_school



## License


This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).  
It allows others to:  
- **Share** — copy and redistribute the material in any medium or format  
- **Adapt** — remix, transform, and build upon the material  

Under the following conditions:  
- **Attribution** — Appropriate credit must be given, a link to the license provided, and any changes indicated.  
- **NonCommercial** — The material cannot be used for commercial purposes.  

For more information, visit the full license at [https://creativecommons.org/licenses/by-nc/4.0/](https://creativecommons.org/licenses/by-nc/4.0/).
