<a href="https://colab.research.google.com/github/SidS12345/Quant-projects/blob/main/Asset_Price_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Goal: to use machine learning methods to forecast asset prices

In this project, I explore the predictive potential of machine learning in financial markets by applying a range of supervised learning models to historical stock data. I use feature engineering to extract informative signals from price and technical indicators across multiple stocks. These features were used to train and evaluate various models — including linear regression, neural networks and kNN — on their ability to forecast short-term price movements. Performance is assessed using standard classification metrics, both on profits and on variance, enabling comparative analysis of model accuracy and robustness. The project highlights the challenges and opportunities in applying data-driven approaches to quantitative trading.

In [20]:
pip install ta



In [21]:
# Installing relevant libraries

import numpy as np
import pandas as pd
import yfinance as yf
from datetime import datetime, timedelta
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, explained_variance_score, mean_absolute_percentage_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, SGDClassifier, LinearRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPRegressor
from sklearn.neighbors import KNeighborsRegressor
from ta import trend, momentum

In [22]:
# Collecting tickers from a range of sectors

tickers = [
  # Tech
  'AAPL',     # Apple
  'GOOGL',    # Alphabet
  'MSFT',     # Microsoft
  'NVDA',     # Nvidia

  # Electric Vehicles
  'TSLA',     # Tesla

  # Finance
  'JPM',      # JPMorgan Chase
  'BAC',      # Bank of America

  # Healthcare
  'JNJ',      # Johnson & Johnson
  'PFE',      # Pfizer

  # Energy
  'XOM',      # Exxon Mobil

  # Retail / Consumer Goods
  'WMT',      # Walmart
  'PG',       # Procter & Gamble

  # ETFs
  'SPY',      # S&P 500 ETF
  'QQQ',      # Nasdaq-100 ETF

  # Crypto
  'BITO',      # Bitcoin Strategy ETF (ProShares)

  # Commodity ETF
  'GLD'       # Gold
]

# choosing start and end date, taking 1500 days of stock data
today = datetime.today()
yesterday = today - timedelta(days = 1)
end_date = yesterday - timedelta(days = 60)
start_date = end_date - timedelta(days = 1500)

In [23]:
ticker_feature_dfs = {}

for ticker in tickers:
  # Download price data
  data = yf.download(ticker, start=start_date, end=end_date, auto_adjust=False)

  df = pd.DataFrame(index=data.index)
  df['close'] = data['Close']

  # Returns & lags
  df['pct_returns'] = 100 * (df['close'] - df['close'].shift(1)) / df['close'].shift(1)
  df['lag_1'] = df['pct_returns'].shift(1)
  df['lag_5'] = df['pct_returns'].shift(5)

  # Moving Averages
  df['sma_5'] = df['close'].rolling(window=5).mean()
  df['sma_10'] = df['close'].rolling(window=10).mean()
  df['sma_20'] = df['close'].rolling(window=20).mean()

  # Volatility
  df['volatility_10'] = df['pct_returns'].rolling(window=10).std()
  df['volatility_20'] = df['pct_returns'].rolling(window=20).std()

  # Exponential Moving Averages
  df['ema_10'] = trend.EMAIndicator(close=df['close'], window=10).ema_indicator()
  df['ema_20'] = trend.EMAIndicator(close=df['close'], window=20).ema_indicator()

  # MACD
  macd = trend.MACD(close=df['close'])
  df['macd'] = macd.macd()
  df['macd_signal'] = macd.macd_signal()
  df['macd_diff'] = macd.macd_diff()

  # RSI
  df['rsi'] = momentum.RSIIndicator(close=df['close'], window=14).rsi()

  # ADX
  high = pd.Series(data['High'].values.flatten(), index=data.index)
  low = pd.Series(data['Low'].values.flatten(), index=data.index)
  close = pd.Series(data['Close'].values.flatten(), index=data.index)
  adx = trend.ADXIndicator(high, low, close)
  df['adx'] = adx.adx()

  # Drop NaNs (from indicators with lags/rolling)
  df = df.dropna()

  # Scale features
  features = df.copy()
  scaler = StandardScaler()
  scaled_features = scaler.fit_transform(features)

  df_scaled = pd.DataFrame(scaled_features, index=features.index, columns=features.columns).dropna()
  # Store in dictionary
  ticker_feature_dfs[ticker] = {
    'features': df_scaled,
    'raw features': df,
    'scaler': scaler,
    'mean': df['close'].mean(),
    'std': df['close'].std()
  }





[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%********

In [24]:

def rescale(dataset, std, mean):
  return dataset * std + mean

# Create model
def create_model(ticker, ticker_feature_dfs, model):
  df_scaled = ticker_feature_dfs[ticker]['features'].copy()
  raw_df = ticker_feature_dfs[ticker]['raw features']
  mean = ticker_feature_dfs[ticker]['mean']
  std = ticker_feature_dfs[ticker]['std']

  df_scaled['target'] = df_scaled['close'].shift(-1)
  df_scaled = df_scaled.dropna()

  X = df_scaled.drop(['target'], axis=1)
  y = df_scaled['target']

  seed = 10 # random state of split. Can evaluate how different splits affect our results. Choose if we use seed or not
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
  target = rescale(y_test.to_numpy(), std, mean)

  model.fit(X_train, y_train)
  model_preds = model.predict(X_test)
  model_preds = rescale(model_preds, std, mean)

  return model_preds, X_test, target, model, raw_df

Now, we will implement a simple trading strategy on our predicted data - if the predicted close price is above the current close price, we will buy. Otherwise, we will sell. We will neutralise the position on the next day, and then advance to the next predicted datapoint. This will help us understand a bit about how useful our predictions may be for a trading bot

In [25]:
'''
def simple_trading(preds, X_test, y_test):
  bought = 0
  sold = 0
  no_trade = 0
  profit = 0
  profitable_trades = 0
  unprofitable_trades = 0
  break_even_trades = 0
  profits = []
  losses = []
  for i in range(len(preds)):
    date = X_test.index[i]
    close_today = X_test.iloc[i]['close']
    pred = preds[i]
    next_close = y_test.iloc[i]
    if pred > close_today:
      bought += 1
      PnL = next_close - close_today
    elif pred < close_today:
      sold += 1
      PnL = close_today - next_close
    else:
      no_trade += 1
    if PnL > 0:
      profitable_trades += 1
      profits.append(PnL)
    elif PnL < 0:
      unprofitable_trades += 1
      losses.append(-PnL)
    else:
      break_even_trades += 1
    profit += PnL
  return profit, profitable_trades, unprofitable_trades
'''

"\ndef simple_trading(preds, X_test, y_test):\n  bought = 0\n  sold = 0\n  no_trade = 0\n  profit = 0\n  profitable_trades = 0\n  unprofitable_trades = 0\n  break_even_trades = 0\n  profits = []\n  losses = []\n  for i in range(len(preds)):\n    date = X_test.index[i]\n    close_today = X_test.iloc[i]['close']\n    pred = preds[i]\n    next_close = y_test.iloc[i]\n    if pred > close_today:\n      bought += 1\n      PnL = next_close - close_today\n    elif pred < close_today:\n      sold += 1\n      PnL = close_today - next_close\n    else:\n      no_trade += 1\n    if PnL > 0:\n      profitable_trades += 1\n      profits.append(PnL)\n    elif PnL < 0:\n      unprofitable_trades += 1\n      losses.append(-PnL)\n    else:\n      break_even_trades += 1\n    profit += PnL\n  return profit, profitable_trades, unprofitable_trades\n"

In [26]:


def simple_trading(preds, X_test, raw_df):
  bought = 0
  sold = 0
  profit = 0
  profitable_trades = 0
  unprofitable_trades = 0
  break_even_trades = 0
  profits = []
  losses = []

  for i in range(len(preds)):
    date = X_test.index[i]
    day = date.dayofweek
    next_day_add = 3 if day == 4 else 1
    next_date = date + timedelta(days=next_day_add)

    if next_date in raw_df.index:
      pred = preds[i]
      close_today = raw_df.loc[date, 'close']
      next_close = raw_df.loc[next_date, 'close']
      if pred > close_today:
        bought += 1
        PnL = next_close - close_today
      else:
        sold += 1
        PnL = close_today - next_close
      if PnL > 0:
        profitable_trades += 1
        profits.append(PnL)
      elif PnL < 0:
        unprofitable_trades += 1
        losses.append(-PnL)
      else:
        break_even_trades += 1
      profit += PnL

  return profit, profits, losses, break_even_trades


In [27]:
# Giving the option to choose model, rather than just sticking simply to linear regression
# Can adjust hyperparameters within this choose model function
def choose_model(n):
  lin_reg = LinearRegression()
  rf = RandomForestRegressor(n_estimators=100)
  grad_boost = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1)
  mlp = MLPRegressor(hidden_layer_sizes=(100,100), max_iter=500)
  knn = KNeighborsRegressor(n_neighbors=5)

  global list_of_models
  list_of_models = [lin_reg, rf, grad_boost, mlp, knn]

  global model_names
  model_names = ['lin_reg   ', 'rf        ', 'grad_boost', 'mlp       ', 'knn       '] # Padded strings with spaces to make everything the same length
# list_of_models = [lin_reg]
  return list_of_models[n]

In [28]:
nothing = choose_model(1)

# Done just to run choose_model and thus have our global variables available

In [29]:
def run_profits(tickers, ticker_feature_dfs, model):
  ticker_profs = {
      ticker: {
          'profit': 0,
          'profitable trades': None,
          'unprofitable trades': None,
          'trained model': None,
          'start prices': None,
          'predicted prices': None,
          'actual prices': None,
          'break even trades': None
      } for ticker in tickers
  }
  for ticker in tickers:
    model_preds, X_test, y_test, trained_model, raw_df = create_model(ticker, ticker_feature_dfs, model)
    profit, profits, losses, break_even_trades = simple_trading(model_preds, X_test, raw_df)
    ticker_profs[ticker]['profit'] = profit
    ticker_profs[ticker]['profits'] = profits
    ticker_profs[ticker]['losses'] = losses
    ticker_profs[ticker]['trained model'] = trained_model

    mean = ticker_feature_dfs[ticker]['mean']
    std = ticker_feature_dfs[ticker]['std']
    ticker_profs[ticker]['start prices'] = rescale(X_test['close'], std, mean)
    ticker_profs[ticker]['predicted prices'] = model_preds
    ticker_profs[ticker]['actual prices'] = y_test
    ticker_profs[ticker]['break even trades'] = break_even_trades

#  for ticker, profit in ticker_profs.items():
#    print(f"{ticker}: {profit}")
  return ticker_profs
'''model = choose_model(4)
ticker_profs = run_profits(tickers, ticker_feature_dfs, model)
for ticker in tickers:
  print(f"{ticker}: {ticker_profs[ticker]['profit']}")
'''

'model = choose_model(4)\nticker_profs = run_profits(tickers, ticker_feature_dfs, model)\nfor ticker in tickers:\n  print(f"{ticker}: {ticker_profs[ticker][\'profit\']}")\n'

Keeping note on how long it takes each ML model to train and run once, where we have 16 assets and creating a lot of dictionaries

model 0 - 0s

model 1 - 24s

model 2 - 13s

model 3 - 8s

model 4 - 0s


In [30]:
# evaluating profit of each model once:

def model_statistics(tickers, ticker_feature_dfs):
  model_dict = {
    i: {
        'model': None,
        'model type': None,
        'overall profit': 0,
        'all profitable trades': None,
        'all unprofitable trades': None,
        'all trained models': None,
        'all start prices': None,
        'all predicted prices': None,
        'all actual prices': None,
        'total break even trades': None
    }
    for i in range(len(list_of_models))
  }

  for i in range(len(list_of_models)):
    model = choose_model(i)
    ticker_profs_model = run_profits(tickers, ticker_feature_dfs, model)
    model_dict[i]['model'] = model
    model_dict[i]['model type'] = model_names[i]
    model_dict[i]['overall profit'] = sum([ticker_profs_model[ticker]['profit'] for ticker in tickers])
    model_dict[i]['all profits'] = [ticker_profs_model[ticker]['profits'] for ticker in tickers]
    model_dict[i]['all losses'] = [ticker_profs_model[ticker]['losses'] for ticker in tickers]
    model_dict[i]['all trained models'] = [ticker_profs_model[ticker]['trained model'] for ticker in tickers]
    model_dict[i]['all start prices'] = [ticker_profs_model[ticker]['start prices'] for ticker in tickers]
    model_dict[i]['all predicted prices'] = [ticker_profs_model[ticker]['predicted prices'] for ticker in tickers]
    model_dict[i]['all actual prices'] = [ticker_profs_model[ticker]['actual prices'] for ticker in tickers]
    model_dict[i]['total break even trades'] = [ticker_profs_model[ticker]['break even trades'] for ticker in tickers]

  return model_dict

In [31]:
'''
# Now simulating this many times and seeing how much overall profit we make on average
# Note - running this with 10 iterations takes about 10 minutes; currently very inefficient look for more efficient solutions
iterations = 1
iteration_profits = np.zeros(iterations)
per_ticker_profits = {
    ticker: 0 for ticker in tickers
}

overall_model_profits = {
    i: 0 for i in range(len(list_of_models))
}

for i in range(iterations):
  model_dict = model_statistics(tickers, ticker_feature_dfs)
  for j in range(len(list_of_models)):
    overall_model_profits[j] += model_dict[j]['overall profit']/iterations

for i, profit in overall_model_profits.items():
  print(f"Model {i}: {profit}")

'''

'''for j in range(len(list_of_models)):
  model = choose_model(j)
  overall_profits = 0
  for i in range(iterations):
    ticker_profs = run_profits(tickers, ticker_feature_dfs, model)
    iteration_profits[i] = sum(ticker_profs.values())
    for ticker, profit in ticker_profs.items():
      per_ticker_profits[ticker] += profit
    for ticker, profit in ticker_profs.items():
      overall_profits += profit/iterations


  model_profits[j] = overall_profits
#  print(f"{ticker}: {profit/iterations}")
for model, profit in model_profits.items():
  print(f"{model}: {profit}")


# Alternatively, we can find the average across a larger number of timestamps for a specific model

model = choose_model(4)
overall_profits = 0
iterations = 100
for i in range(iterations):
  ticker_profs = run_profits(tickers, ticker_feature_dfs, model)
  iteration_profits[i] = sum(ticker_profs.values())
  for ticker, profit in ticker_profs.items():
    per_ticker_profits[ticker] += profit
    overall_profits += profit/iterations
for ticker, profit in per_ticker_profits.items():
  print(f"{ticker}: {profit/iterations}")
'''

'for j in range(len(list_of_models)):\n  model = choose_model(j)\n  overall_profits = 0\n  for i in range(iterations):\n    ticker_profs = run_profits(tickers, ticker_feature_dfs, model)\n    iteration_profits[i] = sum(ticker_profs.values())\n    for ticker, profit in ticker_profs.items():\n      per_ticker_profits[ticker] += profit\n    for ticker, profit in ticker_profs.items():\n      overall_profits += profit/iterations\n\n\n  model_profits[j] = overall_profits\n#  print(f"{ticker}: {profit/iterations}")\nfor model, profit in model_profits.items():\n  print(f"{model}: {profit}")\n\n\n# Alternatively, we can find the average across a larger number of timestamps for a specific model\n\nmodel = choose_model(4)\noverall_profits = 0\niterations = 100\nfor i in range(iterations):\n  ticker_profs = run_profits(tickers, ticker_feature_dfs, model)\n  iteration_profits[i] = sum(ticker_profs.values())\n  for ticker, profit in ticker_profs.items():\n    per_ticker_profits[ticker] += profit\n  

In [32]:
model_dict = model_statistics(tickers, ticker_feature_dfs)

# Hard coding in our dictionary so that we can make running time of compare_models shorter

In [33]:
# Now, we want to compare different ML models, using a range of metrics

def compare_models():
# model_dict = model_statistics(tickers, ticker_feature_dfs)
  comparison_dict = {
      i: {
          'model type': None,
          # 'trained ticker models': None,

          # Trading profit performance metrics
          'number of trades': None,
          'PnL': None,
          'win rate': None,
          'average profit per trade': None,
          'profit standard deviation': None,
          'profit factor': None,
          'average profit per successful trade': None,
          'average loss per unsuccessful trade': None,
          'max profit': None,
          'max loss': None,
          'sharpe ratio': None,

          # Prediction accuracy metrics
          'mae': None,
          'rmse': None,
          'r2': None,
          'correlation': None,
          'mape': None,
          'explained variance': None,

          # Using more general classification accuracy metrics
          'accuracy': None,
          'precision': None,
          'recall': None,
          'f1 score': None,

          # Can also add other metrics to do with model specifically, e.g. how hyperparameters or random seed affect performance



      }
      for i in range(len(list_of_models))
  }
  for i in range(len(list_of_models)):

    # Profit metrics
    profits = model_dict[i]['all profits']
    all_profits = np.concatenate(profits)
    losses = model_dict[i]['all losses']
    all_losses = np.concatenate(losses)

    num_of_wins = len(all_profits)
    num_of_losses = len(all_losses)
    num_of_break_evens = sum(model_dict[i]['total break even trades'])
    num_of_trades = num_of_wins + num_of_losses + num_of_break_evens

    total_profits = sum(all_profits)
    total_losses = sum(all_losses)
    total_PnL = total_profits - total_losses
    win_rate = num_of_wins/num_of_trades
    average_PnL = total_PnL/num_of_trades
    profit_std = np.std(np.concatenate((all_profits,-all_losses)))
    profit_factor = total_profits/total_losses
    average_profit_per_successful_trade = total_profits/num_of_wins
    average_loss_per_unsuccessful_trade = total_losses/num_of_losses
    max_profit = max(all_profits)
    max_loss = max(all_losses)
    sharpe_ratio = average_PnL/profit_std

    # Prediction accuracy metrics
    y_test = np.concatenate(model_dict[i]['all actual prices'])
    y_pred = np.concatenate(model_dict[i]['all predicted prices'])
    y_start = np.concatenate(model_dict[i]['all start prices'])
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    correlation = np.corrcoef(y_test, y_pred)[0,1]
    mape = mean_absolute_percentage_error(y_test, y_pred)
    explained_variance = explained_variance_score(y_test, y_pred)

    # Metrics for assessing price movement (up/down) - use positive to mean price increased
    TP = 0
    FP = 0
    TN = 0
    FN = 0
    for j in range(len(y_test)):
      if y_pred[j] > y_start[j]:
        if y_test[j] > y_start[j]:
          TP += 1
        else:
          FP += 1
      else:
        if y_test[j] < y_start[j]:
          TN += 1
        else:
          FN += 1
    accuracy = (TP + TN) / (TP + TN + FP + FN)
    precision = TP / (TP + FP)
    recall = TP / (TP + FN)
    f1_score = 2 * (precision * recall) / (precision + recall)

    comparison_dict[i]['model type'] = model_dict[i]['model type']
#   comparison_dict[i]['trained ticker models'] = model_dict[i]['all trained models']     -     Commented out to make dictionary more readable
    comparison_dict[i]['number of trades'] = num_of_trades
    comparison_dict[i]['PnL'] = total_PnL
    comparison_dict[i]['win rate'] = win_rate
    comparison_dict[i]['average profit per trade'] = average_PnL
    comparison_dict[i]['profit standard deviation'] = profit_std
    comparison_dict[i]['profit factor'] = profit_factor
    comparison_dict[i]['average profit per successful trade'] = average_profit_per_successful_trade
    comparison_dict[i]['average loss per unsuccessful trade'] = average_loss_per_unsuccessful_trade
    comparison_dict[i]['max profit'] = max_profit
    comparison_dict[i]['max loss'] = max_loss
    comparison_dict[i]['sharpe ratio'] = sharpe_ratio

    comparison_dict[i]['mae'] = mae
    comparison_dict[i]['rmse'] = rmse
    comparison_dict[i]['r2'] = r2
    comparison_dict[i]['correlation'] = correlation
    comparison_dict[i]['mape'] = mape
    comparison_dict[i]['explained variance'] = explained_variance

    comparison_dict[i]['accuracy'] = accuracy
    comparison_dict[i]['precision'] = precision
    comparison_dict[i]['recall'] = recall
    comparison_dict[i]['f1 score'] = f1_score

  return comparison_dict





In [34]:
def format_floats(obj, decimals=2):
    if isinstance(obj, dict):
        return {k: format_floats(v, decimals) for k, v in obj.items()}
    elif isinstance(obj, list):
        return [format_floats(item, decimals) for item in obj]
    elif isinstance(obj, (float, np.floating)):
        return f"{obj:.{decimals}f}"
    else:
        return obj


In [35]:
returned = format_floats(compare_models())
for item, value in returned.items():
  print(f"{item}: {value}")

0: {'model type': 'lin_reg   ', 'number of trades': 3063, 'PnL': '47.15', 'win rate': '0.51', 'average profit per trade': '0.02', 'profit standard deviation': '3.77', 'profit factor': '1.01', 'average profit per successful trade': '2.05', 'average loss per unsuccessful trade': '2.14', 'max profit': '39.73', 'max loss': '40.52', 'sharpe ratio': '0.00', 'mae': '2.12', 'rmse': '3.81', 'r2': '1.00', 'correlation': '1.00', 'mape': '0.01', 'explained variance': '1.00', 'accuracy': '0.51', 'precision': '0.52', 'recall': '0.57', 'f1 score': '0.55'}
1: {'model type': 'rf        ', 'number of trades': 3051, 'PnL': '334.93', 'win rate': '0.52', 'average profit per trade': '0.11', 'profit standard deviation': '3.80', 'profit factor': '1.11', 'average profit per successful trade': '2.15', 'average loss per unsuccessful trade': '2.13', 'max profit': '49.94', 'max loss': '28.78', 'sharpe ratio': '0.03', 'mae': '2.31', 'rmse': '4.05', 'r2': '1.00', 'correlation': '1.00', 'mape': '0.01', 'explained var

Having issue with data volatility - results of our predictions seem to vary massively based on what our train test split is. Maybe because sometimes the actual test data is not so good as opposed to our model being bad, but unsure

See what fields in model_dict I'm using for comparison_dict, and which ones I can get rid of / implement more effectively

In [36]:
# running the models on the last 30 days of test data

closing_prices = pd.DataFrame()
for ticker in tickers:
  data = yf.download(ticker, start = end_date, end = today)
  closing_prices[ticker] = data["Close"]

# Now cleaning the data so that we can operate on it successfully

closing_prices = closing_prices.dropna(how='any')
closing_prices = closing_prices.drop_duplicates()

  data = yf.download(ticker, start = end_date, end = today)
[*********************100%***********************]  1 of 1 completed
  data = yf.download(ticker, start = end_date, end = today)
[*********************100%***********************]  1 of 1 completed
  data = yf.download(ticker, start = end_date, end = today)
[*********************100%***********************]  1 of 1 completed
  data = yf.download(ticker, start = end_date, end = today)
[*********************100%***********************]  1 of 1 completed
  data = yf.download(ticker, start = end_date, end = today)
[*********************100%***********************]  1 of 1 completed
  data = yf.download(ticker, start = end_date, end = today)
[*********************100%***********************]  1 of 1 completed
  data = yf.download(ticker, start = end_date, end = today)
[*********************100%***********************]  1 of 1 completed
  data = yf.download(ticker, start = end_date, end = today)
[*********************100%***********

In [37]:
ticker_feature_dfs_predict = {}

for ticker in tickers:
  # Download price data
  data = yf.download(ticker, start=end_date, end=yesterday)

  df = pd.DataFrame(index=data.index)
  df['close'] = data['Close']

  # Returns & lags
  df['pct_returns'] = 100 * (df['close'] - df['close'].shift(1)) / df['close'].shift(1)
  df['lag_1'] = df['pct_returns'].shift(1)
  df['lag_5'] = df['pct_returns'].shift(5)

  # Moving Averages
  df['sma_5'] = df['close'].rolling(window=5).mean()
  df['sma_10'] = df['close'].rolling(window=10).mean()
  df['sma_20'] = df['close'].rolling(window=20).mean()

  # Volatility
  df['volatility_10'] = df['pct_returns'].rolling(window=10).std()
  df['volatility_20'] = df['pct_returns'].rolling(window=20).std()

  # Exponential Moving Averages
  df['ema_10'] = trend.EMAIndicator(close=df['close'], window=10).ema_indicator()
  df['ema_20'] = trend.EMAIndicator(close=df['close'], window=20).ema_indicator()

  # MACD
  macd = trend.MACD(close=df['close'])
  df['macd'] = macd.macd()
  df['macd_signal'] = macd.macd_signal()
  df['macd_diff'] = macd.macd_diff()

  # RSI
  df['rsi'] = momentum.RSIIndicator(close=df['close'], window=14).rsi()

  # ADX
  high = pd.Series(data['High'].values.flatten(), index=data.index)
  low = pd.Series(data['Low'].values.flatten(), index=data.index)
  close = pd.Series(data['Close'].values.flatten(), index=data.index)
  adx = trend.ADXIndicator(high, low, close)
  df['adx'] = adx.adx()

  # Drop NaNs (from indicators with lags/rolling)
  df = df.dropna()

  # Scale features
  features = df.copy()
  scaler = ticker_feature_dfs[ticker]['scaler']
  scaled_features = scaler.fit_transform(features)

  df_scaled = pd.DataFrame(scaled_features, index=features.index, columns=features.columns).dropna()
  # Store in dictionary
  ticker_feature_dfs_predict[ticker] = {
    'features': df_scaled,
    'raw features': df,
    'scaler': scaler,
    'mean': df['close'].mean(),
    'std': df['close'].std()
  }





  data = yf.download(ticker, start=end_date, end=yesterday)
[*********************100%***********************]  1 of 1 completed
  data = yf.download(ticker, start=end_date, end=yesterday)
[*********************100%***********************]  1 of 1 completed
  data = yf.download(ticker, start=end_date, end=yesterday)
[*********************100%***********************]  1 of 1 completed
  data = yf.download(ticker, start=end_date, end=yesterday)
[*********************100%***********************]  1 of 1 completed
  data = yf.download(ticker, start=end_date, end=yesterday)
[*********************100%***********************]  1 of 1 completed
  data = yf.download(ticker, start=end_date, end=yesterday)
[*********************100%***********************]  1 of 1 completed
  data = yf.download(ticker, start=end_date, end=yesterday)
[*********************100%***********************]  1 of 1 completed
  data = yf.download(ticker, start=end_date, end=yesterday)
[*********************100%***********

In [38]:
# Ensure end_date is a datetime object
end_date = pd.to_datetime(end_date)

# Filter each ticker's data to only include rows from end_date onwards
for ticker in ticker_feature_dfs_predict:
    ticker_feature_dfs_predict[ticker]['features'] = ticker_feature_dfs_predict[ticker]['features'].loc[end_date:]
    ticker_feature_dfs_predict[ticker]['raw features'] = ticker_feature_dfs_predict[ticker]['raw features'].loc[end_date:]


In [39]:
# Running each model from model_dict on our predict data and seeing what profits we get

def predict_profits(ticker_feature_dfs_predict, model_index):
  overall_profit = 0
  for i in range(len(tickers)):
    model = model_dict[model_index]['all trained models'][i]
    preds = model.predict(ticker_feature_dfs_predict[tickers[i]]['features'])
    X_test = ticker_feature_dfs_predict[tickers[i]]['features']
    raw_df = ticker_feature_dfs_predict[tickers[i]]['raw features']
    profit, _, _, _ = simple_trading(preds, X_test, raw_df)
    overall_profit += profit
  return overall_profit

In [40]:
for j in range(len(list_of_models)):
  model_type = model_names[j]
  overall_profit = predict_profits(ticker_feature_dfs_predict, j)
  print(f"{model_type} made {overall_profit}")

lin_reg    made -90.98705863952637
rf         made -90.98705863952637
grad_boost made -90.98705863952637
mlp        made -90.98705863952637
knn        made -90.98705863952637


Conclusion so far:

We have that the 5 different models have the exact same PnL over the last 60 days, which suggests they make the exact same trades. This greatly differs from our test data from the train test split, where KNN often makes profits in the hundreds while linear regression makes small losses. This might suggest that, because of how KNN works fundamentally, the next day closing prices on the test data can be somewhat inferred from the training data.
Ultimately, so far all the models perform identically on the last 60 days, so to further learn about model performance I will run tests over longer time periods, and try to understand where our training data is less useful. This result indicates the intricacies of building a stock price predictor off pure historical data, so I will also try testing this on other backtesters, such as random walks for stock prices
