<a href="https://colab.research.google.com/github/Billy-Drunkenstein/MAFN/blob/main/Spring%202025/Machine%20Learning%20for%20Finance/Homework%202.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Context of the problem**

The problem centers around trying to predict S&P 500 stock market movements using:

Volatility Index, which measures market sentiment and expected volatility

Crude Oil Futures, which can serve as a proxy for global economic activity.

Gold Futures, often used as a safe-haven asset during times of market stress.

The respective tickers for each of these items are the following:

^GSPC = SP500

^VIX = Volatility index

CL=F = Crude oil futures

GC=F = Gold futures

In [2]:
Tickers = {
    '^GSPC': 'S&P 500',
    '^VIX': 'Vol Index',
    'CL=F': 'Crude Oil Futures',
    'GC=F': 'Gold Futures',
}

**(0 points) Prerequisite - run the code to import the necessary modules**

In [4]:
import yfinance as yf
import pandas as pd
import numpy as np
import datetime

from sqlalchemy import create_engine
#Creates a connection (engine) to an SQLite database.

import sqlite3
#sqlite3 is imported as well, although in this code the SQLAlchemy engine is used to handle database I/O

from sklearn.metrics import accuracy_score, classification_report
#sklearn.metrics:Provides functions such as accuracy_score and classification_report to evaluate model performance

from sklearn.linear_model import LogisticRegression, SGDClassifier
#	•	sklearn.linear_model: Provides linear models: LogisticRegression is used to fit classification models with various regularization methods. SGDClassifier is a stochastic gradient descent classifier that can be trained incrementally (allowing for iteration callbacks).

from sklearn.neighbors import KNeighborsClassifier
#KNeighborsClassifier (for k-nearest neighbors

from sklearn.svm import SVC
#SVC (for support vector machines)

from sklearn.tree import DecisionTreeClassifier
#DecisionTreeClassifier (for decision trees)

from sklearn.ensemble import RandomForestClassifier
#RandomForestClassifier (for random forests)

**(0 points)Prerequisite - how to download data from Yahoo Finance without having a premium subscription**

*Option 1: Excel*

1) Go in browser to finance.yahoo.com put stock ticker in a box


2) On the left select HISTORICAL DATA


3) For period select MAX


4) SELECT ALL from table then COPY


5) Open new sheet of excel in excel select PASTE AS HTML


6) Delete any extra data from Excel


7) Import the data from the Excel file

*Option 2: Yahoo Finance workaround*

Run this python code to download the data.

!pip install yfinance openpyxl

-------------------------

msft_data = yf.download("MSFT", start="2014-01-24", end="2025-01-15", interval="1d", auto_adjust=False)

if "Adj Close" not in msft_data.columns and "Adj Close" in msft_data.columns.str.lower():
    msft_data.rename(columns={"adj close": "Adj Close"}, inplace=True)

file_path = "MSFT_Historical_Data.xlsx"
msft_data.to_excel(file_path, engine="openpyxl")

**Task 1 - Data preparation and exploration**



(10 points) Task 1.1 - Create a function to download data via API. Hint: Data will be downloaded from Yahoo Finance

In [5]:
def download_data(tickers, start_date, end_date):
    """
    Downloads historical 'Adj Close' or 'Close' data for the given tickers from Yahoo Finance.
    Handles cases where 'Adj Close' is missing.
    """
    data_dict = {}

    for ticker in tickers:
        data = yf.download(ticker, start = start_date, end = end_date,
                           interval = '1d', auto_adjust = False, progress = False)

        # Keep only the 'Adj Close' column
        if 'Adj Close' in data.columns:
            price = data['Adj Close'].rename(columns = {'Adj Close' : ticker})
            print(ticker, 'Adj Close')

        else:
            price = data['Close'].rename(columns = {'Close' : ticker})
            print(ticker, 'Close')

        # Rename Column
        data_dict[ticker] = price

    # Assemble DataFrame
    price_df = pd.concat(data_dict.values(), axis = 1, join = 'inner')
    price_df.index.name = None
    price_df.columns.name = None

    return price_df

In [6]:
start = '2014-01-24'
end = '2025-01-15'

Data = download_data(Tickers, start, end)

^GSPC Adj Close
^VIX Adj Close
CL=F Adj Close
GC=F Adj Close


(10 points) Task 1.2 - Create a function to clean the data. Hint: Drop the rows with missing values.

Calculate the daily percentage change (return) for the S&P 500. Create a binary target variable: 1 if the next day's S&P 500 return > 0, else 0.


In [7]:
def clean_data(df):
    """
    Drops rows with missing NA
    Calculates daily return
    Creates binary target variable
    """

    df.dropna(inplace = True)

    # Daily return
    df['Return'] = df['^GSPC'].pct_change()

    # Boolean target
    df['Target'] = (df['Return'] > 0).shift(-1)

    # Drop final row
    df.dropna(inplace = True)

    return df

In [8]:
Data = clean_data(Data)

print(Data.isna().sum().sum())
Data.head()

0


Unnamed: 0,^GSPC,^VIX,CL=F,GC=F,Return,Target
2014-01-27,1781.560059,17.42,95.720001,1263.599976,-0.004876,True
2014-01-28,1792.5,15.8,97.410004,1251.0,0.006141,False
2014-01-29,1774.199951,17.35,97.360001,1262.199951,-0.010209,True
2014-01-30,1794.189941,17.290001,98.230003,1242.199951,0.011267,False
2014-01-31,1782.589966,18.41,97.489998,1240.099976,-0.006465,False


(10 points) Task 1.3 - Create a function to save the data to a local database

In [None]:
def save_to_db(df, db_name='market_data.db', table_name='market_data'):

**(10 points) Task 2 – Create a function to split the Data into Train, Test, and Validation Sets**. Hint: Ensure the data is sorted by date first.

In [None]:
def split_data(df, feature_columns, target_column):

    return

**(10 points) Task 3.1 - Create a function to analyse the data applying loss functions and regularisation functions such as ridge and lasso regulsatisation**

In [None]:
def logistic_regression_models(X_train, y_train, X_val, y_val):

**(15 points) Task 4 - Create an iteration callback function to print information for troubleshooting**

During each epoch, the callback should print:
A. The current epoch number.
B. The loss value for that epoch.
C. The training accuracy for that epoch.

Comment on how the loss evolves over the epochs and whether this trend indicates proper convergence?

In [None]:
def sgd_classifier_with_callback(X_train, y_train, n_epochs=10):

    return

**(10 points) Task 5 - Create a function to explore classification algorithms in the following order: Nearest Neighbor,SVM, Decision Trees and Random Forest**.

Using the features from your dataset, your goal is to predict whether the S&P 500’s next-day return will be positive (1) or negative (0).

For each algorithm, compute and report accuracy on the test set and a detailed classification report (including precision, recall, and F1-score).

Hint: This can all be done within a single function, using a for loop.

In [None]:
def explore_classification_algorithms(X_train, y_train, X_test, y_test):

**(15 points) Task 6 - Create a simple data pipeline, which can be a scheduled script.** Hint: This serves as the main entry point that strings together all of the previous functions to form a complete data pipeline. This should be scheduled at regular intervals.

In [None]:
def run_pipeline():

**(10 points) Task 7 - What does the accuracy of each method used seem to be? What does this tell us both about the data that we used to train our prediction model and the predictive techniques which we tried to implement?**