# Overview

Hello, this ML project has been one of my great passion projects for quite some time and I am incredibly happy to be able to finally implement my ideas. This field is incredibly vast and I have only just scraped the surface, so I am excited to learn more and build more projects! I intend for this to be a "living" project that I would incrementally improve as my knowledge and understanding of machine learning improves over time.

## Options Pricing using Machine Learning Models

This project aims to predict options prices using various machine learning models and I intend to implement and experiment with different algorithms. At the moment, it employs the K-Nearest Neighbors (KNN) algorithm for its prediction task. Furtheromre, the project also uses the Black-Scholes formula to calculate theoretical prices which is used as a feature in the machine learning model.

## Data Source and Features
I am using data fetched from Yahoo Finance using the yfinance library. The features we are primarily interested in is:

strike: The strike price
bs_price: Theoretical options price calculated using the Black-Scholes formula
bid: The bid price
ask: The ask price
lastPrice: The last traded price, which is our target variable

I also extracted the expiry date of each contract from its symbol to obtain the 'days to expiry' which is used in the Black-Scholes formula.

## Data Preprocessing

We start by cleaning the data by dropping any missing values. We then split the data into features and the target variable, X and y respectively. The data is split to be 80% training and 20% testing. Before applying the machine learning models, the feature variables are standardized and normalized using StandardScaler.

## Model Training and Evaluation

We use a Grid Search to tune hyperparameters for the KNN model and find optimal hyperparameters. The parameters included in the grid search include the number of neighbors, weights, and the type of distance measure. It is then evaluated using 10 fold cross-validation and scored using negative mean squared error (MSE).

Then, after finding the best parameters, we use the best model to make predictions on the options prices. The model's performance is evaluated using several error metrics, primarily Mean Squared Error, Root Mean Squared Error, Mean Absolute Error, Mean Absolute Percentage Error, Median Absolute Error, and r-squared.

## Data Visualization

We use Matplotlib and Seaborn to visualize the data as well as compare the outputs of the model between the actual values. This provides an intuitive understanding of the performance of our model. We are also visualizing the errors of the predictions in order to better understand the model's performance.

## Future Improvements

There is a lot of room for growth in this project. As I mentioned earlier, I intend for this to be a living project that is incrementally improved as my understanding of the concepts of machine learning grows. Potential future improvements could include using more sophisticated models, adding more features, working with a larger dataset, and adding time series forecasting models.

In [1]:
# Import the necessary libaries
import pandas as pd
import numpy as np
import datetime
import yfinance as yf
import scipy.stats as si
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt
import seaborn as sns



In [2]:
# Function for the Black Scholes Equation
def black_scholes(row):
    S = row['underlying_price']
    X = row['strike']
    T = (row['expiry_date'] - datetime.datetime.now()).days / 365.0
    # Risk-free rate
    r = 0.01
    σ = row['impliedVolatility']
    σ = max(1e-5, σ)  # Ensure σ is never zero by taking the maximum of σ and a small number

    d1 = (np.log(S / X) + (r + 0.5 * σ ** 2) * T) / (σ * np.sqrt(T))
    d2 = d1 - σ * np.sqrt(T)
    
    return S * si.norm.cdf(d1, 0.0, 1.0) - X * np.exp(-r * T) * si.norm.cdf(d2, 0.0, 1.0)

# Function to format the dates from the yahoo finance api
def extract_expiry_date(contract_symbol):
    year = int(contract_symbol[3:5])
    month = int(contract_symbol[5:7])
    day = int(contract_symbol[7:9])
    expiry_date = pd.Timestamp(year + 2000, month, day)
    return expiry_date


In [3]:
# We are using SPYDER ETF from the Yahoo Finance api for our dataset
ticker = yf.Ticker("SPY")
options = ticker.options
option_chain = ticker.option_chain(options[0])
calls = option_chain.calls

In [4]:
# Additional features
calls['expiry_date'] = calls['contractSymbol'].apply(extract_expiry_date)
calls['underlying_price'] = ticker.info['previousClose']
calls['bs_price'] = calls.apply(black_scholes, axis=1)

In [5]:
# Selecting the relevant features
# In the future, I will test and add more features and add more data
features = ['strike', 'bid', 'ask', 'lastPrice']
data = calls[features]

# Cleaning the data from NULL values
data = data.dropna()

# The model will be predicting last price
X = data.drop('lastPrice', axis=1)
y = data['lastPrice']


# 80% of data will be training, 20% will be testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Normalize and standardize the input features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [6]:
# We are using KNN to build our model
knn = KNeighborsRegressor(n_neighbors=5)

knn.fit(X_train_scaled, y_train)

# Finally, we are making predictions on the preprocessed test data
y_pred = knn.predict(X_test_scaled)

print(y_pred)

In [8]:
# Here we are tuning the model to find better hyperparameters
# Define the parameter values
k_range = list(range(1, 31))
weight_options = ['uniform', 'distance']
# Manhattan, Euclidean
p_values = [1, 2]

# Map the parameter names to the values that should be searched
param_grid = dict(n_neighbors=k_range, weights=weight_options, p=p_values)
grid = GridSearchCV(KNeighborsRegressor(), param_grid, cv=20, scoring='neg_mean_squared_error')

# simple_grid = GridSearchCV(KNeighborsRegressor(), {"n_neighbors": [5]}, cv=1)
# simple_grid.fit(X_train_scaled, y_train)

In [9]:
# Fit the grid with data
grid.fit(X_train_scaled, y_train)
print(grid.cv_results_)

# Examine the best model
print(grid.best_score_)
print(grid.best_params_)
print(grid.best_estimator_)

{'mean_fit_time': array([0.00184293, 0.00031872, 0.00030203, 0.00029221, 0.00028415,
       0.0002686 , 0.00026484, 0.00025997, 0.00025067, 0.00024891,
       0.00024352, 0.00023446, 0.00023408, 0.00024266, 0.00022826,
       0.00022221, 0.00023017, 0.00022635, 0.00022535, 0.00024915,
       0.00023189, 0.00022655, 0.00022607, 0.00023146, 0.00025034,
       0.00022621, 0.0002315 , 0.00022345, 0.00022736, 0.000248  ,
       0.00022659, 0.00022745, 0.00022702, 0.00022664, 0.00023279,
       0.0002265 , 0.00022521, 0.00022607, 0.00022793, 0.00022669,
       0.00022588, 0.00022283, 0.00022645, 0.00022578, 0.00022411,
       0.00023336, 0.00022726, 0.00022459, 0.0002284 , 0.00022378,
       0.00023842, 0.00022292, 0.0002243 , 0.00022388, 0.00022192,
       0.00022874, 0.00022368, 0.00022697, 0.00022368, 0.00022335,
       0.00022774, 0.0002924 , 0.0002337 , 0.00022221, 0.00022173,
       0.00021849, 0.0002183 , 0.00021906, 0.00021601, 0.00021577,
       0.00022645, 0.00022111, 0.00022683, 0

: 

In [None]:
# Here we are cross-validating with other models such as Linear regression and Decision trees to evaluate the performance of our model
models = [
    ('KNN', KNeighborsRegressor(n_neighbors=grid.best_params_['n_neighbors'], weights=grid.best_params_['weights'], p=grid.best_params_['p'])),
    ('Linear Regression', LinearRegression()),
    ('Decision Tree', DecisionTreeRegressor())
]

kfold = KFold(n_splits=10, random_state=42, shuffle=True)
results = []
names = []

# Cross validating
for name, model in models:
    cv_results = cross_val_score(model, X_train_scaled, y_train, cv=kfold, scoring='neg_mean_squared_error')
    results.append(cv_results)
    names.append(name)
    print(f'{name}: {cv_results.mean()} ({cv_results.std()})')

# Make a boxplot to visualize comparing the algorithms
fig = plt.figure(figsize=(10, 7))
fig.suptitle('Comparing the Algorithms')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

In [None]:
# The best model from the grid
knn_best = grid.best_estimator_ 

# Make predictions from the best model
y_pred = knn_best.predict(X_test_scaled)

# Calculate MSE and MAE and RMSE
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mse)

print("Mean Squared Error:", mse)
print("Mean Absolute Error:", mae)
print("Root Mean Squared Error:", rmse)

NameError: name 'grid' is not defined

In [None]:
# Create the scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, color='blue', alpha=0.7)

# Change the formatting of the output to be in decimal instead of scientific notation
np.set_printoptions(suppress=True)

# Make a prediction for the options price
prediction = knn_best.predict(X_test_scaled)
print("Predicted options price:", prediction)


In [None]:
# Here we are calculating the error of the model
# Calculate the Mean Absolute Percentage Error
actual = np.array(y_test)
absolute_percentage_errors = np.abs((actual - y_pred) / actual)
mape = np.mean(absolute_percentage_errors) * 100
print("Mean Absolute Percentage Error:", mape)

# Calculate the Mean Absolute Error
medae = np.median(np.abs(y_test - y_pred))
print("Median Absolute Error:", medae)

# Calculate R-Squared
r2 = r2_score(y_test, y_pred)
print("R-Squared:", r2)

In [None]:
# Finally, we are plotting our results to better understadn the model and visually determine its performance
# We are using the seaborn library to modify the design of the plots
sns.set_style("whitegrid")
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, color='blue', alpha=0.7)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linestyle='--')

# Plot labels
plt.xlabel('Actual Prices', fontsize=12)
plt.ylabel('Predicted Prices', fontsize=12)
plt.title('Actual vs Predicted Prices', fontsize=14)

# Tick Labels
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)

# Plot Limits
plt.xlim(min(y_test), max(y_test))
plt.ylim(min(y_test), max(y_test))

# Legend
plt.legend(['Perfect Prediction', 'Data Points'], loc='lower right', fontsize=10)

# Remove the Spines
sns.despine()

plt.tight_layout()
plt.show()
