STUDENT: Joel S. Mollel

NUMBER: C00313599

ALGORITHM: K-Nearest Neighbour


Provided with K-Nearest neighbour Code, we are required to

i) make sure it run

ii)Use another dataset and perform other operations

iii)Change some number of features and see the impact

iv) Simulate as an app


(i) Making the code running
After combining the code from the page it worked, trained the model and calculated accuracy

In [None]:
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Function to get the design matrix (features) X
def get_X(data):
    """Return model design matrix X"""
    return data.filter(like='X').values

# Function to get the target variable y
def get_y(data):
    """Return dependent variable y"""
    y = data.Close.pct_change(48).shift(-48)  # Returns after roughly two days
    y[y.between(-.004, .004)] = 0             # Devalue returns smaller than 0.4%
    y[y > 0] = 1
    y[y < 0] = -1
    return y

# Function to clean X and y (remove NaNs)
def get_clean_Xy(df):
    """Return (X, y) cleaned of NaN values"""
    X = get_X(df)
    y = get_y(df).values
    isnan = np.isnan(y)
    X = X[~isnan]
    y = y[~isnan]
    return X, y

# Simulate some sample data (you would replace this with your real data)
# Assuming that the data contains columns like 'X1', 'X2', 'Close', etc.
np.random.seed(0)
data = pd.DataFrame({
    'X1': np.random.rand(1000),
    'X2': np.random.rand(1000),
    'Close': np.random.rand(1000) * 100  # Simulating a 'Close' price column
})

# Get cleaned X and y
X, y = get_clean_Xy(data)

# Split data into training and test sets (50% split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

# Initialize and train the k-NN classifier with 7 neighbors
clf = KNeighborsClassifier(7)
clf.fit(X_train, y_train)

# Predict the test set
y_pred = clf.predict(X_test)

# Plot true vs predicted values
_ = pd.DataFrame({'y_true': y_test, 'y_pred': y_pred}).plot(figsize=(15, 2), alpha=0.7)
plt.show()

# Print classification accuracy
print('Classification accuracy: ', np.mean(y_test == y_pred))


(ii) Using another dataset and perform operations related to KNN work

The Tesla Stock historical data from 2010 to December 31, 2024 is going to be used for this project to visualize the trends of opening and closing price of Tesla stock over a period of time

Steps taken:
1. I installed all the required libraries including Yahoo Finance

All the libraries are then imported

In [None]:
import yfinance as yf
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler


step2: Fetching the TESLA stock data from yfinance and observe the first few rows 

In [None]:
import yfinance as yf

# Example: Fetch stock data for Apple (AAPL)
data = yf.download("TSLA", start="2010-01-01", end="2025-01-01")
print(data.head())

step3: Feature Engineering- making lagged features (using previous day's data)

In [None]:
data['lag_close'] = data['Close'].shift(1)
data['lag_open'] = data['Open'].shift(1)
data['lag_high'] = data['High'].shift(1)
data['lag_low'] = data['Low'].shift(1)
data['lag_volume'] = data['Volume'].shift(1)

Step4: Data Cleaning - Dropping NaN

In [None]:
data.dropna(inplace=True)

Step5: Creating feature matrix, no target variable required

In [None]:
X = data[['lag_close', 'lag_open', 'lag_high', 'lag_low', 'lag_volume']].values

Step 6: Feature Scaling to normalize the data

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
ticker="TSLA"
plt.figure(figsize=(10, 6)) #Size of the figure

plt.plot(data.index, data['Open'], label='Open Price', color='green') 
plt.plot(data.index, data['Close'], label='Close Price', color='blue', alpha=0.4)

plt.title(f'{ticker} Stock Price(Opening and Closing price) Over Time') #Labels
plt.xlabel('Date')
plt.ylabel('Stock Price (USD)')
plt.legend()
plt.grid(True)
plt.show()


iii) Change number of features to see how the impact model performance

==> In this case more features gave more accuracy than the two features
In predicting close price, Open Price and volume features were used but with the Mean Squared Error: 138.64782553698436 and
Mean Absolute Error: 8.531002985044966. When 'Open', 'Volume', 'High', 'Low', and 'Close' features were used, the significant improvement was observed with Mean Squared Error: 84.09050306528196 and Mean Absolute Error: 1.804469963522518


Mean Squared Error: 
Mean Absolute Error: 5.986933045185053

a) Libraries

In [None]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

b) Loading Tesla dataset

In [None]:
ticker = 'TSLA'
data = yf.download(ticker, start='2000-01-01', end='2024-12-31')

c) Feature selection, target declaration and scalling all features 
Open price and Volume features are selected

In [None]:
features = data[['Open', 'Volume']]
target = data['Close']

scaler = MinMaxScaler()
features_scaled = scaler.fit_transform(features)


d) Splitting data into training(80%) and testing (20%)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(features_scaled, target, test_size=0.2, shuffle=False)


e) Training the module with number of neighbours =5 and running the predictions

In [None]:
knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(X_train, y_train)

predictions = knn.predict(X_test)

f) Finding MSE and MAE and printing

In [None]:
mse = mean_squared_error(y_test, predictions)
mae = mean_absolute_error(y_test, predictions)
print(f'Mean Squared Error: {mse}')
print(f'Mean Absolute Error: {mae}')

g) Plotting Actual vs Predicted stock prices

In [None]:
plt.figure(figsize=(10, 6))
plt.plot(y_test.index, y_test, label='Actual', color='blue')
plt.plot(y_test.index, predictions, label='Predicted', color='red', linestyle='dashed')
plt.title('KNN Prediction of Tesla Stock Prices (Open, Volume as Features)')
plt.xlabel('Date')
plt.ylabel('Stock Price (USD)')
plt.legend()
plt.grid(True)
plt.show()

==> Increasing the number of features and see the impact improved accuracy by reducing Mean Absolute Error and Mean Standard Error

Features selected now are Open price, Volume, High, Low and Close price and the goal is to find close price

Declaration and scaling of features

In [None]:
features2 = data[['Open', 'Volume', 'High', 'Low', 'Close']]
target = data['Close']

scaler = StandardScaler()
features_scaled = scaler.fit_transform(features2)

Training(80%) and testing(20%) splitting

In [None]:
X_train, X_test, y_train, y_test = train_test_split(features_scaled, target, test_size=0.2, shuffle=False)

Applying KNN Regression on the training set

In [None]:
knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(X_train, y_train)

Performing predictions and evaluation of performance

In [None]:
predictions = knn.predict(X_test)

mse = mean_squared_error(y_test, predictions)
mae = mean_absolute_error(y_test, predictions)
print(f'Mean Squared Error: {mse}')
print(f'Mean Absolute Error: {mae}')

Plotting the graph

In [None]:
plt.figure(figsize=(10, 6))
plt.plot(y_test.index, y_test, label='Actual', color='blue')
plt.plot(y_test.index, predictions, label='Predicted', color='red', linestyle='dashed')
plt.title("Stock Price Prediction using KNN with features 'Open', 'Volume', 'High', 'Low', 'Close'")
plt.xlabel('Date')
plt.ylabel('Stock Price (USD)')
plt.legend()
plt.grid(True)
plt.show()

iv) Prompting User input for customized predictions - Real life application

==> Libraries and dataset import

In [None]:
import yfinance as yf
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

ticker = 'TSLA'
data = yf.download(ticker, start='2000-01-01', end='2024-12-31')

==> Declaration and scalling of features to predict "Close" price

In [None]:
features = data[['Open', 'Volume', 'High', 'Low', 'Close']]  # All 5 features
target = data['Close']  # Target is 'Close'

scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

==> Splitting data into training (80%) and testing(20%) datasets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(features_scaled[:, :4], target, test_size=0.2, shuffle=False)

==> Train the KNN Regression Model using the training dataset

In [None]:
knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(X_train, y_train)

==> Performing predictions using the test set

In [None]:
predictions = knn.predict(X_test)

In [None]:
# Evaluate Model Performance
mse = mean_squared_error(y_test, predictions)
mae = mean_absolute_error(y_test, predictions)
print(f'Mean Squared Error: {mse}')
print(f'Mean Absolute Error: {mae}')

==> Prompt user input and predict the Close Price

In [None]:
def predict_stock_price():
    print("Please enter the following details to predict the stock price:")

    open_price = float(input("Enter the Open price: "))
    volume = int(input("Enter the Volume: "))
    high_price = float(input("Enter the High price: "))
    low_price = float(input("Enter the Low price: "))

    scaled_input = scaler.transform([[open_price, volume, high_price, low_price, 0]])[:, :4]

    predicted_close = knn.predict(scaled_input)

    print(f"Predicted Stock Close Price is ${predicted_close.item():.2f}")

    return open_price, volume, high_price, low_price

open_price, volume, high_price, low_price = predict_stock_price()

plt.figure(figsize=(10, 6))
plt.plot(y_test.index, y_test, label='Actual', color='blue')
plt.plot(y_test.index, predictions, label='Predicted', color='red', linestyle='dashed')
plt.title(f'KNN Prediction of Tesla Stock Prices using Open: ${open_price}, Volume: {volume}, High: ${high_price}, Low: ${low_price}')
plt.xlabel('Date')
plt.ylabel('Stock Price (USD)')
plt.legend()
plt.grid(True)
plt.show()
