This Mandatory assignment was done by: Daniel Trofimovs (s374922) and Sander Schultz (s374968)

We have chosen the task of predicting stock market prices for TESLA. Since we are predicting a continous values, we have chosen to EXCLUDE using classification algorithms such as Naive Bayes, K-Nearest Neighbours (KNN) and SVM. We have also decided to exclude using linear regression algorithm as it will not capture non-linear relationships between features and target variable, resulting in a bad R^2 (prediction) score. Therefore we have the options of using Decision Tree algorithm, Random Forest Algorithm or Support Vector Machine for Regression (SVR). Spoiler SVR was not that good as the other algorithms for this dataset.

Our findings: Linear regression had a prediciton score of 41.1%, SVR = 59.9%, Decision Tree = 99.891%, Random Forest = 99,899%. All though it is far from perfect we have chosen to rely the most on the Random Forest Algorithm. Random Forest Algorithm has the best prediction out of the 4 algorithms, but it has the flaw of overfitting the csv data. E.g. It cannot predict todays Tesla stock price correctly because of its price fall in the end of 2021 and early 2022 (which is a crucial part of data that is excluded from the csv data)

In [1]:
import pandas as pd
import matplotlib.pyplot as pyplot
import seaborn as sea

# Display is used to show several outputs per cell
from IPython.display import display

# Create dataframe from csv file
data = pd.read_csv('data/TSLA.csv')
display(data.head())

# Checking for missing values
display(data.isnull().sum())

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2010-06-29,3.8,5.0,3.508,4.778,4.778,93831500
1,2010-06-30,5.158,6.084,4.66,4.766,4.766,85935500
2,2010-07-01,5.0,5.184,4.054,4.392,4.392,41094000
3,2010-07-02,4.6,4.62,3.742,3.84,3.84,25699000
4,2010-07-06,4.0,4.0,3.166,3.222,3.222,34334500


Date         0
Open         0
High         0
Low          0
Close        0
Adj Close    0
Volume       0
dtype: int64

In [2]:
#Keep only 'Date' and 'Close' columns (Only columns needed for this assignment)
modified_data = data[['Date', 'Close']]

#Saving the modified data as a new csv file
modified_data.to_csv('data/TESLA_modified.csv', index=False)

#Load the modified csv file
data = pd.read_csv('data/TESLA_modified.csv')
data.head()

Unnamed: 0,Date,Close
0,2010-06-29,4.778
1,2010-06-30,4.766
2,2010-07-01,4.392
3,2010-07-02,3.84
4,2010-07-06,3.222


In [3]:
#Since machine learning algorithms cant handle dates directly, we convert to ordinal numbers
#First convert to datetime format
data['Date'] = pd.to_datetime(data['Date'])

#Convert to ordinal number
data['Date_Ordinal'] = data['Date'].map(lambda date: date.toordinal())
data.head()

Unnamed: 0,Date,Close,Date_Ordinal
0,2010-06-29,4.778,733952
1,2010-06-30,4.766,733953
2,2010-07-01,4.392,733954
3,2010-07-02,3.84,733955
4,2010-07-06,3.222,733959


In [4]:
#Splitting the data into train and test sets. Chose to split to 70% training and 30% testing
from sklearn.model_selection import train_test_split

X = data[['Date_Ordinal']] 
y = data['Close'] 

#Setting the random_state to 42. This guarantees that each time we run the code, the split between the training and test sets will be the same, which helps ensure reproducible results.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) 

In [5]:
#Just to show how bad of a R^2 score a linear regression would have on this dataset
from sklearn.linear_model import LinearRegression

# Initialize the Linear Regression model
lr_model = LinearRegression()

# Train the model
lr_model.fit(X_train, y_train)

# Make predictions
y_pred_lr = lr_model.predict(X_test)

# Calculate the R² score (how well the model fits the test data)
score_lr = lr_model.score(X_test, y_test)
print(f"Linear Regression R²: {score_lr}")

Linear Regression R²: 0.41159340695106783


In [6]:
#SVR did not perform as well as we thought it would. 
from sklearn.svm import SVR

svr_model = SVR(kernel='rbf', C=100, gamma=0.1)
svr_model.fit(X_train, y_train)
y_pred_svr = svr_model.predict(X_test)
score_svr = svr_model.score(X_test, y_test)
print(f"SVR R^2: {score_svr}")

SVR R^2: 0.5992887645645717


In [7]:
#Decision tree algorithm, simple and effective
from sklearn.tree import DecisionTreeRegressor

tree_model = DecisionTreeRegressor()
tree_model.fit(X_train, y_train)
y_pred_tree = tree_model.predict(X_test)
score_tree = tree_model.score(X_test, y_test)
print(f"Decision Tree R^2: {score_tree}")

Decision Tree R^2: 0.9989192719416186


In [8]:
#Random forest, also a good algorithm for this dataset.
from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
score_rf = rf_model.score(X_test, y_test)
print(f"Random Forest R^2: {score_rf}")

Random Forest R^2: 0.9989909728288165


Giving the user the option of inputting a date in YYYY-MM-DD format. The program gives out the predicted stock price on that specified date, as well as the Prediction Score accuracy. 

In [9]:
from datetime import datetime

# Function to take user input and predict stock price for the given date
def predict_stock_price_for_date(model):
    # Prompt the user to input a date in the format YYYY-MM-DD
    date_str = input("Enter a date in YYYY-MM-DD format (e.g., 2024-09-28): ")
    
    try:
        # Convert the input string to a datetime object
        user_date = datetime.strptime(date_str, "%Y-%m-%d")
        
        # Convert the date to ordinal format (what the model expects)
        user_date_ordinal = user_date.toordinal()
        
        # Convert to DataFrame for prediction (the same format the model was trained with)
        X_new = pd.DataFrame([[user_date_ordinal]], columns=['Date_Ordinal'])
        
        # Use the model to predict the stock price for the input date
        predicted_price = model.predict(X_new)
        
        # Output the predicted price
        print(f"Predicted stock price for Tesla on {user_date.strftime('%Y-%m-%d')}: ${predicted_price[0]:.2f}")
    
    except ValueError:
        print("Invalid date format. Please enter the date in YYYY-MM-DD format.")

#Choosing which model to use
predict_stock_price_for_date(rf_model)
print(f"Model Prediction Accuracy (R^2 Score): {score_rf}")

Predicted stock price for Tesla on 2020-10-10: $431.83
Model Prediction Accuracy (R^2 Score): 0.9989909728288165


Tried to test how well this program can predict todays Tesla stock prices. After testing all 4 models on 2024-10-10, we got some interesting results. Even though SVR algorithm had a lower R^2 score than Random Forest and Decision Tree, it came closest with the predicted price of 76.34$ (2024-10-10 price was 241.28$). In this case we see that Random Forest and Decision Tree suffers from "Overfitting" the csv data. Overfitting means the model is too closely tailored to the historical data, and might not perform as well on unseen or future data (e.g. today’s stock price). The SVR algorithm was closest because the TESLA stock had a large decline in late 2021 and early 2022, which is not included in the csv file. By the end of 2022 it had fallen about 70% from its peak in late 2021. 