### Algorithm Choice

We chose **Linear Regression** for this problem because we want to predict the **stock price**, which is a **continuous number** and not a **category**.  
Linear regression is simple, easy to interpret, and works well for this kind of problem.  
The model finds a relationship between the **date** and the **stock price**, and it is both **quick** and **reliable**.

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


GME_df = pd.read_csv("GME_stock.csv")

#Preview of the dataset 
GME_df.head()

Unnamed: 0,date,open_price,high_price,low_price,close_price,volume,adjclose_price
0,2021-01-28,265.0,483.0,112.25,193.600006,58815800.0,193.600006
1,2021-01-27,354.829987,380.0,249.0,347.51001,93396700.0,347.51001
2,2021-01-26,88.559998,150.0,80.199997,147.979996,178588000.0,147.979996
3,2021-01-25,96.730003,159.179993,61.130001,76.790001,177874000.0,76.790001
4,2021-01-22,42.59,76.760002,42.32,65.010002,196784300.0,65.010002


In [5]:
#We need to preprocess the data first
#We can see that the date is a categorical feature, we need to convert the date into a numerical feture
print(GME_df.info())  

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4773 entries, 0 to 4772
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   date            4773 non-null   object 
 1   open_price      4773 non-null   float64
 2   high_price      4773 non-null   float64
 3   low_price       4773 non-null   float64
 4   close_price     4773 non-null   float64
 5   volume          4773 non-null   float64
 6   adjclose_price  4773 non-null   float64
dtypes: float64(6), object(1)
memory usage: 261.2+ KB
None


In [6]:
#Conversion of the date from string to an ordinal number, which is a numerical representation of the date starting from jan 1, year 1
GME_df['date'] = pd.to_datetime(GME_df['date'])
GME_df['date_ordinal'] = GME_df['date'].map(pd.Timestamp.toordinal)

In [7]:
#Verifying conversion, all good 
print(GME_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4773 entries, 0 to 4772
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   date            4773 non-null   datetime64[ns]
 1   open_price      4773 non-null   float64       
 2   high_price      4773 non-null   float64       
 3   low_price       4773 non-null   float64       
 4   close_price     4773 non-null   float64       
 5   volume          4773 non-null   float64       
 6   adjclose_price  4773 non-null   float64       
 7   date_ordinal    4773 non-null   int64         
dtypes: datetime64[ns](1), float64(6), int64(1)
memory usage: 298.4 KB
None


In [8]:
#Let's also check for any null values, there are none 
GME_df.isnull().sum()

date              0
open_price        0
high_price        0
low_price         0
close_price       0
volume            0
adjclose_price    0
date_ordinal      0
dtype: int64

In [9]:
#We can also see that the adjclose_price row has the same data as in the close_price row, if this is the case for all colums,
#then we can safely drop this row because this is redundant data.
#Let's check if this is the case for all the rows 
(GME_df['close_price'] == GME_df['adjclose_price']).all()

np.False_

In [10]:
#We can see that this is not the case, the two rows have different values, we will therefore keep both the rows
GME_df[GME_df['close_price'] != GME_df['adjclose_price']]   

Unnamed: 0,date,open_price,high_price,low_price,close_price,volume,adjclose_price,date_ordinal
474,2019-03-13,11.540,11.640,11.470,11.580,2191600.0,11.200000,737131
475,2019-03-12,11.280,11.550,11.230,11.470,2164900.0,11.093610,737130
476,2019-03-11,10.980,11.280,10.890,11.260,2703600.0,10.890501,737129
477,2019-03-08,11.070,11.220,10.750,10.970,6171600.0,10.610018,737126
478,2019-03-07,11.560,11.660,11.440,11.590,1811400.0,11.209672,737125
...,...,...,...,...,...,...,...,...
4768,2002-02-20,9.600,9.875,9.525,9.875,1723200.0,6.648838,730901
4769,2002-02-19,9.900,9.900,9.375,9.550,1852600.0,6.430017,730900
4770,2002-02-15,10.000,10.025,9.850,9.950,2097400.0,6.699336,730896
4771,2002-02-14,10.175,10.195,9.925,10.000,2755400.0,6.733003,730895


In [12]:
#We now need to select our features and targets, the assignment says that the input should be the date and the output should be the close price,
#we will therefore pick those two to keep it simple 
X = GME_df[['date_ordinal']]
y = GME_df['close_price']

#We will now split the data into training and testing sets, we will test 20% of the dataset 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Training the model 
model = LinearRegression()
model.fit(X_train, y_train)

#Then we use our model to predict the close price for the 20% test set 
y_pred = model.predict(X_test)

#We then need to evaluate the model, since predicting stock price is a continous number, a confusion matrix won't work here,
#we will instead use regression metrics such as Mean Squared Error (MSE), which tells us how far off the predictions are
#and R² score, which tells us how well the model explains the variation in the data.

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R² Score:", r2)

#the model is not accurate at all, need to add more features or change the regression model 

NameError: name 'LinearRegression' is not defined