# INTRODUCTION.

### I am building a predictive model for stock price prediction of Samsung and I would be applying LinearRegression model. The dataset for Samsung's stock price would be gotten from pandas "yfinance" library.

Importing all necessary libraries.

In [1]:
import yfinance as yf
import pandas as pd
from datetime import date, timedelta, datetime
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

Getting the needed data.
"""We would assume that there are no external factores that would influence the Samsung's stock price"""

In [2]:
samsung_data = yf.Ticker('005930.KS')
hist = samsung_data.history(period="1y")

In [3]:
hist.sample(5)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2023-11-13 00:00:00+09:00,70329.262274,70329.262274,69342.877108,69441.515625,9246919,0.0,0.0
2024-03-08 00:00:00+09:00,72148.870481,72743.504029,71950.659298,72644.398438,19271349,0.0,0.0
2023-10-20 00:00:00+09:00,67961.935388,68257.850927,67172.827285,67863.296875,15204495,0.0,0.0
2024-02-20 00:00:00+09:00,73040.820803,73040.820803,72148.870481,72644.398438,14681477,0.0,0.0
2024-04-04 00:00:00+09:00,84821.678414,85120.346296,83925.674769,84921.234375,25248934,0.0,0.0


Checking the timezone of our stock (Not compulsory though).

In [4]:
print(hist.index.tzinfo)

Asia/Seoul


I sometimes prefer to always make a copy of the dataset. This step can be ignored.

In [5]:
data = hist.copy()

Checking if there are empty rows.

In [6]:
data.isna().count()

Open            246
High            246
Low             246
Close           246
Volume          246
Dividends       246
Stock Splits    246
dtype: int64

Interest is on the daily closing prices and trading volume.

In [7]:
data = data[["Close", "Volume"]]
data.sample(5)

Unnamed: 0_level_0,Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2023-10-26 00:00:00+09:00,65791.890625,15517624
2024-07-17 00:00:00+09:00,86700.0,18186490
2023-10-19 00:00:00+09:00,68553.773438,13985012
2024-09-10 00:00:00+09:00,66200.0,30651376
2024-03-29 00:00:00+09:00,82034.117188,27126366


Deriving Features from Continuous Data.

In [8]:
data['priceRise'] = np.log(data['Close'] / data['Close'].shift(1))
data['volumeRise'] = np.log(data['Volume'] / data['Volume'].shift(1))

In [9]:
data.sample(5)

Unnamed: 0_level_0,Close,Volume,priceRise,volumeRise
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2024-09-05 00:00:00+09:00,69000.0,25686769,-0.014389,-0.063346
2024-07-01 00:00:00+09:00,81800.0,11317202,0.003674,0.179682
2024-04-11 00:00:00+09:00,83726.5625,25538009,0.005963,0.073598
2023-11-01 00:00:00+09:00,67666.023438,13775256,0.025094,-0.050508
2024-04-22 00:00:00+09:00,75762.085938,30469477,-0.019519,-0.027454


In [10]:
data = data.dropna()

In [11]:
data.sample(5)

Unnamed: 0_level_0,Close,Volume,priceRise,volumeRise
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2024-01-19 00:00:00+09:00,74031.875,23363427,0.040989,0.268978
2024-01-16 00:00:00+09:00,71950.664062,14760415,-0.017748,1.607464
2024-01-12 00:00:00+09:00,72446.1875,13038939,-0.001367,-1.487166
2024-05-02 00:00:00+09:00,77653.648438,18900640,0.006431,-0.005612
2024-03-12 00:00:00+09:00,72644.398438,13011654,0.012354,0.289553


Filter the DataFrame to only include the new columns.

In [12]:
data = data[['priceRise','volumeRise',]]

Generating the output Variable (target or dependent variable).

In [13]:
conditions = [
            (data['priceRise'].shift(-1) > 0.01),
            (data['priceRise'].shift(-1) < -0.01)
            ]
choices = [1, -1]
data['Pred'] = np.select(conditions, choices, default=0)

Training and Evaluating the model.

In [14]:
features = data[['priceRise','volumeRise']].to_numpy()
features = np.around(features, decimals=2)
target = data['Pred'].to_numpy()

Spliting the data into training and testing sets and train the model.

In [15]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

In [16]:
model = LinearRegression()
model.fit(X_train, y_train)

In [17]:
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)

In [18]:
print(f'Mean Squared Error: {mse:3f}')

Mean Squared Error: 0.579398


This indicates the model accurately predicted the next day’s trajectory of Samsung's stock about 58 percent of the time.