<h1 style="text-align: center"> Using Semantic analysis of news headlines to predict stock price direction</h1>
<div style ="text-align: center">
    <h4> 24COC102 - Advanced Artificial Intelligence Systems</h4>
    <h4> By kenan Palmer (F123624) </h4>
</div>

<h2 style="text-align: center"> Abstract</h2>
<p>
This project aims to create a tutorial to create an artificial neural network with pytorch that can predict whether the price of a particular stock will increase or decrease using news headlines. Data will be scraped from the web and retrieved by API calls. 
</p>

<h2 style="text-align: center"> Learning Outcomes</h2>
<ul>
    <li>Basic Web Scrapping</li>
    <li>Pandas Dataframe Manipulation</li>
    <li>How to Download and Use a Pretrained Large Language Model with Hugging Face</li>
    <li>How to Preprocess Data</li>
    <li>How to Create and Train Artifical Nueral Network with Pytorch</li>
<li>How to Compare Performce of Different Artificial Nueral Networks</li>
</ul>

<h2 style="text-align: center"> Table of Contents</h2>
<ol>
    <li>Web scrapping to get news headlines.</li>
    <li>	Downloading a large language model from Hugging face.</li>
    <li>	Running Semantic evaluation on the news headlines and computing a ‘semantic score’ for each day.</li>
    <li>	Retrieving stock price data from yahoo finance</li>
    <li>	Combining and preprocessing data </li>
    <li>	Creating Artificial Neural Network with Pytorch</li>
    <li>	Train Network with data</li>
    <li>	Test and compare different hyperparameters and configuration of the Artificial Neural Network</li>
    <li>	Evaluate Model</li>
</ol>
</p>

<h3 style="text-align: center"> Libraries required</h3>

In [1]:
#pip install pandas bs4 requests transformers lxml torch

In [2]:
import pandas as pd
import requests
import yfinance as yf
import numpy as np
import matplotlib.pyplot as plt


from sklearn.model_selection import train_test_split, GridSearchCV, ParameterGrid
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.preprocessing import StandardScaler
from scikeras.wrappers import KerasRegressor

from bs4 import BeautifulSoup

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

<h2 style="text-align: center"> Step 1: Web Scraping to Get News Headlines</h2>
<p>
    The first step is gather news headlines with their corresponding dates. This step is largly based of this video: https://www.youtube.com/watch?v=5tpEDlUCzjk. We will be using <b>Business Insider</b>. They have a search feature that allows you to search for news articles related to a specific stock and sorts it by date. We will be using the <b>BeauitfulSoup</b> package to send the request, parse the response and extract the headlines with their respective date.
</p>

In [3]:
# selecting apple stock but you can select any (apples stock id is AAPL)
stock = 'aapl'

In [4]:
#create dataframe to store data
columns = ['Date', 'Headline']
df = pd.DataFrame(columns = columns)

In [5]:

counter = 0
numberOfPages = 300

#Loop through number of pages -> more pages means more historical data
for page in range(1, numberOfPages):
    #Get html webpage
    url = f'https://markets.businessinsider.com/news/aapl-stock?p={page}&'
    response = requests.get(url)
    html = response.text
    soup = BeautifulSoup(html, 'lxml')

    #Select divs that hold each news headline
    articles = soup.find_all('div', class_ = 'latest-news__story')
    for article in articles:
        #Extract news head line and date
        headline = article.find('a', class_ = 'news-link').text
        date = article.find('time', class_ = 'latest-news__date').get('datetime')

        #Store data in dataframe
        df = pd.concat([pd.DataFrame([[date, headline]], columns=df.columns), df], ignore_index=True)
        counter +=1

#Turn date column to datetime using pandas 
df['Date'] = pd.to_datetime(df['Date']).dt.date
print(f"number of articles gathered: {counter}")

#Store data frame as a csv file so you do not have to collect the data everytime
df.to_csv('news_data.csv')

#Result
df.head()

number of articles gathered: 14950


Unnamed: 0,Date,Headline
0,2018-08-02,UPDATE 1-Apple working with Chinese telecom fi...
1,2018-08-02,'You're getting nothing': Steve Jobs' daughter...
2,2018-08-02,UPDATE 2-Apple in touch with Chinese telcos on...
3,2018-08-02,Apple in touch with Chinese telcos on ways to ...
4,2018-08-02,Apple's code may have just revealed details ab...


In [6]:
#Load data frame from csv file - so you do not have to collect data everytime
df = pd.read_csv("news_data.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,Date,Headline
0,0,2018-08-02,UPDATE 1-Apple working with Chinese telecom fi...
1,1,2018-08-02,'You're getting nothing': Steve Jobs' daughter...
2,2,2018-08-02,UPDATE 2-Apple in touch with Chinese telcos on...
3,3,2018-08-02,Apple in touch with Chinese telcos on ways to ...
4,4,2018-08-02,Apple's code may have just revealed details ab...


<h2 style="text-align: center"> Step 2(A): Downloading a Large Language Model From Hugging Face</h2>

<p>
Step 2 uses the <b>transformers</b> library from <b>Hugging Face</b>. It allows you to use pretrained AI models with as you will see very little configuration or setup. Will be creating a pipeline that classifys financial text data. Simply, the model takes some text and returns if the text is positive, neutral or negative. The model also returns a 2nd value which represents its confidence in its answer; the closer to 1 the confidence value is, the greater the confidence the statement is positive.
</p>

<p>
In order to use these values we will need to map them to a value the artifical neural network we create can use. We will map the values 'positive', 'neutral' and 'negative' to 1,0,-1 respectively and multiply it by the confidence value. This value we will denote as 'score'. This value will be used later on as a feature of our own Artificial Neural Network.
</p>

In [7]:
# Function to get sentiment class and a confidence value - takes a pipeline and a text and then returns the sentiment
def get_sentiment_scores(sentiment_pipeline, text):

    # Map sentiment labels to integer values
    sentiment_map = {
        "positive": 1,
        "negative": -1,
        "neutral": 0
    }

    # Return the integer value of the label and its confidence value
    result = sentiment_pipeline(text)[0]
    label_int = sentiment_map.get(result['label'], 0)  # Default to 0 if label is unknown

    return label_int, result['score']

<p>
We will be using the Natural Langauge Processing model finBert (more information can be found here <a href = 'https://arxiv.org/abs/1908.10063'>https://arxiv.org/abs/1908.10063'</a>). We will need to download the model from the online repository Hugging Face. finbert's 'model card' can be found here 'https://huggingface.co/ProsusAI/finbert'.
</p>
<p>
In summary, finbert is a model built by further training Google's Bidirectional Encoder Representations from Transformers (BERT) language model which is designed to understand the context behind words by considering the relationship each word has in a  sentence to the words each side of it. The model is further trained on a "large financial corpus" to gain a better, more targeted understanding of financial texts. 
</p>

In [8]:
model_name = "ProsusAI/finbert"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

sentiment_pipeline = pipeline("text-classification", model=model, tokenizer=tokenizer, framework="pt")

Device set to use cpu


<h4 style="text-align: center"> Here are a few examples to demonstrate the finbert pipeline working. Remember that a 1 represents positive, 0 neutral and -1 negative texts </h4>

In [9]:
print("Apple stock plummeted 20% after revenue miss, investors shaken:",
      get_sentiment_scores(sentiment_pipeline, "Apple stock plummeted 20% after revenue miss, investors shaken."))
print("Stocks rallied and the British pound gained.: ",
      get_sentiment_scores(sentiment_pipeline, "Stocks rallied and the British pound gained."))
print("The weather will be okay today.: ",
      get_sentiment_scores(sentiment_pipeline, "The weather will be okay today."))
print("bla bla bla bla bla bla bla: ",
      get_sentiment_scores(sentiment_pipeline, "bla bla bla bla bla bla bla"))

Apple stock plummeted 20% after revenue miss, investors shaken: (-1, 0.9701551795005798)
Stocks rallied and the British pound gained.:  (1, 0.8983617424964905)
The weather will be okay today.:  (1, 0.5298416018486023)
bla bla bla bla bla bla bla:  (0, 0.8873725533485413)


In [10]:
#create 2 new columns sentiment_label and confidence by inputing each headline into the Large Language Model then combine these into one column called score
#This might take some time
df[['sentiment_label', 'confidence']] = df["Headline"].apply(lambda text: pd.Series(get_sentiment_scores(sentiment_pipeline, text)))
df['score'] = df['sentiment_label'] * df['confidence']

In [11]:
df.head()

Unnamed: 0.1,Unnamed: 0,Date,Headline,sentiment_label,confidence,score
0,0,2018-08-02,UPDATE 1-Apple working with Chinese telecom fi...,1.0,0.941072,0.941072
1,1,2018-08-02,'You're getting nothing': Steve Jobs' daughter...,-1.0,0.500521,-0.500521
2,2,2018-08-02,UPDATE 2-Apple in touch with Chinese telcos on...,1.0,0.902042,0.902042
3,3,2018-08-02,Apple in touch with Chinese telcos on ways to ...,1.0,0.808363,0.808363
4,4,2018-08-02,Apple's code may have just revealed details ab...,0.0,0.944104,0.0


<h2 style="text-align: center"> Step 2(B): Grouping by Date</h2>

<p>
Now that we have the sentiment score for each head line we shall group the scores by the date of each article and use the mean of the socre to determine a sentiment score for each date
</p>

In [12]:
date_sentiment_df = df.groupby('Date').agg(
    avg_sentiment_score=('score', 'mean'),
).reset_index()


In [13]:
date_sentiment_df.head()

Unnamed: 0,Date,avg_sentiment_score
0,2018-08-02,0.161824
1,2018-08-03,0.303217
2,2018-08-04,0.86968
3,2018-08-06,-0.504943
4,2018-08-07,0.073535


<h2 style="text-align: center"> Step 3(A): Gathering Historical Stock Data</h2>

<p>
 We cannot use only the determined sentiment value in our artifical nueral network, we need to include other features. This data can be received from yahoo finance and their yfinance library. We shall request the stock data for the dates we have determined a sentiment score for and join the two dataframes together to give ourselves our features (X). You can improve the model by adding columns with analystical data such as smiple moving averages, RSI indicators, Volatility - To keep the model simple, we shall only be using a few input columns but adding such columns will be left as a n exercise to the read. 
</p>

In [14]:
# Download Apple stock data
start = date_sentiment_df['Date'].min()
end = date_sentiment_df['Date'].max()

stock_df = yf.download(stock, start=start, end=end)

YF.download() has changed argument auto_adjust default to True


[*********************100%***********************]  1 of 1 completed


In [15]:
# Reset the index of the pandas dataframe to make 'Date' a column in both DataFrames
# Reseting the index and dropping the columns removes any multilayering
date_sentiment_df.reset_index(inplace=True)
stock_df.reset_index(inplace=True)
stock_df.columns = stock_df.columns.droplevel(1)

# Ensure 'Date' is in the same format for both DataFrames
date_sentiment_df['Date'] = pd.to_datetime(date_sentiment_df['Date'])
stock_df['Date'] = pd.to_datetime(stock_df['Date'])

In [16]:
print(stock_df.head())
print('-' * 30)
print(date_sentiment_df.head())

Price       Date      Close       High        Low       Open     Volume
0     2018-08-02  49.122517  49.357010  47.455020  47.509497  249616000
1     2018-08-03  49.264641  49.442286  48.670118  49.037253  133789600
2     2018-08-06  49.520443  49.563076  49.046722  49.267000  101701600
3     2018-08-07  49.056206  49.622303  48.973303  49.579669  102349600
4     2018-08-08  49.089359  49.222001  48.442731  48.805127   90102000
------------------------------
   index       Date  avg_sentiment_score
0      0 2018-08-02             0.161824
1      1 2018-08-03             0.303217
2      2 2018-08-04             0.869680
3      3 2018-08-06            -0.504943
4      4 2018-08-07             0.073535


In [17]:
#Merge the two dataframes on their 'Date' columns
data = pd.merge(date_sentiment_df, stock_df, on='Date', how='inner')
data.head()

Unnamed: 0,index,Date,avg_sentiment_score,Close,High,Low,Open,Volume
0,0,2018-08-02,0.161824,49.122517,49.35701,47.45502,47.509497,249616000
1,1,2018-08-03,0.303217,49.264641,49.442286,48.670118,49.037253,133789600
2,3,2018-08-06,-0.504943,49.520443,49.563076,49.046722,49.267,101701600
3,4,2018-08-07,0.073535,49.056206,49.622303,48.973303,49.579669,102349600
4,5,2018-08-08,0.0,49.089359,49.222001,48.442731,48.805127,90102000


<h2 style="text-align: center"> Step 3(B): Label data</h2>

<p>
In order to turn this into a classification problem we need to create labels for each row of data. We shall use a integer value of 1 or 0, where 1 represents that the price increases and 0 that the price decreases. In this section we will be also adding a few more columns to hold the previous close prices.
</p>

In [18]:
#Create target value
# Set target as either 1 or 0.
data['Target'] = (data['Close'].shift(1) > data['Close']).astype(int)

In [19]:
#Add previous close price differences to current close price as seperate columns
DAYS_LAG = 3
lags = {f'lag_{i}': data['Close'].diff(i) for i in range(1, DAYS_LAG + 1)}
data = data.assign(**lags)

# Some dates won't have all the information available, such as the 1st date; it has not previous data to look at
# As such we will be removing them from the data so we keep the input structure for our ANN the same
data.dropna(inplace=True)

In [20]:
data.head()

Unnamed: 0,index,Date,avg_sentiment_score,Close,High,Low,Open,Volume,Target,lag_1,lag_2,lag_3
3,4,2018-08-07,0.073535,49.056206,49.622303,48.973303,49.579669,102349600,1,-0.464237,-0.208435,-0.066311
4,5,2018-08-08,0.0,49.089359,49.222001,48.442731,48.805127,90102000,0,0.033154,-0.431084,-0.175282
5,6,2018-08-09,0.0,49.475437,49.68861,49.07751,49.629395,93970400,0,0.386078,0.419231,-0.045006
6,7,2018-08-10,0.0,49.328083,49.70126,49.123668,49.287676,98444800,1,-0.147354,0.238724,0.271877
7,9,2018-08-13,-0.402896,49.64658,50.140978,49.368481,49.751164,103563600,0,0.318497,0.171143,0.55722


<h2 style="text-align: center"> Step 4: Preprocess Data </h2>

<p>
As you can see, the values in the data above differ by orders of magnitude. This means that if we input these values, the columns with values in the millions have a greater influence on the output. To counter this we will scale down the columns. We will use Scikit learns Standard Scaler which uses the mean and standard deviation of the column. An alternative would be to use the highest and lowest value. This ensure that each feature has roughly the same influence at the start of the training, which greatly speeds up tuning the weights of the ANN. 
</p>

In [21]:
# Data scalers
scaler_X = StandardScaler()
scaler_Y = StandardScaler()

# Create X: Input to the model
X = data[['High', 'Close', 'Open', 'Low', 'Volume'] + [f'lag_{i}' for i in range(1, DAYS_LAG + 1)]]

# Create Y : Target we want to predict
Y = data[['Target']]

# Split data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

X_train = scaler_X.fit_transform(X_train)
X_test = scaler_X.transform(X_test)

#no need to scale y as its between 1 and 0


In [22]:
print("The first 5 input values:")
print(X_train[0:5])
print("-" * 30)
print("The first 5 Target values:")
print(Y_train[0:5])

The first 5 input values:
[[ 0.22405816  0.24088793  0.22503185  0.24084783 -0.67794393  0.35454827
   0.23545838  0.66470973]
 [ 0.34204296  0.36050731  0.3302144   0.34485348 -0.51004593  1.11631696
   0.35111993  0.59975307]
 [ 0.21327284  0.22501699  0.18855458  0.19381892 -0.35783227  0.82579704
   1.68201991  1.24437961]
 [ 0.77810197  0.79723569  0.75404743  0.78124414 -0.46217569  0.99508649
   1.43607095  0.9361816 ]
 [-0.29790176 -0.31984675 -0.27948075 -0.29997165  0.39646851 -0.70637681
  -0.66610517 -0.54772762]]
------------------------------
The first 5 Target values:
      Target
810        0
1012       0
1043       0
1302       0
538        1


<h2 style="text-align: center"> Step 5(A): Setup for ANN using pytorch</h2>
<p>We Will be using pytorch to create and train an artificial nueral network, We will create a class called ANN which will take 3 paramets - the size of the input layer, hidden layer and the output layer. The input layer will map to the X values we've created. the  and the oputput layer will have 1 output that will give the ANN prediction</p>

<p>We will then create a parameter grid with different values for the parameters of the model and training. </p>

In [23]:
#Define a simple PyTorch ANN - information about Pytorch ANN classes can be found here https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html
class ANN(nn.Module):
    OUTPUT_DIMENSION = 1 # output dimension should not be changed we are keeping a single output
    
    def __init__(self, input_dimension, hidden_dimension):
        super().__init__()
        self.fc1 = nn.Linear(input_dimension, hidden_dimension)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dimension, self.OUTPUT_DIMENSION)

        # Use Sigmoid
        self.activation = nn.Sigmoid()

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return self.activation(x)

In [24]:
# Define model hyper parameters - we will use the different permuations of the parameters bellow to find a good model
parameter_grid = {
    'hidden_dimension': [2,4,8,16, 32, 64],  # Size of the hidden layer
    'batch_size': [16, 32, 64],              # Number of samples processed before updating weights
    'learning_rate': [0.001, 0.1],           # Learning rate
    'epochs' : [50, 100, 200]                # Number of times going thorugh whole data
}

<h2 style="text-align: center"> Step 5(B): Setup Training Loop And Evaluation</h2>
<p> We will need to define a training loop that can be used by each different permuatation of model and hyperparameters. In order to train a model we need to determine how we will determine if the model got the sample correct and how to update the weights during training. We will be using Pytorch's <b>'BCELoss' or Binary Cross Entropy.</b> BCELoss measures how close the predicted value is to the actual with the equation <b>Loss =−(ylog(p)+(1−y)log(1−p)) where y is actual value and p is the predicted value</b></p>

<p> We will need an optimiser to update the weights fo the model to decrease the loss function and improve the model's prediction accuracy. We will be using the <b>ADAM, Adaptive Moment Estimation</b> which is a popular choice. However other optimisation function are available and should be considered.</p>

<p>We will then create a parameter grid with different values for the parameters of the model and training. </p>

In [25]:
# Define the training loop and evaluate a model with given hyperparameters
def train_eval(X_train, Y_train, X_test, Y_test,parameters):

    
    model = ANN(X_train.shape[1], parameters['hidden_dimension']) # Create Model with given hidden dimension
 
    loss_func = nn.BCELoss()                                    # Define loss function
    optimiser = optim.Adam(model.parameters(), lr=parameters['learning_rate']) # Define optimiser

    # Wrap data to imporve performance - list and zip x and y training data combines them and collects them in a list of tuples
    # These will be used by the data loader to return different shuffled batches of samples to train on for each epoch
    training_loader = DataLoader(list(zip(X_train, Y_train)), batch_size=parameters['batch_size'], shuffle=True)

    # Itrerate over whole training data n amount of times
    for epoch in range(parameters['epochs']):
        for batch_x, batch_y in training_loader:
            optimiser.zero_grad()                  # Clear optimiser
            predictions = model(batch_x)           # Get predictions
            loss = loss_func(predictions, batch_y) # Compare predictions wiuth output and get loss 
            loss.backward()                        # Compute the gradients via back propogation
            optimiser.step()                       # Update model's weights based on the gradients


    #Evaluate Model's performace and return accuracy
    model.eval()                                                                  # Set to evaluation mode
    with torch.no_grad():                                                         # No gradients needed - saves performance
        y_pred = model(X_test)                                                    # Get predictions
        y_pred = (y_pred > 0.5).float()                                           # classify predictions as 1 or 0 - cast as float to ensure the same object type compared
        accuracy = (y_pred.squeeze() == Y_test).sum().item() / Y_test.size(0)    # Determine percentage of test samples correctly predicted

    return accuracy

    

In [26]:
#wrap data with Pytorch tesnors - necessary to use the pytorch model
X_train = torch.tensor(X_train, dtype=torch.float32)
Y_train = torch.tensor(Y_train.values, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.float32)
Y_test = torch.tensor(Y_test.values, dtype=torch.float32)

In [None]:
# Train and evaluate a model for each combination of parameter - parameter grid is sci kit learn module that creates every combination of a parameter grid
# Note: this might take a while
results = []
for parameters in ParameterGrid(parameter_grid):
    model = ANN(X_train.shape[1], parameters['hidden_dimension'])
    results.append([parameters, train_eval( X_train, Y_train, X_test, Y_test, parameters)])
    
#print(results)

In [None]:
max_result = max(results, key=lambda x: x[1])
print("Best parameters:", max_result[0])
print(f"Highest percentage correct:{max_result[1]}%")

In [None]:
print("shallom")