# **Financial Headlines Sentiment Analysis**
# Author: Jakov Vodanović

# Loading data and essential libraries.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


If you wish to do this yourself, you will have to change the path.

In [2]:
%cd /content/drive/MyDrive/Arhitekture neuronskih mreža/Projekt

/content/drive/MyDrive/Arhitekture neuronskih mreža/Projekt


In [3]:
import numpy as np
import pandas as pd
import tensorflow as tf
import time
from sklearn.model_selection import train_test_split

In [4]:
df = pd.read_csv('archive/all-data.csv', encoding='latin-1')
print(df.head())
print("Total positive:", len(df[df['sentiment'] == 'positive']))
print("Total negative:", len(df[df['sentiment'] == 'negative']))
print("Total neutral:", len(df[df['sentiment'] == 'neutral']))

  sentiment                                           headline
0   neutral  According to Gran , the company has no plans t...
1   neutral  Technopolis plans to develop in stages an area...
2  negative  The international electronic industry company ...
3  positive  With the new production plant the company woul...
4  positive  According to the company 's updated strategy f...
Total positive: 1363
Total negative: 604
Total neutral: 2879


# Data processing.

In [5]:
# tokenize words in headlines
headlines = df['headline']
print(type(headlines))
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(headlines)
print(len(tokenizer.word_index)) # 10000 is a relatively small amount of words so we wont set a cap here

<class 'pandas.core.series.Series'>
10122


In [6]:
sequences = tokenizer.texts_to_sequences(headlines)

In [None]:
# pad sequences to equal length
lengths = map(lambda x: len(x), sequences)
max_length = np.max(list(lengths))
print(f"Max length: {max_length}")
sequences = tf.keras.preprocessing.sequence.pad_sequences(sequences, padding="post", maxlen=max_length)
print(sequences.shape)
print(sequences)
# check randomly to ensure they are all padded to the correct length
print(len(sequences[66]))
print(len(sequences[77]))

Max length: 71
(4846, 71)
[[  94    5 3498 ...    0    0    0]
 [ 840  336    5 ...    0    0    0]
 [   1  293  656 ...    0    0    0]
 ...
 [  42   31  242 ...    0    0    0]
 [  30   27    2 ...    0    0    0]
 [  27    3   35 ...    0    0    0]]
71
71


It appears that all are of the same length, and the padding was successful.

In [None]:
# get sentiments (labels), tensorflow requires them to be integers, so need to map them
sentiment_mapping = {
    "negative" : 0,
    "positive" : 1,
    "neutral" : 2
}
y = df['sentiment'].replace(sentiment_mapping)

x_train, x_test, y_train, y_test = train_test_split(sequences, y, train_size = 0.7, shuffle = True, random_state = 1)

In [None]:
print(x_train)
print(list(y_train))

[[5442  510   16 ...    0    0    0]
 [  22 1628    4 ...    0    0    0]
 [1141  936  136 ...    0    0    0]
 ...
 [   1  419   16 ...    0    0    0]
 [2586  123 3247 ...    0    0    0]
 [  30  615  555 ...    0    0    0]]
[1, 0, 2, 2, 1, 0, 0, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 0, 1, 2, 1, 2, 1, 1, 2, 1, 2, 2, 1, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 0, 2, 1, 1, 1, 2, 2, 0, 2, 2, 2, 1, 0, 1, 2, 1, 0, 0, 0, 1, 1, 2, 2, 1, 0, 2, 2, 2, 2, 1, 2, 0, 2, 0, 0, 2, 2, 0, 2, 2, 2, 1, 2, 1, 1, 0, 1, 2, 2, 2, 1, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 0, 2, 1, 2, 2, 1, 1, 2, 1, 2, 1, 2, 2, 1, 1, 0, 2, 1, 2, 1, 2, 1, 2, 2, 1, 2, 1, 1, 2, 2, 2, 0, 2, 2, 1, 2, 2, 2, 2, 0, 1, 1, 1, 1, 2, 2, 0, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 2, 1, 2, 0, 2, 1, 2, 0, 2, 2, 2, 1, 1, 1, 2, 2, 2, 0, 1, 1, 0, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 0, 2, 2, 1, 2, 2, 2, 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 0, 2, 1, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 

First, let's convert words into different integers and encode our sentences based on these integers. In this encoding, there are no specific rules, just encoding. The primary layer of the RNN, the Embedding layer, is crucial. It takes a sequence of sentences with encoded words as input (we need to pad with zeros and/or truncate to ensure uniform size), and it outputs a vector representing a word for a given vector size. Why is this much better than the one-hot encoder used in categorical encoding? Firstly, the one-hot vector would be over 10,000 in length in this case, with only one '1' and the rest '0's, making it impractical. Secondly, all words are equally distant from each other. Training the Embedding allows us to encode words so that words frequently appearing together will be closer in Euclidean distance.

# RNN Basics

Let's illustrate a simple RNN model, train it, and observe the results.

<img src='https://drive.google.com/uc?id=1cCqTXurPJYd7BwyIpfrsOo_1FU0kxx0U' width='90%'>

<br><br>

$$h_t = f_W(h_{t-1}, x_t)$$

>The same function $f$ and parameters $W$ are used at each step

1. input at step $t$:
 - $x_t$
2. hidden state update:
 - $h_t = \tanh(W_{hh}^Th_{t-1} + W_{xh}^Tx_t + b_h)$
3. output:
 - $\hat{y}t=W^T{hy}h_t + b_y$

So, we update all these weights during training.


Before training the baseline model, we need to discuss all the parameters. Firstly, in the Embedding layer, we need to set the size of the vector representing words. In one-hot encoding, this would be the size of the vocabulary, but now we want to reduce it. Of course, a larger vector will better represent a word, but it will also slow down the network training. There is no fixed number that is good for choosing the dimension for any dataset. Considering that Word2Vec uses 300 dimensions for a much larger vocabulary in Google News, it doesn't make sense to go beyond 300. We will try 50 and 100 dimensions and compare effectiveness defined as: accuracy - 0.001*seconds.

We will use ADAM to avoid getting stuck in a local minimum, which is a common issue with other optimizers like SGD.

# 50-dimensional vector.

In [None]:
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(input_dim=10123,
                                    output_dim=50,
                                    input_length=x_train.shape[1]))
model.add(tf.keras.layers.SimpleRNN(50))
model.add(tf.keras.layers.Dense(3, activation='softmax'))

model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

In [None]:
start_time = time.time()

history = model.fit(
    x_train,
    y_train,
    validation_split=0.2,
    batch_size=32,
    epochs=10,
)


end_time = time.time()
elapsed_time = end_time - start_time
print("Elapsed time: ", elapsed_time)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Elapsed time:  143.57347583770752


In [None]:
_, acc = model.evaluate(x_test, y_test)



In [None]:
acc - 0.001*elapsed_time

0.4410207738876343

As we can see, the accuracy on the validation set hardly changes. Let's implement early stopping to save time.

# Early stopping

In [None]:
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(input_dim=10123,
                                    output_dim=50,
                                    input_length=x_train.shape[1]))
model.add(tf.keras.layers.SimpleRNN(50))
model.add(tf.keras.layers.Dense(3, activation='softmax'))

model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

In [None]:
# zaustavljamo treniranje mreže ako 3 uzastopne epohe nema poboljšanja gubitka na skupu za validaciju
callback_ES = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

start_time = time.time()

history = model.fit(
    x_train,
    y_train,
    validation_split=0.2,
    batch_size=32,
    epochs=10,
    callbacks=[callback_ES]
)


end_time = time.time()
elapsed_time = end_time - start_time
print("Elapsed time: ", elapsed_time)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Elapsed time:  41.82280445098877


In [None]:
_, acc = model.evaluate(x_test, y_test)



In [None]:
acc - 0.001*elapsed_time

0.5640919075012207

According to our effectiveness measure, the early stopping model is much better.

# 100-dimensional vector

In [None]:
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(input_dim=10123,
                                    output_dim=100,
                                    input_length=x_train.shape[1]))
model.add(tf.keras.layers.SimpleRNN(100))
model.add(tf.keras.layers.Dense(3, activation='softmax'))

model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

In [None]:
# zaustavljamo treniranje mreže ako 3 uzastopne epohe nema poboljšanja gubitka na skupu za validaciju
callback_ES = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

start_time = time.time()

history = model.fit(
    x_train,
    y_train,
    validation_split=0.2,
    batch_size=32,
    epochs=10,
    callbacks=[callback_ES]
)


end_time = time.time()
elapsed_time = end_time - start_time
print("Elapsed time: ", elapsed_time)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Elapsed time:  55.365785360336304


In [None]:
_, acc = model.evaluate(x_test, y_test)



In [None]:
acc - 0.001*elapsed_time

0.5292284643650055

As we can see in this attempt, the 100-dimensional vector yielded worse results. Therefore, I conclude that the difference is not significant, and we can opt for the 50-dimensional vector. It's likely that we could even use a vector of smaller dimensions due to the small vocabulary.

We notice that these accuracies hover around 59 percent. At first glance, it may not seem like a terrible result. However, a trivial network that only predicts "neutral" would give us 59 percent accuracy since we have approximately 59 percent neutral news in the data. Therefore, we definitely want higher accuracy than this.

# GRU

Our model is learning very slowly, and it seems that after a few iterations it's not improving, which leads to early stopping. We used a simple RNN, so we likely encountered the issue of vanishing gradients. The vanishing gradient problem arises because of the way weights are computed in the network: in later iterations, "older" weights stop training ("backpropagation" weakly updates older weights due to very small derivatives – the issue arises if using sigmoid or hyperbolic tangent activation functions). Let's change the model to a GRU now.

What is a GRU?

<br>

<img src='https://miro.medium.com/v2/resize:fit:720/format:webp/1*6eNTqLzQ08AABo-STFNiBw.png' width='90%'>

<br>


In [None]:
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(input_dim=10123,
                                    output_dim=50,
                                    input_length=x_train.shape[1]))
model.add(tf.keras.layers.GRU(256, activation='tanh', return_sequences=True))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(3, activation='softmax'))

model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

It's important to emphasize the use of "return_sequences=True," which allows us to "return" a sequence of all hidden states (these are just the outputs in the sequence).

In [None]:
# zaustavljamo treniranje mreže ako 3 uzastopne epohe nema poboljšanja gubitka na skupu za validaciju
callback_ES = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

start_time = time.time()

history = model.fit(
    x_train,
    y_train,
    validation_split=0.2,
    batch_size=32,
    epochs=10,
    callbacks=[callback_ES]
)


end_time = time.time()
elapsed_time = end_time - start_time
print("Elapsed time: ", elapsed_time)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Elapsed time:  33.35904788970947


In [None]:
_, acc = model.evaluate(x_test, y_test)



In [None]:
acc - 0.001*elapsed_time

0.6791581540107727

As we can see, these are significantly better results. Let's check if it would be better without early stopping.

# GRU without early stopping

In [None]:
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(input_dim=10123,
                                    output_dim=50,
                                    input_length=x_train.shape[1]))
model.add(tf.keras.layers.GRU(256, activation='tanh', return_sequences=True))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(3, activation='softmax'))

model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

In [None]:
start_time = time.time()

history = model.fit(
    x_train,
    y_train,
    validation_split=0.2,
    batch_size=32,
    epochs=10,
)


end_time = time.time()
elapsed_time = end_time - start_time
print("Elapsed time: ", elapsed_time)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Elapsed time:  27.168657779693604


In [None]:
_, acc = model.evaluate(x_test, y_test)



In [None]:
acc - 0.001*elapsed_time

0.7135466074943543

The result is better when we removed early stopping.

# LSTM

Let's now try with an LSTM. First, let's look at the structure of an LSTM.

<br>

<img src='https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png' width='90%'>

<br>

In [None]:
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(input_dim=10123,
                                    output_dim=50,
                                    input_length=x_train.shape[1]))
model.add(tf.keras.layers.LSTM(50, return_sequences=True))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(3, activation='softmax'))

model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

In [None]:
start_time = time.time()

history = model.fit(
    x_train,
    y_train,
    validation_split=0.2,
    batch_size=32,
    epochs=10,
)


end_time = time.time()
elapsed_time = end_time - start_time
print("Elapsed time: ", elapsed_time)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Elapsed time:  40.371262073516846


In [None]:
_, acc = model.evaluate(x_test, y_test)



In [None]:
acc - 0.001*elapsed_time

0.6941541800498963

It's a bit slower than GRU and the results are similar.

# Bidirectional LSTM

<img src='https://production-media.paperswithcode.com/methods/Screen_Shot_2020-05-25_at_8.54.27_PM.png
' width='90%'>

<br>

It improves the context of the sentence because we know what happened before and after a given word.

In [None]:
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(input_dim=10123,
                                    output_dim=50,
                                    input_length=x_train.shape[1]))
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(50, return_sequences=True)))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(3, activation='softmax'))

model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

In [None]:
start_time = time.time()

history = model.fit(
    x_train,
    y_train,
    validation_split=0.2,
    batch_size=32,
    epochs=10,
)


end_time = time.time()
elapsed_time = end_time - start_time
print("Elapsed time: ", elapsed_time)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Elapsed time:  32.15957164764404


In [None]:
_, acc = model.evaluate(x_test, y_test)



In [None]:
acc - 0.001*elapsed_time

0.704429144859314

There is no significant improvement.

# Bidirectional LSTM with dropout

Apart from implementing dropout, we will also save the model that had the best accuracy on validation and try applying it to the test set.

In [None]:
checkpoint_LSTM = tf.keras.callbacks.ModelCheckpoint('LSTM_best_val_acc.h5',
                                                     monitor='val_accuracy',
                                                     verbose=1,
                                                     save_best_only=True,
                                                     mode='max')

In [None]:
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(input_dim=10123,
                                    output_dim=50,
                                    input_length=x_train.shape[1]))
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(50, return_sequences=True)))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dropout(0.3))
model.add(tf.keras.layers.Dense(3, activation='softmax'))

model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

In [None]:
start_time = time.time()

history = model.fit(
    x_train,
    y_train,
    validation_split=0.2,
    batch_size=32,
    epochs=10,
    callbacks=[checkpoint_LSTM]
)


end_time = time.time()
elapsed_time = end_time - start_time
print("Elapsed time: ", elapsed_time)

Epoch 1/10
Epoch 1: val_accuracy improved from -inf to 0.63623, saving model to LSTM_best_val_acc.h5
Epoch 2/10
 1/85 [..............................] - ETA: 8s - loss: 0.5613 - accuracy: 0.7500

  saving_api.save_model(


Epoch 2: val_accuracy improved from 0.63623 to 0.67599, saving model to LSTM_best_val_acc.h5
Epoch 3/10
Epoch 3: val_accuracy improved from 0.67599 to 0.68483, saving model to LSTM_best_val_acc.h5
Epoch 4/10
Epoch 4: val_accuracy improved from 0.68483 to 0.69072, saving model to LSTM_best_val_acc.h5
Epoch 5/10
Epoch 5: val_accuracy improved from 0.69072 to 0.71134, saving model to LSTM_best_val_acc.h5
Epoch 6/10
Epoch 6: val_accuracy did not improve from 0.71134
Epoch 7/10
Epoch 7: val_accuracy improved from 0.71134 to 0.71429, saving model to LSTM_best_val_acc.h5
Epoch 8/10
Epoch 8: val_accuracy improved from 0.71429 to 0.71576, saving model to LSTM_best_val_acc.h5
Epoch 9/10
Epoch 9: val_accuracy did not improve from 0.71576
Epoch 10/10
Epoch 10: val_accuracy did not improve from 0.71576
Elapsed time:  31.557350158691406


In [None]:
best_LSTM = tf.keras.models.load_model('LSTM_best_val_acc.h5')

In [None]:
_, acc = best_LSTM.evaluate(x_test, y_test)



In [None]:
acc - 0.001*elapsed_time

0.7050313663482666

Dropout did not help.

# LSTM with pre-trained Glove Embedding

In [None]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

In [None]:
tokenizer = Tokenizer(nb_words=15000)
tokenizer.fit_on_texts(headlines)
sequences = tokenizer.texts_to_sequences(headlines)

word_index = tokenizer.word_index



In [None]:
def load_glove_model(File):
    print("Loading Glove Model")
    glove_model = {}
    with open(File,'r') as f:
        for line in f:
            split_line = line.split()
            word = split_line[0]
            embedding = np.array(split_line[1:], dtype=np.float64)
            glove_model[word] = embedding
    print(f"{len(glove_model)} words loaded!")
    return glove_model

In [None]:
glove_model = load_glove_model('glove.6B.100d.txt')

Loading Glove Model
400000 words loaded!


In [None]:
embedding_matrix = np.zeros((len(word_index) + 1, 100))
for word, i in word_index.items():
    embedding_vector = glove_model.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

In [None]:
from keras.layers import Embedding

embedding_layer = Embedding(len(word_index) + 1,
                            100,
                            weights=[embedding_matrix],
                            input_length=x_train.shape[1],
                            trainable=False)

In [None]:
model = tf.keras.models.Sequential()
model.add(embedding_layer)
model.add(tf.keras.layers.LSTM(100, return_sequences=True))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(3, activation='softmax'))

model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

In [None]:
start_time = time.time()

history = model.fit(
    x_train,
    y_train,
    validation_split=0.2,
    batch_size=32,
    epochs=10,
)


end_time = time.time()
elapsed_time = end_time - start_time
print("Elapsed time: ", elapsed_time)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Elapsed time:  12.843188047409058


In [None]:
_, acc = model.evaluate(x_test, y_test)



In [None]:
acc - 0.001*elapsed_time

0.7223700320720673

Training with a pre-trained Embedding was significantly faster because most of the weights in these networks were tied to the Embedding. However, the results are relatively similar.

# Conclusion

In all models, the accuracy on the training set reached up to 95 percent, while the validation and test set accuracy were significantly lower. I conclude that increasing the number of epochs wouldn't improve the model's performance; it might even harm it, indicating mild overfitting.

<br>

The best model was GRU without early stopping, achieving an accuracy of 74.07 percent on the test set. How satisfied can we be with this accuracy? Well, it's definitely better than the initial attempts with a simple RNN, but as we can see, none of the various models (which are more complex than GRU) achieved higher accuracy, mostly hovering around 73-74 percent. The conclusion of this project is that the dataset is simply too small. Part of the problem is also likely caused by the imbalance between positive, negative, and neutral news, with neutrals accounting for as much as 59 percent.