#Financial News Sentiment Analysis


> About the Dataset

India financial news sentiment analysis dataset compiled together.

Date range: Jan 1, 2017 to April 15, 2021

News sources:
Indian sources: Economic Times, Money Control, Livemint, Business Today, Financial Express
Foreign sources: NY Times, WSJ, Washington Post

Keywords:
Indian sources: "economy" or "markets" or "inflation"
Foreign sources: "Indian economy" OR "India economy" OR "Indian businesses" OR "Indian business"

Sentiment analysis: Performed using flair NLP model. All confidence scores for NEGATIVE sentiment datapoints have been multiplied by -1 from the original flair output. Basic cleanup of data done to remove repetition of headlines and all headlines less than 30 characters are ignored.

Acknowledgements: GDELT Headline Scrape script from Prof. Ken Blake (https://drkblake.com/gdeltheadlinescrape/) has been used to generate the news headlines dataset.

Motivation: The intent of generating this data was to compile recent years financial news headlines for India and perform sentiment analysis on it.



Connecting Google Colab to Kaggle to get Dataset directly to colab

Downloading the helper functions designed by mrdbourke which contains custom functions

In [None]:
! wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py

Importing required functions from helper functions

In [72]:
from helper_functions import unzip_data, plot_loss_curves, make_confusion_matrix

Importing required libraries

In [73]:
import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras import layers

#Part 1 : Data Preprocessing

importing the dataset 

In [74]:
df = pd.read_csv("News_sentiment_Jan2017_to_Apr2021.csv")

In [75]:
df.head()

Unnamed: 0,Date,Title,URL,sentiment,confidence,Unnamed: 5
0,05/01/17,Eliminating shadow economy to have positive im...,http://economictimes.indiatimes.com/news/econo...,POSITIVE,0.996185,
1,05/01/17,Two Chinese companies hit roadblock with India...,http://economictimes.indiatimes.com/news/econo...,NEGATIVE,-0.955493,
2,05/01/17,SoftBank India Vision gets new $100,http://economictimes.indiatimes.com/small-biz/...,POSITIVE,0.595612,
3,05/01/17,Nissan halts joint development of luxury cars ...,http://economictimes.indiatimes.com/news/inter...,NEGATIVE,-0.996672,
4,05/01/17,Despite challenges Rajasthan continues to prog...,http://economictimes.indiatimes.com/news/polit...,POSITIVE,0.997388,


Label Encoding to sentiment column


> Details

As we are dealing with binary classification, we need to convert sentiment column class name ("POSITIVE", "NEGATIVE") to binary(0,1) because we are going to process this data to Neural Network , the class value must be in binary for this problem



In [76]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['sentiment'] = le.fit_transform(df['sentiment'])

In [77]:
df.head()

Unnamed: 0,Date,Title,URL,sentiment,confidence,Unnamed: 5
0,05/01/17,Eliminating shadow economy to have positive im...,http://economictimes.indiatimes.com/news/econo...,1,0.996185,
1,05/01/17,Two Chinese companies hit roadblock with India...,http://economictimes.indiatimes.com/news/econo...,0,-0.955493,
2,05/01/17,SoftBank India Vision gets new $100,http://economictimes.indiatimes.com/small-biz/...,1,0.595612,
3,05/01/17,Nissan halts joint development of luxury cars ...,http://economictimes.indiatimes.com/news/inter...,0,-0.996672,
4,05/01/17,Despite challenges Rajasthan continues to prog...,http://economictimes.indiatimes.com/news/polit...,1,0.997388,


Spliting the data into train_sentences, val_sentences, train_labels, val_labels

In [78]:
from sklearn.model_selection import train_test_split
train_sentences, val_sentences, train_labels, val_labels = train_test_split(df['Title'].to_numpy(),
                                                                            df['sentiment'].to_numpy(),
                                                                            test_size = 0.2,
                                                                            random_state = 42)

Create datasets (as fast as possible)



> tf.data: Build TensorFlow input pipelines and better performance with the tf.data API
  

we'll ensure TensorFlow loads our data onto the GPU as fast as possible, in turn leading to faster training time.



In [79]:
train_dataset = tf.data.Dataset.from_tensor_slices((train_sentences, train_labels))
valid_dataset = tf.data.Dataset.from_tensor_slices((val_sentences, val_labels))



In [80]:
train_dataset = train_dataset.batch(32).prefetch(tf.data.AUTOTUNE)
valid_dataset = valid_dataset.batch(32).prefetch(tf.data.AUTOTUNE)

#Part 2 : Embeding the Inputs (sentences) using Transfer Learning



> Converting text into numbers

you can build your own tokenizer and embedding layer but for this problem im gonna using  pre-trained word embeddings i.e Universal Sentence Encoder




loading pretrained model from hub to colab

In [81]:
embed = hub.load('https://tfhub.dev/google/universal-sentence-encoder-large/5')

creating sentence encoder layer which we gonna add in neural network

In [82]:
sentence_encoder_layer = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder-large/5", input_shape = [], dtype = "string")

#Part 3 : Build the Deep Learning Model 

Building LSTM Model using Functional Api

In [83]:
inputs = layers.Input(shape = [], dtype = "string", name = "input_layer")
x = sentence_encoder_layer(inputs)
x = tf.expand_dims(x, axis = 1)
x = layers.Bidirectional(layers.LSTM(72, return_sequences = True))(x)
x = layers.Dropout(0.5)(x)
x = layers.Bidirectional(layers.LSTM(72, return_sequences = True))(x)
x = layers.Dropout(0.5)(x)
x = layers.Bidirectional(layers.LSTM(72))(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation = 'sigmoid', name = 'output_layer')(x)
model = tf.keras.Model(inputs, outputs, name = "model_lstm")

model.compile(loss = "binary_crossentropy", optimizer = 'adam', metrics = ['accuracy'])




In [84]:
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
tf.test.gpu_device_name()

Num GPUs Available:  0
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 10970432882806203582
xla_global_id: -1
]


''

Fitting the Model

In [85]:
history = model.fit(train_dataset, validation_data = valid_dataset, epochs = 10)

Epoch 1/10
Epoch 2/10
Epoch 3/10

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



ploting the loss and accuracy curves

In [86]:
plot_loss_curves(history)

#Part 4 : Evaluating the Trained Model

Testing Model on Validation sentences

In [87]:
y_probs = model.predict(val_sentences)


converting the probabilities  in y_probs variables to class

In [88]:
y_preds = tf.round(y_probs)

Comparing the results with actual validation labels with model predicted labels

In [89]:
y_preds[:10]

In [90]:
val_labels[:10]

Building the Confustion Matrix to check model performance

In [91]:
make_confusion_matrix(val_labels, y_preds)

Saving the model for Deployment

In [92]:
model.save('best_model.h5')

loading the model to ceck whether all weights are saved

In [93]:
model = tf.keras.models.load_model("best_model.h5",custom_objects={"KerasLayer": hub.KerasLayer})

In [94]:
model.evaluate(valid_dataset)

#Part 5 : Realtime Testing of the Trained Model before Deployment

sentence from Economics Times

In [95]:
custom = "Student loan forgiveness has scammers ‘on the move,’ warns FTC"

In [96]:
custom = "Sobana is annoying"

creating a function to predict whether its is postive or negative news

In [97]:
def predict_on_sentence(model, sentence):
  """
  Uses model to make a prediction on sentence.

  Returns the sentence, the predicted label and the prediction probability.
  """
  pred_prob = model.predict([sentence])
  pred_label = tf.squeeze(tf.round(pred_prob)).numpy()
  print(f"Pred: {pred_label}", "(It's a Positive News)" if pred_label > 0 else "(It's a Negative News)", f"Prob: {pred_prob[0][0]}")
  print(f"Text:\n{sentence}")

Results

In [98]:
predict_on_sentence(model = model, sentence=custom)