##Notebook used for news sentiment feature extraction from Kaggle financial news dataset as part of:


*Training LSTM models to predict stock index prices from sequential historical data and financial news sentiment*


In [1]:
!pip install --upgrade tensorflow
import re
import pickle
import pandas as pd
from tqdm import tqdm
from google.colab import drive
from google.colab import files
from datetime import datetime
from dateutil import parser
from collections import defaultdict
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification

drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


# **IMPORTING DATA AND PREPROCESSING**

Importing Data from Kaggle Account using API Key:

Go to your Kaggle account settings page (https://www.kaggle.com/account), scroll down to the "API" section, and click on the "Create New API Token" button. This will download a file named kaggle.json containing your API key.

Upload Kaggle API Key to Colab Below:

In [2]:
# Upload the Kaggle API key file
uploaded = files.upload()

Saving kaggle.json to kaggle (1).json


Make sure to Install the Kaggle package in your Colab notebook:

In [6]:
#!pip install kaggle




Move the uploaded kaggle.json file to the Kaggle configuration directory and set appropriate permissions for the API key file by running the following commands:

In [3]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

Now you can use the Kaggle API to download a dataset directly to your notebook.

In [None]:
DATASET = "financial-news-headlines"
!kaggle datasets download -d notlucasp/{DATASET}
!unzip -q {DATASET}.zip

In [5]:
reuters = pd.read_csv('reuters_headlines.csv')
guardian = pd.read_csv('guardian_headlines.csv')
cnbc = pd.read_csv('cnbc_headlines.csv')

In [6]:
concatenated_df = pd.concat([reuters, cnbc,  guardian], ignore_index=True)

In [22]:
concatenated_df.tail()

Unnamed: 0,Headlines,Time,Description
53645,How investing in solar energy can create a bri...,17-Dec-17,
53646,Poundland suppliers hit by insurance downgrade,17-Dec-17,
53647,Cryptocurrencies: City watchdog to investigate...,17-Dec-17,
53648,Unilever sells household name spreads to KKR f...,17-Dec-17,
53649,The Guardian view on Ryanair’s model: a union-...,17-Dec-17,


In [7]:
'''Parses the dates from the differing formats of the Reuters, Guardian,
 and CNBC CSV files. Returns a date in the format yyyy-mm-dd.'''
def parse_date(input_date):


    try:
        # Try parsing the date using dateutil.parser
        date_obj = parser.parse(input_date)
    except ValueError:
        # If parsing fails, return None or handle the error as needed
        return None

    # Convert the date to the 'yyyy-mm-dd' format
    formatted_date = date_obj.strftime('%Y-%m-%d')

    return formatted_date




In [8]:
concatenated_df['Time'] = concatenated_df['Time'].apply(lambda x: parse_date(x) if pd.notnull(x) else x)



In [9]:
features = concatenated_df[['Headlines','Time']]

In [26]:
features.head()

Unnamed: 0,Headlines,Time
0,TikTok considers London and other locations fo...,2020-07-18
1,Disney cuts ad spending on Facebook amid growi...,2020-07-18
2,Trail of missing Wirecard executive leads to B...,2020-07-18
3,Twitter says attackers downloaded data from up...,2020-07-18
4,U.S. Republicans seek liability protections as...,2020-07-17


Splitting on the year 2020 for training/testing

In [14]:
features['Time'] = pd.to_datetime(features['Time'])

# Specify the split date
split_date = pd.to_datetime('2020-01-01')

# Split the DataFrame into train and test based on the date
train_df = features[features['Time'] < split_date]
test_df = features[features['Time'] >= split_date]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features['Time'] = pd.to_datetime(features['Time'])


Cleaning data.

In [20]:
features.iloc[32770:32774]

Unnamed: 0,Headlines,Time
32770,Jim Cramer: A better way to invest in the Covi...,2020-07-17
32771,Cramer's lightning round: I would own Teradyne,2020-07-17
32772,,
32773,"Cramer's week ahead: Big week for earnings, ev...",2020-07-17


In [21]:
'''Takes in a dataframe and column name and removes 'NaN rows' (if there is a
NaN value in that specified column. Returns the cleaned/filtered dataframe'''
def remove_rows_with_nan(features, column_name):

    features_filtered = features.dropna(subset=[column])

    return features_filtered

In [24]:
features = remove_rows_with_nan(features)
features.iloc[32770:32774]

Unnamed: 0,Headlines,Time
32770,Jim Cramer: A better way to invest in the Covi...,2020-07-17
32771,Cramer's lightning round: I would own Teradyne,2020-07-17
32773,"Cramer's week ahead: Big week for earnings, ev...",2020-07-17
32774,IQ Capital CEO Keith Bliss says tech and healt...,2020-07-17


# **SENTIMENT EXTRACTION**

Importing the tokenizer and transfomer from huggingface.

In [10]:
pipe = pipeline("text-classification", model="ProsusAI/finbert")

tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")

model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")



config.json:   0%|          | 0.00/758 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [30]:
'''Given a dataframe, date column, and text column, tokenizes and gets sentiment
 values for the the given text. Returns sentiments as a default dictionary of
 date(yyyy--mm--dd): [[pos: xx, neg: xx, neu: xx], [pos: xx, neg: xx, neu: xx]].
 This structure is used to account for dates with multiple articles.'''

def analyze_sentiment(dataframe, datelabel, textlabel):
    sentiments = defaultdict(lambda: {'pos': [], 'neg': [], 'neu': []})

    # Use tqdm as a wrapper around your dataframe's iterator
    for index, row in tqdm(dataframe.iterrows(), total=len(dataframe)):
        text = [row[textlabel]]

        # Tokenize the input text
        inputs = tokenizer(text, return_tensors="pt")

        # Forward pass through the model
        outputs = model(**inputs)

        # Get the predicted class probabilities
        probs = outputs.logits.softmax(dim=-1)

        positive = probs[:, 0].tolist()
        negative = probs[:, 1].tolist()
        neutral = probs[:, 2].tolist()

        sentiments[row[datelabel]]['pos'].extend(positive)
        sentiments[row[datelabel]]['neg'].extend(negative)
        sentiments[row[datelabel]]['neu'].extend(neutral)

    return sentiments


In [26]:
'''Given a dictionary of sentiments, computes the average positive, negative,
and neutral sentiment values for each individual day. Returns a dictionary of
yyyy-mm-dd : pos: xx, neg: xx, neu: xx.'''

    def compute_average_sentiments(sentiments):
        accumulated_sentiments = {}

          # Accumulate sentiment scores and counts
          for date, scores in sentiments.items():
              accumulated_scores = {'pos': 0.0, 'neg': 0.0, 'neu': 0.0, 'count': 0}

              for i in range(len(scores['pos'])):
                  pos_score, neg_score, neu_score = scores['pos'][i], scores['neg'][i], scores['neu'][i]
                  accumulated_scores['pos'] += pos_score
                  accumulated_scores['neg'] += neg_score
                  accumulated_scores['neu'] += neu_score
                  accumulated_scores['count'] += 1

              # Calculate averages or take the single available score
              averaged_scores = {
                  'pos': accumulated_scores['pos'] / accumulated_scores['count'] if accumulated_scores['count'] > 0 else accumulated_scores['pos'],
                  'neg': accumulated_scores['neg'] / accumulated_scores['count'] if accumulated_scores['count'] > 0 else accumulated_scores['neg'],
                  'neu': accumulated_scores['neu'] / accumulated_scores['count'] if accumulated_scores['count'] > 0 else accumulated_scores['neu'],
              }

              accumulated_sentiments[date] = averaged_scores

          return accumulated_sentiments

Running data through FinBERT and getting average sentiment scores for each day. Saving the output to a pickle file.

In [31]:
sentiments = analyze_sentiment(features, 'Time', 'Headlines')
sentiments_dict = compute_average_sentiments(sentiments)

pickle_filename_drive = '/content/drive/MyDrive/AI/sentiments20k.pickle'

# Save the dictionary to a pickle file on Google Drive
with open(pickle_filename_drive, 'wb') as file:
    pickle.dump(sentiments_dict, file)

100%|██████████| 53370/53370 [2:10:16<00:00,  6.83it/s]


In [37]:
'''Given a dictionary of the form yyyy-mm-dd : pos: xx, neg: xx, neu: xx.
Visualizes the data as a pandas dataframe with distinct columns for each
sentiment value.'''

def visualize_sentiment_data(sentiments_dict):

    # Convert the dictionary to a DataFrame
    visualization_data = pd.DataFrame(list(sentiments_dict.items()), columns=['date', 'sentiment'])

    # Split the 'sentiment' column into pos, neg, and neu columns
    visualization_data[['pos', 'neg', 'neu']] = pd.DataFrame(visualization_data['sentiment'].tolist(), index=visualization_data.index)

    # Drop the original 'sentiment' column
    visualization_data = visualization_data.drop(columns=['sentiment'])

    return visualization_data



Generating a dataframe and saving it to a pickle file for later use.

In [None]:

visualization_df = visualize_sentiment_data(sentiments_dict)

# Display the resulting DataFrame
visualization_df

pickle_filename_drive = '/content/drive/MyDrive/AI/sentiments_df.pickle'

# Save the dictionary to a pickle file on Google Drive
with open(pickle_filename_drive, 'wb') as file:
    pickle.dump(visualization_df, file)

Re-opening pickle file and re-naming columns. Converting dataframe to CSV an executing a local download.

In [6]:
with open('/content/drive/MyDrive/AI/sentiments_df.pickle', 'rb') as file:

sentiment_df = pd.read_pickle(file)

sentiment_df.rename(columns = {'date': 'Date', 'pos' : 'Positive', 'neg': 'Negative', 'neu': 'Neutral'},  inplace=True)
sentiment_df

# Save the DataFrame to a CSV file
sentiment_df.to_csv('Financial_News_Sentiment.csv', index=False)

# Download the CSV file to your local machine
files.download('Financial_News_Sentiment.csv')

Unnamed: 0,Date,Positive,Negative,Neutral
0,2020-07-18,0.135504,0.314033,0.550463
1,2020-07-17,0.194660,0.400375,0.404965
2,2020-07-16,0.198061,0.429176,0.372763
3,2020-07-15,0.278595,0.337260,0.384145
4,2020-07-14,0.231314,0.385918,0.382769
...,...,...,...,...
928,2017-12-21,0.094427,0.396806,0.508767
929,2017-12-20,0.153500,0.408590,0.437910
930,2017-12-19,0.098566,0.410148,0.491286
931,2017-12-18,0.134950,0.427190,0.437860
