<a href="https://colab.research.google.com/github/DRSNAJ/BERT-LSTM-sentiment-trader/blob/main/BERT_LSTM_news_sentiment_trader.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import os, re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as md
from datetime import datetime
import time
import plotly.express as px

import tensorflow as tf
import torch
from keras.models import Sequential
from keras.layers import LSTM, Dense
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import TFBertModel, BertTokenizer
from transformers import Trainer, TrainingArguments
from transformers import AutoModelForSequenceClassification
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from torch.nn.functional import softmax

#### Model Training and Loading Configuration

This section of the code defines the control flow for handling a machine learning model, specifically deciding whether to train a new model or load a pre-existing one.

In [3]:
train_model = False
load_model = '/content/drive/MyDrive/Colab Notebooks/Saved Models/log_pert_LSTM-model-2024-05-11_1035.keras'

# Data Loading and Processing for News Tweets

This script is designed to load and process tweet data from various news outlets, preparing it for further analysis. Below is a step-by-step breakdown of what each section of the code accomplishes:

1. **Initialize File Paths and Data Structures**:
    - `dataset_src` specifies the directory containing the tweet data files.
    - `news_df` initializes an empty DataFrame to store all tweets from different files.
    - `datainfo` is a dictionary to hold summary information about the tweets for each news outlet such as start and end dates and the number of tweets.

2. **List of News Outlets**:
    - `all_outlets` lists all news outlets for which the tweets are being analyzed.

3. **Load and Process Each File**:
    - Iterates through each file in the specified dataset source directory.
    - Reads each tweet file into a DataFrame and parses the `timestamp` from separate `date` and `time` columns.
    - Initializes columns for each news outlet in the DataFrame to mark which tweets belong to which outlet using flags (0 or 1).
  
4. **Identify and Mark the Data Source**:
    - Determines the news outlet by matching the file name and updates the corresponding outlet column to 1.
    - Updates the `datainfo` dictionary with the earliest and latest timestamps and the total number of tweets for each outlet.

5. **Handle Exceptions**:
    - A try-except block to gracefully handle any errors during file processing, such as missing files.

6. **Concatenate DataFrames**:
    - Appends the data from each file to `news_df`.
    - Retains only the necessary columns, focusing on timestamps, outlet flags, and tweet content.

7. **Round Timestamps**:
    - Rounds off the `timestamp` in `news_df` to the nearest minute to standardize the times for easier analysis.

8. **Clean Tweet Content**:
    - Uses regular expressions to remove URLs and specific phrases from the tweet text to clean and standardize the data.

9. **Summarize and Display Data**:
    - Constructs a formatted string to display a summary of the data collected, including the total number of tweets and the date range for tweets from each outlet.
    - Finally, prints the summary table to the console.

This structured approach efficiently organizes the tweet data for easy access and manipulation in subsequent analyses.


In [5]:
dataset_src = '/content/drive/MyDrive/Colab Notebooks/datasets/news_tweets' # locations of the tweet csv files
news_df = pd.DataFrame()

In [6]:
dataset_src = '/content/drive/MyDrive/Colab Notebooks/datasets/news_tweets' # locations of the tweet csv files
news_df = pd.DataFrame()

all_outlets = ['bbc', 'cnn', 'eco']

datainfo = {"bbc":{"start_data":None,"end_data":None,"num_tweets":None},
            "cnn":{"start_data":None,"end_data":None,"num_tweets":None},
            "eco":{"start_data":None,"end_data":None,"num_tweets":None}}


for fileItem in os.listdir(dataset_src):
  data = pd.read_csv(dataset_src + "/" + fileItem)
  data['timestamp'] = pd.to_datetime((data['date'] + ' ' + data['time']))

  for i in all_outlets:
    data[i] = 0

  try:
    outlet = 'unknown'
    if (fileItem == 'tweets_bbc.csv'):
      outlet = "bbc"
    elif (fileItem == 'tweets_cnn.csv'):
      outlet = "cnn"
    elif (fileItem == 'tweets_eco.csv'):
      outlet = "eco"
    else:
      outlet = 'unknown'

    data[outlet] = 1

    datainfo[outlet]["start_data"] = data[data['timestamp'] == data['timestamp'].min()]['timestamp'].item()
    datainfo[outlet]["end_data"] = data[data['timestamp'] == data['timestamp'].max()]['timestamp'].item()
    datainfo[outlet]["num_tweets"] = data.shape[0]

  except Exception as e:
    print(f"File not found: {e}")


  news_df = pd.concat([news_df,data])

  print('The total columns of dataset: ' + str(list(news_df.columns)))
  news_df = news_df[['timestamp','bbc','cnn','eco','tweet','replies_count', 'retweets_count', 'likes_count']]

# news_df['timestamp'] = news_df['timestamp'].round('min') # minute

news_df.sort_values(by=['timestamp'])


# Regular expression pattern to match URLs
tweet_link_format = r'(\s)http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

# Replace URLs with an empty string
news_df['tweet'] = news_df['tweet'].str.replace(tweet_link_format, '', regex=True)
news_df['tweet'] = news_df['tweet'].str.replace('. Follow live updates:', '', regex=True)

# Printing results
total_tweets = 0
data_tbl = '\n==================================DATA SUMMARY===================================\n|Outlet \t|Start Date \t\t|End Date \t\t|Tweets \t|\n|---------------|-----------------------|-----------------------|---------------|\n'
for outlet in all_outlets:
  data_tbl = data_tbl + '|'+ outlet + '\t\t|'+ str(datainfo[outlet]["start_data"]) + '\t|'+ str(datainfo[outlet]["end_data"]) + '\t|'+ str(datainfo[outlet]["num_tweets"]) +'\t\t|\n'
  total_tweets = total_tweets + datainfo[outlet]["num_tweets"]
data_tbl = data_tbl +'|---------------|-----------------------|-----------------------|---------------|\n' + '|\t\t|\t\t\t|\t\t\t|' + str(total_tweets) + '\t\t|'
data_tbl = data_tbl +'\n=================================================================================\n'
print(data_tbl)

news_df.set_index('timestamp', inplace=True)


In [4]:
# dataset_src = '/content/drive/MyDrive/Colab Notebooks/datasets/news_tweets' # locations of the tweet csv files
# news_df = pd.DataFrame()

# all_outlets = ['bbc', 'cnn', 'eco']

# datainfo = {"bbc":{"start_data":None,"end_data":None,"num_tweets":None},
#             "cnn":{"start_data":None,"end_data":None,"num_tweets":None},
#             "eco":{"start_data":None,"end_data":None,"num_tweets":None}}


# for fileItem in os.listdir(dataset_src):
#   data = pd.read_csv(dataset_src + "/" + fileItem)
#   data['timestamp'] = pd.to_datetime((data['date'] + ' ' + data['time']))

#   for i in all_outlets:
#     data[i] = 0

#   try:
#     outlet = 'unknown'
#     if (fileItem == 'tweets_bbc.csv'):
#       outlet = "bbc"
#     elif (fileItem == 'tweets_cnn.csv'):
#       outlet = "cnn"
#     elif (fileItem == 'tweets_eco.csv'):
#       outlet = "eco"
#     else:
#       outlet = 'unknown'

#     data[outlet] = 1

#     datainfo[outlet]["start_data"] = data[data['timestamp'] == data['timestamp'].min()]['timestamp'].item()
#     datainfo[outlet]["end_data"] = data[data['timestamp'] == data['timestamp'].max()]['timestamp'].item()
#     datainfo[outlet]["num_tweets"] = data.shape[0]

#   except Exception as e:
#     print(f"File not found: {e}")


#   news_df = pd.concat([news_df,data])

#   print('The total columns of dataset: ' + str(list(news_df.columns)))
#   news_df = news_df[['timestamp','bbc','cnn','eco','tweet','replies_count', 'retweets_count', 'likes_count']]

# # news_df['timestamp'] = news_df['timestamp'].round('min') # minute

# news_df.sort_values(by=['timestamp'])


# # Regular expression pattern to match URLs
# tweet_link_format = r'(\s)http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

# # Replace URLs with an empty string
# news_df['tweet'] = news_df['tweet'].str.replace(tweet_link_format, '', regex=True)
# news_df['tweet'] = news_df['tweet'].str.replace('. Follow live updates:', '', regex=True)

# # Printing results
# total_tweets = 0
# data_tbl = '\n==================================DATA SUMMARY===================================\n|Outlet \t|Start Date \t\t|End Date \t\t|Tweets \t|\n|---------------|-----------------------|-----------------------|---------------|\n'
# for outlet in all_outlets:
#   data_tbl = data_tbl + '|'+ outlet + '\t\t|'+ str(datainfo[outlet]["start_data"]) + '\t|'+ str(datainfo[outlet]["end_data"]) + '\t|'+ str(datainfo[outlet]["num_tweets"]) +'\t\t|\n'
#   total_tweets = total_tweets + datainfo[outlet]["num_tweets"]
# data_tbl = data_tbl +'|---------------|-----------------------|-----------------------|---------------|\n' + '|\t\t|\t\t\t|\t\t\t|' + str(total_tweets) + '\t\t|'
# data_tbl = data_tbl +'\n=================================================================================\n'
# print(data_tbl)

# news_df.set_index('timestamp', inplace=True)


  data = pd.read_csv(dataset_src + "/" + fileItem)


The total columns of dataset: ['id', 'conversation_id', 'created_at', 'date', 'time', 'timezone', 'user_id', 'username', 'name', 'place', 'tweet', 'language', 'mentions', 'urls', 'photos', 'replies_count', 'retweets_count', 'likes_count', 'hashtags', 'cashtags', 'link', 'retweet', 'quote_url', 'video', 'thumbnail', 'near', 'geo', 'source', 'user_rt_id', 'user_rt', 'retweet_id', 'reply_to', 'retweet_date', 'translate', 'trans_src', 'trans_dest', 'timestamp', 'bbc', 'cnn', 'eco']


  data = pd.read_csv(dataset_src + "/" + fileItem)


The total columns of dataset: ['timestamp', 'bbc', 'cnn', 'eco', 'tweet', 'replies_count', 'retweets_count', 'likes_count', 'id', 'conversation_id', 'created_at', 'date', 'time', 'timezone', 'user_id', 'username', 'name', 'place', 'language', 'mentions', 'urls', 'photos', 'hashtags', 'cashtags', 'link', 'retweet', 'quote_url', 'video', 'thumbnail', 'near', 'geo', 'source', 'user_rt_id', 'user_rt', 'retweet_id', 'reply_to', 'retweet_date', 'translate', 'trans_src', 'trans_dest']


  data = pd.read_csv(dataset_src + "/" + fileItem)


The total columns of dataset: ['timestamp', 'bbc', 'cnn', 'eco', 'tweet', 'replies_count', 'retweets_count', 'likes_count', 'id', 'conversation_id', 'created_at', 'date', 'time', 'timezone', 'user_id', 'username', 'name', 'place', 'language', 'mentions', 'urls', 'photos', 'hashtags', 'cashtags', 'link', 'retweet', 'quote_url', 'video', 'thumbnail', 'near', 'geo', 'source', 'user_rt_id', 'user_rt', 'retweet_id', 'reply_to', 'retweet_date', 'translate', 'trans_src', 'trans_dest']

|Outlet 	|Start Date 		|End Date 		|Tweets 	|
|---------------|-----------------------|-----------------------|---------------|
|bbc		|2010-01-01 19:40:04	|2021-07-02 15:28:43	|34547		|
|cnn		|2010-01-01 06:58:23	|2021-07-05 05:08:12	|55236		|
|eco		|2010-01-01 21:20:14	|2021-07-05 04:59:39	|254413		|
|---------------|-----------------------|-----------------------|---------------|
|		|			|			|344196		|



In [123]:
# resampled_df = news_df.resample('4H').agg({
#     'bbc': list,
#     'cnn': list,
#     'eco': list,
#     'tweet': list,
#     'replies_count': list,
#     'retweets_count': list,
#     'likes_count': list
# })
# # Remove rows where all list columns are empty
# resampled_df = resampled_df[~resampled_df.apply(lambda row: all(len(x) == 0 for x in row), axis=1)]

In [124]:
news_df[['positive', 'negative', 'neutral']] = 0
news_tweets = news_df[news_df['eco']==0]

In [125]:
model_name = "ProsusAI/finbert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)

In [1]:
for text in news_tweets['tweet']:
  encoded_input = tokenizer(text, padding=True, return_tensors='pt')

  with torch.no_grad():
      output = model(**encoded_input)
  logits = output.logits

  probabilities = softmax(logits, dim=1)
  predicted_class = torch.argmax(probabilities).item()
  # Mapping indices to classes based on the usual setup for finbert
  news_tweets.iloc[1000, news_tweets.columns.get_indexer(['positive', 'negative', 'neutral'])] = probabilities[0].tolist()


NameError: name 'news_tweets' is not defined

In [None]:
probabilities[0].tolist()

In [None]:
news_tweets.iloc[1000]['positive', 'negative', 'neutral']


print(news_tweets.iloc[1000])
print("=========")
print(list(probabilities[0]))

In [None]:
resampled_df['tweet'][101][1]

# Tokenize the tweets in the block
encoded_input = tokenizer(resampled_df['tweet'][101][1], padding=True, truncation=True, return_tensors='tf')

# Make sure the model is in evaluation mode
model.eval()

# Running the model and getting the logits
with tf.device('/cpu:0'):  # Assuming you're using CPU; change to '/gpu:0' if using GPU
    outputs = model(**encoded_input)
    logits = outputs.logits

In [None]:
resolution = 4*60 # the blocks of time used for predition in min (eg: every 1 hour, every 4 hours, daily, weekly, every 30 min...)


In [None]:
data_path = '/content/drive/MyDrive/Colab Notebooks/datasets/forex_data/DAT_MT_GBPUSD_M1' # location of the forex data

forex_data = pd.DataFrame()
column_names = ['date','time','open','high','low','close','na']

for f in os.listdir(data_path):
  data = pd.read_csv(data_path + '/' + f, names=column_names)

  # Formatting data and creating timestamps
  data['date'] = data['date'].str.replace('.', '-')
  data['timestamp'] = pd.to_datetime((data['date'] + ' ' + data['time']))

  forex_data = pd.concat([forex_data,data])

# Removing duplicates and sorting by time.
forex_data = forex_data[['timestamp','open','high','low','close']].drop_duplicates().sort_values(by='timestamp')

# Adding in missing timestamps and interpolating the forex prices between those values.
forex_data = forex_data.set_index('timestamp')[['open','high','low','close']].asfreq(freq='60s').interpolate()

# Smoothing out closing data over 4H to remove noise using Exponential Moving Average and Simple Moving Average
period = 60*4
forex_data['4hemw'] = forex_data['close'].ewm(span=period, adjust=False).mean() # Exponential Moving Average
forex_data['ma'] = forex_data['close'].rolling(window=period).mean() # Simple Moving Average
forex_data['ma'] = forex_data['ma'].shift(-int(np.round(period/2))) # Smoothing out stock prices

# Calculating the rate of change of the average
forex_data['pert_change'] = np.gradient(forex_data['ma'])
forex_data['pert_change'] = forex_data['pert_change'].rolling(window=period).mean() # can try ema here as well
forex_data['pert_change'] = forex_data['pert_change'].shift(-int(np.round(period/2)))

forex_data['log_change']  = np.log(1 + forex_data['pert_change']);

# Loading BERT Model

In [None]:
sentiments = []

