Drew Lickman

CSCI 4820-001

Project #7

Due 12/??/24

AI Disclaimer: A.I. Disclaimer: Work for this assignment was completed with the aid of artificial intelligence tools and comprehensive documentation of the names of, input provided to, and output obtained from, these tools is included as part of my assignment submission.

# Custom NLP Project using 3 Hugging Face Pipelines
### Dr. Sal Barbosa, Department of Computer Science, Middle Tennessee State University

# Project Description
This project is used to analyze the transcripts of the Federal Open Market Committees.

This JupyterNotebook will:

1. Download PDF transcripts from the official Federal Open Market Committee website using `fomc-crawler.py`, then converts them to text files with `pdf-to-txt.py`
2. Use tabularisai's robust-sentiment-analysis (distil)BERT-based Sentiment Classification Model `https://huggingface.co/tabularisai/robust-sentiment-analysis`

In [12]:
# Libraries used for fomc-crawler.py and pdf-to-txt.py
# !pip install requests tqdm beautifulsoup4
# !pip install pdfplumber

In [16]:
import os
import torch
import numpy as np
import torch.nn as nn
import torch.optim as optim
import plotly.graph_objects as go
from   transformers import AutoTokenizer, AutoModelForSequenceClassification

---
### Load the Data
---

### Prerequisite
You must run `./data/fomc-crawler.py` and `./data/pdf-to-txt.py` to download all the FOMC transcripts first, then convert the PDFs to TXT.

In [None]:
# Scrape FOMC Transcripts from https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm
# Please wait about 1 to 3 minutes.
# Code written by Claude 3.5 Sonnet (New)
!python ./data/fomc-crawler.py
# Outputs to ./data/fomc_transcripts

In [None]:
#!pip install pdfplumber
# Convert PDFs to TXT
# Please wait 1 to 3 minutes
# Code written by Claude 3.5 Sonnet (New)
!python ./data/pdf-to-txt.py
# Outputs to ./data/extracted_text

In [26]:
# Data directory
DATADIR = "./data/extracted_text" # Local FOMC transcript data as .txt

In [28]:
# List all TXT files in the directory
txt_files = [txt for txt in os.listdir(DATADIR) if txt.endswith('.txt')]
# Print the title of each TXT file
print(f"{len(txt_files)} documents ready for analysis!")

46 documents ready for analysis!


---
### (distil)BERT-based Sentiment Analysis
---

In [33]:
# If you encounter an error, you may not have Windows Long Path support enabled. 
# You can find information on how to enable this at https://pip.pypa.io/warnings/enable-long-paths
#!pip install transformers

In [None]:
# Load model and tokenizer
model_name = "tabularisai/robust-sentiment-analysis"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Function to predict sentiment
def predict_sentiment(text):
	inputs = tokenizer(text.lower(), return_tensors="pt", truncation=True, padding=True, max_length=512)
	with torch.no_grad():
		outputs = model(**inputs)
	
	probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
	predicted_class = torch.argmax(probabilities, dim=-1).item()
	
	probs_list = probabilities[0].tolist()
	sentiment_map = {0: "Very Negative", 1: "Negative", 2: "Neutral", 3: "Positive", 4: "Very Positive"}
	
	# Create a dictionary of sentiment labels and their probabilities
	sentiment_probs = {
		sentiment_map[i]: prob
		for i, prob in enumerate(probs_list)
	}

	return {
			'predicted_class': sentiment_map[predicted_class],
			'probabilities': sentiment_probs
			}

sentimentCount = {"Very Negative": 0, "Negative": 0, "Neutral": 0, "Positive": 0, "Very Positive": 0}
for text in txt_files:
	result = predict_sentiment(text)
	print(f"Text: {text}")
	print(f"Predicted Sentiment: {result['predicted_class']}")
	print("Probability Distribution:")
	for sentiment, prob in result['probabilities'].items():
		print(f"  {sentiment}: {prob * 100:.2f}%")
		sentimentCount[sentiment] += prob
	print()

In [68]:
# Print average sentiment confidence
avgSentimentPcts = []
for sentiment in sentimentCount:
	avgSentimentPcts.append(float(f"{sentimentCount[sentiment]/len(txt_files) * 100:.2f}"))
	print(f"Average {sentiment}: \t{sentimentCount[sentiment]/len(txt_files) * 100:.2f}%")
#print(avgSentimentPcts)

Average Very Negative: 	1.67%
Average Negative: 	4.65%
Average Neutral: 	60.32%
Average Positive: 	20.44%
Average Very Positive: 	12.92%


In [73]:
# Data preparation
sentiments = ["Very Negative", "Negative", "Neutral", "Positive", "Very Positive"]
percentages = avgSentimentPcts
colors = ["#ff4d4d", "#ff8c8c", "#8c8c8c", "#7fbf7f", "#2eb82e"]

# Create the bar chart
fig = go.Figure(data=[
    go.Bar(
        x=sentiments,
        y=percentages,
        marker_color=colors,
        text=[f'{p}%' for p in percentages],
        textposition='auto',
    )
])

fig.update_layout(
    title='Average FOMC Sentiment Distribution',
    xaxis_title='Sentiment',
    yaxis_title='Percentage (%)',
    yaxis_range=[0, 100],
    template='plotly_white',
    bargap=0.2
)

fig.show()

---
### Using Pre-Trained Word Embeddings (word2vec)
---

---
### Embeddings Matrix
---

---
### Converting word tokens to index values
---

---
### Indexing of a short speech

---

---
## Padding Speeches
---

---
## Split Data into Training, Validation, and Test

The data must be split into training and test data minimally. Many training loops can also use validation data at the end of each epoch, allowing a comparison between training and validation losses (if this value is high or growing it may indicate overfitting).

The split for this demonstration will be 80% training and 10% each for test and validation.

---

In [15]:
# training/test split (validation will come from test portion)
tt_split = int(len(features) * trainSplitPercent)

train_x, valtest_x = features[:tt_split], features[tt_split:]
train_y, valtest_y = encoded_labels[:tt_split], encoded_labels[tt_split:]

# Validation/test split (further split test data into validation and test)
vt_split = int(len(valtest_x) * validationSplitPercent) # Default 0.5
val_x, test_x = valtest_x[:vt_split], valtest_x[vt_split:]
val_y, test_y = valtest_y[:vt_split], valtest_y[vt_split:]

# Show shapes of data
print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
		"\nValidation set: \t{}".format(val_x.shape),
		"\nTest set: \t\t{}".format(test_x.shape))

			Feature Shapes:
Train set: 		(3816, 507) 
Validation set: 	(212, 507) 
Test set: 		(212, 507)


---
## Batching and DataLoaders
---

---
### The model
---

---
### Model Parameters

---

---
### Neural Network Hyperparameters

---

---
### Training

---

---
### Training Loop

---

---
## Testing the Model

---

In [26]:
#Testing loop
def rnn_test(test_loader):
    # Turn off gradient calculations (saves time and compute resources)
    with torch.no_grad():
    
        # Variables for tracking losses
        test_losses = [] 
        num_correct = 0
    
        true_list = []
        pred_list = []
    
        # Place model in evaluation mode
        rnn_model.eval()
    
        # Run test data through model
        for inputs, labels in test_loader:
    
            # Move test data batch to GPU/CPU
            inputs, labels = inputs.to(device), labels.to(device)
    
            # Get predicted output
            output = rnn_model(inputs)
    
            # Calculate the loss
            # test_loss = rnn_criterion(output.squeeze(), labels.float())
            test_loss = rnn_criterion(output.squeeze(), labels)
            test_losses.append(test_loss.item())
    
            # Convert output sigmoid probabilities to predicted classes (0 or 1)
            #pred = torch.round(output.squeeze())  # rounds to the nearest integer
            pred = torch.argmax(output, dim=1)
    
            # Place true and predicted labels in list
            true_list += list(labels.cpu().numpy())
            pred_list += list(pred.cpu().numpy())
    
            # Compare predicted and true labels and count number of correct prediction
            correct_tensor = pred.eq(labels.float().view_as(pred))
            correct = np.squeeze(correct_tensor.numpy()) if device=='cpu' else np.squeeze(correct_tensor.cpu().numpy())
            num_correct += np.sum(correct)
    
    pred_list = [a.squeeze().tolist() for a in pred_list]
    print(confusion_matrix(true_list, pred_list))
    print()
    print(classification_report(true_list, pred_list))
    print()
    print(f"Accuracy {accuracy_score(true_list, pred_list):.2%}")
    
    # Output average test loss
    print("Test loss: {:.3f}".format(np.mean(test_losses)))
    
    # Output average accuracy
    test_acc = num_correct/len(test_loader.dataset)
    print("Test accuracy: {:.3f}".format(test_acc))

rnn_test(test_loader)

[[29  0  0  0  0  0]
 [ 1 39  1  2  5  0]
 [ 0  0 30  0  2  0]
 [ 0  1  0 30  3  0]
 [ 0  0  0  3 32  2]
 [ 1  0  0  0  1 30]]

              precision    recall  f1-score   support

           0       0.94      1.00      0.97        29
           1       0.97      0.81      0.89        48
           2       0.97      0.94      0.95        32
           3       0.86      0.88      0.87        34
           4       0.74      0.86      0.80        37
           5       0.94      0.94      0.94        32

    accuracy                           0.90       212
   macro avg       0.90      0.91      0.90       212
weighted avg       0.90      0.90      0.90       212


Accuracy 89.62%
Test loss: 0.340
Test accuracy: 0.896
