Drew Lickman

CSCI 4820-001

Project #7

Due 12/??/24

AI Disclaimer: A.I. Disclaimer: Work for this assignment was completed with the aid of artificial intelligence tools and comprehensive documentation of the names of, input provided to, and output obtained from, these tools is included as part of my assignment submission.

# Custom NLP Project using 3 Hugging Face Pipelines
### Dr. Sal Barbosa, Department of Computer Science, Middle Tennessee State University

# Project Description
This project is used to analyze the transcripts of the Federal Open Market Committees (FOMC).

I chose this project because I believe it is important for people to get a quick and easy-to-understand analysis of the FOMC meetings. The FOMC "reviews economic and financial conditions, determines the appropriate stance of monetary policy, and assesses the risks to its long-run goals of price stability and sustainable economic growth". (https://www.federalreserve.gov/monetarypolicy/fomc.htm)

The dataset I used is the FOMC transcripts from each of their meetings. I created a web scraper to read the FOMC website and download the PDFs.

This JupyterNotebook will:

1. Download PDF transcripts from the official FOMC website using `fomc-crawler.py`
2. Convert the PDFs to text files with `pdf-to-txt.py`
3. Use a slightly modified version of tabularisai's robust-sentiment-analysis (distil)BERT-based Sentiment Classification Model `https://huggingface.co/tabularisai/robust-sentiment-analysis`
4. Summarize each document via pipeline of Falconsai's text_summarization Fine-Tuned T5 Small for Text Summarization Model `https://huggingface.co/Falconsai/text_summarization`

In [1]:
# Libraries used for fomc-crawler.py and pdf-to-txt.py
# !pip install requests tqdm beautifulsoup4
# !pip install pdfplumber

In [2]:
import os
import nltk
import torch
import numpy as np
import torch.nn as nn
nltk.download('punkt')
import torch.optim as optim
import plotly.graph_objects as go
from   nltk.tokenize import sent_tokenize
from   transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\drew1\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
  from .autonotebook import tqdm as notebook_tqdm


---
### Load the Data
---

### Prerequisite
You must run `./data/fomc-crawler.py` and `./data/pdf-to-txt.py` to download all the FOMC transcripts first, then convert the PDFs to TXT.

In [18]:
# Scrape FOMC Transcripts from https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm
# Please wait about 1 to 3 minutes.
# Code written by Claude 3.5 Sonnet (New)
!python ./data/fomc-crawler.py
# Outputs to ./data/fomc_transcripts

Finding press conference pages...

Found 48 press conference pages.

Gathering transcript PDF links...

Found 46 transcript PDFs to download:
- https://www.federalreserve.gov/mediacenter/files/FOMCpresconf20240131.pdf
- https://www.federalreserve.gov/mediacenter/files/FOMCpresconf20240320.pdf
- https://www.federalreserve.gov/mediacenter/files/FOMCpresconf20240501.pdf
- https://www.federalreserve.gov/mediacenter/files/FOMCpresconf20240612.pdf
- https://www.federalreserve.gov/mediacenter/files/fomcpresconf20240731.pdf
- https://www.federalreserve.gov/mediacenter/files/FOMCpresconf20240918.pdf
- https://www.federalreserve.gov/mediacenter/files/FOMCpresconf20241107.pdf
- https://www.federalreserve.gov/mediacenter/files/FOMCpresconf20230201.pdf
- https://www.federalreserve.gov/mediacenter/files/FOMCpresconf20230322.pdf
- https://www.federalreserve.gov/mediacenter/files/FOMCpresconf20230503.pdf
- https://www.federalreserve.gov/mediacenter/files/FOMCpresconf20230614.pdf
- https://www.federalr


  0%|          | 0/48 [00:00<?, ?it/s]
  2%|▏         | 1/48 [00:00<00:16,  2.87it/s]
  4%|▍         | 2/48 [00:00<00:15,  3.01it/s]
  6%|▋         | 3/48 [00:01<00:15,  2.96it/s]
  8%|▊         | 4/48 [00:01<00:14,  3.06it/s]
 10%|█         | 5/48 [00:01<00:14,  2.97it/s]
 12%|█▎        | 6/48 [00:02<00:14,  2.88it/s]
 15%|█▍        | 7/48 [00:02<00:13,  2.98it/s]
 17%|█▋        | 8/48 [00:02<00:13,  2.96it/s]
 19%|█▉        | 9/48 [00:03<00:13,  2.87it/s]
 21%|██        | 10/48 [00:03<00:12,  3.02it/s]
 23%|██▎       | 11/48 [00:03<00:12,  2.96it/s]
 25%|██▌       | 12/48 [00:04<00:12,  2.84it/s]
 27%|██▋       | 13/48 [00:04<00:12,  2.73it/s]
 29%|██▉       | 14/48 [00:04<00:12,  2.66it/s]
 31%|███▏      | 15/48 [00:05<00:12,  2.71it/s]
 33%|███▎      | 16/48 [00:05<00:11,  2.73it/s]
 35%|███▌      | 17/48 [00:06<00:11,  2.66it/s]
 38%|███▊      | 18/48 [00:06<00:10,  2.77it/s]
 40%|███▉      | 19/48 [00:06<00:10,  2.85it/s]
 42%|████▏     | 20/48 [00:07<00:09,  2.85it/s]
 44%|████

In [4]:
#!pip install pdfplumber
# Convert PDFs to TXT
# Please wait 1 to 3 minutes
# Code written by Claude 3.5 Sonnet (New)
!python ./data/pdf-to-txt.py
# Outputs to ./data/extracted_text

Batch conversion completed successfully!


2024-11-23 16:47:28,774 - INFO - Successfully generated FOMCpresconf20190130.txt
2024-11-23 16:47:30,424 - INFO - Successfully generated FOMCpresconf20190320.txt
2024-11-23 16:47:31,833 - INFO - Successfully generated FOMCpresconf20190501.txt
2024-11-23 16:47:33,415 - INFO - Successfully generated FOMCpresconf20190619.txt
2024-11-23 16:47:35,072 - INFO - Successfully generated FOMCpresconf20190731.txt
2024-11-23 16:47:36,960 - INFO - Successfully generated FOMCpresconf20190918.txt
2024-11-23 16:47:38,692 - INFO - Successfully generated FOMCpresconf20191030.txt
2024-11-23 16:47:40,650 - INFO - Successfully generated FOMCpresconf20191211.txt
2024-11-23 16:47:42,658 - INFO - Successfully generated FOMCpresconf20200129.txt
2024-11-23 16:47:44,399 - INFO - Successfully generated FOMCpresconf20200429.txt
2024-11-23 16:47:46,677 - INFO - Successfully generated FOMCpresconf20200610.txt
2024-11-23 16:47:48,999 - INFO - Successfully generated FOMCpresconf20200729.txt
2024-11-23 16:47:51,319 - IN

In [5]:
# Data directory
DATADIR = "./data/extracted_text" # Local FOMC transcript data as .txt

In [6]:
# Save text files and their data to a dictionary
txt_fileNames = [txt for txt in os.listdir(DATADIR) if txt.endswith('.txt')]
# Print the title of each TXT file
print(f"{len(txt_fileNames)} documents ready for analysis!")

txt_data = [open(os.path.join(DATADIR, file), 'r', encoding='utf-8').read() for file in txt_fileNames]

textDict = {fileName: data for fileName, data in zip(txt_fileNames, txt_data)}

46 documents ready for analysis!


In [7]:
# Helper function that splits an input text into chunks, and attempts to keep sentences together
# Written by Claude 3.5 Sonnet (New)

def chunk_text(text, max_chunk_size):
    """
    Split text into chunks based on sentences to respect max token limit.
    Tries to keep sentences together while staying under the token limit.
    """
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = []
    current_length = 0
    
    for sentence in sentences:
        # Rough approximation of tokens (words + punctuation)
        sentence_length = len(sentence.split())
        
        if current_length + sentence_length > max_chunk_size:
            if current_chunk:  # Save current chunk if it exists
                chunks.append(' '.join(current_chunk))
                current_chunk = [sentence]
                current_length = sentence_length
            else:  # Handle case where single sentence exceeds max_chunk_size
                chunks.append(sentence)
                current_chunk = []
                current_length = 0
        else:
            current_chunk.append(sentence)
            current_length += sentence_length
    
    # Add the last chunk if it exists
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    
    return chunks

---
### (distil)BERT-based Sentiment Analysis
---

In [8]:
# If you encounter an error, you may not have Windows Long Path support enabled. 
# You can find information on how to enable this at https://pip.pypa.io/warnings/enable-long-paths
#!pip install transformers

In [9]:
# tabularisai's robust-sentiment-analysis used via pipeline:
# Modified to be chunked for longer input texts
# also outputs probability distribution, rather than just the highest result
# Please wait 2 to 4 minutes.
model_name = "tabularisai/robust-sentiment-analysis"
sentimentAnalysis = pipeline(model=model_name, device=device)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Pipeline from Hugging Face (copied from example on page, had to modify to get probability distribution)
def predict_sentiment(text):
	inputs = tokenizer(text.lower(), return_tensors="pt", truncation=True, padding=True, max_length=512)
	with torch.no_grad():
		outputs = model(**inputs)
	
	probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
	predicted_class = torch.argmax(probabilities, dim=-1).item()
	
	probs_list = probabilities[0].tolist()
	sentiment_map = {0: "Very Negative", 1: "Negative", 2: "Neutral", 3: "Positive", 4: "Very Positive"}
	
	# Create a dictionary of sentiment labels and their probabilities
	sentiment_probs = {
						sentiment_map[i]: prob
						for i, prob in enumerate(probs_list)
						}

	return {
			'predicted_class': sentiment_map[predicted_class],
			'probabilities': sentiment_probs
			}

def analyze_long_text(text, max_chunk_size):
	"""
	Analyze sentiment of long text by breaking it into chunks and averaging results.
	"""
	# Clean text
	text = text.replace('\n', ' ').strip()
	
	# Split into chunks using existing chunk_text function
	chunks = chunk_text(text, max_chunk_size)
	
	# Analyze each chunk
	chunk_sentiments = {"Very Negative": 0, "Negative": 0, "Neutral": 0, "Positive": 0, "Very Positive": 0}
	valid_chunks = 0
	
	for chunk in chunks:
		try:
			result = predict_sentiment(chunk) # Uses modified pipeline
			for sentiment, prob in result['probabilities'].items():
				chunk_sentiments[sentiment] += prob
			valid_chunks += 1
		except Exception as e:
			print(f"Error processing chunk: {e}")
			continue
	
	# Average the sentiments
	if valid_chunks > 0:
		for sentiment in chunk_sentiments:
			chunk_sentiments[sentiment] /= valid_chunks
	
	# Determine overall sentiment
	max_sentiment = max(chunk_sentiments.items(), key=lambda x: x[1])
	
	return {
			'predicted_class': max_sentiment[0],
			'probabilities': chunk_sentiments
			}

# Updated sentiment analysis loop
sentimentCount = {"Very Negative": 0, "Negative": 0, "Neutral": 0, "Positive": 0, "Very Positive": 0}
for txt in textDict:
    try:
        result = analyze_long_text(textDict[txt], max_chunk_size=256)
        print(f"File: {txt}")
        print(f"Predicted Sentiment: {result['predicted_class']}")
        print("Probability Distribution:")
        for sentiment, prob in result['probabilities'].items():
            print(f"  {sentiment}: {prob * 100:.2f}%")
            sentimentCount[sentiment] += prob
        print()
    except Exception as e:
        print(f"Error processing {txt}: {e}")


File: FOMCpresconf20190130.txt
Predicted Sentiment: Neutral
Probability Distribution:
  Very Negative: 2.87%
  Negative: 10.58%
  Neutral: 63.98%
  Positive: 15.65%
  Very Positive: 6.92%

File: FOMCpresconf20190320.txt
Predicted Sentiment: Neutral
Probability Distribution:
  Very Negative: 3.92%
  Negative: 8.50%
  Neutral: 56.45%
  Positive: 22.29%
  Very Positive: 8.84%

File: FOMCpresconf20190501.txt
Predicted Sentiment: Neutral
Probability Distribution:
  Very Negative: 1.88%
  Negative: 8.70%
  Neutral: 65.71%
  Positive: 17.21%
  Very Positive: 6.50%

File: FOMCpresconf20190619.txt
Predicted Sentiment: Neutral
Probability Distribution:
  Very Negative: 1.36%
  Negative: 5.72%
  Neutral: 66.87%
  Positive: 18.17%
  Very Positive: 7.88%

File: FOMCpresconf20190731.txt
Predicted Sentiment: Neutral
Probability Distribution:
  Very Negative: 2.06%
  Negative: 8.54%
  Neutral: 64.95%
  Positive: 16.42%
  Very Positive: 8.03%

File: FOMCpresconf20190918.txt
Predicted Sentiment: Neutral

In [10]:
# Print average sentiment confidence
avgSentimentPcts = []
for sentiment in sentimentCount:
	avgSentimentPcts.append(float(f"{sentimentCount[sentiment]/len(textDict) * 100:.2f}"))
	print(f"Average {sentiment}: \t{sentimentCount[sentiment]/len(textDict) * 100:.2f}%")
#print(avgSentimentPcts)

Average Very Negative: 	3.18%
Average Negative: 	8.19%
Average Neutral: 	60.66%
Average Positive: 	18.30%
Average Very Positive: 	9.68%


In [11]:
# Data preparation
sentiments = ["Very Negative", "Negative", "Neutral", "Positive", "Very Positive"]
percentages = avgSentimentPcts
colors = ["#ff4d4d", "#ff8c8c", "#8c8c8c", "#7fbf7f", "#2eb82e"]

# Create the bar chart
fig = go.Figure(data=[
    go.Bar(
        x=sentiments,
        y=percentages,
        marker_color=colors,
        text=[f'{p}%' for p in percentages],
        textposition='auto',
    )
])

fig.update_layout(
    title='Average FOMC Sentiment Distribution',
    xaxis_title='Sentiment',
    yaxis_title='Percentage (%)',
    yaxis_range=[0, 100],
    template='plotly_white',
    bargap=0.2
)

fig.show()

---
### Summarize each document
---

In [37]:
# Falconsai's text_summarization used via pipeline:
# Modified to be chunked for longer input texts
# Please wait 14 - 18 minutes.
summarizer = pipeline(model="Falconsai/text_summarization", device=device)

def summarize_long_text(text, summarizer, max_length_div, min_length_div, max_chunk_size):
	"""
	Summarize long text by breaking it into chunks and combining summaries.
	"""
	# Clean text
	text = text.replace('\n', ' ').strip()
	
	# Split into chunks
	chunks = chunk_text(text, max_chunk_size)
	chunkLen = len(chunks)
	max_length = chunkLen // max_length_div
	min_length = chunkLen // min_length_div
	# Summarize each chunk
	chunk_summaries = []
	for chunk in chunks:
		try:
			result = summarizer(chunk, max_length=max_length, min_length=min_length) # Pipeline from Hugging Face
			chunk_summaries.append(result[0]['summary_text'])
		except Exception as e:
			print(f"Error processing chunk: {e}")
			continue
	
	# Combine chunk summaries
	if len(chunks) == 1:
		return chunk_summaries[0]
	else:
		# For multiple chunks, create a final summary of the combined summaries
		combined_summary = ' '.join(chunk_summaries)
		
		word_count = len(combined_summary.split()) # Count words, not characters
		try:
			max_length = word_count // max_length_div
			min_length = word_count // min_length_div
			
			final_summary = summarizer(combined_summary, 
									max_length=max_length,
									min_length=min_length)[0]['summary_text']
			return final_summary
		except Exception as e:
			print(f"Error in final summarization: {e}")
			return combined_summary

for txt in textDict:
	try:
		length = len(textDict[txt])
		summary = summarize_long_text(
			text=textDict[txt],
			summarizer=summarizer,
			max_length_div=2, #divisor of chunk
			min_length_div=10, #divisor of chunk
			max_chunk_size=256  # Adjust based on model's token limit
		)
		summary_dir = f"./data/summaries"
		if not os.path.exists(summary_dir):
			os.makedirs(summary_dir)
		with open(os.path.join(summary_dir, txt), "w+") as summary_file:
			summary_file.write(f"File: {txt}\nSummary: {summary}\n")
	except Exception as e:
		print(f"Error processing {txt}: {e}")

Token indices sequence length is longer than the specified maximum sequence length for this model (558 > 512). Running this sequence through the model will result in indexing errors


---
### Embeddings Matrix
---

---
### Converting word tokens to index values
---

---
### Indexing of a short speech

---

---
## Padding Speeches
---

---
## Split Data into Training, Validation, and Test

The data must be split into training and test data minimally. Many training loops can also use validation data at the end of each epoch, allowing a comparison between training and validation losses (if this value is high or growing it may indicate overfitting).

The split for this demonstration will be 80% training and 10% each for test and validation.

---

In [None]:
# training/test split (validation will come from test portion)
tt_split = int(len(features) * trainSplitPercent)

train_x, valtest_x = features[:tt_split], features[tt_split:]
train_y, valtest_y = encoded_labels[:tt_split], encoded_labels[tt_split:]

# Validation/test split (further split test data into validation and test)
vt_split = int(len(valtest_x) * validationSplitPercent) # Default 0.5
val_x, test_x = valtest_x[:vt_split], valtest_x[vt_split:]
val_y, test_y = valtest_y[:vt_split], valtest_y[vt_split:]

# Show shapes of data
print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
		"\nValidation set: \t{}".format(val_x.shape),
		"\nTest set: \t\t{}".format(test_x.shape))

---
## Batching and DataLoaders
---

---
### The model
---

---
### Model Parameters

---

---
### Neural Network Hyperparameters

---

---
### Training

---

---
### Training Loop

---

---
## Testing the Model

---

In [None]:
#Testing loop
def rnn_test(test_loader):
    # Turn off gradient calculations (saves time and compute resources)
    with torch.no_grad():
    
        # Variables for tracking losses
        test_losses = [] 
        num_correct = 0
    
        true_list = []
        pred_list = []
    
        # Place model in evaluation mode
        rnn_model.eval()
    
        # Run test data through model
        for inputs, labels in test_loader:
    
            # Move test data batch to GPU/CPU
            inputs, labels = inputs.to(device), labels.to(device)
    
            # Get predicted output
            output = rnn_model(inputs)
    
            # Calculate the loss
            # test_loss = rnn_criterion(output.squeeze(), labels.float())
            test_loss = rnn_criterion(output.squeeze(), labels)
            test_losses.append(test_loss.item())
    
            # Convert output sigmoid probabilities to predicted classes (0 or 1)
            #pred = torch.round(output.squeeze())  # rounds to the nearest integer
            pred = torch.argmax(output, dim=1)
    
            # Place true and predicted labels in list
            true_list += list(labels.cpu().numpy())
            pred_list += list(pred.cpu().numpy())
    
            # Compare predicted and true labels and count number of correct prediction
            correct_tensor = pred.eq(labels.float().view_as(pred))
            correct = np.squeeze(correct_tensor.numpy()) if device=='cpu' else np.squeeze(correct_tensor.cpu().numpy())
            num_correct += np.sum(correct)
    
    pred_list = [a.squeeze().tolist() for a in pred_list]
    print(confusion_matrix(true_list, pred_list))
    print()
    print(classification_report(true_list, pred_list))
    print()
    print(f"Accuracy {accuracy_score(true_list, pred_list):.2%}")
    
    # Output average test loss
    print("Test loss: {:.3f}".format(np.mean(test_losses)))
    
    # Output average accuracy
    test_acc = num_correct/len(test_loader.dataset)
    print("Test accuracy: {:.3f}".format(test_acc))

rnn_test(test_loader)