<a href="https://colab.research.google.com/github/Banking-Analytics-Lab/DLinBankingBook/blob/main/Labs/TextBook_Lab_Chap4_Textual_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Transformers for Text Analysis**

In this lab, we will explore Transformer models for text analysis using the [Hugging Face Transformers library](https://huggingface.co/docs/transformers/index). This library provides a wide range of pre-trained models that we can leverage for various natural language processing (NLP) tasks.


First, we need to install and import the necessary libraries.

In [None]:
# Install necessasary packages, if not done before
!pip install transformers evaluate accelerate

## **Downloading Datasets**

Now, we will use two datasets for this lab:

1. **Federal Reserve Speeches (1996–2024)**  
   - This dataset contains **text data** from speeches delivered by Federal Reserve officials over the years.  

2. **[Chicago Fed National Activity Index (CFNAI)](https://fred.stlouisfed.org/series/CFNAI)**  
   - The CFNAI is a comprehensive economic indicator that tracks **85 key economic factors** such as employment, production, and consumption.  
   - It helps measure national economic activity:
     - **Zero value** → Economy is growing at its historical trend rate.
     - **Negative values** → Below-average growth.(Economy contracting)
     - **Positive values** → Above-average growth.(Economy expanding)

In [None]:
!gdown --fuzzy 'https://drive.google.com/file/d/1uVt9BC2tgr-MWrFZvYvA_I8IzTabNZtL/view?usp=sharing'
!gdown --fuzzy 'https://drive.google.com/file/d/1I7isSks6Y8kJoigbDZJumgdQ1fFpwn4i/view?usp=sharing'

In [None]:
# Imports
import numpy as np
import os
import pandas as pd
import random
# Scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.utils import class_weight
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import roc_auc_score, confusion_matrix, roc_curve, auc

# Plots
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image
%matplotlib inline

# Import Pytorch lybraries
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
from torch.utils.data import TensorDataset, DataLoader, random_split
from torch.optim.lr_scheduler import _LRScheduler

# Huggingface
import transformers
from transformers import AutoTokenizer, DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from transformers import pipeline
from transformers import set_seed
from datasets import load_dataset, Dataset, Value, ClassLabel, Features, load_from_disk
import evaluate

Now, we will load the Federal Reserve Speeches dataset.
Once loaded, we will inspect the structure of the dataset to understand its key columns, such as date, speaker, speech content, and topic. This will help us determine how to preprocess and analyze the text data effectively.

In [None]:
fed_speech = pd.read_csv("/content/fed_speeches.csv", delimiter=",", on_bad_lines="skip", engine="python")

In [None]:
fed_speech.info()

In [None]:
fed_speech.head(10)

## Data preprocess - merging

In this step, we will preprocess the Federal Reserve Speeches dataset to prepare it for analysis. First, we will extract the **year** and **month** from the speech dates. Then, we will shift the month forward by one to align each speech with the economic conditions of the following month.

Our goal is to predict the next month's economic upturn or downturn based on the language used in Federal Reserve speeches. This adjustment ensures that our model learns from past speeches to forecast future economic trends more effectively.

In [None]:
# Create year and month columns
fed_speech['date'] = pd.to_datetime(fed_speech['date'], errors='coerce')
fed_speech["year"] = fed_speech["date"].dt.year  # Extract year
fed_speech["month"] = fed_speech["date"].dt.month  # Extract month

# Shift df1's month forward by 1
fed_speech["month"] += 1

# Handle December (12 → 1 and increase year)
fed_speech.loc[fed_speech["month"] == 13, "month"] = 1
fed_speech.loc[fed_speech["month"] == 1, "year"] += 1

Next, we will load the Chicago Fed National Activity Index (CFNAI) dataset, which serves as a key indicator of U.S. economic activity.

Once loaded, we will merge the CFNAI dataset with the Federal Reserve Speeches dataset using year and month as the merging keys. Since we previously adjusted the speech dataset by shifting the month forward, this ensures that each speech is aligned with the economic activity of the following month.

In [None]:
econ_index = pd.read_csv('/content/CFNAI.csv')
econ_index.columns = ['date', 'CFNAI']
econ_index['date'] = pd.to_datetime(econ_index['date'], errors='coerce')
econ_index["year"] = econ_index["date"].dt.year  # Extract year
econ_index["month"] = econ_index["date"].dt.month  # Extract month
econ_index.drop(columns=['date'], inplace=True)
econ_index

In [None]:
merged_df = fed_speech.merge(econ_index, on=["year", "month"], how="left")
merged_df

## **Text Preprocessing: Tokenization, Stopword Removal, and Cleaning**

In this step, we preprocess the speech text by **tokenizing, removing stopwords, and eliminating punctuation** to prepare the data for further analysis.

### **1. Import Required Libraries**
The following libraries from **NLTK (Natural Language Toolkit)** are used:
- `word_tokenize` → Splits text into individual words (tokens).
- `stopwords` → Provides a list of common English stopwords (e.g., "the", "is", "and").
- `string` → Used to remove punctuation.

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

nltk.download('stopwords')
nltk.download('punkt_tab')
stop_words = set(stopwords.words("english"))


def clean_text(text):
    text = text.lower()  # Convert to lowercase
    tokens = word_tokenize(text)  # Tokenize text
    table = str.maketrans('', '', string.punctuation)  # Create a table for removing punctuation
    filtered_tokens = [
        token.translate(table) for token in tokens
        if token.isalnum() and token not in stop_words  # Remove stop words here!
    ]
    cleaned_text = ' '.join(filtered_tokens)
    return cleaned_text

merged_df['text_cleaned'] = merged_df['text'].apply(clean_text)

As we discuss in the book, you would need to test the best performance depending on the cleaning steps necessary. Normally, either doing nothing, removing stopwords, or lowercase would lead to the best performance depending on the application area. Combining strategies may lead to decreased performance. Test different strategies and see if you can improve this model!

You can inspect how the **`clean_text`** function processed the text by viewing the cleaned version stored in the **`text_cleaned`** column.

In [None]:
merged_df.tail(5)

As observed above, rows corresponding to **January 2025 (2025-01)** do not have CFNAI values. To address this, we will extract these rows and set them aside for testing.


In [None]:
# Extract the last two rows that contain NaN values
nan_rows_df = merged_df[merged_df.isna().any(axis=1)]

# Remove these rows from the main DataFrame
merged_df = merged_df.drop(nan_rows_df.index)

## **Visualizing Frequent Words in FED Speeches with a Word Cloud**  

In this step, we generate a **word cloud** to visualize the most frequently used words in the **Federal Reserve speeches dataset**. A word cloud is a useful tool for quickly identifying common terms in text data.


In [None]:
from wordcloud import WordCloud

sample_txt = " ".join(i for i in merged_df['text_cleaned'])

wc = WordCloud(colormap="Set2",collocations=False).generate(sample_txt)
plt.title("Most Frequent Words in FED Speeches")
plt.axis("off")
plt.imshow(wc,interpolation='bilinear')
plt.show()

Pretty cool! We can see that the Federal Reserve frequently uses words like risk, inflation, and financial, among others.

## Labelling

In this step, we categorize the **Chicago Fed National Activity Index (CFNAI)** values into binary labels to prepare our dataset for classification.

---

### **1. Define Bins and Labels**  
We create two categories based on the CFNAI values:  
- **Negative or zero CFNAI (`≤ 0`) → Label 1**  
- **Positive CFNAI (`> 0`) → Label 0**  


In [None]:
# Define bins and labels
bins = [-float('inf'), 0, float('inf')]
labels = [1, 0]

# Apply categorization
merged_df["label"] = pd.cut(
    merged_df["CFNAI"], bins=bins, labels=labels, include_lowest=True
)

# Convert to integer type
merged_df["label"] = merged_df["label"].astype(int)

merged_df

We can see that the labels are fairly balanced.

In [None]:
merged_df.label.value_counts()

Now that we have cleaned and processed the dataset, we will save it for future use.


In [None]:
# Save to CSV
merged_df.to_csv('FEDSpeechesProcessed.csv', index=False)

## **Preparing the Dataset for Model Training**  

In this step, we **convert the preprocessed dataset into a Hugging Face `Dataset` format**, encode the labels, and split the data into training and testing sets.

---

We extract the **cleaned text (`text_cleaned`)** and its corresponding **label (`label`)** from `merged_df`, then convert it into a Hugging Face `Dataset`.


In [None]:
# Create the dataset
fed_speech_data = Dataset.from_pandas(merged_df.loc[:,['text_cleaned', 'label']])

# Set the label variable
fed_speech_data = fed_speech_data.class_encode_column("label")

# Drop the index variable
fed_speech_data = fed_speech_data.remove_columns(["__index_level_0__"])

# Train / test split
fed_speech_data = fed_speech_data.train_test_split(0.33)
fed_speech_data

In [None]:
fed_speech_data['train'].features

## **Tokenizing the Text Data**  

Before feeding our text data into a Transformer model, we need to **tokenize** it. Tokenization converts raw text into numerical format that the model can understand.

---

We use the **DistilBERT tokenizer** from Hugging Face’s `transformers` library.


In [None]:
# Tokenize the data.
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased",  do_lower_case=False, processing_class=True)

In [None]:
# Function to truncate text. Our text is very long!
def preprocess_function(examples):
    return tokenizer(examples["text_cleaned"], truncation=True)

In [None]:
tokenized_fed_speech_data = fed_speech_data.map(preprocess_function, batched=True)

In [None]:
# Save the outcome to disk to not run this again.
tokenized_fed_speech_data.save_to_disk("TokenizedData")

## **Applying Data Collation for Efficient Batching**  

When working with Transformer models, input sequences need to be **padded** to the same length within a batch. To handle this efficiently, we use a **data collator**.

---


In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
# How many classes there are.
num_labels = len(merged_df["label"].unique())
print(f'There are {num_labels} classes in the dataset.')

Now that we have tokenized the text data, we need to define the **Transformer model** that will be used for classification.



In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=num_labels
    )

We define the **evaluation metric** to assess the performance of our model. We use **accuracy**, which measures the proportion of correctly classified samples.


In [None]:
accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

## **Defining Training Parameters**  

Now, we configure the **training arguments** that determine how our model will be trained using the Hugging Face `Trainer` API.

---

In [None]:
training_args = TrainingArguments(
    # Where to store the model.
  output_dir="ModelOutput",
    # Learning rate to use.
    learning_rate=1e-4,
    # Batch size to use per GPU in training.
    #per_device_train_batch_size=32,  ## T4
    per_device_train_batch_size=200,  ## T4
    # Batch size to use per GPU in evaluation
    #per_device_eval_batch_size=32,   ## T4
    per_device_eval_batch_size=200,  ## T4
    # Epochs to train
    num_train_epochs=15,
    # If decaying or not the weights
    weight_decay=5e-3,
    # When to evaluate the model
    eval_strategy="epoch",
    # When to save checkpoint
    save_strategy="epoch",
    # Load best after training? No as we don't have validation / test difference.
    load_best_model_at_end=False,
    # Save in Huggingface? (Account required)
    push_to_hub=False,
    # How often to log training
    logging_steps=100,
)

We **set a fixed random seed** for reproducibility and initialize the Hugging Face `Trainer` for model training.

In [None]:
import gc
gc.collect()
torch.cuda.empty_cache

In [None]:
# Set a fixed seed value
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)
transformers.set_seed(SEED)

# Empty VRAM
torch.cuda.empty_cache()

# Create trainer object.
trainer = Trainer(
    # What model to use.
    model=model,
    # Arguments to the model
    args=training_args,
    # Training data
    train_dataset=tokenized_fed_speech_data["train"],
    # Test dataset
    eval_dataset=tokenized_fed_speech_data["test"],
    # How to pad sequences
    data_collator=data_collator,
    # Error function
    compute_metrics=compute_metrics,
)

Now that we have set up the **dataset, tokenizer, model, training arguments, and `Trainer`**, it's time to **train the model**!  

To enable experiment tracking, we will use **Weights & Biases (W&B)** for logging training metrics. Before training, you need to sign in to [wandb.ai](https://wandb.ai/home) to get an API key.



In [None]:
trainer.train()

We can see that the **accuracy is increasing** as training progresses, indicating that the model is learning from the data.  


After training, we need to **save the model** so we can reuse it for evaluation, inference, or further fine-tuning without retraining from scratch.



In [None]:
# Save the model to a folder
trainer.save_model('FEDSppechModel')

In [None]:
# Zip it
!zip -r DistilBert.zip FEDSppechModel

In [None]:
# Calculate AUC over the test set.
predictions = trainer.predict(tokenized_fed_speech_data["test"])
preds = predictions.predictions
preds

# Plot ROC Curve
fpr, tpr, threshold = roc_curve(tokenized_fed_speech_data["test"]["label"], preds[:, 1])
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('c3_ROC_Curve.pdf')
plt.show()


## **Testing the Model on Recent FED Speeches**  

Now that we have trained our model, let's test it on **recent Federal Reserve speeches** to evaluate its performance in predicting economic sentiment.

---


In [None]:
text = nan_rows_df.iloc[0]['text_cleaned']
text

In [None]:
# Apply tokenizer and return pytorch tensors
inputs = tokenizer(text, return_tensors="pt", truncation=True)

In [None]:
with torch.no_grad():
    outputs = model(**inputs.to("cuda"))
    logits = outputs.logits

In [None]:
# Probabilities
probs = nn.functional.softmax(logits, dim=1).cpu().numpy()
print(probs)

# Class
print(f'The text is predicted to be of class {np.argmax(probs)}')

Our model predicts an economic downturn in January 2025 based on Federal Reserve speeches from December 2024.

And the CFNAI for January 2025 is [-0.03](https://fred.stlouisfed.org/series/CFNAI), which is classified as econnomic downturn in our analysis!


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Move the best model to drive
!cp DistilBert.zip '/content/drive/MyDrive/Colab Notebooks/DL in Banking Book/DeepLearningInBankingBook/TextBook_Lab'