# **A Step-by-Step Case Study using RoBERTa**

Similart to what we have done above, we need to follow the following steps when applying a RoBERTa model.

* RoBERTa Initialization: Initializes RoBERTa tokenizer and model
* Data Preparation: Loads and preprocesses the dataset
* Batch Tokenization: Tokenizes abstracts in batches
* Embedding Generation: Generates embeddings using RoBERTa, and save it
* Topic Modeling: Applies BERTopic with RoBERTa embeddings
* Improve and fine-tune
* Visualization

This section focuses on integrating RoBERTa into the topic modeling pipeline, enhancing its analytical capabilities.

## **Dataset**

We will be using the same dataset, "Web_of_Science_Query May 07 2024_1-5000.csv".

In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv('Web_of_Science_Query May 07 2024_1-5000.csv', encoding='utf-8')
abstracts = df['Abstract'].dropna().tolist()  # Ensure no NaN values

# Ensure all elements are strings
abstracts = [str(abstract) for abstract in abstracts]

# Debug: Print the first few elements to check
print(abstracts[:5])

["Relational values have been proposed as a way of capturing more inclusively the relationships that people have with nature and have been adopted within the conceptual framework of the Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services (IPBES). Relational values literature has taken strides towards a more comprehensive appreciation of human-nature interactions than previous frameworks. However, we see an opportunity to build further on the relational values concept through the frame of political ontology. In this Perspective, we argue that, in order to understand people's relationships with their environments, we must first ask the following question: what is nature to those who value their relationships with it? Comprehending the multiple natures that people experience and value can help us to achieve equitable and representative conservation policy, explain actions and behaviours, and identify obstacles to engagement with conservation agendas.", 'The cu

## **Tokenize the Data**

Convert the abstracts into tokens that the RoBERTa model can process.

In [None]:
# Function to tokenize in batches
def batch_tokenize(texts, batch_size=32):
    all_inputs = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        inputs = tokenizer(batch, return_tensors='pt', padding=True, truncation=True, max_length=512)
        all_inputs.append(inputs)
    return all_inputs

# Tokenize abstracts in batches
batched_inputs = batch_tokenize(abstracts)

## **Embedding Generation**

This following part is responsible for generating embeddings for each batch of tokenized inputs. More specifically:

   - `inputs`: This parameter represents a list of tokenized inputs. Each element in the list corresponds to a batch of tokenized input data.
   - `embeddings = []`: This initializes an empty list to store the embeddings generated for each batch.
   - Batch Processing: The function iterates through each batch of tokenized inputs provided in the `inputs` list. Within each iteration, a `with torch.no_grad():` block ensures that no gradients are calculated during the forward pass, reducing memory consumption and speeding up computations.
   - `outputs = model(**input)`: This line feeds the current batch of tokenized inputs (`input`) to the RoBERTa model (`model`) to obtain the model outputs.
   - `outputs.last_hidden_state`: The `outputs` object contains various attributes, including the last hidden states of all tokens in the input sequence. Here, `last_hidden_state` retrieves these hidden states.
   - `batch_embeddings = outputs.last_hidden_state.mean(dim=1)`: This computes the mean of the last hidden states along the sequence dimension (dimension 1), resulting in a single vector representation (embedding) for each input sequence in the batch.
   - `torch.cat(embeddings)`: Finally, all the embeddings generated for different batches are concatenated along the batch dimension (dimension 0) using PyTorch's `torch.cat()` function, resulting in a tensor containing embeddings for all input sequences.

Please note that executing this step may take **a substantial amount of time** due to its computational complexity.

In [None]:
import torch

# Function to generate embeddings for each batch
def batch_embed(inputs):
    embeddings = []
    for input in inputs:
        with torch.no_grad():
            outputs = model(**input)
            batch_embeddings = outputs.last_hidden_state.mean(dim=1)
            embeddings.append(batch_embeddings)
    return torch.cat(embeddings)

# Generate embeddings
embeddings = batch_embed(batched_inputs)

In [None]:
import csv

# Define the file path to save the embeddings
output_file = "embeddings_roberta.csv"

# Convert embeddings tensor to a numpy array
embeddings_array = embeddings.numpy()

# Write the embeddings to a CSV file
with open(output_file, 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    for embedding_row in embeddings_array:
        writer.writerow(embedding_row)

## **Topic Modeling**

In [None]:
import pandas as pd
import numpy as np

# Load the embeddings from the CSV file
df = pd.read_csv("embeddings_roberta.csv", header=None)
embeddings = df.values

# Create a BERTopic instance without specifying an embedding model
topic_model = BERTopic()

# Fit the topic model and get topics and probabilities
topics, probabilities = topic_model.fit_transform(abstracts, embeddings)

## **Visualizing, Analyzing and Comparing Results**

Similar to what we have produced above, we are first looking at the Intertopic Distance Map.

In [None]:
topic_model.visualize_topics()

In [None]:
topic_model.visualize_hierarchy()

In [None]:
topic_model.get_topic_info()
topic_model.get_topics()

{-1: [('the', 0.016398694432739763),
  ('of', 0.015303074928970822),
  ('and', 0.014990280288738476),
  ('to', 0.013863891105568005),
  ('in', 0.013608581324400078),
  ('sustainability', 0.013060366043390559),
  ('for', 0.010864397549493872),
  ('is', 0.010534467221796436),
  ('that', 0.01030211589397393),
  ('this', 0.010240751871837817)],
 0: [('to', 0.016054661251884678),
  ('and', 0.015972621206749284),
  ('of', 0.014064047210023032),
  ('in', 0.01321205394623058),
  ('the', 0.012768271892586619),
  ('sustainability', 0.012762639759949596),
  ('for', 0.012323279970694646),
  ('that', 0.012148143756493203),
  ('we', 0.011698040271766266),
  ('this', 0.010667035107748132)],
 1: [('to', 0.015497311010740382),
  ('the', 0.015394500448339342),
  ('of', 0.014717508899309029),
  ('and', 0.013975670921707595),
  ('sustainability', 0.0128251638735908),
  ('in', 0.012748943258486696),
  ('is', 0.012430867143424704),
  ('for', 0.012274819598135034),
  ('this', 0.010356200417007686),
  ('on', 

## **Improve and Fine-Tune**

Clearly, we can see the performance of this model is not ideal, as there are many stop words that influence the quality of our output. Stop words such as "we," "the," "of," and "and" are common and do not carry significant meaning, which can dilute the meaningful patterns in our data and negatively impact the performance of our NLP model. To improve the performance, we can pre-process the textual dataset as follows:  

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import re

# Load the dataset again
df = pd.read_csv('Web_of_Science_Query May 07 2024_1-5000.csv')
abstracts = df['Abstract'].dropna().tolist()

# Define a pre-processing function
def preprocess(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    words = text.split()
    words = [word for word in words if word not in ENGLISH_STOP_WORDS]  # Remove stop words
    return ' '.join(words)

# Preprocess the abstracts
abstracts = [preprocess(abstract) for abstract in abstracts]

Then we repeat the analysis again:

In [None]:
from transformers import RobertaTokenizer, RobertaModel
import torch

# Again, load RoBERTa model and tokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')

# Function to tokenize text in batches
def batch_tokenize(texts, batch_size=32):
    all_inputs = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        inputs = tokenizer(batch, return_tensors='pt', padding=True, truncation=True, max_length=512)
        all_inputs.append(inputs)
    return all_inputs

# Function to generate embeddings for each batch
def batch_embed(inputs):
    embeddings = []
    for input in inputs:
        with torch.no_grad():
            outputs = model(**input)
            batch_embeddings = outputs.last_hidden_state.mean(dim=1)
            embeddings.append(batch_embeddings)
    return torch.cat(embeddings)

# Generate embeddings
batched_inputs = batch_tokenize(abstracts)
embeddings = batch_embed(batched_inputs)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Save the updated embeddings
output_file = "embeddings_roberta_updated.csv"

# Convert this embeddings tensor to a numpy array
embeddings_array = embeddings.numpy()

# Write the new embeddings to a CSV file
with open(output_file, 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    for embedding_row in embeddings_array:
        writer.writerow(embedding_row)

In [None]:
df = pd.read_csv("embeddings_roberta_updated.csv", header=None)
embeddings = df.values

# Create a BERTopic instance without specifying an embedding model
topic_model = BERTopic()

# Fit the topic model and get topics and probabilities
topics, probabilities = topic_model.fit_transform(abstracts, embeddings)

In [None]:
topic_model.visualize_topics() # Visualize the topics

In [None]:
topic_info = topic_model.get_topic_info()
print("Optimized Topic Information:")
print(topic_info.head(10))  # Print the top 10 topics

Optimized Topic Information:
   Topic  Count                                               Name  \
0     -1   3056  -1_sustainability_study_sustainable_environmental   
1      0    263      0_sustainability_assessment_indicators_social   
2      1    237       1_transitions_sustainability_research_change   
3      2     81       2_research_sustainability_social_sustainable   
4      3     75            3_sustainability_study_paper_evaluation   
5      4     74    4_management_companies_sustainability_practices   
6      5     64             5_education_students_teachers_learning   
7      6     61             6_innovation_sustainability_firms_firm   
8      7     58                          7_urban_cities_city_space   
9      8     55  8_sustainability_sustainable_environmental_design   

                                      Representation  \
0  [sustainability, study, sustainable, environme...   
1  [sustainability, assessment, indicators, socia...   
2  [transitions, sustainability,

In [None]:
topic_model.visualize_barchart(top_n_topics=15)

In [None]:
topic_model.visualize_hierarchy()

In [None]:
topic_model.visualize_heatmap()