## Run on HPC at Imperial

### Setup Tensorflow environment
Follow the instructions in "conda" section this link https://icl-rcs-user-guide.readthedocs.io/en/latest/hpc/applications/guides/tensorflow/ to setup Tensorflow environment, enabling to utilize GPU on HPC

### Install requried packages
After setting up the environment, execute the following steps to install required packages for supporting the code to run.
  -  module load anaconda3/personal
  -  source activate "your virtual env name from the setup"
  -  python3 -m pip install "tensorflow-text==2.15.*"
  -  python3 -m pip install "tf-models-official==2.15.*"
  -  python3 -m pip install bertopic
  -  python3 -m pip install xlrd
  -  python3 -m pip install umap-learn hdbscan
  -  python3 -m pip install nbformat

### Prepare data
Upload the data to HPC. The data (5 .xls files) are downloadable from this GitHub repository.


In [1]:
import os
import shutil # Import the shutil module for file operations
import pandas as pd
import numpy as np
import torch
import csv

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from official.nlp import optimization  # to create AdamW optimizer

import matplotlib.pyplot as plt

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer


# Set TensorFlow logger level to ERROR to suppress unnecessary output
tf.get_logger().setLevel('ERROR')

2024-05-29 09:43:32.302386: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-29 09:43:32.302522: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-29 09:43:32.493936: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-05-29 09:43:33.041933: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# List of file names
file_names = [
    "Web_of_Science_Search_1-1000 results.xls",
    "Web_of_Science_Search_1001-2000 results.xls",
    "Web_of_Science_Search_2001-3000 results.xls",
    "Web_of_Science_Search_3001-4000 results.xls",
    "Web_of_Science_Search_4001-5000 results.xls"
]

# List to store dataframes
dfs = []

# Read each Excel file and select the desired columns
for file_name in file_names:
    df = pd.read_excel(file_name)
    df_selected = df[["Publication Type", "Authors", "Article Title", "Source Title", "Abstract", "Publication Year", "DOI"]]
    dfs.append(df_selected)

# Concatenate all dataframes
merged_df = pd.concat(dfs, ignore_index=True)

# Write the merged dataframe to a CSV file
merged_df.to_csv("Web_of_Science_Query May 07 2024_1-5000.csv", index=False)

print("Merged CSV file created successfully.")

Merged CSV file created successfully.


In [3]:
# Load the dataset
df = pd.read_csv("Web_of_Science_Query May 07 2024_1-5000.csv", encoding='utf-8')

# Preview the data
print(df.head())

  Publication Type                                            Authors  \
0                J                             Campbell, S; Gurney, L   
1                J                            Carstens, M; Preiser, R   
2                J  Manuel-Navarrete, D; DeLuca, S; Friso, F; Poli...   
3                J                    Carmen, E; Fazey, I; Friend, RM   
4                J  Griesberger, P; Kunz, F; Hacklaender, K; Matts...   

                                       Article Title            Source Title  \
0  What are we protecting? Rethinking relational ...   ECOSYSTEMS AND PEOPLE   
1  Exploring relationality in African knowledge s...   ECOSYSTEMS AND PEOPLE   
2  Ayahuasca ceremonies, relationality, and inner...   ECOSYSTEMS AND PEOPLE   
3  Community-based sustainability initiatives: th...  SUSTAINABILITY SCIENCE   
4  Building a decision-support tool to inform sus...                   AMBIO   

                                            Abstract  Publication Year  \
0  Rel

In [4]:
# Preprocess the data to handle null values
df['Abstract'] = df['Abstract'].fillna('')  # Replace null values with empty strings

# Create a BERTopic instance
topic_model = BERTopic(verbose=True)

# Fit the model on your dataset
docs = df['Abstract'].tolist()
topics, probs = topic_model.fit_transform(docs)

2024-05-29 09:47:31,686 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 157/157 [00:11<00:00, 13.62it/s]
2024-05-29 09:47:58,892 - BERTopic - Embedding - Completed ✓
2024-05-29 09:47:58,893 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-05-29 09:48:35,139 - BERTopic - Dimensionality - Completed ✓
2024-05-29 09:48:35,142 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-05-29 09:48:35,294 - BERTopic - Cluster - Completed ✓
2024-05-29 09:48:35,298 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-05-29 09:48:35,907 - BERTopic - Representation - Completed ✓


In [5]:
num_documents = len(docs)
print("Number of documents:", num_documents)

Number of documents: 5000


In [6]:
# Initialize a SentenceTransformer model with the 'all-MiniLM-L6-v2' variant for generating embeddings
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Initialize a BERTopic model with the specified SentenceTransformer embedding model and enable verbose mode for logging
topic_model = BERTopic(embedding_model=embedding_model, verbose=True)

# Encode the list of documents into embeddings using the initialized SentenceTransformer model,
# showing a progress bar during the encoding process
embeddings = embedding_model.encode(docs, show_progress_bar=True)

# Save the embeddings to a NumPy array file (.npy)
import numpy as np
np.save('embeddings.npy', embeddings)  # Save to .npy file

# Save the embeddings to a pickle file for serialization (.pkl)
# Serialization refers to the process of converting an object into a format that can be easily stored, transmitted, or reconstructed later. In Python, serialization is commonly used for saving objects to files or transferring them between different systems.
# The .pkl extension here denotes a pickle file, which is a binary file format used for serializing and deserializing objects. Pickle files can store various Python objects, such as lists, dictionaries, and even custom classes, in a compact and efficient binary format.
import pickle
with open('embeddings.pkl', 'wb') as file:
    pickle.dump(embeddings, file)

# Convert the embeddings into a pandas DataFrame for further analysis and export it to a CSV file without indexing
import pandas as pd
embeddings_df = pd.DataFrame(embeddings)
embeddings_df.to_csv('embeddings.csv', index=False)

Batches: 100%|██████████| 157/157 [00:06<00:00, 25.58it/s]


In [7]:
topic_model.fit(docs)

2024-05-29 09:48:54,721 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 157/157 [00:05<00:00, 27.72it/s]
2024-05-29 09:49:00,624 - BERTopic - Embedding - Completed ✓
2024-05-29 09:49:00,625 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-05-29 09:49:12,756 - BERTopic - Dimensionality - Completed ✓
2024-05-29 09:49:12,757 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-05-29 09:49:12,918 - BERTopic - Cluster - Completed ✓
2024-05-29 09:49:12,921 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-05-29 09:49:13,441 - BERTopic - Representation - Completed ✓


<bertopic._bertopic.BERTopic at 0x14625d953610>

In [8]:
topic_model.visualize_topics().write_html('topics_before_tuning.html', auto_open=True) # open the file in a separate tab in your web browser.

In [9]:
topic_model.visualize_barchart(top_n_topics=15).write_html('topics_barchart_before_tuning.html', auto_open=True)

In [10]:
topic_model.visualize_hierarchy(top_n_topics=100).write_html('topics_hierachy_before_tuning.html', auto_open=True)

In [11]:
topic_model.visualize_heatmap(top_n_topics=100).write_html('topics_heatmap.before_tuning.html', auto_open=True)

In [12]:
from transformers import RobertaTokenizer, RobertaModel

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
# Load dataset
df = pd.read_csv('Web_of_Science_Query May 07 2024_1-5000.csv', encoding='utf-8')
abstracts = df['Abstract'].dropna().tolist()  # Ensure no NaN values

# Ensure all elements are strings
abstracts = [str(abstract) for abstract in abstracts]

# Debug: Print the first few elements to check
print(abstracts[:5])

["Relational values have been proposed as a way of capturing more inclusively the relationships that people have with nature and have been adopted within the conceptual framework of the Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services (IPBES). Relational values literature has taken strides towards a more comprehensive appreciation of human-nature interactions than previous frameworks. However, we see an opportunity to build further on the relational values concept through the frame of political ontology. In this Perspective, we argue that, in order to understand people's relationships with their environments, we must first ask the following question: what is nature to those who value their relationships with it? Comprehending the multiple natures that people experience and value can help us to achieve equitable and representative conservation policy, explain actions and behaviours, and identify obstacles to engagement with conservation agendas.", 'The cu

In [14]:
# Function to tokenize in batches
def batch_tokenize(texts, batch_size=32):
    all_inputs = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        inputs = tokenizer(batch, return_tensors='pt', padding=True, truncation=True, max_length=512)
        all_inputs.append(inputs)
    return all_inputs

# Tokenize abstracts in batches
batched_inputs = batch_tokenize(abstracts)

In [15]:
# Function to generate embeddings for each batch
def batch_embed(inputs):
    embeddings = []
    for input in inputs:
        with torch.no_grad():
            outputs = model(**input)
            batch_embeddings = outputs.last_hidden_state.mean(dim=1)
            embeddings.append(batch_embeddings)
    return torch.cat(embeddings)

# Generate embeddings
embeddings = batch_embed(batched_inputs)

In [16]:
# Define the file path to save the embeddings
output_file = "embeddings_roberta.csv"

# Convert embeddings tensor to a numpy array
embeddings_array = embeddings.numpy()

# Write the embeddings to a CSV file
with open(output_file, 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    for embedding_row in embeddings_array:
        writer.writerow(embedding_row)

In [17]:
# Load the embeddings from the CSV file
df = pd.read_csv("embeddings_roberta.csv", header=None)
embeddings = df.values

# Create a BERTopic instance without specifying an embedding model
topic_model = BERTopic()

# Fit the topic model and get topics and probabilities
topics, probabilities = topic_model.fit_transform(abstracts, embeddings)

In [24]:
topic_model.fit(abstracts)

<bertopic._bertopic.BERTopic at 0x148cbe15d090>

In [28]:
topic_model.visualize_topics().write_html('abstracts.html', auto_open=True)

In [29]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import re

# Load the dataset again
df = pd.read_csv('Web_of_Science_Query May 07 2024_1-5000.csv')
abstracts = df['Abstract'].dropna().tolist()

# Define a pre-processing function
def preprocess(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    words = text.split()
    words = [word for word in words if word not in ENGLISH_STOP_WORDS]  # Remove stop words
    return ' '.join(words)

# Preprocess the abstracts
abstracts = [preprocess(abstract) for abstract in abstracts]

In [30]:
from transformers import RobertaTokenizer, RobertaModel
import torch

# Again, load RoBERTa model and tokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')

# Function to tokenize text in batches
def batch_tokenize(texts, batch_size=32):
    all_inputs = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        inputs = tokenizer(batch, return_tensors='pt', padding=True, truncation=True, max_length=512)
        all_inputs.append(inputs)
    return all_inputs

# Function to generate embeddings for each batch
def batch_embed(inputs):
    embeddings = []
    for input in inputs:
        with torch.no_grad():
            outputs = model(**input)
            batch_embeddings = outputs.last_hidden_state.mean(dim=1)
            embeddings.append(batch_embeddings)
    return torch.cat(embeddings)

# Generate embeddings
batched_inputs = batch_tokenize(abstracts)
embeddings = batch_embed(batched_inputs)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [31]:
# Save the updated embeddings
output_file = "embeddings_roberta_updated.csv"

# Convert this embeddings tensor to a numpy array
embeddings_array = embeddings.numpy()

# Write the new embeddings to a CSV file
with open(output_file, 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    for embedding_row in embeddings_array:
        writer.writerow(embedding_row)

In [32]:
df = pd.read_csv("embeddings_roberta_updated.csv", header=None)
embeddings = df.values

# Create a BERTopic instance without specifying an embedding model
topic_model = BERTopic()

# Fit the topic model and get topics and probabilities
topics, probabilities = topic_model.fit_transform(abstracts, embeddings)

In [36]:
topic_model.visualize_topics().write_html('topics_update.html', auto_open=True) # Visualize the topics