# **Analyzing and Generating Scientific Abstracts**

## **Introduction**

This project explores how Large Language Models (LLMs) can be applied to the analysis and generation of scientific abstracts, using data from the arXiv repository. With the growing volume of scientific publications, LLMs offer the potential to automate abstract and title generation, saving time for researchers. Key tasks include generating titles from abstracts, generating abstracts from titles, and predicting paper categories. The dataset, available on HuggingFace, allows us to test the model's performance and refine it using cross-validation and optimization techniques.

## **Use Case**

The use case for this project is to help researchers and academics in the scientific community to generate abstracts and titles for their papers. This can be useful for researchers who are looking to quickly generate abstracts for their papers or for students who are looking to learn how to write abstracts. Additionally, the model can be used to predict paper categories, which can help researchers to quickly identify papers that are relevant to their research.

## **Scope**

This project aims to explore some of the capabilities of NLP models in generating scientific abstracts and titles, taking advantage of the arXiv dataset in order to do that. This way, we aim to provide a tool capable of:

- Generating abstracts from titles;
- Generating titles from abstracts;
- Classifying papers into categories based on their abstracts and/or titles;
- Otimizing the model's performance improving the generated titles, abstracts and categories;

## **Objectives**

The main objectives of this project are:

- Develop a tool capable of generating abstracts and titles using LLMs;
- Develop a model capable of predicting paper categories based on their abstracts and/or titles;
- Apply cross-validation to evaluate the model's accuracy, recall and F1-score;
- Improve the model's performance using, for example, alredy existing pre trained models (e.g. BERT, GPT, etc) and fine-tuning them with the arXiv dataset.
- EExperiment with summarization models available on HuggingFace and compare their performance.
- Analyze the arXiv category structure, propose a hierarchical taxonomy, and implement classification models to assess performance across different levels of this taxonomy.

## **Tasks**

The tasks involved in this project include:

1. **Dataset Analysis**: Analyze the arXiv dataset to understand its structure, metadata, and the distribution of categories. Identify the most relevant features for title generation, abstract generation, and category prediction.

2. **Data Preparation**: Preprocess the data for model training and evaluation (e.g., handling missing data, tokenization, train-test split). Create proper datasets for each task (title generation, abstract generation, and classification).

3. **Baseline Model Development**: Develop baseline models for generating abstracts and titles and for predicting paper categories using basic architectures like Seq2Seq or LSTM. Evaluate the models’ performance using standard metrics (e.g., BLEU, ROUGE, F1).

4. **Fine-tuning Pre-trained Models**: Use pre-trained models (e.g., BERT, GPT, T5) and fine-tune them on the arXiv dataset for the tasks of abstract generation, title generation, and category prediction. Evaluate improvements over baseline models.

5. **Model Optimization**: Optimize the models through hyperparameter tuning (e.g., learning rate, optimizer choice, number of layers) and by using techniques like early stopping or data augmentation. Consider experimenting with a wider subset of the arXiv dataset.

6. **Parameter Tuning for Generation Tasks**: Play with different parameter configurations (e.g., temperature, top-k, top-p) in the title and abstract generation tasks. Compare results to determine optimal settings for quality output.

7. **Experiment with Summarization Models**: Test and compare various summarization models from the HuggingFace library (e.g., BART, T5) for abstract generation. Evaluate their performance against metrics like ROUGE and BLEU.

8. **Category Prediction and Taxonomy Analysis**: Analyze existing arXiv categories and develop a hierarchical taxonomy. Train models for both flat classification and hierarchical classification. Evaluate performance using metrics like F1-score at different levels of the taxonomy.

9. **Cross-Validation and Model Evaluation**: Apply cross-validation to evaluate model robustness. Focus on metrics like accuracy, precision, recall, and F1-score to assess classification models, and BLEU/ROUGE for generation tasks.

## **Data Analysis**

The dataset contains 1999486 rows and 10 columns. The columns are as follows:

- id: Unique identifier of the ArXiv paper;
- submitter: Name of the user who submitted the paper;
- authors: Authors of the paper;
- title: Title of the paper;
- comments: Additional comments added to the paper;
- journal-ref: Identifier of the journal where the paper was submitted;
- doi: Persistent address for the paper;
- abstract: Abstract of the paper;
- report-no: Unique identifier of the paper within the organization;
- categories: Categories associated with the paper.
- versions: Version of the paper.

In order to analyse the most relevant attributes and how they interact with each other, extracting relevant information from the dataset, we will take advantage of the libraries **pandas** and **ydata_profiling**. These libraries provide a simple way to manipulate and profile datasets and extract relevant information from them.

In [None]:
import pandas as pd

# Load the data
db = pd.read_csv("abstracts_trimmed.csv")

# Remove all columns except the abstract, title and categories
db = db[["abstract", "title", "categories"]]

In [None]:
from ydata_profiling import ProfileReport

# Generate the Report
profile = ProfileReport(db,title="Adult Census Profile")

In [None]:
profile

# **Predicting Paper Categories**

First we prepare the data for the classification task, converting the categories column into multiple binary columns, one for each category, and format the data.

Here we developed 2 different data preparation methods:

- One with all categories;
- One with the top N categories in order to reduce the number of classes and check if the model's performance improves.

The first one is the following:

In [None]:
import pandas as pd

# Load the data
db = pd.read_csv("abstracts_trimmed.csv")

# Remove all columns except the abstract, title and categories
db = db[["abstract", "title", "categories"]]

In [None]:
# Clean up characters like "[", "]", "'" from the "categories" column
db["categories"] = db["categories"].str.replace(r"[\[\]']", "", regex=True)

# Separate the "categories" column into binary columns
db = db.join(db["categories"].str.get_dummies(" "))

# Remove the "categories" column
db = db.drop(columns=["categories"])

Select a small sample of the data , otherwise the training process will be too slow.

In [None]:
# Select the first X rows of the dataset to speed up the process
X = 100
db2 = db.tail(X)

# Save the cleaned dataset
db.to_csv("predict_dataset.csv", index=False)

# Select the first X rows of the dataset to speed up the process
X = 100
db = db.head(X)

# Save the cleaned dataset
db.to_csv("abstracts_cleaned.csv", index=False)

The second one is the following:

import pandas as pd

# Load the data
db = pd.read_csv("/content/sample_data/abstracts_trimmed.csv")

# Remove all columns except the abstract, title and categories
db = db[["abstract", "title", "categories"]]

In [None]:
import pandas as pd
import numpy as np
from collections import Counter

# Assuming db is your original DataFrame containing the categories

# Clean up characters like "[", "]", "'" from the "categories" column
db["categories"] = db["categories"].str.replace(r"[\[\]']", "", regex=True)

# Determine the top 20 categories in the dataset
all_categories = db["categories"].str.cat(sep=' ').split()
category_counts = Counter(all_categories)
top_20_categories = [category for category, _ in category_counts.most_common(20)]

# Create an empty DataFrame to store the sampled rows
normalized_dataset = pd.DataFrame()

# Loop through each top category and sample 30 rows
for category in top_20_categories:
    # Filter rows that contain the current category
    filtered_rows = db[db["categories"].str.contains(category)]
    
    # Ensure the sampled rows contain only top 20 categories
    filtered_rows = filtered_rows[filtered_rows["categories"].str.split().apply(lambda x: all(cat in top_20_categories for cat in x))]

    # Sample 30 rows if available
    if len(filtered_rows) > 30:
        sampled_rows = filtered_rows.sample(n=30, random_state=42)
    else:
        # If less than 30 rows are available, take all of them
        sampled_rows = filtered_rows

    # Append the sampled rows to the normalized dataset
    normalized_dataset = pd.concat([normalized_dataset, sampled_rows], ignore_index=True)

# Reset the index of the normalized dataset
normalized_dataset.reset_index(drop=True, inplace=True)

# Separate the "categories" column into binary columns after normalizing
normalized_dataset = normalized_dataset.join(normalized_dataset["categories"].str.get_dummies(" "))

# Remove the "categories" column
normalized_dataset = normalized_dataset.drop(columns=["categories"])

In [None]:
# Count the unique categories in the normalized dataset based on the binary columns
unique_categories_normalized = normalized_dataset.columns.tolist()

# Count occurrences of each unique category in the normalized dataset
normalized_category_counts = normalized_dataset[unique_categories_normalized].sum()

# Print the shape of the normalized dataset
print("Normalized Dataset Shape:", normalized_dataset.shape)

In [None]:
# Randomly shuffle the dataset before splitting
normalized_dataset = normalized_dataset.sample(frac=1, random_state=42).reset_index(drop=True)

# Split the dataset into 90% and 10%
train_size = int(0.9 * len(normalized_dataset))
train_dataset = normalized_dataset.iloc[:train_size]
predict_dataset = normalized_dataset.iloc[train_size:]

# Save both datasets to CSV files
train_dataset.to_csv('abstracts_cleaned.csv', index=False)
predict_dataset.to_csv('predict_dataset.csv', index=False)

# Print the shape of the datasets
print("Training Dataset Shape:", train_dataset.shape)
print("Prediction Dataset Shape:", predict_dataset.shape)

## Fine tuning a pre-trained model (BERT) to predict paper categories based on their abstracts and/or titles

In [None]:
# Libraries to install
!pip install datasets
!pip install evaluate

### Preparing the dataset to fine-tune the model

In order to facilitate the fine-tuning process, we decided to compact all the target columns (cattegories), which corresponded to binary values, into a single column. This column is labeled as "labels" and contains a list of binary values, each one corresponding to a category (being 1 if the paper belongs to that category and 0 otherwise).

Addicionally, having in consideration that the model is supposed to predict the categories based on the abstracts and/or titles, we also substituted the "title" and "abstract" columns by a single column, labeled as "input_text", that folds each row into 3 rows with the same value in the "labels" column, but with different values in the "input_text" column (one for the title, one for the abstract and one for the concatenation of both). For this we took advantage of markers, with the objective of still identifying what text corresponds to the title and what text corresponds to the abstract. This way, one row folds in something like the following:

| labels | title | abstract |
| --- | --- | --- |
| [0, 0, ..., 1, 0] | Title of the paper | Abstract of the paper |
| ... | ... | ... |

Into something like:

| labels | input_text |
| --- | --- |
| [0, 0, ..., 1, 0] | [Title] Title of the paper [Abstract] Abstract of the paper |
| [0, 0, ..., 1, 0] | [Title] Title of the paper |
| [0, 0, ..., 1, 0] | [Abstract] Abstract of the paper |
| ... | ... |

This way, we can use the "input_text" column as the input for the model and the "labels" column as the target, facilitating the fine-tuning process.

In [None]:
from datasets import load_dataset, DatasetDict

# First load the entire dataset
dataset = load_dataset("csv", data_files="abstracts_cleaned.csv")

# Define the category columns based on the columns in your dataset
category_columns = [col for col in dataset['train'].column_names if col not in ["title", "abstract"]]

# Create the labels column by mapping each category column into a list of binary labels
def create_labels(row):
    return [float(row[col]) for col in category_columns]

# Function to create the 3 required input text variations
def expand_rows(batch):
    titles = batch["title"]
    abstracts = batch["abstract"]
    labels_list = batch["labels"]

    input_texts = []
    labels = []

    for title, abstract, label in zip(titles, abstracts, labels_list):
        input_texts.extend([
            f"[TITLE] {title} [ABSTRACT] {abstract}",
            f"[TITLE] {title}",
            f"[ABSTRACT] {abstract}"
        ])
        # Duplicate the label for each variation
        labels.extend([label] * 3)

    return {"input_text": input_texts, "labels": labels}

# Apply the label creation and remove original category columns
dataset = dataset.map(lambda row: {'labels': create_labels(row)})
dataset = dataset.remove_columns(category_columns)

# Create the train-validation split from the original dataset
splits = dataset["train"].train_test_split(test_size=0.2, seed=42)

# Expand rows to have 3 rows for each original row for the training set
expanded_train_dataset = splits['train'].map(expand_rows, batched=True, remove_columns=["title", "abstract"])

# Expand rows to have 3 rows for each original row for the validation set
expanded_val_dataset = splits['test'].map(expand_rows, batched=True, remove_columns=["title", "abstract"])

# Create a new DatasetDict with the desired split names
dataset = DatasetDict({
    'train': expanded_train_dataset,
    'validation': expanded_val_dataset
})

# Access the train and validation sets
train_dataset = dataset['train']
val_dataset = dataset['validation']

# Check the dataset structure
print(dataset)

# Print some basic information about the splits
print("\nTrain dataset size:", len(train_dataset))
print("Validation dataset size:", len(val_dataset))

# Print the first example from each split to verify the data
print("\nFirst example from train split:")
print(train_dataset[0])
print(train_dataset[1])
print(train_dataset[2])

print("\nFirst example from validation split:")
print(val_dataset[0])
print(val_dataset[1])
print(val_dataset[2])

Use Bert tokenizer to tokenize the input text and prepare the dataset for fine-tuning the model.

It's important to mention that we use fine tune tow different BERT models, with different sizes, so it's important to select the right tokenizer for each model.

In [None]:
# To use the distilbert-base-uncased pre trained model after
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")


def tokenize_function(examples):

    return tokenizer(examples["input_text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Check the dataset structure
print(tokenized_datasets)

# Print the first example from each tokenized split to verify the data
print("\nFirst example from train split:")
print(tokenized_datasets['train'][0])

print("\nFirst example from validation split:")
print(tokenized_datasets['validation'][0])

In [None]:
# To use the TinyBERT pre trained model after
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("huawei-noah/TinyBERT_General_4L_312D")

def tokenize_function(examples):

    return tokenizer(examples["input_text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Check the dataset structure
print(tokenized_datasets)

# Print the first example from each tokenized split to verify the data
print("\nFirst example from train split:")
print(tokenized_datasets['train'][0])

print("\nFirst example from validation split:")
print(tokenized_datasets['validation'][0])

In [None]:
# Clean up the tokenized datasets by removing the original input_text column
tokenized_datasets = tokenized_datasets.remove_columns("input_text")
tokenized_datasets = tokenized_datasets.with_format("torch")

# Access the train and validation sets
print(tokenized_datasets["train"])
print(tokenized_datasets["validation"])

### Train the model

To train the model we used 2 different strategies:

- Train the model using cross entropy as loss function;
- Train the model using weighted Binary Cross Entropy (BCE) as loss function;

Also we used both distilbert-base-uncased and huawei-noah/TinyBERT_General_4L_312D pre trained models.

Next we have both training blocks.

In [None]:
# Using distilbert-base-uncased pre trained model
from transformers import AutoModelForSequenceClassification

num_labels = len(category_columns)
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=num_labels,
    problem_type="multi_label_classification"
)

In [None]:
# Using TinyBERT pre trained model
from transformers import AutoModelForSequenceClassification

num_labels = len(category_columns)
model = AutoModelForSequenceClassification.from_pretrained(
    "huawei-noah/TinyBERT_General_4L_312D",
    num_labels=num_labels,
    problem_type="multi_label_classification"
)

Using cross entropy as loss function:

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir='./results',
    eval_strategy="epoch",
    learning_rate=1e-3,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=2,
    num_train_epochs=4,
    weight_decay=0.01
)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    sigmoid_logits = torch.sigmoid(torch.tensor(logits))
    threshold = 0.3
    predictions = (sigmoid_logits > threshold).numpy().astype(np.int32)
    labels = labels.astype(np.int32)
    
    accuracies = [accuracy_score(label, pred) for pred, label in zip(predictions, labels)]
    return {"accuracy": np.mean(accuracies)}

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics,
)
trainer.train()

Using weighted Binary Cross Entropy (BCE) as loss function:

In [None]:
import torch
import numpy as np
from transformers import TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, f1_score
import torch.nn.functional as F

# Calculate appropriate steps based on dataset size
total_samples = len(train_dataset)
batch_size = 8
steps_per_epoch = total_samples // batch_size
warmup_steps = steps_per_epoch  # One epoch of warmup

training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-4,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=20,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    warmup_steps=warmup_steps,
    gradient_accumulation_steps=4,
    fp16=False,
    gradient_checkpointing=False,
    save_total_limit=2,
)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    
    # Apply sigmoid to get probabilities
    probs = 1 / (1 + np.exp(-logits))
    
    # Lower thresholds for small dataset
    thresholds = [0.1, 0.15, 0.2, 0.25, 0.3]
    best_f1 = 0
    best_threshold = 0.5
    best_predictions = None
    
    # Find the best threshold
    for threshold in thresholds:
        predictions = (probs > threshold).astype(np.int32)
        f1 = f1_score(labels, predictions, average='macro', zero_division=1)
        
        if f1 > best_f1:
            best_f1 = f1
            best_threshold = threshold
            best_predictions = predictions
    
    # Calculate metrics using the best threshold
    predictions = best_predictions
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, predictions, average='macro', zero_division=1
    )
    accuracy = accuracy_score(labels, predictions)
    
    # Count positive predictions and actual positives
    pred_pos = predictions.sum()
    actual_pos = labels.sum()
    
    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'best_threshold': best_threshold,
        'predicted_positives': int(pred_pos),
        'actual_positives': int(actual_pos)
    }

class MultilabelTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits
        
        # Weighted BCE loss with higher weight for positive examples
        pos_weight = torch.ones_like(labels[0]).float() * 3.0
        loss_fct = torch.nn.BCEWithLogitsLoss(pos_weight=pos_weight)
        loss = loss_fct(logits.view(-1, logits.shape[-1]), 
                       labels.float().view(-1, labels.shape[-1]))
        
        return (loss, outputs) if return_outputs else loss

trainer = MultilabelTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics,
)

print(f"Training with {total_samples} samples")
print(f"Steps per epoch: {steps_per_epoch}")
print(f"Warmup steps: {warmup_steps}")
print(f"Total steps: {steps_per_epoch * training_args.num_train_epochs}")

trainer.train()

### Test the model to predict the categories of a new papers

In [None]:
# Load the prediction dataset from the csv
prediction_dataset = load_dataset("csv", data_files="predict_dataset.csv")

# Apply the label creation and remove original category columns
prediction_dataset = prediction_dataset.map(lambda row: {'labels': create_labels(row)})
prediction_dataset = prediction_dataset.remove_columns(category_columns)

# Duplicate the dataset for each input text variation
prediction_dataset = prediction_dataset.map(expand_rows, batched=True, remove_columns=["title", "abstract"])

# Tokenize the prediction dataset
tokenized_prediction_dataset = prediction_dataset.map(tokenize_function, batched=True)

# Clean up the tokenized prediction dataset by removing the original input_text column
tokenized_prediction_dataset = tokenized_prediction_dataset.remove_columns("input_text")

# Access the prediction set
print("Prediction dataset:\n")
print(tokenized_prediction_dataset)

# Print the first example from the prediction set to verify the data
print("\nFirst example from prediction set:")
print(tokenized_prediction_dataset["train"][0])
print(tokenized_prediction_dataset["train"][1])
print(tokenized_prediction_dataset["train"][2])

In [None]:
# Make predictions on the prediction set
predictions = trainer.predict(tokenized_prediction_dataset["train"])

# Process predictions if necessary
predicted_labels = predictions.predictions  # Access the predictions output
predicted_categories = (torch.sigmoid(torch.tensor(predicted_labels)) > 0.5).int().numpy()

# Function to map binary vectors to category names
def get_category_names(binary_vector, category_columns):
    return [category_columns[i] for i in range(len(binary_vector)) if binary_vector[i] == 1]

# Access the actual labels from the tokenized prediction dataset
actual_labels = tokenized_prediction_dataset['train']['labels']

total_correct_title_only = 0
total_correct_abstract_only = 0
total_correct_title_abstract = 0

total_wrong_title_only = 0
total_wrong_abstract_only = 0
total_wrong_title_abstract = 0

# Display the expected and predicted categories along with the counts
for i in range(len(predicted_categories)):
    # Get expected categories from actual labels
    expected_categories = get_category_names(actual_labels[i], category_columns)
    # Get predicted categories from predicted categories
    predicted_category_names = get_category_names(predicted_categories[i], category_columns)
    
    # Calculate correct and wrong predictions
    correct_predictions = set(expected_categories) & set(predicted_category_names)
    wrong_predictions = set(predicted_category_names) - set(expected_categories)
    num_correct = len(correct_predictions)
    num_wrong = len(wrong_predictions)

    # Check which input text variations were correctly predicted
    if i % 3 == 0:
        total_correct_title_abstract += num_correct
        total_wrong_title_only += num_wrong
    elif i % 3 == 1:
        total_correct_title_only += num_correct
        total_wrong_abstract_only += num_wrong
    else:
        total_correct_abstract_only += num_correct
        total_wrong_title_abstract += num_wrong

    print(f"Example {i}: Expected categories: {expected_categories}, Predicted categories: {predicted_category_names}")
    print(f"Correctly predicted: {num_correct}, Incorrectly predicted: {num_wrong}\n")

# Print the total correct and wrong predictions for each input text variation
print(f"Total correct predictions for title only: {total_correct_title_only}; Total wrong predictions: {total_wrong_title_only}")
print(f"Total correct predictions for abstract only: {total_correct_abstract_only}; Total wrong predictions: {total_wrong_abstract_only}")
print(f"Total correct predictions for title and abstract: {total_correct_title_abstract}; Total wrong predictions: {total_wrong_title_abstract}")