# Creation of datasets for finetuning LLMs on arXiv abstracts

Useful info can be found here: https://info.arxiv.org/help/arxiv_identifier_for_services.html

# Outline

- [ 1 - Packages and setup](#1)
    - [1.1 - Log into huggingface hub](#1.1)
    - [1.2 - Define variables to automate extraction and upload](#1.2)
- [ 2 - Load full arXiv metadata (currently ~4.2Gb)](#2)
- [ 3 - Data manipulation](#3)
    - [3.1 - Identify small set of papers which we authored](#3.1)
    - [3.2 - Explore categories](#3.2)
    - [3.3 - Extract specific category/categories](#3.3)
    - [3.4 - Remove papers that have been withdrawn](#3.4)
- [ 4 - Look at the abstracts](#4)    
    - [4.1 - Length of abstracts](#4.1)
    - [4.2 - Keywords/PACS at end of abstracts](#4.2)
    - [4.3 - Multi-lingual abstracts](#4.3)
    - [4.4 - Look at distribution of dates from `id` column](#4.4)
- [ 5 - Clean the abstract data](#5)
- [ 6 - Convert to Huggingface dataset and push](#6)
    - [ 6.1 - Convert Pandas DataFrame to dataset Dataset](#6.1)
    - [ 6.2 - Split dataset into train, test and validation datasets](#6.2)
    - [ 6.3 - Upload data to Huggingface](#6.3)
- [ 7 - Concatenate hep-th_primary and hep-ph_gr-qc_primary datasets](#7)

<a name="1"></a>
## 1 - Packages and setup

In [1]:
import numpy as np 
import pandas as pd
import json
import re

import huggingface_hub
import datasets

<a name="1.1"></a>
### 1.1 - Log into huggingface hub

In [2]:
try:
    from kaggle_secrets import UserSecretsClient
    user_secrets = UserSecretsClient()
    secret_value_0 = user_secrets.get_secret("HFapi")
    huggingface_hub.login(secret_value_0)
except:
    huggingface_hub.login()

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


<a name="1.2"></a>
### 1.2 - Define variables to automate extraction and upload

In this subsection we define several variables which control how this notebook operates. Once these variables are set you can click `Run All`.

The variables are
1. `repo_id`: String. This is the name of the Huggingface repository that the dataset will be upload to. If the repository does not already exist it will be created.
2. `commit_message`: String or None. An optional commit message used in the `push_to_hub` function. Set it to `None` for an initial commit.
3. `wanted_categories`: List. A list of strings corresponding to arXiv categories. The code looks at the string data in the `categories` column and checks if any of the entries in `wanted_categories` appears.
4. `primary_classification_only`: Boolean. 
    * `True` will only match entries in `wanted_categories` to the first substring which appears in the `categories` column. 
    * `False` will match entries in `wanted_categories` if they appear *anywhere* in the `categories` column.
5. `train_size`: Float. Represents a percentage of data to use in creating a training dataset. `(1-train_size)` is used as a non-train dataset *i.e.* combined test and validation set.
6. `validation_size`: Float. Represents a percentage of data from the non-train dataset to use as a validation dataset. The remaining percentage, `(1-validation_size)`, is used as a test dataset. The test set is to be used for hyperparameter tuning etc with the validation set left unused until the end to determine final model performance.

In [None]:
repo_id = "LLMsForHepth/hep-th_primary"
commit_message = None # None will use default in `push_to_hub` which is `"Upload dataset"`

wanted_categories = ['hep-th'] 
primary_classification_only = True

train_size = 0.7 # use 70% of the dataset for training, 30% for testing & validation
validation_size = 0.5 # test_size is 1 - validation_size

<a name="2"></a>
## 2 - Load full arXiv metadata (currently ~4.2Gb)

The Kaggle dataset is described here: https://www.kaggle.com/datasets/Cornell-University/arxiv

In this notebook we've pinned the dataset to be v193 which includes submissions upto around 22nd August 2024.


Arxiv submissions are tightly controlled and should follow the instructions given here https://info.arxiv.org/help/prep.html.

In particular, the Title and Abstract metadata must be in ASCII input and Unicode characters should be converted to LaTex equivalent.
Since ASCII is a subset of utf-8 we can use utf-8 encoding to parse the json file.

In [None]:
df_dir='/kaggle/input/arxiv/arxiv-metadata-oai-snapshot.json'
json_data = []

with open(df_dir, 'r', encoding='utf-8') as f:
    for line in f:
        # Parse JSON from each line
        try:
            json_object = json.loads(line)
            json_data.append(json_object)
        except json.JSONDecodeError as e:
            print(f"Error decoding JSON: {e}")
            continue
            
df = pd.DataFrame(json_data)
del json_data

<a name="3"></a>
## 3 - Data manipulation

In [None]:
df.shape

In [None]:
df.head()

<a name="3.1"></a>
### 3.1 - Identify small set of papers which we authored

We want to create a very small control dataset so we can see how well a LLM completes abstracts as it is being finetuned.

We choose the id of the paper which appears as the $n$th entry in each of our inspires record (Sid not included as no hep-th papers)
with $n = \text{entry number of citeable papers} \, // \, 2$.

In [None]:
ids = ["1804.08625", "1404.0016", "1205.2086", "1209.5915", "1802.05268"]
df_overfit = df[df['id'].isin(ids)]
df_overfit

In [None]:
# remove our papers from main dataframe so we don't double count
df = df[~df['id'].isin(ids)]

<a name="3.2"></a>
### 3.2 - Explore categories

See https://arxiv.org/category_taxonomy for a desciption of the values which can appear.
Each arXiv article has a primary category and may also have one or more cross-lists to other categories.

Inspecting `df.head()` we see that `df.iloc[1]['categories'] = 'math.CO cs.CG'`. The *primary* classifcation is `math.CO` and it is also cross-listed to the `cs.BG` category.

In [None]:
# there are many combinations of primary and cross-list categories
df['categories'].value_counts()

In [None]:
# split string appearing in `categories` on white space and expand
# split_cats[0] is the *primary* classification
split_cats = df['categories'].str.split(n=-1, expand=True)
primary_cat = split_cats[0]

In [None]:
# get a list of primary classifications and associated count
# list is ordered in descending count  
primary_cats_and_counts = list(zip(split_cats[0].value_counts().keys().tolist(), split_cats[0].value_counts().tolist()))
primary_cats_and_counts

<a name="3.3"></a>
### 3.3 - Extract specific category/categories

**NB:** The variables `wanted_categories` and `primary_classification_only` are defined in section [1.2](#1.2)

In [None]:
if primary_classification_only:
    # we get those papers whose *primary* classification is in `wanted_categories`
    df = df[primary_cat.apply(lambda x: any(k in x for k in wanted_categories))]
else:
    # we get papers where `wanted_categories` appears anywhere in `categories` i.e. primary and also in cross-listing
    df = df[df['categories'].apply(lambda x: any(k in x for k in wanted_categories))]

In [None]:
df.shape

<a name="3.4"></a>
### 3.4 - Remove papers that have been withdrawn

See https://info.arxiv.org/help/withdraw.html

In [None]:
# make an index of comments which contain either 'Withdrawn' or 'withdrawn'
withdrawn = df['comments'].str.contains('Withdrawn', case=False) # empty comments return None
withdrawn.fillna(value=False, inplace=True) # replace None with False
withdrawn.value_counts()

In [None]:
# sanity check but takes a while
# make an index of abstracts which contain either 'Withdrawn' or 'withdrawn'
# this way is quicker than using contains('withdrawn', case=False)
withdrawn_abs = df['abstract'].str.contains('Withdrawn') | df['abstract'].str.contains('withdrawn') #| df['abstract'].str.contains('removed') 
withdrawn_abs.value_counts()

In [None]:
# look at entries with `withdrawn` in abstract but not in comments
df[~withdrawn & withdrawn_abs]

In [None]:
# drop the withdrawn papers
df = df[~(withdrawn | withdrawn_abs)]
df.shape

<a name="4"></a>
## 4 - Look at the abstracts

<a name="4.1"></a>
### 4.1 - Length of abstracts

In [None]:
df.reset_index(drop=True, inplace=True)

In [None]:
# get the number of characters in each abstract
abstract_len = df['abstract'].map(lambda x: len(x))
# look at the summary statistics for abstract_len
abstract_len.describe()
# According to https://info.arxiv.org/help/prep.html
# abstracts longer than 1920 characters are not accepted. 
# So when did this rule begin as we have examples of abstract_len > 1920?

In [None]:
abstract_len.hist()

In [None]:
# Take a look at the longest abstract
df.iloc[abstract_len.idxmax()]['abstract']

Things to notice about the above abstract
- there is lots of whitespace at the start
- there are many \n instead of spaces

This suggests we replace \n with ' ' and strip out the extra leading/trailing whitespaces.

<a name="4.2"></a>
### 4.2 - Keywords/PACS at end of abstracts

In [None]:
# Turns out there's PACS numbers and Keywords at the end of some abstracts,
# should we remove these for training a LLM?
# Find which abstracts contain either 'Keyword' or 'PACS'
has_keyword = df['abstract'].str.contains('Keyword|PACS', case=False)
df[has_keyword].shape[0]

In [None]:
# df[has_keyword].iloc[0]['abstract']

In [None]:
print(f"Percentage of abstacts with Keywords or PACS is {100 * df[has_keyword].shape[0] / df.shape[0]:.3f}%")

<a name="4.3"></a>
### 4.3 - Multi-lingual abstracts

See https://info.arxiv.org/help/faq/multilang.html

First we find any multi-lingual abstracts

In [None]:
multi = df['abstract'].str.contains("-----")
multi.value_counts()

In [None]:
print(f"Percentage of multi-lingual abstacts is {100 * df[multi].shape[0] / df.shape[0]:.3f}%")

In [None]:
english_only = df['abstract'].apply(lambda x: x.split("-----")[0])

We have to be careful because there are some abstracts which have metric signatures denoted by $+-----$ as can be seen below!

In [None]:
# df[multi]['abstract'].iloc[3]

To remove the "-----" and everything after it we would use the following

In [None]:
english_only = df['abstract'].apply(lambda x: x.split("-----")[0])

However, since there are very few examples we leave things as they are

## 4.4 - Look at distribution of dates from `id` column

Old scheme identifiers are of the form hep-th/9901001.

New scheme identifiers are of the form 0704.0001 or 1501.00001

In [None]:
# The dataframes look to be ordered by identifier
df['id'][:-10]

In [None]:
def get_year_from_id(id):
    if '.' in id:
        year = id[:2]
    else:
        tmp = id.split('/')[1]
        year = tmp[:2]
    if year[0] == '9':
        year = '19' + year
    else:
        year = '20' + year
    return year

In [None]:
years = df['id'].map(get_year_from_id)

In [None]:
years.sort_values(ascending=True).hist(figsize=(15,3))

<a name="5"></a>
## 5 - Clean the abstract data

In [None]:
def clean_abstracts(abstract):
    abstract = re.sub(r'\n\s*', ' ', abstract)  # replace '\n' and any whitespace immediately after it with a single whitespace
    abstract = abstract.strip()  # remove leading/trailing whitespace
    return abstract

In [None]:
# apply `clean_abstracts` function to Series. Don't know how to do this inplace
# so we add a new column and then do some renaming
df['cleaned_abstract'] = df['abstract'].map(clean_abstracts)
df = df.rename(columns={"abstract": "orig_abstract", "cleaned_abstract": "abstract"})

In [None]:
df['abstract'].iloc[0]

<a name="6"></a>
## 6 - Convert to Huggingface dataset and push

<a name="6.1"></a>
### 6.1 - Convert Pandas DataFrame to dataset Dataset

In [None]:
raw_dataset = datasets.Dataset.from_pandas(df, preserve_index=False)

<a name="6.2"></a>
### 6.2 - Split dataset into train, test and validation datasets

use `?datasets.Dataset.train_test_split` to get full documentation.

Since the DataFrame seems to be ordered by `id` column we must randomly shuffle before splitting.

**NB:** The variables `train_size` and `validation_size` are defined in section [1.2](#1.2)

In [None]:
train_testvalid = raw_dataset.train_test_split(train_size=train_size, seed=42, shuffle=True)
test_valid = train_testvalid['test'].train_test_split(test_size=validation_size, seed=42, shuffle=True)

train_test_valid_dataset = datasets.DatasetDict({'train': train_testvalid['train'],
                                                 'test': test_valid['test'],
                                                 'validation': test_valid['train']})

In [None]:
# print the number of entries in each dataset
for name, data in train_test_valid_dataset.items():
    print(f"Dataset {name} has size {data.shape[0]}")

<a name="6.3"></a>
### 6.3 - Upload data to Huggingface

See `?datasets.Dataset.push_to_hub` for full documentation.

**NB:** The variables `repo_id` and `commit_message` are defined in section [1.2](#1.2)

In [None]:
# Push the Dataset to Huggingface
try:
    train_test_valid_dataset.push_to_hub(repo_id, commit_message=commit_message)
except:
    huggingface_hub.create_repo(repo_id=repo_id,
                                repo_type="dataset",
                                private=True,
                                commit_message=commit_message)
    train_test_valid_dataset.push_to_hub(repo_id)

In [None]:
# logout from Huggingface
huggingface_hub.logout()

****NB: we can get previous instances of datasets by using****

```
ds_old = datasets.load_dataset('LLMsForHepth/arxiv_hepth_first', 
                               revision='346140be7a01f109af9845a0e3742b9fcd66fd9a')
                               ```
                               
where '346140be7a01f109af9845a0e3742b9fcd66fd9a' is a commit hash found on the repo website

In [None]:
ds = load_dataset('LLMsForHepth/hep-th_primary')

<a name="7"></a>
## 7 - Concatenate hep-th_primary and hep-ph_gr-qc_primary datasets

In [3]:
from datasets import load_dataset, concatenate_datasets, DatasetDict

In [4]:
ds_1 = load_dataset('LLMsForHepth/hep-th_primary')
ds_2 = load_dataset('LLMsForHepth/hep-ph_gr-qc_primary')

Downloading readme:   0%|          | 0.00/4.72k [00:00<?, ?B/s]

Downloading data: 100%|██████████| 80.6M/80.6M [00:02<00:00, 36.8MB/s]
Downloading data: 100%|██████████| 17.3M/17.3M [00:00<00:00, 28.4MB/s]
Downloading data: 100%|██████████| 17.3M/17.3M [00:00<00:00, 26.0MB/s]


Generating train split:   0%|          | 0/73768 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/15808 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/15808 [00:00<?, ? examples/s]

Downloading readme:   0%|          | 0.00/3.80k [00:00<?, ?B/s]

Downloading data: 100%|██████████| 166M/166M [00:04<00:00, 40.8MB/s] 
Downloading data: 100%|██████████| 35.6M/35.6M [00:01<00:00, 32.7MB/s]
Downloading data: 100%|██████████| 35.5M/35.5M [00:01<00:00, 34.4MB/s]


Generating train split:   0%|          | 0/137136 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/29387 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/29386 [00:00<?, ? examples/s]

In [5]:
ds_concat = DatasetDict()
names = ds_1.keys()

for name in names:
    ds_concat[name] = concatenate_datasets([ds_1[name], ds_2[name]])

In [6]:
ds_concat

DatasetDict({
    train: Dataset({
        features: ['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi', 'report-no', 'categories', 'license', 'orig_abstract', 'versions', 'update_date', 'authors_parsed', 'abstract'],
        num_rows: 210904
    })
    test: Dataset({
        features: ['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi', 'report-no', 'categories', 'license', 'orig_abstract', 'versions', 'update_date', 'authors_parsed', 'abstract'],
        num_rows: 45195
    })
    validation: Dataset({
        features: ['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi', 'report-no', 'categories', 'license', 'orig_abstract', 'versions', 'update_date', 'authors_parsed', 'abstract'],
        num_rows: 45194
    })
})

In [7]:
ds_concat['train'] = ds_concat['train'].shuffle(seed=42)
ds_concat['train'] = ds_concat['train'].flatten_indices()

ds_concat['test'] = ds_concat['test'].shuffle(seed=42)
ds_concat['test'] = ds_concat['test'].flatten_indices()

ds_concat['validation'] = ds_concat['validation'].shuffle(seed=42)
ds_concat['validation'] = ds_concat['validation'].flatten_indices()

Flattening the indices:   0%|          | 0/210904 [00:00<?, ? examples/s]

Flattening the indices:   0%|          | 0/45195 [00:00<?, ? examples/s]

Flattening the indices:   0%|          | 0/45194 [00:00<?, ? examples/s]

In [10]:
print(ds_1['train'][0:5]['id'])
print(ds_2['train'][0:5]['id'])
print(ds_concat['train'][0:5]['id'])

['2205.12835', '0706.1875', 'hep-th/0306003', '1307.3106', '1601.01310']
['hep-ph/0001018', '2111.04548', '1306.4970', '2310.04053', 'hep-ph/0401114']
['1806.04140', 'hep-th/0209192', 'gr-qc/0505099', 'hep-th/9303053', 'hep-th/9404121']


In [11]:
# Push the Dataset to Huggingface
try:
    ds_concat.push_to_hub('LLMsForHepth/hep-th_hep-ph_gr-qc_primary_v3')
except:
    huggingface_hub.create_repo(repo_id='LLMsForHepth/hep-th_hep-ph_gr-qc_primary_v3',
                                repo_type="dataset",
                                private=False)
    ds_concat.push_to_hub(repo_id)

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/211 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/46 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/46 [00:00<?, ?ba/s]