# Data collection process

This notebook scrapes the arXiv website for papers in the category "cs.CV" (Computer Vision), "stat.ML" / "cs.LG" (Machine Learning) and "cs.AI" (Artificial Intelligence). The papers are then saved in a csv file.

In [3]:
!pip install arxiv

Collecting arxiv
  Downloading arxiv-2.1.3-py3-none-any.whl.metadata (6.1 kB)
Collecting feedparser~=6.0.10 (from arxiv)
  Downloading feedparser-6.0.11-py3-none-any.whl.metadata (2.4 kB)
Collecting sgmllib3k (from feedparser~=6.0.10->arxiv)
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25ldone
Downloading arxiv-2.1.3-py3-none-any.whl (11 kB)
Downloading feedparser-6.0.11-py3-none-any.whl (81 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.3/81.3 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: sgmllib3k
  Building wheel for sgmllib3k (setup.py) ... [?25ldone
[?25h  Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6049 sha256=461b7398343ae0a8b1aa17d43d3cbe1bdc9f5b79f6d051a0b8edc8867fe9444c
  Stored in directory: /root/.cache/pip/wheels/f0/69/93/a47e9d621be168e9e33c7ce60524393c0b92ae83cf6c6e89c5
Successfully built sgmllib3k
Installing collected packages: sg

In [4]:
import arxiv
import pandas as pd

from tqdm import tqdm
from pathlib import Path

## Scraping the arXiv website

Let's start by defining a list of keywords that we will use to query the arXiv API.

In [6]:
query_keywords = [
    "\"image segmentation\"",
    "\"self-supervised learning\"",
    "\"representation learning\"",
    "\"image generation\"",
    "\"object detection\"",
    "\"transfer learning\"",
    "\"transformers\"",
    "\"adversarial training",
    "\"generative adversarial networks\"",
    "\"model compressions\"",
    "\"image segmentation\"",
    "\"few-shot learning\"",
    "\"natural language\"",
    "\"graph\"",
    "\"colorization\"",
    "\"depth estimation\"",
    "\"point cloud\"",
    "\"structured data\"",
    "\"optical flow\"",
    "\"reinforcement learning\"",
    "\"super resolution\"",
    "\"attention\"",
    "\"tabular\"",
    "\"unsupervised learning\"",
    "\"semi-supervised learning\"",
    "\"explainable\"",
    "\"radiance field\"",
    "\"decision tree\"",
    "\"time series\"",
    "\"molecule\"",
    "\"large language models\"",
    "\"llms\"",
    "\"language models\"",
    "\"image classification\"",
    "\"document image classification\"",
    "\"encoder\"",
    "\"decoder\"",
    "\"multimodal\"",
    "\"multimodal deep learning\"",
]

Afterwards, we define a function that creates a search object using the given query. It sets the maximum number of results for each category to 6000 and sorts them by the last updated date. 

In [None]:
client = arxiv.Client(num_retries=20, page_size=500)


def query_with_keywords(query) -> tuple:
    """
    Query the arXiv API for research papers based on a specific query and filter results by selected categories.
    
    Args:
        query (str): The search query to be used for fetching research papers from arXiv.
    
    Returns:
        tuple: A tuple containing three lists - terms, titles, and abstracts of the filtered research papers.
        
            terms (list): A list of lists, where each inner list contains the categories associated with a research paper.
            titles (list): A list of titles of the research papers.
            abstracts (list): A list of abstracts (summaries) of the research papers.
            urls (list): A list of URLs for the papers' detail page on the arXiv website.
    """
    
    search = arxiv.Search(
        query=query,
        max_results=6000,
        sort_by=arxiv.SortCriterion.LastUpdatedDate
    )
    
    terms = []
    titles = []
    abstracts = []
    urls = []

    for res in tqdm(client.results(search), desc=query):
        if res.primary_category in ["cs.CV", "stat.ML", "cs.LG", "cs.AI"]:
            terms.append(res.categories)
            titles.append(res.title)
            abstracts.append(res.summary)
            urls.append(res.entry_id)

    return terms, titles, abstracts, urls

In [8]:
all_titles = []
all_abstracts = []
all_terms = []
all_urls = []

for query in query_keywords:
    terms, titles, abstracts, urls = query_with_keywords(query)
    all_titles.extend(titles)
    all_abstracts.extend(abstracts)
    all_terms.extend(terms)
    all_urls.extend(urls)

"image segmentation": 4583it [01:02, 73.07it/s]
"self-supervised learning": 0it [00:02, ?it/s]
"representation learning": 6000it [02:03, 48.48it/s]
"image generation": 4677it [01:53, 41.36it/s]
"object detection": 6000it [01:37, 61.65it/s]
"transfer learning": 6000it [01:31, 65.86it/s]
"transformers": 6000it [01:24, 71.17it/s]
"adversarial training: 0it [00:02, ?it/s]
"generative adversarial networks": 6000it [01:39, 60.54it/s]
"model compressions": 1102it [00:16, 67.43it/s]
"image segmentation": 4583it [00:58, 78.57it/s] 
"few-shot learning": 0it [00:03, ?it/s]
"natural language": 6000it [01:22, 72.69it/s]
"graph": 6000it [01:23, 71.90it/s]
"colorization": 6000it [01:23, 71.90it/s]
"depth estimation": 1930it [00:26, 73.80it/s]
"point cloud": 6000it [01:34, 63.59it/s]
"structured data": 2705it [00:46, 57.75it/s]
"optical flow": 2025it [00:30, 66.85it/s]
"reinforcement learning": 6000it [01:14, 80.51it/s]
"super resolution": 4177it [01:01, 67.83it/s]
"attention": 6000it [01:13, 82.03it/

Now, we create a pandas.DataFrame object to store the results.

In [9]:
arxiv_data = pd.DataFrame({
    'titles': all_titles,
    'abstracts': all_abstracts,
    'terms': all_terms,
    'urls': all_urls
})

Finally, we export the DataFrame to a csv file.

In [12]:
arxiv_data.to_csv('./data.csv', index=False)

In [14]:
arxiv_data_1 = arxiv_data[~arxiv_data["titles"].duplicated()]
print(f"There are {len(arxiv_data_1)} rows in the deduplicated dataset.")

There are 58789 rows in the deduplicated dataset.


In [15]:
arxiv_data_1.to_csv('./filtered_data.csv', index=False)