# Data collection process

This notebook scrapes the arXiv website for papers in the category "cs.CV" (Computer Vision), "stat.ML" / "cs.LG" (Machine Learning) and "cs.AI" (Artificial Intelligence). The papers are then saved in a csv file.

In [9]:
import arxiv
import pandas as pd

from tqdm import tqdm
from pathlib import Path

In [10]:
PATH_DATA_BASE = Path.cwd().parent / "data"

## Scraping the arXiv website

Let's start by defining a list of keywords that we will use to query the arXiv API.

In [11]:
query_keywords = [
    "\"image segmentation\"",
    "\"self-supervised learning\"",
    "\"representation learning\"",
    "\"image generation\"",
    "\"object detection\"",
    "\"transfer learning\"",
    "\"transformers\"",
    "\"adversarial training",
    "\"generative adversarial networks\"",
    "\"model compressions\"",
    "\"image segmentation\"",
    "\"few-shot learning\"",
    "\"natural language\"",
    "\"graph\"",
    "\"colorization\"",
    "\"depth estimation\"",
    "\"point cloud\"",
    "\"structured data\"",
    "\"optical flow\"",
    "\"reinforcement learning\"",
    "\"super resolution\"",
    "\"attention\"",
    "\"tabular\"",
    "\"unsupervised learning\"",
    "\"semi-supervised learning\"",
    "\"explainable\"",
    "\"radiance field\"",
    "\"decision tree\"",
    "\"time series\"",
    "\"molecule\"",
    "\"large language models\"",
    "\"llms\"",
    "\"language models\"",
    "\"image classification\"",
    "\"document image classification\"",
    "\"encoder\"",
    "\"decoder\"",
    "\"multimodal\"",
    "\"multimodal deep learning\"",
]

Afterwards, we define a function that creates a search object using the given query. It sets the maximum number of results for each category to 50 and sorts them by the last updated date. 

In [12]:
client = arxiv.Client(num_retries=20, page_size=500)


def query_with_keywords(query) -> tuple:
    """
    Query the arXiv API for research papers based on a specific query and filter results by selected categories.
    
    Args:
        query (str): The search query to be used for fetching research papers from arXiv.
    
    Returns:
        tuple: A tuple containing three lists - terms, titles, and abstracts of the filtered research papers.
        
            terms (list): A list of lists, where each inner list contains the categories associated with a research paper.
            titles (list): A list of titles of the research papers.
            abstracts (list): A list of abstracts (summaries) of the research papers.
            urls (list): A list of URLs for the papers' detail page on the arXiv website.
    """
    
    # Create a search object with the query and sorting parameters.
    search = arxiv.Search(
        query=query,
        max_results=50,
        sort_by=arxiv.SortCriterion.LastUpdatedDate
    )
    
    # Initialize empty lists for terms, titles, abstracts, and urls.
    terms = []
    titles = []
    abstracts = []
    urls = []

    # For each result in the search...
    for res in tqdm(client.results(search), desc=query):
        # Check if the primary category of the result is in the specified list.
        if res.primary_category in ["cs.CV", "stat.ML", "cs.LG", "cs.AI"]:
            # If it is, append the result's categories, title, summary, and url to their respective lists.
            terms.append(res.categories)
            titles.append(res.title)
            abstracts.append(res.summary)
            urls.append(res.entry_id)

    # Return the four lists.
    return terms, titles, abstracts, urls

In [13]:
all_titles = []
all_abstracts = []
all_terms = []
all_urls = []

for query in query_keywords:
    terms, titles, abstracts, urls = query_with_keywords(query)
    all_titles.extend(titles)
    all_abstracts.extend(abstracts)
    all_terms.extend(terms)
    all_urls.extend(urls)

"image segmentation": 50it [00:06,  7.62it/s]
"self-supervised learning": 0it [00:02, ?it/s]
"representation learning": 50it [00:08,  5.65it/s]
"image generation": 50it [00:08,  6.01it/s]
"object detection": 50it [00:08,  5.65it/s]
"transfer learning": 50it [00:12,  3.85it/s]
"transformers": 50it [00:12,  4.11it/s]
"adversarial training: 0it [00:02, ?it/s]
"generative adversarial networks": 50it [00:15,  3.19it/s]
"model compressions": 50it [00:04, 11.13it/s]
"image segmentation": 50it [00:04, 10.08it/s]
"few-shot learning": 0it [00:03, ?it/s]
"natural language": 50it [00:15,  3.14it/s]
"graph": 50it [00:26,  1.86it/s]
"colorization": 50it [01:18,  1.57s/it]
"depth estimation": 50it [00:50,  1.01s/it]
"point cloud": 50it [00:12,  4.05it/s]
"structured data": 50it [00:16,  3.05it/s]
"optical flow": 50it [00:22,  2.18it/s]
"reinforcement learning": 50it [00:57,  1.15s/it]
"super resolution": 50it [00:44,  1.13it/s]
"attention": 50it [00:20,  2.48it/s]
"tabular": 50it [00:16,  2.99it/s]
"

Now, we create a pandas.DataFrame object to store the results.

In [14]:
arxiv_data = pd.DataFrame({
    'titles': all_titles,
    'abstracts': all_abstracts,
    'terms': all_terms,
    'urls': all_urls
})

Finally, we export the DataFrame to a csv file.

In [15]:
arxiv_data.to_csv(PATH_DATA_BASE / 'data.csv', index=False)