Skip to content

Daniele-Gregori/ArXiv-Hepth-Data-Analysis

Repository files navigation

Text analysis of all 163 000+ theoretical high energy physics papers on arXiv (with hep-th as primary or cross-list category), from 1986 to 2023.

Exploration of the following possible tasks: 1) counting; 2) feature extraction; 3) classification; 4) question answering; 5) summarising; 6) recommending papers / research directions. The results are the following:

  1. interesting temporal trends appear in title words popularity;
title words shares
  1. 2-words combinations of title words turn out to correspond to hep-th concepts and allow effective feature extraction and CONCEPT embedding of abstracts;

  2. classifiers of article categories are built as Neural Networks (NNs) based on either CONCEPT or SciBERT embedding;

confusion matrix proper
  1. through a more sophisticated NN, the CONCEPT classifier works also for the subcategories within hep-th category;
confusion matrix
  1. effective question answering and summarization of article introductions, through high level AI WL functionality;

  2. a first basic recommendation algorithm, according to distance in feature space.

In perspective it looks sensible to relate papers in feature space and thus inspire new discoveries.

graph papers concept space

All this can be found in the notebook named arXivDataAnalysisV1.3 (to unzip).

Then, as a partial aside, in the notebook Affiliation Countries, we also show the computation of total number of papers over affiliated co-authors, for each country in 2023. This is done using directly inspirehep API. The results are the following: as total

Affiliation Countries Total Table

or as shares per (1->10^6) capita

Affiliation Countries Capita Table