# arXiv_scraper

V1.0.0 (19/06/2020)

v1.0.1 (23/11/2020)
- changed GetUrl() to return link to pdf.

v1.0.2 (15/02/2021)
- added bold font to printed papers' titles.
                          
Author: _Filippo M. Gambetta_

A simple IPython notebook to search for newly uploaded arXiv papers matching with a given set of keywords.

Currently, it is possible to search through arXiv "new" sections only (e.g. https://arxiv.org/list/cond-mat/new, https://arxiv.org/list/quant-ph/new, etc...)

Keywords can be imported from an external txt file or inserted as a list of strings.

This notebook has been inspired by
* Joel Grus, Data Science from Scratch (O’Reilly, 2019); Chapter 9. 

## Class Paper and search function

In [6]:
from bs4 import BeautifulSoup
import requests
import re

In [7]:
# Class Paper
# abstract_max_lenght (= 10000), max_authors (= 3) can be changed in __init__ function

class Paper(object):
    """
    Class for arxiv papers.
    """
    def __init__(self, identifier, title, author_list, abstract, replacement = False, print_abstract = False, abstract_max_lenght = 10000, max_authors = 3):
        self.identifier = identifier
        self.title = title
        self.author_list = author_list
        self.abstract = abstract
        self.replacement = replacement
        self.print_abstract = print_abstract
        self.abstract_max_lenght = abstract_max_lenght
        self.max_authors = max_authors
        
    def GetIdentifier(self):
        return self.identifier
    
    def GetTitle(self):
        return self.title
    
    def GetAuthors(self):
        return selft.author_list
    
    def GetAbstract(self):
        return self.abstract
    
    def GetAbstractMaxLenght(self):
        return abstract_max_lenght
    
    def GetUrl(self):
        return "https://arxiv.org/pdf/"+str(self.identifier)+".pdf"
    
    def IsReplacement(self):
        return self.replacement
    
    def __str__(self):
        paper_str = "\033[1m"+str(self.title) + "\033[0m" + "\n" # "\033[1m" and "\033[0m": start and end bold font
        ind_max = min(len(self.author_list), self.max_authors)
        for ind in range(ind_max):
            paper_str += self.author_list[ind]
            if ind < (ind_max - 1):
                paper_str += ", "
        if len(self.author_list)> self.max_authors:
            paper_str += " et al."
        paper_str += "\n"+self.GetUrl()
        if self.print_abstract: 
            paper_str += "\n\n" + self.abstract[:self.abstract_max_lenght] # Regulate printer abstract length
        return paper_str

In [8]:
# Search functions

def SearchNewPapers(urls, Keywords_vect, print_abstract = False, print_replacement = False, repetition = False):
    """
    Search the list of keywords contained in Keywords_vect in titles,
    authors' lists, and abstracts of all papers from a list urls
    """
    
    all_interesting_papers_ids = set()
    
    for url in urls:
        html = requests.get(url).text
        soup = BeautifulSoup(html, 'html5lib')
        
        # Get the date
        announcement_string = soup.find("div", {"class": "list-dateline"}).text.split("announced ")

        # Building the lists of paper identifiers
        all_dt = soup('dt') # numbers and identifiers are contained in dt sections

        all_identifiers = [a for dt in all_dt for a in dt('a') if a.has_attr('href')]
        all_identifiers_list_tmp = [identifier.text.split()[0] for identifier in all_identifiers]

        regex=r"^arXiv"
        all_identifiers_list = [identifier.replace("arXiv:","") for identifier in all_identifiers_list_tmp if re.match(regex,identifier)]

        # Building the list of titles, authors, and abstracts
        all_dd = soup('dd') # titles, authors, and abstracts are contained in dd sections

        all_titles = [div for dd in all_dd for div in dd('div','list-title mathjax')]
        all_authors = [div for dd in all_dd for div in dd('div','list-authors')]
        all_abstracts = [p for dd in all_dd for p in dd('p','mathjax')]
        
        all_titles_list = [title.text.replace("\n","").replace("Title: ","").replace("  "," ") for title in all_titles]
        all_authors_list = [authors.text.replace("\n","").replace("Authors: ","").replace("  "," ").split(", ") for authors in all_authors]
        all_abstracts_list = [abstract.text.replace("\n"," ").replace("  "," ") for abstract in all_abstracts]

        all_titles_list_lower = [title.text.lower().replace("\n","").replace("title: ","").replace("  "," ") for title in all_titles]
        all_authors_list_lower = [authors.text.lower().replace("\n","").replace("authors: ","").replace("  "," ").split(", ") for authors in all_authors]
        all_abstracts_list_lower = [abstract.text.lower().replace("\n"," ").replace("  "," ") for abstract in all_abstracts]

        # Search for keywords in titles, authors' lists, and abstracts

        interesting_papers = set()

        for keyword in Keywords_vect:
            interesting_titles = set([index for index, title in enumerate(all_titles_list_lower) if keyword in title])
            interesting_authors = set([index for index, author in enumerate(all_authors_list_lower) if keyword in author])
            interesting_abstract = set([index for index, abstract in enumerate(all_abstracts_list_lower) if keyword in abstract])
            
            interesting_papers = interesting_papers.union(interesting_titles, interesting_authors,interesting_abstract)
            
        interesting_papers_list = sorted(interesting_papers)

        # Building the output with link to the papers

        # Taking care of replacements
        new_papers = len(all_abstracts_list)
        total_papers = len(all_identifiers_list)

        all_abstracts_list_full = all_abstracts_list
        for item in range(new_papers,total_papers):
            all_abstracts_list_full.append("This paper is a replacement.")

        # Printing results

        print("Today ({})".format(announcement_string[1]),"in", url,"there are", new_papers, "new papers and", total_papers-new_papers, "replacements.\n\n")

        todays_papers = []
        for item in interesting_papers_list:
            if item < new_papers:
                replacement = False
            else: 
                replacement = True
            paper = Paper(all_identifiers_list[item], all_titles_list[item], all_authors_list[item], all_abstracts_list[item], replacement, print_abstract)
            todays_papers.append(paper)

        new_interesting_papers = 0
        already_announced_papers = 0
        for paper in todays_papers:
            if not paper.IsReplacement():
                if paper.GetIdentifier() not in all_interesting_papers_ids:
                    new_interesting_papers += 1
                else:
                    already_announced_papers += 1

        print("There are",new_interesting_papers,"new interesting papers and", len(todays_papers)-new_interesting_papers-already_announced_papers,"interesting replacements. \n")
        
        ind=0
        for paper in todays_papers:
            if not print_replacement:
                if not paper.IsReplacement():
                    if paper.GetIdentifier() not in all_interesting_papers_ids:
                        print(str(ind+1)+")")
                        print(paper,"\n")
                        ind += 1
            else: 
                if paper.GetIdentifier() not in all_interesting_papers_ids:
                    print(str(ind+1)+")")
                    print(paper,"\n")
                    ind += 1
                
        # Creating a set with all ids of already announced papers to avoid repetition
        if not repetition:
            for paper in todays_papers: 
                all_interesting_papers_ids.add(paper.GetIdentifier())

        print("\n************************************\n")

## Main program

In [9]:
#Import keywords from txt file
#Format: Insert keywords as new lines. Lines beginning with # will be ignored. 
#Tip: Check if there are accidental white spaces at the end of each entry!

Keywords_vect=[]

with open('Keywords.txt') as file_for_reading:
    for line in file_for_reading:
        if not re.match('^#', line) and line.strip()!='':
            Keywords_vect.append(line.strip().lower())

#print(Keywords_vect)

In [10]:
urls  = ["https://arxiv.org/list/cond-mat/new",
         "https://arxiv.org/list/quant-ph/new"] # Insert here the urls of the arXiv new sections to search

print_abstract = True            # if True, the abstract of the papers will be printed
print_replacement = False        # if True, replacements will be printed
repetition = False               # if True, papers already announced in previous section will be announced again
SearchNewPapers(urls, Keywords_vect, print_abstract, print_replacement, repetition)

Today (Wed, 17 Mar 21) in https://arxiv.org/list/cond-mat/new there are 83 new papers and 73 replacements.


There are 34 new interesting papers and 18 interesting replacements. 

1)
[1mPlanckian Metal at a Doping-Induced Quantum Critical Point[0m
Philipp T. Dumitrescu, Nils Wentzell, Antoine Georges et al.
https://arxiv.org/pdf/2103.08607.pdf

We numerically study a model of interacting spin-$1/2$ electrons with random exchange coupling on a fully connected lattice. This model hosts a quantum critical point separating two distinct metallic phases as a function of doping: a Fermi liquid with a large Fermi surface volume and a low-doping phase with local moments ordering into a spin-glass. We show that this quantum critical point has non-Fermi liquid properties characterized by $T$-linear Planckian behavior, $\omega/T$ scaling and slow spin dynamics of the Sachdev-Ye-Kitaev (SYK) type. The $\omega/T$ scaling function associated with the electronic self-energy is found to have an intri

Today (Wed, 17 Mar 21) in https://arxiv.org/list/quant-ph/new there are 53 new papers and 31 replacements.


There are 34 new interesting papers and 19 interesting replacements. 

1)
[1mApplication of the Diamond Gate in Quantum Fourier Transformations and Quantum Machine Learning[0m
E. Bahnsen, S. E. Rasmussen, N. J. S. Loft et al.
https://arxiv.org/pdf/2103.08605.pdf

As we are approaching actual application of quantum technology it is important to exploit the current quantum resources the best possible way. With this in mind it might not be beneficial to use the usual standard gate sets, inspired from classical logic gates, when compiling quantum algorithms, when other less standardized gates currently perform better. We therefore consider a native gate, which occurs naturally in superconducting circuits, known as the diamond gate. We show how the diamond gate can be decomposed into standard gates and with the use of single-qubit gates can work as a controlled-not-swap (\cns) gate