Authors: Louis Ravillon, Martin Piana
Date: October 2020

In [1]:
# packages

import requests
from bs4 import BeautifulSoup


## Parsing directly on CNRS database looks a bit complicated

For now we'll just try parsing the Arxiv website and see how it goes

In [4]:
arxiv_url = "https://arxiv.org/search/?query={}&searchtype=title&abstracts=show&order=-announced_date_first&size=50&start={}"


In [8]:
# choose the query we want
# careful: here we're just working on the first 50 results (the website presents them 50 by 50)
# if we want more we just have to iterate on size adding 50 by 50


            
def get_pdf_urls(website_url, query, total_size):
    """
    ARGS: 
         - total_size is the total amount of articles we have on the web page. we have to check the webpage beforehand
        to know what it amounts to
    OUTPUT: a list of urls linking to the articles
    """
    urls = []
    for i in range (0, total_size, 50):
        size = i

        x = requests.get(website_url.format(query, size)).content
        parser = "html.parser"
        soup = BeautifulSoup(x, parser)

        results = soup("p", class_="list-title is-inline-block")


        for result in results:
            for a in result.find_all('a', href=True):
                if "pdf" in a['href']:
                    urls.append(a['href']+".pdf")
    return urls

urls = get_pdf_urls(arxiv_url, "machine+learning", 50)
print(urls)
print(type(urls[0]))

['https://arxiv.org/pdf/2010.02866.pdf', 'https://arxiv.org/pdf/2010.02749.pdf', 'https://arxiv.org/pdf/2010.02715.pdf', 'https://arxiv.org/pdf/2010.02670.pdf', 'https://arxiv.org/pdf/2010.02576.pdf', 'https://arxiv.org/pdf/2010.02523.pdf', 'https://arxiv.org/pdf/2010.02374.pdf', 'https://arxiv.org/pdf/2010.02317.pdf', 'https://arxiv.org/pdf/2010.02213.pdf', 'https://arxiv.org/pdf/2010.02174.pdf', 'https://arxiv.org/pdf/2010.02087.pdf', 'https://arxiv.org/pdf/2010.02086.pdf', 'https://arxiv.org/pdf/2010.02011.pdf', 'https://arxiv.org/pdf/2010.01996.pdf', 'https://arxiv.org/pdf/2010.01976.pdf', 'https://arxiv.org/pdf/2010.01968.pdf', 'https://arxiv.org/pdf/2010.01711.pdf', 'https://arxiv.org/pdf/2010.01709.pdf', 'https://arxiv.org/pdf/2010.01668.pdf', 'https://arxiv.org/pdf/2010.01582.pdf', 'https://arxiv.org/pdf/2010.01431.pdf', 'https://arxiv.org/pdf/2010.01213.pdf', 'https://arxiv.org/pdf/2010.01163.pdf', 'https://arxiv.org/pdf/2010.01149.pdf', 'https://arxiv.org/pdf/2010.01030.pdf',

### Getting the text

Now that we have the pdf url we have two options: either use beautiful soup, see the web page as a html doc and get the texts from there; either download completely the pdf and use pdfplumber or something alike.

In [2]:
# We'll try with beautiful soup

x = requests.get("https://arxiv.org/pdf/2010.02866.pdf").content
parser = "html.parser"
soup = BeautifulSoup(x, parser)
print(soup)
#results = soup.findall("span", string = True, limit = 5)
#print(results)




Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



### Problems with scrapping

We'll try downloading it here first and then getting the text out

In [5]:
destination = "/home/martin/Desktop/ImpAgt/test.pdf"

chunk_size = 4000

import requests

url = "https://arxiv.org/pdf/2010.02866.pdf"
r = requests.get(url, stream=True)

with open(destination, 'wb') as fd:
    for chunk in r.iter_content(chunk_size):
        fd.write(chunk)
        


Notes:

 - On bloque un peu dans notre scrapping sur la base de recherche du CNRS. On pense que c'est parce que le site est en dynamique donc il faut qu'on creuse un peu 
 - on est parti sur un site plus simple: Arxiv
 - On arrive pas a récupérer directement les docs donc on passe par un chemin détourné: on download en bloc et on utililse un package pour ca

## Scrapping bigger data base: Google scholar

In [3]:
scholar_url = "https://scholar.google.fr/scholar?start={}&q={}&hl=fr&as_sdt=0,5&as_ylo={}&as_yhi={}"

def get_pdf_urls(website_url, query, start_date, end_date):
    """
    ARGS: query: what we want in the research bar
         
    OUTPUT: a list of urls linking to the articles
    """
    urls = []
    for i in range (0, 50, 10):
        size = i

        x = requests.get(website_url.format(size, query, start_date, end_date)).content
        parser = "html.parser"
        soup = BeautifulSoup(x, parser)

        results = soup("h3", class_="gs_rt")


        for result in results:
            for a in result.find_all('a', href=True):

                urls.append(a['href'])
    return urls

urls = get_pdf_urls(scholar_url, "machine+learning+agriculture", 2015, 2020)
print(urls)
print(len(urls))

        

['https://www.mdpi.com/327494', 'https://www.sciencedirect.com/science/article/pii/S0168169917314710', 'https://www.sciencedirect.com/science/article/pii/S0168169918304289', 'https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1002/2016WR019933', 'https://www.tandfonline.com/doi/abs/10.1080/01431161.2016.1278312', 'https://academic.oup.com/erae/article-abstract/47/3/849/5552525', 'https://www.mdpi.com/2072-4292/8/6/514', 'https://www.sciencedirect.com/science/article/pii/S0048969720338602', 'https://ieeexplore.ieee.org/abstract/document/8534558/', 'https://ieeexplore.ieee.org/abstract/document/7225403/', 'https://www.sciencedirect.com/science/article/pii/S016816991630117X', 'https://link.springer.com/article/10.1007/s11119-014-9372-7', 'https://www.sciencedirect.com/science/article/pii/S0168192315007467', 'https://ieeexplore.ieee.org/abstract/document/7838138/', 'https://link.springer.com/chapter/10.1007/978-981-13-7403-6_50', 'https://www.sciencedirect.com/science/article/pii/S01681699

In [50]:
def get_abstracts(url):

    x = requests.get(url).content
    parser = "html.parser"
    print(type(x))
    soup = BeautifulSoup(x, parser)
    print(type(soup))
    #results = soup("div", class_=re.compile("abstract"))
    results = soup("div")
    print(url)
    print(type(results))
    for result in results:
        print(type(result))
        #print(result.get_text())

    return 0

get_abstracts(urls[49])

<class 'bytes'>
<class 'bs4.BeautifulSoup'>
https://link.springer.com/article/10.1186/s40537-017-0077-4
<class 'bs4.element.ResultSet'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class

0

Its a real mess to get the abstracts: half of the websites dont let you get there (protection from robots) and the other half are so randomly coded that you get a lot of info you don't want. Most pertinent might still be to download the available pdfs and find a way to access the abstract pdfs from there.

14/10 : We had a call with B Frank who helped us on the bypassing of security checks we're facing with google scholar and the websites located on it. 

In [45]:
import urllib3
import json

In [48]:
http = urllib3.PoolManager()
S=0
for i in range (len(urls)):
    r = http.request('GET', urls[i])
     # transform byte information to string info
    string = r.data.decode("utf-8")
    if "doctype html" not in str.lower(string):
        print(str.lower(string[:15]))
        print(i)
        S+=1
print(S)


#it seems that using this method we get access to the html whereas previously we didn't
# we still have one or 2 problems(apparentlu with captchas) but it represents a small percentage 
# of the articles apparently
# you can check by uncommenting the lines below
"""
r = http.request('GET', urls[4])
 # transform byte information to string info
string = r.data.decode("utf-8")
print(string)
"""

wiley online li
3
1


In [7]:

r = http.request('GET', urls[4])
print(urls[4])
 # transform byte information to string info
    
string = r.data.decode("utf-8")

#print(string)
byte_page = r.data



parser = "html.parser"
soup = BeautifulSoup(byte_page, parser)
#print(soup)
results = soup("h2", class_="widget-header header-none header-compact-vertical")
#print(results)
"""
for result in results:
    print(result.get_text())

        results = soup("h3", class_="gs_rt")


        for result in results:
            for a in result.find_all('a', href=True):

                urls.append(a['href'])
    return urls
"""

NameError: name 'http' is not defined

On this example we have a pb: we are getting empty parts in the html. This is due to he dynamic website? We might need to use selenium

In [2]:
import selenium


In [3]:
from selenium import webdriver

# start web browser
browser=webdriver.Firefox()

# get source code
browser.get('https://www.tandfonline.com/doi/abs/10.1080/01431161.2016.1278312')
html = browser.page_source

print(html)

# close web browser
browser.close()


WebDriverException: Message: 'geckodriver' executable needs to be in PATH. 


check https://stackoverflow.com/questions/40208051/selenium-using-python-geckodriver-executable-needs-to-be-in-path and https://pythonbasics.org/selenium-get-html/
