Authors: Louis Ravillon, Martin Piana
Date: October 2020

In [1]:
# packages

import requests
from bs4 import BeautifulSoup


In [2]:
x=requests.get("https://bib.cnrs.fr/home/").content
print(x)

b'<html>\n    <head>\n        <title>BibCnrs</title>\n        <meta charset="utf-8"/>\n        <meta name="description" content="Portail BibCnrs IST">\n        <meta name="apple-mobile-web-app-capable" content="yes" />\n        <meta name="robots" content="index follow">\n        <meta name="viewport" content="width=device-width, initial-scale=1.0, shrink-to-fit=no">\n        <link href="https://fonts.googleapis.com/css?family=DM+Sans&display=swap" rel="stylesheet">            <link rel="stylesheet" media="all" href="https://bib.cnrs.fr/wp-content/themes/portail/style.css?v=2.2" type="text/css"><link rel=\'dns-prefetch\' href=\'//bib.cnrs.fr\' />\n<link rel=\'dns-prefetch\' href=\'//s.w.org\' />\n<link rel="alternate" type="application/rss+xml" title="BibCnrs &raquo; Home Comments Feed" href="https://bib.cnrs.fr/home/feed/" />\n\t\t<script type="text/javascript">\n\t\t\twindow._wpemojiSettings = {"baseUrl":"https:\\/\\/s.w.org\\/images\\/core\\/emoji\\/13.0.0\\/72x72\\/","ext":".png","

In [3]:
parser = "html.parser"
soup = BeautifulSoup(x, parser)
print(soup)

<html>
<head>
<title>BibCnrs</title>
<meta charset="utf-8"/>
<meta content="Portail BibCnrs IST" name="description"/>
<meta content="yes" name="apple-mobile-web-app-capable">
<meta content="index follow" name="robots"/>
<meta content="width=device-width, initial-scale=1.0, shrink-to-fit=no" name="viewport"/>
<link href="https://fonts.googleapis.com/css?family=DM+Sans&amp;display=swap" rel="stylesheet"/> <link href="https://bib.cnrs.fr/wp-content/themes/portail/style.css?v=2.2" media="all" rel="stylesheet" type="text/css"/><link href="//bib.cnrs.fr" rel="dns-prefetch">
<link href="//s.w.org" rel="dns-prefetch">
<link href="https://bib.cnrs.fr/home/feed/" rel="alternate" title="BibCnrs » Home Comments Feed" type="application/rss+xml"/>
<script type="text/javascript">
			window._wpemojiSettings = {"baseUrl":"https:\/\/s.w.org\/images\/core\/emoji\/13.0.0\/72x72\/","ext":".png","svgUrl":"https:\/\/s.w.org\/images\/core\/emoji\/13.0.0\/svg\/","svgExt":".svg","source":{"concatemoji":"https:\

## Parsing directly on CNRS database looks a bit complicated

For now we'll just try parsing the Arxiv website and see how it goes

In [4]:
arxiv_url = "https://arxiv.org/search/?query={}&searchtype=title&abstracts=show&order=-announced_date_first&size=50&start={}"


In [8]:
# choose the query we want
# careful: here we're just working on the first 50 results (the website presents them 50 by 50)
# if we want more we just have to iterate on size adding 50 by 50


            
def get_pdf_urls(website_url, query, total_size):
    """
    ARGS: 
         - total_size is the total amount of articles we have on the web page. we have to check the webpage beforehand
        to know what it amounts to
    OUTPUT: a list of urls linking to the articles
    """
    urls = []
    for i in range (0, total_size, 50):
        size = i

        x = requests.get(website_url.format(query, size)).content
        parser = "html.parser"
        soup = BeautifulSoup(x, parser)

        results = soup("p", class_="list-title is-inline-block")


        for result in results:
            for a in result.find_all('a', href=True):
                if "pdf" in a['href']:
                    urls.append(a['href']+".pdf")
    return urls

urls = get_pdf_urls(arxiv_url, "machine+learning", 50)
print(urls)
print(type(urls[0]))

['https://arxiv.org/pdf/2010.02866.pdf', 'https://arxiv.org/pdf/2010.02749.pdf', 'https://arxiv.org/pdf/2010.02715.pdf', 'https://arxiv.org/pdf/2010.02670.pdf', 'https://arxiv.org/pdf/2010.02576.pdf', 'https://arxiv.org/pdf/2010.02523.pdf', 'https://arxiv.org/pdf/2010.02374.pdf', 'https://arxiv.org/pdf/2010.02317.pdf', 'https://arxiv.org/pdf/2010.02213.pdf', 'https://arxiv.org/pdf/2010.02174.pdf', 'https://arxiv.org/pdf/2010.02087.pdf', 'https://arxiv.org/pdf/2010.02086.pdf', 'https://arxiv.org/pdf/2010.02011.pdf', 'https://arxiv.org/pdf/2010.01996.pdf', 'https://arxiv.org/pdf/2010.01976.pdf', 'https://arxiv.org/pdf/2010.01968.pdf', 'https://arxiv.org/pdf/2010.01711.pdf', 'https://arxiv.org/pdf/2010.01709.pdf', 'https://arxiv.org/pdf/2010.01668.pdf', 'https://arxiv.org/pdf/2010.01582.pdf', 'https://arxiv.org/pdf/2010.01431.pdf', 'https://arxiv.org/pdf/2010.01213.pdf', 'https://arxiv.org/pdf/2010.01163.pdf', 'https://arxiv.org/pdf/2010.01149.pdf', 'https://arxiv.org/pdf/2010.01030.pdf',

### Getting the text

Now that we have the pdf url we have two options: either use beautiful soup, see the web page as a html doc and get the texts from there; either download completely the pdf and use pdfplumber or something alike.

In [2]:
# We'll try with beautiful soup

x = requests.get("https://arxiv.org/pdf/2010.02866.pdf").content
parser = "html.parser"
soup = BeautifulSoup(x, parser)
print(soup)
#results = soup.findall("span", string = True, limit = 5)
#print(results)




Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



### Problems with scrapping

We'll try downloading it here first and then getting the text out

In [5]:
destination = "/home/martin/Desktop/ImpAgt/test.pdf"

chunk_size = 4000

import requests

url = "https://arxiv.org/pdf/2010.02866.pdf"
r = requests.get(url, stream=True)

with open(destination, 'wb') as fd:
    for chunk in r.iter_content(chunk_size):
        fd.write(chunk)
        


Notes:

 - On bloque un peu dans notre scrapping sur la base de recherche du CNRS. On pense que c'est parce que le site est en dynamique donc il faut qu'on creuse un peu 
 - on est parti sur un site plus simple: Arxiv
 - On arrive pas a récupérer directement les docs donc on passe par un chemin détourné: on download en bloc et on utililse un package pour ca

## Scrapping bigger data base: Google scholar

In [33]:
scholar_url = "https://scholar.google.fr/scholar?start={}&q={}&hl=fr&as_sdt=0,5&as_ylo={}&as_yhi={}"

def get_pdf_urls(website_url, query, start_date, end_date):
    """
    ARGS: query: what we want in the research bar
         
    OUTPUT: a list of urls linking to the articles
    """
    urls = []
    for i in range (0, 50, 10):
        size = i

        x = requests.get(website_url.format(size, query, start_date, end_date)).content
        parser = "html.parser"
        soup = BeautifulSoup(x, parser)

        results = soup("h3", class_="gs_rt")


        for result in results:
            for a in result.find_all('a', href=True):

                urls.append(a['href'])
    return urls

urls = get_pdf_urls(scholar_url, "machine+learning+agriculture", 2015, 2020)
print(urls)
print(len(urls))
for url in urls:
    if "sciencedirect" in url:
        print("daumn")
        

['https://www.mdpi.com/327494', 'https://www.sciencedirect.com/science/article/pii/S0168169917314710', 'https://www.sciencedirect.com/science/article/pii/S0168169918304289', 'https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1002/2016WR019933', 'https://www.tandfonline.com/doi/abs/10.1080/01431161.2016.1278312', 'https://academic.oup.com/erae/article-abstract/47/3/849/5552525', 'https://www.mdpi.com/2072-4292/8/6/514', 'https://www.sciencedirect.com/science/article/pii/S0305054820300435', 'https://ieeexplore.ieee.org/abstract/document/7225403/', 'https://www.academia.edu/download/57549164/IRJET-V5I9158.pdf', 'https://www.sciencedirect.com/science/article/pii/S016816991630117X', 'https://link.springer.com/article/10.1007/s11119-014-9372-7', 'https://link.springer.com/article/10.1007/s11356-017-0496-y', 'https://www.sciencedirect.com/science/article/pii/S0168192315007467', 'https://ieeexplore.ieee.org/abstract/document/7838138/', 'https://www.sciencedirect.com/science/article/pii/S01681

In [36]:
def get_abstracts(url):

    x = requests.get(url).content
    parser = "html.parser"
    soup = BeautifulSoup(x, parser)
    #results = soup("div", class_=re.compile("abstract"))
    results = soup("div")
    print(url)
    #print(results)
    for result in results:
        print(result.get_text())

    return 0

get_abstracts(urls[4])

https://www.tandfonline.com/doi/abs/10.1080/01431161.2016.1278312

Skip to Main Content















































Access provided by Taylor & Francis Online











Log in

 | 

Register










Cart



























Home




All Journals




International Journal of Remote Sensing




List of Issues




Volume 38, Issue 7




A machine learning approach for agricult ....



































Search in:


This Journal


Anywhere










Advanced search

















Journal


International Journal of Remote Sensing



Volume 38, 2017 - Issue 7: European remote sensing: progress, challenges, and opportunities







Submit an article
Journal homepage















































3,566


Views




19


CrossRef citations to date






Altmetric

















Listen






Articles


A machine learning approach for agricultural parcel delineation through agglomerative segmentationA. García-Pedrero Center for Biomed

0

Its a real mess to get the abstracts: half of the websites dont let you get there (protection from robots) and the other half are so randomly coded that you get a lot of info you don't want. Most pertinent might still be to download the available pdfs and find a way to access the abstract pdfs from there.