Authors: Louis Ravillon, Martin Piana
Date: October 2020

In [1]:
# packages

import requests
from bs4 import BeautifulSoup


## Parsing directly on CNRS database looks a bit complicated

For now we'll just try parsing the Arxiv website and see how it goes

In [4]:
arxiv_url = "https://arxiv.org/search/?query={}&searchtype=title&abstracts=show&order=-announced_date_first&size=50&start={}"


In [8]:
# choose the query we want
# careful: here we're just working on the first 50 results (the website presents them 50 by 50)
# if we want more we just have to iterate on size adding 50 by 50


            
def get_pdf_urls(website_url, query, total_size):
    """
    ARGS: 
         - total_size is the total amount of articles we have on the web page. we have to check the webpage beforehand
        to know what it amounts to
    OUTPUT: a list of urls linking to the articles
    """
    urls = []
    for i in range (0, total_size, 50):
        size = i

        x = requests.get(website_url.format(query, size)).content
        parser = "html.parser"
        soup = BeautifulSoup(x, parser)

        results = soup("p", class_="list-title is-inline-block")


        for result in results:
            for a in result.find_all('a', href=True):
                if "pdf" in a['href']:
                    urls.append(a['href']+".pdf")
    return urls

urls = get_pdf_urls(arxiv_url, "machine+learning", 50)
print(urls)
print(type(urls[0]))

['https://arxiv.org/pdf/2010.02866.pdf', 'https://arxiv.org/pdf/2010.02749.pdf', 'https://arxiv.org/pdf/2010.02715.pdf', 'https://arxiv.org/pdf/2010.02670.pdf', 'https://arxiv.org/pdf/2010.02576.pdf', 'https://arxiv.org/pdf/2010.02523.pdf', 'https://arxiv.org/pdf/2010.02374.pdf', 'https://arxiv.org/pdf/2010.02317.pdf', 'https://arxiv.org/pdf/2010.02213.pdf', 'https://arxiv.org/pdf/2010.02174.pdf', 'https://arxiv.org/pdf/2010.02087.pdf', 'https://arxiv.org/pdf/2010.02086.pdf', 'https://arxiv.org/pdf/2010.02011.pdf', 'https://arxiv.org/pdf/2010.01996.pdf', 'https://arxiv.org/pdf/2010.01976.pdf', 'https://arxiv.org/pdf/2010.01968.pdf', 'https://arxiv.org/pdf/2010.01711.pdf', 'https://arxiv.org/pdf/2010.01709.pdf', 'https://arxiv.org/pdf/2010.01668.pdf', 'https://arxiv.org/pdf/2010.01582.pdf', 'https://arxiv.org/pdf/2010.01431.pdf', 'https://arxiv.org/pdf/2010.01213.pdf', 'https://arxiv.org/pdf/2010.01163.pdf', 'https://arxiv.org/pdf/2010.01149.pdf', 'https://arxiv.org/pdf/2010.01030.pdf',

### Getting the text

Now that we have the pdf url we have two options: either use beautiful soup, see the web page as a html doc and get the texts from there; either download completely the pdf and use pdfplumber or something alike.

In [2]:
# We'll try with beautiful soup

x = requests.get("https://arxiv.org/pdf/2010.02866.pdf").content
parser = "html.parser"
soup = BeautifulSoup(x, parser)
print(soup)
#results = soup.findall("span", string = True, limit = 5)
#print(results)




Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



### Problems with scrapping

We'll try downloading it here first and then getting the text out

In [5]:
destination = "/home/martin/Desktop/ImpAgt/test.pdf"

chunk_size = 4000

import requests

url = "https://arxiv.org/pdf/2010.02866.pdf"
r = requests.get(url, stream=True)

with open(destination, 'wb') as fd:
    for chunk in r.iter_content(chunk_size):
        fd.write(chunk)
        


Notes:

 - On bloque un peu dans notre scrapping sur la base de recherche du CNRS. On pense que c'est parce que le site est en dynamique donc il faut qu'on creuse un peu 
 - on est parti sur un site plus simple: Arxiv
 - On arrive pas a récupérer directement les docs donc on passe par un chemin détourné: on download en bloc et on utililse un package pour ca

## Scrapping bigger data base: Google scholar

In [3]:
scholar_url = "https://scholar.google.fr/scholar?start={}&q={}&hl=fr&as_sdt=0,5&as_ylo={}&as_yhi={}"

def get_pdf_urls(website_url, query, start_date, end_date):
    """
    ARGS: query: what we want in the research bar
         
    OUTPUT: a list of urls linking to the articles
    """
    urls = []
    for i in range (0, 50, 10):
        size = i

        x = requests.get(website_url.format(size, query, start_date, end_date)).content
        parser = "html.parser"
        soup = BeautifulSoup(x, parser)

        results = soup("h3", class_="gs_rt")


        for result in results:
            for a in result.find_all('a', href=True):

                urls.append(a['href'])
    return urls

urls = get_pdf_urls(scholar_url, "machine+learning+agriculture", 2015, 2020)
print(urls)
print(len(urls))

        

['https://www.mdpi.com/327494', 'https://www.sciencedirect.com/science/article/pii/S0168169917314710', 'https://www.sciencedirect.com/science/article/pii/S0168169918304289', 'https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1002/2016WR019933', 'https://academic.oup.com/erae/article-abstract/47/3/849/5552525', 'https://www.tandfonline.com/doi/abs/10.1080/01431161.2016.1278312', 'https://www.mdpi.com/2072-4292/8/6/514', 'https://ieeexplore.ieee.org/abstract/document/8534558/', 'https://ieeexplore.ieee.org/abstract/document/7225403/', 'https://www.sciencedirect.com/science/article/pii/S016816991630117X', 'https://link.springer.com/article/10.1007/s11119-014-9372-7', 'https://www.sciencedirect.com/science/article/pii/S0168192315007467', 'https://ieeexplore.ieee.org/abstract/document/7838138/', 'https://link.springer.com/chapter/10.1007/978-981-13-7403-6_50', 'https://www.sciencedirect.com/science/article/pii/S0168169917308803', 'https://www.nature.com/articles/544S21a', 'https://www.scie

In [8]:
def get_abstracts(url):

    x = requests.get(url).content
    parser = "html.parser"
    print(type(x))
    soup = BeautifulSoup(x, parser)
    print(type(soup))
    #results = soup("div", class_=re.compile("abstract"))
    results = soup("div")
    print(url)
    print(type(results))
    for result in results:
        print(type(result))
        #print(result.get_text())

    return 0

get_abstracts(urls[49])

<class 'bytes'>
<class 'bs4.BeautifulSoup'>
https://www.aeaweb.org/articles?id=10.1257/aer.p20171038
<class 'bs4.element.ResultSet'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>


0

Its a real mess to get the abstracts: half of the websites dont let you get there (protection from robots) and the other half are so randomly coded that you get a lot of info you don't want. Most pertinent might still be to download the available pdfs and find a way to access the abstract pdfs from there.

14/10 : We had a call with B Frank who helped us on the bypassing of security checks we're facing with google scholar and the websites located on it. 

In [4]:
import urllib3
import json

In [5]:
http = urllib3.PoolManager()
"""
S=0
for i in range (len(urls)):
    r = http.request('GET', urls[i])
     # transform byte information to string info
    string = r.data.decode("utf-8")
    if "doctype html" not in str.lower(string):
        print(str.lower(string[:15]))
        print(i)
        S+=1
print(S)
"""

#it seems that using this method we get access to the html whereas previously we didn't
# we still have one or 2 problems(apparentlu with captchas) but it represents a small percentage 
# of the articles apparently
# you can check by uncommenting the lines below
"""
r = http.request('GET', urls[4])
 # transform byte information to string info
string = r.data.decode("utf-8")
print(string)
"""

'\nr = http.request(\'GET\', urls[4])\n # transform byte information to string info\nstring = r.data.decode("utf-8")\nprint(string)\n'

In [15]:
# By running this cell and uncommenting "print soup" you'll realize that for some 
# reason the abstract isn't present in the html

r = http.request('GET', urls[4])
print(urls[4])
 # transform byte information to string info
    
string = r.data.decode("utf-8")

#print(string)
byte_page = r.data



parser = "html.parser"
soup = BeautifulSoup(byte_page, parser)
#print(soup)
results = soup("div", class_="abstractSection abstractInFull")
#print(results)
"""
for result in results:
    print(result.get_text())

        results = soup("h3", class_="gs_rt")


        for result in results:
            for a in result.find_all('a', href=True):

                urls.append(a['href'])
    return urls
"""

https://academic.oup.com/erae/article-abstract/47/3/849/5552525


'\nfor result in results:\n    print(result.get_text())\n\n        results = soup("h3", class_="gs_rt")\n\n\n        for result in results:\n            for a in result.find_all(\'a\', href=True):\n\n                urls.append(a[\'href\'])\n    return urls\n'

### Checkpoint: 

On this example we have a pb: we are getting missing parts in the html. This is due to he dynamic website? We might need to use selenium

Installing Selenium with the right version of geckodriver etc was a mess. I recommend following these steps for it to work: https://tecadmin.net/setup-selenium-with-firefox-on-ubuntu/
(go to step 4 included)

In [2]:
import selenium
import os

In [3]:
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.webdriver.firefox.options import Options


# Note: You'll have to adapt your paths here. I installed my geckodriver on my desktop. Youll have to find where it is
# to find where your firefox is on linux: "which firefox" in terminal


binary = FirefoxBinary('/usr/bin/firefox')
binary = r'/usr/bin/firefox'
options = Options()
options.set_headless(headless=True)
options.binary = binary

cap = DesiredCapabilities().FIREFOX
cap["marionette"] = True #optional
browser = webdriver.Firefox(firefox_options=options, capabilities=cap,executable_path='/home/martin/Desktop/geckodriver-v0.25.0-linux64/geckodriver')

# get source code
browser.get('https://www.sciencedirect.com/science/article/pii/S0168169917314710')
html = browser.page_source

print(type(html))
#print(html)
# close web browser
browser.close()



  


<class 'str'>


With this new method every thing seems to be working: the abstract at least is present as you can see if you uncomment the "print (html)" in the cell above

In [9]:
scholar_url = "https://scholar.google.se/scholar?start={}&q={}&hl=fr&as_sdt=0,5&as_ylo={}&as_yhi={}"

import time
from fake_useragent import UserAgent

def get_pdf_urls(website_url, query, start_date, end_date):
    """
    ARGS: query: what we want in the research bar
         
    OUTPUT: a list of urls linking to the articles
    """
    
    
    #https://stackoverflow.com/questions/58873022/how-to-make-selenium-script-undetectable-using-geckodriver-and-firefox-through-p
    profile = webdriver.FirefoxProfile('/home/martin/.mozilla/firefox/9ncorkym.ImpAgt-user')

    PROXY_HOST = "12.12.12.123"
    PROXY_PORT = "1234"
    profile.set_preference("network.proxy.type", 1)
    profile.set_preference("network.proxy.http", PROXY_HOST)
    profile.set_preference("network.proxy.http_port", int(PROXY_PORT))
    profile.set_preference("dom.webdriver.enabled", False)
    profile.set_preference('useAutomationExtension', False)
    profile.update_preferences()
    
    
    
    binary = FirefoxBinary('/usr/bin/firefox')
    binary = r'/usr/bin/firefox'
    options = Options()
    options.set_headless(headless=True)
    options.binary = binary
    ua = UserAgent()
    userAgent = ua.random
    print("useragent: ", userAgent)
    options.add_argument(f'user-agent={userAgent}')

    cap = DesiredCapabilities().FIREFOX
    cap["marionette"] = True #optional
    browser = webdriver.Firefox(firefox_profile=profile, firefox_options=options, capabilities=cap,executable_path='/home/martin/Desktop/geckodriver-v0.25.0-linux64/geckodriver')
    
    
    
    urls = []
    for i in range (0, 10, 10):
        size = i
        
        
        """
        
        binary = r'/usr/bin/firefox'
        options = Options()
        options.set_headless(headless=True)
        options.binary = binary
        ua = UserAgent()
        userAgent = ua.random
        print("useragent: ", userAgent)
        options.add_argument(f'user-agent={userAgent}')

        cap = DesiredCapabilities().FIREFOX
        cap["marionette"] = True #optional
        browser = webdriver.Firefox(firefox_options=options, capabilities=cap,executable_path='/home/martin/Desktop/geckodriver-v0.25.0-linux64/geckodriver')

        """
        
        
        
        browser.get(website_url.format(size, query, start_date, end_date))
        time.sleep(1)
        scholar_html = (browser.page_source)
        parser = "html.parser"
        soup = BeautifulSoup(scholar_html, parser)
        
        results = soup("h3", class_="gs_rt")

        for result in results:
            for a in result.find_all('a', href=True):

                urls.append(a['href'])
    browser.close()
    return urls

urls = get_pdf_urls(scholar_url, "machine+learning+agriculture", 2015, 2020)
print(urls)
print(len(urls))

print("non")



useragent:  Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0
['https://www.mdpi.com/327494', 'https://www.sciencedirect.com/science/article/pii/S0168169917314710', 'https://www.sciencedirect.com/science/article/pii/S0168169918304289', 'https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1002/2016WR019933', 'https://academic.oup.com/erae/article-abstract/47/3/849/5552525', 'https://www.tandfonline.com/doi/abs/10.1080/01431161.2016.1278312', 'https://www.mdpi.com/2072-4292/8/6/514', 'https://ieeexplore.ieee.org/abstract/document/8534558/', 'https://ieeexplore.ieee.org/abstract/document/7225403/', 'https://www.sciencedirect.com/science/article/pii/S016816991630117X']
10
non


['https://www.mdpi.com/327494', 'https://www.sciencedirect.com/science/article/pii/S0168169917314710', 'https://www.sciencedirect.com/science/article/pii/S0168169918304289', 'https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1002/2016WR019933', 'https://academic.oup.com/erae/article-abstract/47/3/849/5552525', 'https://www.tandfonline.com/doi/abs/10.1080/01431161.2016.1278312', 'https://www.mdpi.com/2072-4292/8/6/514', 'https://ieeexplore.ieee.org/abstract/document/8534558/', 'https://ieeexplore.ieee.org/abstract/document/7225403/', 'https://www.sciencedirect.com/science/article/pii/S016816991630117X']
url https://www.mdpi.com/327494 didnt work
url https://www.sciencedirect.com/science/article/pii/S0168169917314710 didnt work
url https://www.sciencedirect.com/science/article/pii/S0168169918304289 didnt work
url https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1002/2016WR019933 didnt work
url https://academic.oup.com/erae/article-abstract/47/3/849/5552525 didnt work
url https://www.t

Installing chrome and chromedriver on ubuntu: https://christopher.su/2015/selenium-chromedriver-ubuntu/

In [12]:
import os
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup 
import requests


scholar_url = "https://scholar.google.se/scholar?start={}&q={}&hl=fr&as_sdt=0,5&as_ylo={}&as_yhi={}"


# Load driver (for Google Chrome)  
chromedriver = "/usr/bin/chromedriver" # chromedriver is the connection between our python code and the browser
os.environ["webdriver.chrome.driver"] = chromedriver

def get_urls(website_url, query, start_date, end_date):
    urls = []
    driver = webdriver.Chrome(chromedriver)
    for i in range (0, 100, 10):
        
        driver.get(website_url.format(i, query, start_date, end_date))
        time.sleep(1)
        scholar_html = driver.page_source
        parser = "html.parser"
        soup = BeautifulSoup(scholar_html, parser)
        
        results = soup("h3", class_="gs_rt")

        for result in results:
            for a in result.find_all('a', href=True):
                
                urls.append(a['href'])
      
    driver.quit() # closing the webdriver 
    return urls

urls = get_urls(scholar_url, "agriculture+machine+learning", 2015, 2020)
print(urls)

['https://www.mdpi.com/327494', 'https://www.sciencedirect.com/science/article/pii/S0168169917314710', 'https://www.sciencedirect.com/science/article/pii/S0168169918304289', 'https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1002/2016WR019933', 'https://academic.oup.com/erae/article-abstract/47/3/849/5552525', 'https://www.tandfonline.com/doi/abs/10.1080/01431161.2016.1278312', 'https://link.springer.com/article/10.1007/s11119-014-9372-7', 'https://www.mdpi.com/2072-4292/8/6/514', 'https://www.sciencedirect.com/science/article/pii/S0168192315007467', 'https://www.sciencedirect.com/science/article/pii/S0168169918306987', 'https://ieeexplore.ieee.org/abstract/document/7225403/', 'https://link.springer.com/article/10.1007/s11119-017-9527-4', 'https://ieeexplore.ieee.org/abstract/document/7325900/', 'https://www.sciencedirect.com/science/article/pii/S2589721719300182', 'https://www.sciencedirect.com/science/article/pii/S0168169917314588', 'https://ieeexplore.ieee.org/abstract/document/783

In [7]:

# get source code
def get_htmls():
    htmls = []
    print(urls)
    for url in urls:

        try:
            driver = webdriver.Chrome(chromedriver)
            driver.get(url)
            htmls.append(driver.page_source)
            print("html extracted from", url)
            driver.quit()
        except:
            print("url {} didnt work".format(url))
    return htmls
#print(htmls[0])

"""

"""

['https://www.mdpi.com/327494', 'https://www.sciencedirect.com/science/article/pii/S0168169917314710', 'https://www.sciencedirect.com/science/article/pii/S0168169918304289', 'https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1002/2016WR019933', 'https://academic.oup.com/erae/article-abstract/47/3/849/5552525', 'https://www.tandfonline.com/doi/abs/10.1080/01431161.2016.1278312', 'https://link.springer.com/article/10.1007/s11119-014-9372-7', 'https://www.mdpi.com/2072-4292/8/6/514', 'https://www.sciencedirect.com/science/article/pii/S0168192315007467', 'https://www.sciencedirect.com/science/article/pii/S0168169918306987']
html extracted from https://www.mdpi.com/327494
html extracted from https://www.sciencedirect.com/science/article/pii/S0168169917314710
html extracted from https://www.sciencedirect.com/science/article/pii/S0168169918304289
html extracted from https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1002/2016WR019933
html extracted from https://academic.oup.com/erae/article-

'\ni=0\nfor html in htmls:\n    text=\'\'\n    parser = "html.parser"\n    soup = BeautifulSoup(html, parser)\n    results = soup("div")\n    print("\n")\n    print("********")\n    print("\n")\n    for result in results:\n        a=result.get_text()\n        if len(a)>1000 and len(a)<4000 and "\n\n" not in a and "ScienceDirectJournals" not in a and "AccessGet" not in a :\n            text+=\'\n\'+a\n            #print(a)\n            #print("iiiiiiiiiiiiiiiii")\n    if len(text)>0:\n        print("this ones good: ", i)\n        print(text)\n    else:\n        print(i)\n\n    i+=1\n'

In [8]:
def get_abstracts(htmls):
    i=0
    for html in htmls:
        text=''
        parser = "html.parser"
        soup = BeautifulSoup(html, parser)
        results = soup("div")
        print("\n")
        print("********")
        print("\n")
        for result in results:
            a=result.get_text()
            if len(a)>1000 and len(a)<4000 and "\n\n" not in a and "ScienceDirectJournals" not in a and "AccessGet" not in a :
                text+='\n'+a
                #print(a)
                #print("iiiiiiiiiiiiiiiii")
        if len(text)>0:
            print("this ones good: ", i)
            print(text)
        else:
            print(i)

        i+=1

In [9]:
get_abstracts(htmls)



********


this ones good:  0


Machine learning has emerged with big data technologies and high-performance computing to create new opportunities for data intensive science in the multi-disciplinary agri-technologies domain. In this paper, we present a comprehensive review of research dedicated to applications of machine learning in agricultural production systems. The works analyzed were categorized in (a) crop management, including applications on yield prediction, disease detection, weed detection crop quality, and species recognition; (b) livestock management, including applications on animal welfare and livestock production; (c) water management; and (d) soil management. The filtering and classification of the presented articles demonstrate how agriculture will benefit from machine learning technologies. By applying machine learning to sensor data, farm management systems are evolving into real time artificial intelligence enabled programs that provide rich recommendations and 



********


this ones good:  3


Climate, groundwater extraction, and surface water flows have complex nonlinear relationships with groundwater level in agricultural regions. To better understand the relative importance of each driver and predict groundwater level change, we develop a new ensemble modeling framework based on spectral analysis, machine learning, and uncertainty analysis, as an alternative to complex and computationally expensive physical models. We apply and evaluate this new approach in the context of two aquifer systems supporting agricultural production in the United States: the High Plains aquifer (HPA) and the Mississippi River Valley alluvial aquifer (MRVA). We select input data sets by using a combination of mutual information, genetic algorithms, and lag analysis, and then use the selected data sets in a Multilayer Perceptron network architecture to simulate seasonal groundwater level change. As expected, model results suggest that irrigation demand has the hig



********


this ones good:  9

Computers and Electronics in AgricultureVolume 155, December 2018, Pages 41-49Original papersAn IoT based smart irrigation management system using Machine learning and open source technologiesAuthor links open overlay panelAmarendraGoapabDeepakSharmabA.K.ShuklabC.Rama KrishnaaShow morehttps://doi.org/10.1016/j.compag.2018.09.040Get rights and contentHighlights•IoT based architecture for smart irrigation using field sensors and weather forecast.•Machine-learning based Soil moisture prediction algorithm with higher accuracy.•Smart irrigation scheduling algorithm using predicted soil moisture and rain forecast.AbstractThe scarcity of clean water resources around the globe has generated a need for their optimum utilization. Internet of Things (IoT) solutions, based on the application specific sensors’ data acquisition and intelligent processing, are bridging the gaps between the cyber and physical worlds. IoT based smart irrigation management systems can he