# Dataset Extraction

### Date: January 12, 2020

The goal of this notebook is to experiment on how to extract dataset from research papers. The steps to get there are: selecting the relevant papers, dowloading them, extracting part of them with mention of dataset...

To perform this task, we are going to use an active learning approach: firstly extracting pages with the mention of the word: "data", "dataset", "datasets". Save those pages into json in order to handlabel them as relevant or not. This would then provide us with a usable training set that could be the basis of a supervised binary classification model.

In [1]:
import warnings
warnings.filterwarnings('ignore')
import concurrent.futures
import time
from multiprocessing import Pool
from urllib.request import urlretrieve
import re
import textract
import PyPDF4
import json
import os
os.chdir(os.curdir)
os.chdir("..")
import fitz
from zipfile import *

### Experiments

Here we experiment on using the available open source library for parsing pdf. This is done with a small subset of papers published in nature.

In [2]:
path_1 = 'pdf_samples/nature1.pdf'
path_2 = 'pdf_samples/nature2.pdf'
path_3 = 'pdf_samples/nature3.pdf'
path_4 = 'pdf_samples/nature4.pdf'
path_5 = 'pdf_samples/nature5.pdf'

##### textract library

In [3]:
#read the content of pdf as text
text = textract.process(path_1)
#use four space as paragraph delimiter to convert the text into list of paragraphs.
prg = re.split('\s{4,}', text.decode("utf-8"))

In [4]:
prg[0]

'Climate change now detectable from any single \nday of weather at global scale\n\nSebastian Sippel'

In [5]:
prg[1]

'1,2,3*, Nicolai Meinshausen2, Erich M. Fischer'

In [6]:
prg[2]

'1, Enikő Székely'

Here, the structure of the document is not well kept.

##### fitz library

In [7]:
doc = fitz.open(path_1)
print ("number of pages: %i" % doc.pageCount)
print(doc.metadata)

number of pages: 13
{'format': 'PDF 1.4', 'title': 'Climate change now detectable from any single day of weather at global scale', 'author': 'Sebastian Sippel', 'subject': 'Nature Climate Change, doi:10.1038/s41558-019-0666-7', 'keywords': None, 'creator': 'Springer', 'producer': None, 'creationDate': "D:20191220113831+05'30'", 'modDate': "D:20191220113925+05'30'", 'encryption': None}


It can be a useful library for metadata extraction.

##### pyPDF4 library

In [8]:
#open the pdf we just downloaded
path = 'pdf_samples/saalberg2016.pdf'
doc = open(path_1, mode='rb')

In [9]:
pdf_document = PyPDF4.PdfFileReader(doc)
first_page = pdf_document.getPage(4)
first_page_ = re.sub(r'\n', ' ', first_page.extractText())
first_page_

'variations in surface climate. We conclude that physically consistent  forced climate signals can be picked up from the spatial pattern of   daily surface temperature and humidity even if the global mean sig - nal is removed. We assess whether climate change is detected at short timescales  by projecting in˜situ observations, reanalyses output and CMIP5  model simulations onto the fingerprints. This step yields a time  series of the AGMT test statistic for each day in a given year (Fig.  3 )   that is assessed against a proxy of natural climate variability. Here   we use the 2.5thŒ97.5th percentile range of the test statistic distri - bution in CMIP5 models in 1870Œ1950. The proxy is conservative   because of forced early twentieth-century warming 28  and exceeds  the ‚extremely likely™ 95% level in Intergovernmental Panel on   Climate Change terminology 29 . CMIP5 models predict that forced climate change can be  detected at a daily basis from the early 2000s onwards, where the  rang

This is probably the current best open source library to use for pdf parsing. As a result, this is the one that we are going to use.

## Pages extraction

Here we are extracting the relevant pages for our task: those that contain the mention of "data".
- Extrating the url of the paper from the dictionary
- Downloading the paper from the url
- Looking at each page from the paper and extracting those with the mention of "data"
- Storing them in a json file
- Deleting the pdf paper

Functions to make:
- get_url: extract url of paper
- download_paper: download the paper from the url
- get_page: get the relevant pages out of a paper
- delete_paper: delete the paper
- parallelize: serialize and parallelize the process of those functions

In [10]:
def get_pages(path):
    '''
    This function will look at each page of the paper and if it find the occurence of 
    the word data in the corresponding page, it will store it in the json file so we can
    later on, hand label it as relevant or unrelevant to our task
    path: the path of the paper we are interested in
    '''
    doc_ = open(path, mode='rb')
    doc = PyPDF4.PdfFileReader(doc_)
    for i in range(doc.getNumPages()):
        page = doc.getPage(i)
        page = re.sub(r'\n', '', page.extractText())
        if 'data' in page or 'dataset' in page or 'datasets' in page:
            if not os.path.exists('pages_selected_bis'): os.makedirs('pages_selected_bis', exist_ok=True)
            nme = 'pages_selected_bis/{}_page_{}.json'.format(path, i)
            with open(nme, 'w') as raw: 
                json.dump(page, raw, indent=4, sort_keys=False)
    return

In [11]:
#get_pages('test_pdf.pdf')

In [12]:
os.chdir('/Users/spezzata/Documents/Projects/AI4Good/data_aiminer')

In [13]:
!ls

An intrinsic characterization of p-symmetric Heegaard splittings.pdf
Computational Analysis for the Malfunction of Turbine Casing.pdf
Poly(silyl ester)s: A new route of synthesis via the condensation of Di-tert-butyl ester of dicarboxylic acid with dichlorosilane.pdf
Sugar Cane to Fuel-Ethanol... to green power? clean water? recycle sludge? reclaim soils?.pdf
Sympathetic skin response in patients with myasthenia gravis: A comparative analysis.pdf
The Indian Summer Monsoon and its Sensitivity to the Mean SSTs: Simulations with the ECHAM4 AGCM at T106 Horizontal Resolution..pdf
[34maminer_papers_0[m[m
aminer_papers_0.zip
aminer_papers_1.zip
aminer_papers_2.zip
aminer_papers_3.zip
download.pdf
my_pdf.pdf
[34mpapers[m[m


In [14]:
with ZipFile('aminer_papers_1.zip', 'r') as zip: 
    # printing all the contents of the zip file 
    zip.printdir()

File Name                                             Modified             Size
aminer_papers_4.txt                            2019-01-20 21:57:08  10000004774
aminer_papers_5.txt                            2019-01-20 22:12:26  10000003508
aminer_papers_6.txt                            2019-01-20 22:39:02  10000004818
aminer_papers_7.txt                            2019-01-20 22:54:18  10000003416


In [15]:
file1 = open("aminer_papers_0/aminer_papers_1.txt","r")

In [16]:
for i in range(5):
    fileline = file1.readline()
    print(fileline)

{"id": "53e99e61b7602d97027281bf", "title": "Anti-cancer mechanism of survivin siRNA plasmid mU6/survivin", "authors": [{"name": "LI Li-ping", "org": "Institute of Biochemistry and Molecular Biology,Guangdong Medical College,Zhanjiang ,China", "id": "542a1880dabfae849f6a9f12"}, {"name": "LIANG Nian-ci", "org": "Institute of Biochemistry and Molecular Biology,Guangdong Medical College,Zhanjiang ,China", "id": "54299e8fdabfaec70819b838"}, {"name": "ZHANG Zhi-zhen", "org": "Institute of Biochemistry and Molecular Biology,Guangdong Medical College,Zhanjiang ,China", "id": "542aac66dabfae646d57e62f"}, {"name": "LUO Chao-quan", "org": "Department of Biochemistry,Zhongshan Medical College,Sun Yat-Sen University,Guangzhou ,China", "id": "542c2658dabfae1bbfd21929"}], "venue": {"raw": "Journal of Modern Oncology", "id": "5451a5b9e0cf0b02b5f34e9c"}, "year": 2009, "keywords": ["siRNA plasmid", "survivin", "breast cancer cells", "mitotic cell death"], "n_citation": 1, "page_start": "40", "page_end"

In [17]:
d = eval(fileline)
d

{'id': '53e99e61b7602d97027281c3',
 'title': 'To Respect the Subjectivity of Party Members and Guarantee Their Democratic Rights:the Way of Realizing Inner-party Harmony',
 'authors': [{'name': 'ZENG Zhi-gang',
   'org': 'Jiangxi Provincial Party School,Nanchang,Jiangxi,,China',
   'id': '542a0d31dabfaec7081d9f3e'},
  {'name': 'HUANG Ming-zhe',
   'org': 'Ganzhou Municipal Party School,Ganzhou,Jiangxi,,China',
   'id': '53f63416dabfae3fc7c71645'}],
 'venue': {'raw': 'Journal of Yanbian University(Social Sciences)',
  'id': '5451a5b8e0cf0b02b5f34d26'},
 'year': 2008,
 'keywords': ['democratic rights',
  'subjectivity of Party member',
  'inner-party harmony'],
 'n_citation': 3,
 'page_start': '7',
 'page_end': '14',
 'lang': 'zh',
 'issue': '06',
 'abstract': "To respect the subjectivity of Party members and guarantee their democratic rights not only is key to promote the inner-party democracy and social solidarity,but also embodies the implementing of the scientific concept of developm

As we can see above, the url provided are rarely pdf. As a result, it might be better to use the title to make a google search, look for pdf results and download them from here.

### Downloading papers from Google

In [18]:
from googlesearch import search

In [19]:
query = "filetype:pdf Persistent soil seed banks in Phacelia secunda (Hydrophyllaceae): experimental detection of variation along an altitudinal gradient in the Andes of central Chile (33 S)"

Using the google library, we are extracting the top 5 results. The pause argument corresponds to the lapse to wait between HTTP requests. Lapse too short may cause Google to block your IP. Keeping significant lapse will make your program slow but its safe and better option. When we will parallelize our document extraction, this might be a limitation of our algorithm (in term of scaling).

In [20]:
for j in search(query, tld="co.in", num=5, stop=5, pause=2): 
    print(j)

https://www.jstor.org/stable/3072115
https://besjournals.onlinelibrary.wiley.com/doi/pdf/10.1046/j.1365-2745.2001.00514.x
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.853.7675&rep=rep1&type=pdf
https://bioone.org/journalArticle/Download?fullDOI=10.1658%2F1100-9233(2003)014%5B0253%3ASBDITG%5D2.0.CO%3B2
https://pdfs.semanticscholar.org/eb4c/179206eeb88c3403628a26466f60eebee04d.pdf


Storing those url as a list, and selecting only those that are pdf:

In [21]:
url_list = [url for url in search(query, tld="com", num=5, stop=5, pause=2) if url[-3:]=='pdf']

In [22]:
url_list

['http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.853.7675&rep=rep1&type=pdf',
 'https://pdfs.semanticscholar.org/eb4c/179206eeb88c3403628a26466f60eebee04d.pdf']

Then, we can take the first pdf link that we have and download it:

In [23]:
url_to_download = url_list[0]

In [24]:
url_to_download

'http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.853.7675&rep=rep1&type=pdf'

In [25]:
#download pdf from online url
urlretrieve(url_to_download, "download_test.pdf")

('download_test.pdf', <http.client.HTTPMessage at 0x10f71cd50>)

In [26]:
doc_test = open("download_test.pdf", mode='rb')

In [27]:
pdf_document = PyPDF4.PdfFileReader(doc_test)
first_page = pdf_document.getPage(0)
first_page_ = re.sub(r'\n', ' ', first_page.extractText())
first_page_

'African Journal of Agricultural Research Vol. 6(10), pp . 2329-2340, 18 May, 2011  Available online at http://www.academicjournals.org/AJA R  DOI: 10.5897/AJAR10.1099   ISSN 1991-637X ©2011 Academic Journals          Full Length Research Paper    Persistent soil seed banks along altitudinal gradie nts in  the Qilian Mountains in China and their significanc e for  conservation management    Qiuyan Li 1 , Haiyan Fang 2 *and Qiangguo Cai 2    1 College of Water Conservancy and Civil Engineering, C hina Agricultural University, Beijing, 100083, China .  2 Institute of Geographic Sciences and Natural Resources  Research, Chinese Academy of Sciences,   Beijing, 100101, China.    Accepted 31 January, 2011    The qualitative and quantitative parameters of pers istent soil seed bank, including species compositio n,  seed density, vertical distribution and the relatio nship of soil seed bank and vegetation, were assess ed  along an altitudinal gradient in seven communities  in shady slope, sunn

In [28]:
!ls

An intrinsic characterization of p-symmetric Heegaard splittings.pdf
Computational Analysis for the Malfunction of Turbine Casing.pdf
Poly(silyl ester)s: A new route of synthesis via the condensation of Di-tert-butyl ester of dicarboxylic acid with dichlorosilane.pdf
Sugar Cane to Fuel-Ethanol... to green power? clean water? recycle sludge? reclaim soils?.pdf
Sympathetic skin response in patients with myasthenia gravis: A comparative analysis.pdf
The Indian Summer Monsoon and its Sensitivity to the Mean SSTs: Simulations with the ECHAM4 AGCM at T106 Horizontal Resolution..pdf
[34maminer_papers_0[m[m
aminer_papers_0.zip
aminer_papers_1.zip
aminer_papers_2.zip
aminer_papers_3.zip
download.pdf
download_test.pdf
my_pdf.pdf
[34mpapers[m[m


Then, deleting the file:

In [29]:
os.remove("download_test.pdf")

In [30]:
!ls

An intrinsic characterization of p-symmetric Heegaard splittings.pdf
Computational Analysis for the Malfunction of Turbine Casing.pdf
Poly(silyl ester)s: A new route of synthesis via the condensation of Di-tert-butyl ester of dicarboxylic acid with dichlorosilane.pdf
Sugar Cane to Fuel-Ethanol... to green power? clean water? recycle sludge? reclaim soils?.pdf
Sympathetic skin response in patients with myasthenia gravis: A comparative analysis.pdf
The Indian Summer Monsoon and its Sensitivity to the Mean SSTs: Simulations with the ECHAM4 AGCM at T106 Horizontal Resolution..pdf
[34maminer_papers_0[m[m
aminer_papers_0.zip
aminer_papers_1.zip
aminer_papers_2.zip
aminer_papers_3.zip
download.pdf
my_pdf.pdf
[34mpapers[m[m


Now, we are creating a function that can process all of those actions:

In [31]:
def process_paper(fileline):
    '''
    This function extract the relevant pages from the papers
    '''
    title = eval(fileline)['title']
    print(title)
    query = 'filetype:pdf ' + title
    url_list = [url for url in search(query, tld="com", num=5, stop=5, pause=2) if url[-3:]=='pdf']
    
    #if there is at least one pdf link among the top 5 google search results, we can extract it 
    try: url_to_download = url_list[0]
    except: return
    
    #download pdf from online url
    urlretrieve(url_to_download, '{}.pdf'.format(title))
    print('doc_downloaded')
    #we extract the relevant pages
    get_pages(path='{}.pdf'.format(title))
    print('pages_parsed')
    #finally, we remove the downloaded document
    os.remove('{}.pdf'.format(title))
    print('doc_removed')
    
    return

Parallelizing using Pool and map:

In [32]:
count = 0
doc_txt = 'aminer_papers_0/aminer_papers_1.txt'
for line in open(doc_txt): count += 1

In [33]:
count

6755748

In [34]:
doc_txt = 'aminer_papers_0/aminer_papers_1.txt'
with open(doc_txt) as myfile:
    lines_list = [next(myfile) for x in range(count)]

In [36]:
lines_list[:3]

['{"id": "53e99e61b7602d97027281bf", "title": "Anti-cancer mechanism of survivin siRNA plasmid mU6/survivin", "authors": [{"name": "LI Li-ping", "org": "Institute of Biochemistry and Molecular Biology,Guangdong Medical College,Zhanjiang ,China", "id": "542a1880dabfae849f6a9f12"}, {"name": "LIANG Nian-ci", "org": "Institute of Biochemistry and Molecular Biology,Guangdong Medical College,Zhanjiang ,China", "id": "54299e8fdabfaec70819b838"}, {"name": "ZHANG Zhi-zhen", "org": "Institute of Biochemistry and Molecular Biology,Guangdong Medical College,Zhanjiang ,China", "id": "542aac66dabfae646d57e62f"}, {"name": "LUO Chao-quan", "org": "Department of Biochemistry,Zhongshan Medical College,Sun Yat-Sen University,Guangzhou ,China", "id": "542c2658dabfae1bbfd21929"}], "venue": {"raw": "Journal of Modern Oncology", "id": "5451a5b9e0cf0b02b5f34e9c"}, "year": 2009, "keywords": ["siRNA plasmid", "survivin", "breast cancer cells", "mitotic cell death"], "n_citation": 1, "page_start": "40", "page_en

In [37]:
from multiprocessing import Pool
p = Pool(8)
start = time.perf_counter()
p.map(process_paper, lines_list)
end = time.perf_counter()

Anti-cancer mechanism of survivin siRNA plasmid mU6/survivin
Computational Analysis for the Malfunction of Turbine Casing
An intrinsic characterization of p-symmetric Heegaard splittings
Sympathetic skin response in patients with myasthenia gravis: A comparative analysis
Poly(silyl ester)s: A new route of synthesis via the condensation of Di-tert-butyl ester of dicarboxylic acid with dichlorosilane
doc_downloaded
doc_downloaded
Nitric oxide is involved in lithium-induced immediate early gene expressions in the adrenal medulla
doc_downloaded
Sugar Cane to Fuel-Ethanol... to green power? clean water? recycle sludge? reclaim soils?
doc_downloaded
Infection en réanimation : un défi permanent de prévention et de prise en charge
Anodisches Verhalten der Edelmetall-Legierungen und die Frage der Resistenzgrenzen
Effect of ranitidine on basal and bicarbonate enhanced intragastric PCO2: a tonometric study.
doc_downloaded
The Indian Summer Monsoon and its Sensitivity to the Mean SSTs: Simulations

doc_downloaded
doc_downloaded
pages_parsed
doc_removed
Sparse representation based on projection method in online least squares support vector machines
pages_parsed
doc_removed
Discussion on Managerial Experience of Sucker Rod Pumping Technique in Shanshan Oil Production Plant
doc_downloaded
pages_parsed
doc_removed
A spur unfolding model for the radiolysis of water
pages_parsed
doc_removed
Metabolic cost and mechanical work during walking after tibiotalar arthrodesis and the influence of footwear
doc_downloaded
pages_parsed
doc_removed
Numerical simulation of internal flow of tubular pump system
pages_parsed
doc_removed
Outcome of severe adult thrombotic microangiopathies in the intensive care unit.
doc_downloaded
doc_downloaded
pages_parsed
doc_removed
Pattern-Based Constraint Satisfaction and Logic Puzzles
doc_downloaded
pages_parsed
doc_removed
The 26th IUGG Conference on Mathematical Geophysics Sea of Galilee, Israel, 4-8 June 2006
pages_parsed
doc_removed
Hydriding behaviors of Z

FileNotFoundError: [Errno 2] No such file or directory: 'Anti-cancer mechanism of survivin siRNA plasmid mU6/survivin.pdf'

In [None]:
print(end-start)

In [40]:
import multiprocessing as mp

doc_txt = 'aminer_papers_0/aminer_papers_1.txt'

def process_wrapper(lineID):
    with open(doc_txt) as f:
        for i,line in enumerate(f):
            if i != lineID:
                continue
            else:
                process_paper(line)
                break

#init objects
cores=8
pool = mp.Pool(cores)
jobs = []

#create jobs
with open(doc_txt) as f:
    for ID,line in enumerate(f):
        jobs.append( pool.apply_async(process_wrapper,(ID)) )

#wait for all jobs to finish
for job in jobs:
    job.get()

#clean up
pool.close()

Process ForkPoolWorker-78:
Process ForkPoolWorker-83:
Process ForkPoolWorker-81:


KeyboardInterrupt: 

Process ForkPoolWorker-80:
Process ForkPoolWorker-84:
Process ForkPoolWorker-82:
Process ForkPoolWorker-77:
Process ForkPoolWorker-79:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "//anaconda3/envs/ai4g/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
Traceback (most recent call last):
  File "//anaconda3/envs/ai4g/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "//anaconda3/envs/ai4g/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "//anaconda3/envs/ai4g/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "//anaconda3/envs/ai4g/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "//anaconda3/envs/

Now we can process this function for the whole aiminer_papers_1.txt text file:

import multiprocessing as mp,os

doc_txt = 'aminer_papers_0/aminer_papers_1.txt'

def process_wrapper(chunkStart, chunkSize):
    with open(doc_txt) as f:
        f.seek(chunkStart)
        lines = f.read(chunkSize).splitlines()
        for line in lines:
            process_paper(line)

def chunkify(fname,size=1024*1024):
    fileEnd = os.path.getsize(fname)
    f = open(fname,'r+')
    chunkEnd = f.tell()
    while True:
        chunkStart = chunkEnd
        f = open(fname,'ab')
        f.seek(size,1)
        f = open(fname,'r+')
        f.readline()
        chunkEnd = f.tell()
        yield chunkStart, chunkEnd - chunkStart
        if chunkEnd > fileEnd:
            break

#init objects
cores = 8
pool = mp.Pool(cores)
jobs = []

start = time.perf_counter()
#create jobs
for chunkStart,chunkSize in chunkify(doc_txt):
    jobs.append( pool.apply_async(process_wrapper,(chunkStart,chunkSize)) )

#wait for all jobs to finish
for job in jobs:
    job.get()

#clean up
pool.close()
end = time.perf_counter()

In [93]:
def process_paper_(self, dict_):
    #1st: Extract Metadata
    try: self.title = dict_['title']
    except: self.title = np.nan
    try: self.authors = dict_['authors'].values()
    except: self.authors = np.nan
    try: self.year = dict_['year']
    except: self.title = np.nan
    try: self.citations = dict_['n_citation']
    except: self.citations = np.nan
        
    #2nd: Analyse the paper and extract pages with mention of 'data'
    self.url = dict_['url'][0]
    os.chdir('/Users/spezzata/Documents/Projects/AI4Good/data_aiminer/papers') 
    urlretrieve(url, "{}".format(self.url))
    path = "{}".format(self.url)
    get_pages(path)
    os.remove(path)
    
    #3rd: Store the information in a json file 
    #os.chdir(dir_for_papers) ##TODO define dir_for_info
    #os.makedirs('{}', exist_ok=True)
    
    return

In [94]:
def test_url(dict_):
    url_ = dict_['url'][0]
    try:
        if url_[-3:] == 'pdf':
            process_paper(dict_)
        else: pass
    except: pass
    return

In [95]:
#TODO: make a function that can process in parallel all of the papers from the aiminer doc that you have

In [30]:
file1 = open("aminer_papers_0/aminer_papers_0.txt","r")

FileNotFoundError: [Errno 2] No such file or directory: 'aminer_papers_0/aminer_papers_0.txt'

In [31]:
file1 = open("MyFile.txt","r")

FileNotFoundError: [Errno 2] No such file or directory: 'MyFile.txt'

In [None]:
file1.read()

In [7]:
start = time.perf_counter()
page1 = doc.loadPage(0)
page1text = page1.getText("page")
end = time.perf_counter()
print(page1text)

Letters
https://doi.org/10.1038/s41558-019-0666-7
1Institute for Atmospheric and Climate Science, ETH Zurich, Zurich, Switzerland. 2Seminar for Statistics, ETH Zurich, Zurich, Switzerland. 3Norwegian Institute 
of Bioeconomy Research, Ås, Norway. 4Swiss Data Science Center, ETH Zurich and EPFL, Lausanne, Switzerland. *e-mail: sebastian.sippel@env.ethz.ch
For generations, climate scientists have educated the public 
that ‘weather is not climate’, and climate change has been 
framed as the change in the distribution of weather that slowly 
emerges from large variability over decades1–7. However, 
weather when considered globally is now in uncharted terri-
tory. Here we show that on the basis of a single day of globally 
observed temperature and moisture, we detect the fingerprint 
of externally driven climate change, and conclude that Earth as 
a whole is warming. Our detection approach invokes statisti-
cal learning and climate model simulations to encapsulate the 
relationship between 

In [8]:
page1text

'Letters\nhttps://doi.org/10.1038/s41558-019-0666-7\n1Institute for Atmospheric and Climate Science, ETH Zurich, Zurich, Switzerland. 2Seminar for Statistics, ETH Zurich, Zurich, Switzerland. 3Norwegian Institute \nof Bioeconomy Research, Ås, Norway. 4Swiss Data Science Center, ETH Zurich and EPFL, Lausanne, Switzerland. *e-mail: sebastian.sippel@env.ethz.ch\nFor generations, climate scientists have educated the public \nthat ‘weather is not climate’, and climate change has been \nframed as the change in the distribution of weather that slowly \nemerges from large variability over decades1–7. However, \nweather when considered globally is now in uncharted terri-\ntory. Here we show that on the basis of a single day of globally \nobserved temperature and moisture, we detect the fingerprint \nof externally driven climate change, and conclude that Earth as \na whole is warming. Our detection approach invokes statisti-\ncal learning and climate model simulations to encapsulate the \nrelati

In [40]:
print(end-start)

0.022048476000236406


In [38]:
for block in page1.getText("blocks"):
        print(block)

(469.82708740234375, 26.17009925842285, 544.9070434570312, 52.01009750366211, 'Letters', 0, 0)
(391.9768981933594, 52.054752349853516, 543.9119262695312, 62.38275146484375, 'https://doi.org/10.1038/s41558-019-0666-7', 1, 0)
(42.51969909667969, 726.0870361328125, 552.6649780273438, 747.8230590820312, '1Institute for Atmospheric and Climate Science, ETH Zurich, Zurich, Switzerland. 2Seminar for Statistics, ETH Zurich, Zurich, Switzerland. 3Norwegian Institute \nof Bioeconomy Research, Ås, Norway. 4Swiss Data Science Center, ETH Zurich and EPFL, Lausanne, Switzerland. *e-mail: sebastian.sippel@env.ethz.ch', 2, 0)
(42.49089813232422, 199.68580627441406, 293.76123046875, 706.2036743164062, 'For generations, climate scientists have educated the public \nthat ‘weather is not climate’, and climate change has been \nframed as the change in the distribution of weather that slowly \nemerges from large variability over decades1–7. However, \nweather when considered globally is now in uncharted ter

In [35]:
page1text

'Letters\nhttps://doi.org/10.1038/s41558-019-0666-7\n1Institute for Atmospheric and Climate Science, ETH Zurich, Zurich, Switzerland. 2Seminar for Statistics, ETH Zurich, Zurich, Switzerland. 3Norwegian Institute \nof Bioeconomy Research, Ås, Norway. 4Swiss Data Science Center, ETH Zurich and EPFL, Lausanne, Switzerland. *e-mail: sebastian.sippel@env.ethz.ch\nFor generations, climate scientists have educated the public \nthat ‘weather is not climate’, and climate change has been \nframed as the change in the distribution of weather that slowly \nemerges from large variability over decades1–7. However, \nweather when considered globally is now in uncharted terri-\ntory. Here we show that on the basis of a single day of globally \nobserved temperature and moisture, we detect the fingerprint \nof externally driven climate change, and conclude that Earth as \na whole is warming. Our detection approach invokes statisti-\ncal learning and climate model simulations to encapsulate the \nrelati

In [8]:
page1.getText("blocks")

[(469.82708740234375,
  26.17009925842285,
  544.9070434570312,
  52.01009750366211,
  'Letters',
  0,
  0),
 (391.9768981933594,
  52.054752349853516,
  543.9119262695312,
  62.38275146484375,
  'https://doi.org/10.1038/s41558-019-0666-7',
  1,
  0),
 (42.51969909667969,
  726.0870361328125,
  552.6649780273438,
  747.8230590820312,
  '1Institute for Atmospheric and Climate Science, ETH Zurich, Zurich, Switzerland. 2Seminar for Statistics, ETH Zurich, Zurich, Switzerland. 3Norwegian Institute \nof Bioeconomy Research, Ås, Norway. 4Swiss Data Science Center, ETH Zurich and EPFL, Lausanne, Switzerland. *e-mail: sebastian.sippel@env.ethz.ch',
  2,
  0),
 (42.49089813232422,
  199.68580627441406,
  293.76123046875,
  706.2036743164062,
  'For generations, climate scientists have educated the public \nthat ‘weather is not climate’, and climate change has been \nframed as the change in the distribution of weather that slowly \nemerges from large variability over decades1–7. However, \nweath

In [18]:
doc = open(path_1, mode='rb')
pdf_document = PyPDF4.PdfFileReader(doc)

In [19]:
first_page = pdf_document.getPage(0)
first_page.extractText()

'1\nInstitute for Atmospheric and Climate Science, ETH Zurich, Zurich, Switzerland. \n2\nSeminar for Statistics, ETH Zurich, Zurich, Switzerland. \n3\nNorwegian Institute \nof Bioeconomy Research, \n˜\ns, Norway. \n4\nSwiss Data Science Center, ETH Zurich and EPFL, Lausanne, Switzerland. *e-mail: \nsebastian.sippel@env.ethz.ch\nFor generations, climate scientists have educated the public \nthat ‚weather is not climate™, and climate change has been \nframed as the change in the distribution of weather that slowly \nemerges from large variability over decades\n1\nŒ\n7\n. However, \nweather when considered globally is now in uncharted terri\n-\ntory. Here we show that on the basis of a single day of globally \nobserved temperature and moisture, we detect the fingerprint \n\nof externally driven climate change, and conclude that Earth as \na whole is warming. Our detection approach invokes statisti\n-\ncal learning and climate model simulations to encapsulate the \n\nrelationship between s

In [20]:
first_page_ = re.sub(r'\n', '', first_page.extractText())

In [21]:
first_page_

'1Institute for Atmospheric and Climate Science, ETH Zurich, Zurich, Switzerland. 2Seminar for Statistics, ETH Zurich, Zurich, Switzerland. 3Norwegian Institute of Bioeconomy Research, ˜s, Norway. 4Swiss Data Science Center, ETH Zurich and EPFL, Lausanne, Switzerland. *e-mail: sebastian.sippel@env.ethz.chFor generations, climate scientists have educated the public that ‚weather is not climate™, and climate change has been framed as the change in the distribution of weather that slowly emerges from large variability over decades1Œ7. However, weather when considered globally is now in uncharted terri-tory. Here we show that on the basis of a single day of globally observed temperature and moisture, we detect the fingerprint of externally driven climate change, and conclude that Earth as a whole is warming. Our detection approach invokes statisti-cal learning and climate model simulations to encapsulate the relationship between spatial patterns of daily temperature and humidity, and key c

In [50]:
first_page_.find('difference')

3336

In [52]:
first_page_sentences = first_page_.split('.')

In [51]:
first_page_[3336:]

'd'

In [58]:
first_page_sentences[2].find('Norway')

50

In [53]:
first_page_sentences

['1Institute for Atmospheric and Climate Science, ETH Zurich, Zurich, Switzerland',
 ' 2Seminar for Statistics, ETH Zurich, Zurich, Switzerland',
 ' 3Norwegian Institute of Bioeconomy Research, ˜s, Norway',
 ' 4Swiss Data Science Center, ETH Zurich and EPFL, Lausanne, Switzerland',
 ' *e-mail: sebastian',
 'sippel@env',
 'ethz',
 'chFor generations, climate scientists have educated the public that ‚weather is not climate™, and climate change has been framed as the change in the distribution of weather that slowly emerges from large variability over decades1Œ7',
 ' However, weather when considered globally is now in uncharted terri-tory',
 ' Here we show that on the basis of a single day of globally observed temperature and moisture, we detect the fingerprint of externally driven climate change, and conclude that Earth as a whole is warming',
 ' Our detection approach invokes statisti-cal learning and climate model simulations to encapsulate the relationship between spatial patterns of 

In [54]:
mentions = set()

In [55]:
def get_extract_metion(sentence):
    if sentence.find('data')!=-1:
        metions.add(''.join())
    return

In [None]:
map(get_extract_metion, first_page_sentences)