# PDFDataExtractor Demo

PDFDataExtractor is a toolkit for automatically extracting semantic information from PDF files of scientific articles, which features a template-based architecture with abilities to extract information from the following various publishers: 
* Elsevier
* Royal Society of Chemistry
* Advanced Material Families (Wiley)
* Angewandte
* Chemistry A European Journal
* American Chemistry Society
* Springer (Temporarily unavailable)

## To install PDFDataExtractor, simply run the following code in your terminal

In [None]:
# git clone git@github.com:cat-lemonade/PDFDataExtractor.git

## Then run the following code:

In [None]:
# python setup.py install

## Pass a single PDF file

### Import necessary module

In [1]:
from pdfdataextractor import Reader

In [2]:
path = r"F:/tadf_papers/wiley/10_1002_adom_201701147.pdf"

In [3]:
file = Reader()

In [4]:
pdf = file.read_file(path)

Reading:  F:/tadf_papers/wiley/10_1002_adom_201701147.pdf
*** Advanced Materials Family detected ***


### Test if PDF is returned successful

In [5]:
pdf.test()

PDF returned successfully


### Get Caption

In [6]:
pdf.caption()

{'figure 1': 'Figure 1.  Molecular structures of D–π–A-type CNBPz- and CNBQx-based  TADF materials.',
 'figure 2': 'Figure  2.  Comparison  of  the  intrinsic  electron-accepting  capability  (LUMO level) of various A units used for red TADF materials, estimated  by DFT at the PBE0/6-31G(d) level.',
 'figure 3': 'Figure 3.  Calculated S1 and T1 energies with oscillator strength (f), optimized molecular geometries, and HOMO and LUMO distributions for 1–4  using the PBE0/6-31G(d) method. H → L represents the HOMO to LUMO transition.',
 'figure 4': 'Figure 4.  Molecular structures of 1–3 (CCDC 1578725, 1578726, and 1576763, respectively) with thermal ellipsoids at 50% probability determined by  single-crystal X-ray analyses. The dihedral angles (θD–π and θA–π) are estimated from the respective interplanar angles for the donor (pink), phenylene  (green), and acceptor (blue) moieties.',
 'figure 5': 'Figure 5.  Photophysical properties of 1–4: a) UV–vis absorption and b) PL spectra in oxyge

### Get Keywords

In [7]:
pdf.keywords()# Note: Some articles do not contain keywords. For example, the current one.

'donor–acceptor  systems,  luminescence,  organic  light  emitting  diodes, \norganic semiconductors, thermally activated delayed fluorescence'

### Get Title

In [8]:
pdf.title()

'Highly Efficient Red–Orange Delayed Fluorescence  Emitters Based on Strong π-Accepting Dibenzophenazine  and Dibenzoquinoxaline Cores: toward a Rational Pure-Red  OLED Design'

### Get DOI

In [9]:
pdf.doi()

'10.1002/adom.201701147'

### Get Abstract

In [10]:
pdf.abstract()

['Organic luminescent materials that exhibit thermally activated delayed fluorescence (TADF) can harvest both singlet and triplet excitons for light emission, leading to high electroluminescence (EL) quantum efficiencies in organic light-emitting diodes (OLEDs). However, efficient red TADF materials are still very rare because of their restricted molecular design based on the energy gap law. To address this issue, elaborate π-conjugated donor–acceptor (D–A) systems that can simultaneously achieve a large fluorescence radiative rate and small singlet–triplet energy splitting should be strategically designed. In this study, to produce high-efficiency pure-red TADF materials, a remarkably strong π-accepting dicyanodibenzo[a,c]phena-zine (CNBPz) unit has been introduced in a D–π–A molecular framework, and combined with a phenylene-linked p-ditolylamine or 9,9-dimethylacridan moiety. The steady-state and time-resolved photophysical measurements revealed intense genuine red TADF emissions of

### Get Journal

In [26]:
pdf.journal()

{'name': 'adv. optical mater',
 'year': '2018',
 'volume': ' 6',
 'page': ' 1701147'}

### Get Journal name

In [27]:
pdf.journal('name')

'adv. optical mater'

### Get Journal Year

In [28]:
pdf.journal('year')

'2018'

### Get Journal Volume

In [29]:
pdf.journal('volume')

' 6'

### Get Journal Page

In [30]:
pdf.journal('page')

' 1701147'

### Get Plain Text

In [11]:
pdf.plaintext()

'OLEDs\n\nHighly Efficient Red–Orange Delayed Fluorescence \nEmitters Based on Strong π-Accepting Dibenzophenazine \nand Dibenzoquinoxaline Cores: toward a Rational Pure-Red \nOLED Design\n\nRyuhei Furue, Kyohei Matsuo, Yasuhiko Ashikari, Hirohito Ooka, Natsuki Amanokura, \nand Takuma Yasuda*\n\nOrganic luminescent materials that exhibit thermally activated delayed \nfluorescence (TADF) can harvest both singlet and triplet excitons for light \nemission, leading to high electroluminescence (EL) quantum efficiencies \nin organic light-emitting diodes (OLEDs). However, efficient red TADF \nmaterials are still very rare because of their restricted molecular design \nbased on the energy gap law. To address this issue, elaborate π-conjugated \ndonor–acceptor (D–A) systems that can simultaneously achieve a large \nfluorescence radiative rate and small singlet–triplet energy splitting should \nbe strategically designed. In this study, to produce high-efficiency pure-red \nTADF materials, a rem

### Get Section titles and corresponding text

In [32]:
pdf.section()

{'1. Introduction': ['Metal-free  purely  organic  luminophores  that display thermally activated delayed flu- orescence  (TADF)  have  recently  attracted  significant attention owing to their promi- sing applications in organic light-emitting  diodes  (OLEDs),[1,2]  light-emitting  elec- trochemical  cells,[3]  optical  upconversion  devices,[4]  and  time-resolved  fluorescence  imaging.[5]  In  particular  for  OLED  dis- plays  and  solid-state  lighting  applica- tions, TADF materials can be an attractive  to  phosphorescent  low-cost  alternative  organometallic  complexes  that  contain  expensive precious metals such as iridium  and platinum. With a small singlet–triplet  energy  splitting  (ΔEST,  typically  less  than  0.2 eV), these TADF materials enable effi- cient  upconversion  of  the  nonemissive  triplet (T1) excitons to the emissive singlet  (S1)  excitons  via  fast  reverse  intersystem  crossing (RISC), facilitating the utilization  of  the  electro-generated  exc

### Get References

In [5]:
for seq, ref in pdf.reference().items():
    print(seq)
    print(ref)

0
['National Science and Technology Council', ' Oﬃce of Science and Technology Policy. Materials Genome Initiative for Global Competitive- ness; 2011. ']
1
['Olivares-Amaya', ' R.; Amador-Bedolla', ' C.; Hachmann', ' J.; Atahan- Evrenk', ' S.; Sanchez-Carrera', ' R. S.; Vogt', ' L.; Aspuru-Guzik', ' A. Accelerated Computational Discovery of High-performance Materials for Organic Photovoltaics by Means of Cheminformatics. Energy Environ. Sci. 2011', ' 4', ' 4849−4861. ']
2
['Jain', ' A.; Ong', ' S. P.; Hautier', ' G.; Chen', ' W.; Richards', ' W. D.; Dacek', ' S.; Cholia', ' S.; Gunter', ' D.; Skinner', ' D.; Ceder', ' G.; Persson', ' K. A. Commentary: The Materials Project: A Materials Genome Approach to Accelerating Materials Innovation. APL Mater. 2013', ' 1', ' 011002. ']
3
['Tsuruoka', ' Y.; Tateishi', ' Y.; Kim', ' J.-D.; Ohta', ' T.; McNaught', ' J.; Ananiadou', ' S.; Tsujii', ' J. In Advances in Informatics; Bozanis', ' P.', ' Houstis', ' E. N.', ' Eds.; Springer Berlin Heidelbe

## Pass multiple files at one time

In [2]:
import glob

In [47]:
def read_single(file):
    reader = Reader()
    pdf = reader.read_file(file)
    print(pdf.abstract())

    
def read_multiple(path):
    for i in path:
        read_single(i)
        print('-------------------', '\n')


In [48]:
read_multiple(glob.glob(r'/Users/miao/Desktop/test/els/*.pdf'))

Reading:  /Users/miao/Desktop/test/els/6.pdf
*** Elsevier detected ***
For policymakers, planners, urban design practitioners and city service decision-makers who endeavour to create policies and take decisions to improve the function of cities, developing an understanding of cities, and the particular city in question, is important. However, in the ever-increasing ﬁeld of urban measurement and analysis, the challenges cities face are frequently presumed: crime and fear of crime, social inequality, environmental degradation, economic deterioration and disjointed governance. Although it may be that many cities share similar problems, it is unwise to assume that cities share the same challenges, to the same degree or in the same combination. And yet, diagnosing the challenges a city faces is often overlooked in preference for improving the understanding of known challenges. To address this oversight, this study evidences the need to diagnose urban challenges, introduces a novel mixed-met

*** Elsevier detected ***
Cities are increasingly challenged to improve their competitiveness. Performance indicators stand as an important element to interpret the success of the policy regime adopted by the municipality. Cities with a set of superior economic, social and environmental indicators have the potential to present better living conditions for their inhabitants. In this context, the aim of this research is to analyze whether the in- dicators published by Brazilian cities are aligned with the approach of a smart or sustainable city. The research used a set of 3150 data points regarding the performance of these cities. It analyzed the per- formance of the 150 best cities, divided into three groups of interest identiﬁed as small cities, medium- sized cities and big cities, on a set of 21 indicators. The set of identiﬁed indicators shows the attention of the cities to socioeconomic and information and communication technologies issues, thus revealing that Brazilian city manager

## Use PDFDataExtractor to perform chemistry related extraction

### You can use the flag "chem=Ture" to instruct the function to carry out chemistry related information extraction at the same time when extracting metadata, using ChemDataExtractor

In [3]:
file_test = r'/Volumes/Backup/PDE_papers/articles/Elesvier/dssc/The-effect-of-molecular-structure-on-the-properties-of-quinox_2020_Dyes-and-.pdf'

In [4]:
reader = Reader()

In [5]:
pdf = reader.read_file(file_test)

Reading:  /Volumes/Backup/PDE_papers/articles/Elesvier/dssc/The-effect-of-molecular-structure-on-the-properties-of-quinox_2020_Dyes-and-.pdf
*** Elsevier detected ***


### Pass True to 'chem'

In [6]:
r = pdf.abstract(chem=True)

### Show records

In [7]:
r.records.serialize()

[{'names': ['donor-π-bridge-acceptor-π']},
 {'names': ['quinoxaline']},
 {'names': ['deep red']}]

## Things to notice

### PDFDataExtractor uses ChemDataExtrator to perform all chemistry related extraction, for more detailed use cases, please refer to http://chemdataextractor.org

## Known Issues

In ACS
* In ACS, a few journals have two section title styles existing at the same time, namely: numbered one and ■ one. This could confuse the title filtration function because two styles have largely different font sizes. But this won’t affect reference extraction
* Reference extracted might not be in order
* Parts of extracted reference could be missing

In Elesvier
* Potentially weak journal extraction leads to missing journal information
* Unnumbered references can be messy

In RSC
* Title can be missing
* Journal year, volume and page numbers can be missing in certain articles
* Some section titles can be missed but reference section remains solid


In Advanced Family
* Reference entries can be mixed
* Keywords can be found inside reference entries, roughly 1 in 20
* Some authors place their bio at the very end, such words are not excluded from reference at the moment

In CAEJ
* Keywords can be incomplete

In Angewandte
* Keywords might not be in order