<a href="https://colab.research.google.com/github/Howida100/Projects/blob/main/Extract_Text_from_PDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extracting Text from PDF Files

Let's look at how to extract text from a PDF file, using the [`pdfx`](https://www.metachris.com/pdfx/) library in Python.

First we need to install the library:

In [1]:
!pip install pdfx

Collecting pdfx
  Downloading pdfx-1.4.1-py2.py3-none-any.whl (21 kB)
Collecting pdfminer.six==20201018 (from pdfx)
  Downloading pdfminer.six-20201018-py3-none-any.whl (5.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting chardet==4.0.0 (from pdfx)
  Downloading chardet-4.0.0-py2.py3-none-any.whl (178 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m178.7/178.7 kB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: chardet, pdfminer.six, pdfx
  Attempting uninstall: chardet
    Found existing installation: chardet 5.2.0
    Uninstalling chardet-5.2.0:
      Successfully uninstalled chardet-5.2.0
Successfully installed chardet-4.0.0 pdfminer.six-20201018 pdfx-1.4.1


Next, let's work with an example from the corpus in the [Rich Context leaderboard competition](https://github.com/Coleridge-Initiative/rclc/blob/master/corpus.ttl) – a machine learning competition about parsing named entities from PDFs of open access research publications.

The following snippets in [TTL format](https://en.wikipedia.org/wiki/Turtle_(syntax)) show a research paper `publication-7aa3d69253e37668541c` hosted on [EuropePMC](https://europepmc.org/) that has a known link to a dataset `dataset-0a7b604ab2e52411d45a` hosted by the [Centers for Disease Control and Prevention](https://wwwn.cdc.gov/nchs/nhanes/).

```
:publication-7aa3d69253e37668541c
  rdf:type :ResearchPublication ;
  foaf:page "http://europepmc.org/articles/PMC3001474"^^xsd:anyURI ;
  dct:publisher "PLoS One" ;
  dct:title "VKORC1 common variation and bone mineral density in the Third National Health and Nutrition Examination Survey" ;
  dct:identifier "10.1371/journal.pone.0015088" ;
  :openAccess "http://europepmc.org/articles/PMC3001474?pdf=render"^^xsd:anyURI ;
  cito:citesAsDataSource :dataset-0a7b604ab2e52411d45a ;
.

:dataset-0a7b604ab2e52411d45a
  rdf:type :Dataset ;
  foaf:page "https://wwwn.cdc.gov/nchs/nhanes/"^^xsd:anyURI ;
  dct:publisher "Centers for Disease Control and Prevention" ;
  dct:title "National Health and Nutrition Examination Survey" ;
  dct:alternative "NHANES" ;
  dct:alternative "NHANES I" ;
  dct:alternative "NHANES II" ;
  dct:alternative "NHANES III" ;
.
```

The paper is:

  * ["VKORC1 common variation and bone mineral density in the Third National Health and Nutrition Examination Survey"](http://europepmc.org/articles/PMC3001474); Dana C. Crawford, Kristin Brown-Gentry, Mark J. Rieder; _PLoS One_. 2010; 5(12): e15088.

We'll used `pdfx` to download the PDF file directly from the open access URL:

In [4]:
import pdfx

pdf = pdfx.PDFx("http://europepmc.org/articles/PMC3001474?pdf=render")

pdf

<pdfx.PDFx at 0x79732903e2c0>

Next, use the `get_text()` function to extract the text from the `pdf` object:

In [5]:
text = pdf.get_text()
text

'VKORC1Common Variation and Bone Mineral Density in\nthe Third National Health and Nutrition Examination\nSurvey\n\nDana C. Crawford1,2*, Kristin Brown-Gentry1, Mark J. Rieder3\n\n1 Center for Human Genetics Research, Vanderbilt University, Nashville, Tennessee, United States of America, 2 Department of Molecular Physiology and Biophysics,\nVanderbilt University, Nashville, Tennessee, United States of America, 3 Department of Genome Sciences, University of Washington, Seattle, Washington, United States of\nAmerica\n\nAbstract\n\nOsteoporosis, defined by low bone mineral density (BMD), is common among postmenopausal women. The distribution of\nBMD varies across populations and is shaped by both environmental and genetic factors. Because the candidate gene\nvitamin K epoxide reductase complex subunit 1 (VKORC1) generates vitamin K quinone, a cofactor for the gamma-\ncarboxylation of bone-related proteins such as osteocalcin, we hypothesized that VKORC1 genetic variants may be\nassociated

Now we can use `spaCy` to parse that text:

In [6]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

Let's look at a dataframe of the parsed tokens:

In [7]:
import pandas as pd

cols = ("text", "lemma", "POS", "explain", "stopword")
rows = []

for t in doc:
    row = [t.text, t.lemma_, t.pos_, spacy.explain(t.pos_), t.is_stop]
    rows.append(row)

df = pd.DataFrame(rows, columns=cols)
df

Unnamed: 0,text,lemma,POS,explain,stopword
0,VKORC1Common,VKORC1Common,PROPN,proper noun,False
1,Variation,Variation,PROPN,proper noun,False
2,and,and,CCONJ,coordinating conjunction,True
3,Bone,Bone,PROPN,proper noun,False
4,Mineral,Mineral,PROPN,proper noun,False
...,...,...,...,...,...
9903,Issue,Issue,PROPN,proper noun,False
9904,12,12,NUM,numeral,False
9905,|,|,NOUN,noun,False
9906,e15088,e15088,NOUN,noun,False


The parsed text shows lots of characters that could be cleaned up, but for this demo, let's run *named entity resolution* in `spaCy` to extract the entities:

In [8]:
for ent in doc.ents:
    print(ent.text, ent.label_)

the Third National Health and Nutrition Examination
Survey

 ORG
Dana C. Crawford1,2* PERSON
Kristin Brown-Gentry1 PERSON
Mark J. Rieder3 PERSON
Center for Human Genetics Research ORG
Nashville GPE
Tennessee GPE
United States of America GPE
Department of Molecular Physiology and Biophysics ORG
Vanderbilt University ORG
Nashville GPE
Tennessee GPE
United States of America GPE
3 Department of Genome Sciences ORG
University of Washington ORG
Seattle GPE
Washington GPE
United States GPE
BMD ORG
K ORG
1 CARDINAL
VKORC1 PERSON
BMD ORG
six CARDINAL
VKORC1 PERSON
7,159 CARDINAL
the Third National Health and Nutrition Examination Survey ORG
BMD ORG
DEXA ORG
four CARDINAL
rs9923231 PERSON
BMD ORG
0.039 CARDINAL
0.024 CARDINAL
rs8050894 PERSON
BMD ORG
0.016 CARDINAL
non-Hispanic NORP
619 CARDINAL
VKORC1 rs2884737 PERSON
BMD ORG
Mexican NORP
795 CARDINAL
0.004 CARDINAL
VKORC1 PERSON
VKORC1 PERSON
BMD ORG
first ORDINAL
VKORC1 PERSON
BMD ORG
one CARDINAL
BMD ORG
Crawford DC GPE
Brown-Gentry K ORG
Ri

Great, that identified multiple mentions of the _NHANES_ dataset:

  * `the Third National Health and Nutrition Examination Survey` _ORG_
  * `NHANES III` _PERSON_
  
The default labels aren't correct, but we could [update the Named Entity Recognizer](https://spacy.io/usage/training#ner) in `spaCy` to fix that.