# Parse PubMed data in order to get plain text of relevant papers

Through the URLs below, PubMed data can be ontained interactively. The idea here is to select papers based on keywords (and perhaps dates, ...) in a first query that results in PMC IDs. After that, the full text of those publications can be ontained one-by-one.

These texts can be analyzed by scispacy language modesl that include POS tagging for biological/scienti

In [1]:
import os
import pandas as pd
import numpy as np
import scispacy
import spacy

from utils import perform_query, extract_clean, retrieve_paper

## Obtain and process data
Steps:
1. Define a query to search papers
2. Call API with that query to obtain all PMC IDs for that query
3. Read the full content of tyhose papers one by one
4. Pre-process the full text
5. Run language model on clean text of body of paper

In [2]:
query = 'trichome'
IDs = perform_query(query)

savetext = True  # Set True if clean text of paper needs to be saved to a text file

all_papers = []
for ID in IDs:
    fname = f'papers/cleantext_{ID}.txt'
    
    if os.path.exists(fname):
        with open(fname, 'r') as f:
            cleantext = f.read()
    else:  
        content = retrieve_paper(ID)
        cleantext = extract_clean(content)
        if savetext:
            with open(fname, 'w') as f:
                f.write(cleantext)
                
    all_papers.append(cleantext)
    

In [3]:
nlp = spacy.load("en_ner_bionlp13cg_md")

In [4]:
docs = [nlp(text) for text in all_papers]
len(docs)

['PMC8796501',
 'PMC8795820',
 'PMC8733389',
 'PMC8778064',
 'PMC8716684',
 'PMC8670628',
 'PMC8671615',
 'PMC8590222',
 'PMC8640136',
 'PMC8514689',
 'PMC8640987']

In [6]:
df_all = pd.DataFrame()
for i,doc in enumerate(docs):
    # For every entity ('word') you have the actual word, its position in the text, its class/label and the paper where it comes from
    df_entities = pd.DataFrame({'text':[ent.text for ent in doc.ents], 
                            'start':[ent.start_char for ent in doc.ents], 
                            'label':[ent.label_ for ent in doc.ents],
                            'pmcid': [IDs[i]] * len(doc.ents)}
                           )
    
    df_all = pd.concat([df_all, df_entities ])

In [9]:
with open('exclude_terms.txt', 'r') as f:
    terms = f.readlines()
    terms = [t.replace('\n','') for t in terms]



In [10]:
def remove_unwanted_terms_and_select_genes(df):
    df = df.query("label == 'GENE_OR_GENE_PRODUCT'")
    df["extra"] = 1
    for t in terms:
        df["extra"] = df["extra"] * np.array([t not in text_from_ent for text_from_ent in df.text])
        
    return df[df.extra == 1]
        
df_filtered = remove_unwanted_terms_and_select_genes(df_all)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [11]:
df_filtered

Unnamed: 0,text,start,label,pmcid,extra
18,GL1,1330.0,GENE_OR_GENE_PRODUCT,PMC8796501,1
19,GL2,1335.0,GENE_OR_GENE_PRODUCT,PMC8796501,1
20,GL3,1340.0,GENE_OR_GENE_PRODUCT,PMC8796501,1
21,TESTA,1358.0,GENE_OR_GENE_PRODUCT,PMC8796501,1
22,TTG1,1372.0,GENE_OR_GENE_PRODUCT,PMC8796501,1
...,...,...,...,...,...
357,cuticular fatty acid,24537.0,GENE_OR_GENE_PRODUCT,PMC8640987,1
361,LAAPPI-MS,24919.0,GENE_OR_GENE_PRODUCT,PMC8640987,1
379,LAAPPI-MS,25745.0,GENE_OR_GENE_PRODUCT,PMC8640987,1
384,LAAPPI-MS,26012.0,GENE_OR_GENE_PRODUCT,PMC8640987,1


In [12]:
# df_entities.head(20)
counts = df_filtered.text.value_counts().sort_values(ascending=False)
counts[:40]

LAAPPI-MS                 25
DGC                       24
BR                        13
alpha bitter acid         11
auxin                     11
cyclins                    9
Gh14-3                     9
GL1                        8
GL2                        8
GL3                        8
C                          7
HlPT1L                     6
Cyclins                    6
HlETC1                     6
Ortmannian                 6
trichome-enriched          6
HpmE031                    6
GIS3                       6
IX                         5
Fig. S6                    5
CDK                        5
MYB-bHLH-WD40              5
pectin                     5
GIS2                       5
TTG1                       5
MYB                        5
AaMYB1                     4
TCL1                       4
cyclin                     4
JinqL                      4
HlMYB5                     4
pectic polysaccharides     4
YFP                        4
CPC                        4
trichome-deple

In [None]:
for token in docs[2][100:110]:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

    SPACE _SP compound   False False
Artemisinin artemisinin NOUN NN nsubjpass Xxxxx True False
is be VERB VBZ auxpass xx True True
specifically specifically ADV RB advmod xxxx True False
biosynthesized biosynthesize VERB VBN ROOT xxxx True False
in in ADP IN case xx True True
the the DET DT det xxx True True
trichomes trichome NOUN NNS nmod xxxx True False
of of ADP IN case xx True True
leaves leave NOUN NNS nmod xxxx True False
