**Sentence Parsing**

The goal of this script is to experiment with sentence parsing, to tightly associate nouns with their descriptive terminology

In [38]:
#load libraries, initialize spaCy

import os, glob
import numpy as np
import spacy as sp
import pandas as pd
from tqdm import tqdm
from bs4 import BeautifulSoup
from spacy import displacy
import joblib
from multiprocessing import Pool
import time

#The 'xx' dataset is the biggest multilanguage one.  It catches the most names
#The 'en' dataset does the best job of parsing organizations and labels verbs and other parts of speech
# Install model with `python -m spacy download en`
nlp = sp.load('en')

In [9]:
# Sample read
sample = '<ack><title>Acknowledgements</title><p>The authors wish to acknowledge Diya Ma, Matthew-Lun Wong, Ka-Long Ko, Ka-Hei Ko and Jin-Peng Lee for their important contributions to the software development.</p><sec id=""FPar1""><title>Funding</title><p id=""Par28"">The work described in this paper was supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No.: CUHK 14113214), grants from the Innovation and Technology Commission (Project No: ITS/149/14FP, GHP/028/14SZ, ITS/293/14FP), grants from CUHK Technology and Business Development Fund (Project No.: TBF16MED002, TBF16MED004), a grant from The Science, Technology and Innovation Commission of Shenzhen Municipality (Project No.: CXZZ20140606164105361), and a grant from The Scientific Research Project of Guangdong Province (Project No.: 2014B090901055).</p></sec></ack>'
sample = '<ack id=""ack0010""><title>Acknowledgements</title><p>The authors thank Dr. R Kaneko for the gift of the iSip2 vector; and Mss. T Honma, K Harada, A Morita, and Y Shimoda for providing technical and secretarial assistance. We thank the staff at the Department of Genetic and Behavioral Neuroscience and Bioresource Center, Gunma University Graduate School of Medicine for their critical comments and technical assistance. This study was supported by <funding-source id=""gs1"">Grants-in-Aid for Scientific Research</funding-source> (23115503, 26290002, 15H01415 and 15H05872 to Y.Y.), a Grant-in-Aid for Scientific Research on Innovative Areas (Comprehensive Brain Science Network) (to Y.Y.) from the <funding-source id=""gs2"">Ministry of Education, Culture, Sports, Science and Technology (MEXT)</funding-source> of Japan, a grant from the Co-operative Study Program of the <funding-source id=""gs3"">National Institute for Physiological Sciences</funding-source>, Japan (to Y.Y.), and a grant from the <funding-source id=""gs4"">Takeda Science Foundation</funding-source> (to Y.Y.).</p></ack>'
sample = u'<ack id=""ack0005""><title>Acknowledgments</title><p>The project was supported by a start-up funding provided to the author by the <funding-source id=""gs0005"">Department of Neurology of the University of Utah</funding-source>.</p><p>This project was inspired by studying the work of Dr. Ed Dudek and the results of the initial experiments were discussed with him.</p><p>I am also grateful to Dr. Erika Scholl for her assistance in measuring rat serum osmolarity and to Dr. Noel Carlson for his insightful comments on the manuscript.</p></ack>'

soup = BeautifulSoup(sample,'lxml')
for ele in soup.find_all('title'):
    ele.decompose()
samp_txt = soup.find_all('ack')[0].get_text(separator=' ')

print(samp_txt)
print()
doc = nlp(samp_txt)

for sent in doc.sents:
    if 'fund' in sent.text or 'grant' in sent.text: continue #if it's a funding sentence, we don't care
    print(sent.text)

The project was supported by a start-up funding provided to the author by the  Department of Neurology of the University of Utah . This project was inspired by studying the work of Dr. Ed Dudek and the results of the initial experiments were discussed with him. I am also grateful to Dr. Erika Scholl for her assistance in measuring rat serum osmolarity and to Dr. Noel Carlson for his insightful comments on the manuscript.

This project was inspired by studying the work of Dr. Ed Dudek and the results of the initial experiments were discussed with him.
I am also grateful to Dr. Erika Scholl for her assistance in measuring rat serum osmolarity and to Dr. Noel Carlson for his insightful comments on the manuscript.


That's a good start, but we need to split up the individual parts of sentences which refer to multiple people

In [28]:
for sent in doc.sents:
    if 'fund' in sent.text or 'grant' in sent.text or 'supported' in sent.text or 'financ' in sent.text: continue #if it's a funding sentence, we don't care
    #count nouns
    sentdoc = nlp(sent.text) #if this needs speed optimization, we can search by start-stop characters instead
    sentppl = [n for n in sentdoc.ents if n.label_ == 'PERSON']
    people_count = len(sentppl)
    if people_count == 0: continue #no named people, disregard
    print(sent.text)

This project was inspired by studying the work of Dr. Ed Dudek and the results of the initial experiments were discussed with him.
I am also grateful to Dr. Erika Scholl for her assistance in measuring rat serum osmolarity and to Dr. Noel Carlson for his insightful comments on the manuscript.


In [31]:
#actually run the thing on the real stuff
F_CSV = glob.glob("../source_data/extracted/*.csv")


In [52]:
def parse_df(row, k):
    if k%1000==0:
        print(k)
        if k%1000==0:
            print(time.asctime())
    
    if row.Acknowledgment_Tag is None or type(row.Acknowledgment_Tag) == float:
        return []
    
    soup = BeautifulSoup(row.Acknowledgment_Tag,'lxml')

    for ele in soup.find_all('title'):
        ele.decompose()
        
    text = soup.find('ack')
    if text is None:
        return []
    
    text = text.get_text(separator=' ')
    
    doc = nlp(text)
    sentlist=[]
    for sent in doc.sents:
        #check for funding terminology and skip the sentence if found
        fundwords = ['fund','grant','supported','financ','award']
        if max([f in sent.text for f in fundwords]): continue
        
        #count nouns
        sentdoc = nlp(sent.text) #if this needs speed optimization, we can search by start-stop characters instead
        sentppl = [n for n in sentdoc.ents if n.label_ == 'PERSON' and len(n.text.split(' ')) > 1]
        people_count = len(sentppl)
        if people_count == 0: continue #no named people, disregard
            
        #at least one person, populate a new item and add it to the list
        item = {"filename":row.filename}
        item["Text"] = sent.text
        item["Verbs"] = ';'.join([ word.lemma_ for word in sentdoc if word.pos_ == 'VERB' and not word.is_stop])  
        item["Nouns"] = ';'.join([ word.text for word in sentdoc if word.pos_ == 'NOUN' and not word.is_stop])
        item["Names"] = ';'.join([ ent.text for ent in sentdoc.ents if ent.label_ == 'PERSON' and len(ent.text.split(' ')) > 1 ])
        item["Organizations"] = ';'.join([ ent.text for ent in sentdoc.ents if ent.label_ == 'ORG' ])
        sentlist.append(item)
        
    return sentlist

In [53]:
for f in F_CSV:
    df = pd.read_csv(f, nrows=3000)
    
    #dfunc = joblib.delayed(parse_df)
    #with joblib.Parallel(1) as MP:
    #    dx = MP(dfunc(row,k) for k,row in df.iterrows())
    dx=[]
    for k,row in df.iterrows():
        dx.extend(parse_df(row,k))
        
    dx = pd.DataFrame(dx).set_index('filename')
    f_save = os.path.join("../parsed_data/sentence_parse/",
                          os.path.basename(f))
    dx.to_csv(f_save)
dx

0
Tue Sep 11 17:26:12 2018
1000
Tue Sep 11 17:26:37 2018
2000
Tue Sep 11 17:27:07 2018
0
Tue Sep 11 17:27:58 2018
1000
Tue Sep 11 17:28:30 2018
2000
Tue Sep 11 17:29:05 2018
0
Tue Sep 11 17:29:24 2018
1000
Tue Sep 11 17:30:18 2018
2000
Tue Sep 11 17:30:47 2018
0
Tue Sep 11 17:31:24 2018
1000
Tue Sep 11 17:32:13 2018
2000
Tue Sep 11 17:32:47 2018


Unnamed: 0_level_0,Names,Nouns,Organizations,Text,Verbs
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
CALPHAD/PMC4270480.nxml,R. Ganesan;Indira Gandhi Centre,p;values;publication,Atomic Research; C p,"They are also grateful to Dr. R. Ganesan, from...",provide
CALPHAD/PMC4270483.nxml,Stephan Puchegger;Olivia Appay,addition;authors;help;suggestions;measurements...,the Faculty of Physics of the University of Vi...,"In addition, the authors wish to thank Dr. Ste...",wish;thank;want;thank;prepare
CALPHAD/PMC4456117.nxml,H. Flandorfer;St. Puchegger,authors;suggestions;discussions;support;studies,SEM,The authors want to acknowledge Dr. H. Flandor...,want;acknowledge
CASE_(Phila)/PMC6058279.nxml,Maria Carr;Michael Yamashita,contributions,,"We thank Maria Carr, MD, and Michael Yamashita...",thank
CASE_(Phila)/PMC6058397.nxml,Hitoshi Sakuraba,suggestions,Meiji Pharmaceutical University,"We thank Professor Hitoshi Sakuraba, Meiji Pha...",thank
CASE_(Phila)/PMC6058763.nxml,Richard Van Praagh,review;case;insights;understanding;heart;disease,,We wish to gratefully acknowledge the thoughtf...,wish;acknowledge;continue;add
CASE_(Phila)/PMC6058275.nxml,Sheldon Singh;Gideon Cohen;Beth Abramson;Anna ...,information;images;report,,"Sheldon Singh, Gideon Cohen, Beth Abramson, an...",provide;include
CASE_(Phila)/PMC6058759.nxml,Jennifer Staley,edits;report,,We thank Jennifer Staley for edits of this rep...,thank
CASE_(Phila)/PMC6058918.nxml,Troy Jefferies;Dan Andrew Dyar,examinations;work,,"We thank Troy Jefferies and Dan Andrew Dyar, w...",thank;perform
CASE_(Phila)/PMC6058300.nxml,Jennifer Pfaff;Susan Nord;Brian Miller;Brian S...,preparation;manuscript;assistance;figures,Aurora Cardiovascular Services;Aurora Research...,We thank Jennifer Pfaff and Susan Nord of Auro...,thank
