# Antibody Specification

## First Phase: Extract all possible data

### Create Training dataset

**Step 1**  
Get a list containing PMCID and PMID from ```pmcids-pmids.txt```

In [1]:
with open('resources/pmcids-pmids.txt', 'r') as file:
    lines = file.readlines()

Each lines seperate PMID and PMCID

In [2]:
list_of_pmids_and_pmcids = []

In [3]:
for line in lines:
    sep_line = line.split('\t')
    pmid = sep_line[0]
    pmcid = sep_line[1].replace('\n', '')
    
    list_of_pmids_and_pmcids.append({ 'pmid': pmid, 'pmcid': pmcid })

**Step 2**  
find the snippets from nxml file

In [4]:
from xml.etree import ElementTree
from tqdm import trange
import pprint
import re
from nltk.tokenize import sent_tokenize
import nltk

In [5]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/ploy/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Find only ```<p>``` and then extract the sentences we want (find the regex pattern):
  
All others tags in ```<p>``` I convert them back to string and remove the xml tag out. 
- (S|s)pecific
- (B|b)ackground staining
- (C|c)ross( |-)reactiv

In [6]:
def remove_xml_tags(text):
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

In [7]:
def extract_snippets(text):
    """
    extract snippets from each paragraph
    """
    snippets = []
    define_words = ['(S|s)pecific', '((B|b)ackground staining)', '(C|c)ross( |-)reactiv']
    # split sentences from text
    split_texts = sent_tokenize(text)
    for word in define_words:
        snippet = []
        # find snippet which contains define_words
        for s_index in range(len(split_texts)):
            word_contain = re.findall(r"([^.]*?%s[^.]*\.)" % word, split_texts[s_index])
            if len(word_contain) != 0:
                snip = ''
                if s_index - 1 >= 0:
                    snip = snip + split_texts[s_index-1] + '\n'
                snip = snip + split_texts[s_index] + '\n'
                if s_index + 1 < len(split_texts):
                    snip = snip + split_texts[s_index+1] + '\n'
                
                # check duplicate sentences in snippet
                is_contain = False
                for s_i in range(len(snippet)):
                    if len(snippet[s_i]) < len(snip):
                        if snippet[s_i] in snip:
                            snippet[s_i] = snip
                            is_contain = True
                            break
                if is_contain == False:
                    snippet.append(snip)
        if len(snippet) != 0:
            snippets.append(set(snippet))
    if len(snippets) != 0:
        return snippets
    return None

In [8]:
snippet_list = []

def find_paragraph(node):
    """
    find snippets in each <p>
    """
    global snippet_list
    if node.tag == 'p':
        # convert all contents in <p> to string
        xml_str = ElementTree.tostring(node).decode('utf-8')
        text = remove_xml_tags(xml_str)

        if node.text is not None:
            snippets = extract_snippets(text)
            if snippets is not None:
                snippet_list.append(snippets)
    for child in node:
        find_paragraph(child)
    
    return snippet_list

In [9]:
def get_snippets(tree):
    """
    get snippets from each file
    """
    global snippet_list
    snippets = []
    node = tree.find('./body')

    for elem in node:
        snippet = find_paragraph(elem)
        snippets.extend(snippet)
        snippet_list = []
        
    if snippets is not None and len(snippets) != 0:
        return snippets
    return None

Resources Papers path

In [10]:
resources_path = 'resources/papers_4chunnan/'

In [11]:
def clean_snippet(snip):
    snip = snip.replace('\n', ' ')
    return snip[:-1]

```outputs``` will contains the dict of outputs that we will save in ```.tsv``` file later.

In [12]:
outputs = []

To parse the file, pass an open file handle to parse()  
It will read the data, parse the XML, and return an ElementTree object

In [13]:
for index in trange(len(list_of_pmids_and_pmcids), desc='reading and finding snippets in file'):
    with open(resources_path + list_of_pmids_and_pmcids[index]['pmcid'] + '.nxml', 'rt') as file:
        tree = ElementTree.parse(file)
        snippets = get_snippets(tree)
        if snippets is not None:
            for snips in snippets:
                for paragraphs in snips:
                    for paragraph in paragraphs:
                        outputs.append(
                            { 
                              'pmid': list_of_pmids_and_pmcids[index]['pmid'], 
                              'pmcid': list_of_pmids_and_pmcids[index]['pmcid'], 
                              'snippet': clean_snippet(paragraph)
                            }
                        )

reading and finding snippets in file: 100%|██████████| 2223/2223 [14:09<00:00,  2.62it/s]


In [14]:
len(outputs)

22013

**Step 3**  
Write outputs to file ```.tsv```  
The pattern is ```PMID\tPMCID\tSnippet\tAntibody related?\tSpecificity?\n```    
In which antibody related? and specificity? are empty.

In [15]:
file = open('train_ex_antibody.tsv', 'a')

In [16]:
file.write('SID\tAntibody related?\tSpecificity?\tPMID\tPMCID\tSnippet\n')

54

In [17]:
for article_index in trange(len(outputs), desc='writing to file '):
    file.write('%d\t\t\t%s\t%s\t%s\n' % (article_index,
                                       outputs[article_index]['pmid'], 
                                       outputs[article_index]['pmcid'], 
                                       outputs[article_index]['snippet']))

writing to file : 100%|██████████| 22013/22013 [00:00<00:00, 347294.23it/s]


In [18]:
file.close()

### Some Example from training dataset

In [19]:
import pandas as pd
import numpy as np

In [20]:
df = pd.read_csv('train_ex_antibody.tsv', sep='\t')

In [21]:
df.head()

Unnamed: 0,SID,Antibody related?,Specificity?,PMID,PMCID,Snippet
0,0,,,20723247,PMC2936283,To study a functional role of D-glucuronyl C5-...
1,1,,,20723247,PMC2936283,"Thus, our results suggest D-glucuronyl C5-epim..."
2,2,,,20723247,PMC2936283,Recent data reveal that there is extensive cro...
3,3,,,21654676,PMC3137399,Multiplex and quantitative RT&#8211;PCR analys...
4,4,,,22216273,PMC3247256,The family of AUF1 proteins appears to be able...


In [22]:
df.loc[df['PMCID'] == 'PMC3565217']

Unnamed: 0,SID,Antibody related?,Specificity?,PMID,PMCID,Snippet
9,9,,,23390418,PMC3565217,Panx1 has been proposed to fulfill a function ...
10,10,,,23390418,PMC3565217,"For example, in the study of Panx1 knockout mi..."
11,11,,,23390418,PMC3565217,The Diatheva and Dahl antibodies were two of t...
12,12,,,23390418,PMC3565217,(2011) where in situ hybridization images of P...
13,13,,,23390418,PMC3565217,Because Western blots are frequently treated a...
14,14,,,23390418,PMC3565217,Tests of these antibodies on tissue lysates fr...
15,15,,,23390418,PMC3565217,This antibody was reported to show no immunofl...
16,16,,,23390418,PMC3565217,Two of these antibodies have already been char...
17,17,,,23390418,PMC3565217,While not explicitly stated in Locovei et al. ...
18,18,,,23390418,PMC3565217,Common features provide clues as to how Panx1 ...


In [23]:
examples = np.array(df.loc[df['PMCID'] == 'PMC3565217'])

In [24]:
for ex in examples:
    print('---------------------------------------')
    print(ex[5])
    print('---------------------------------------')

---------------------------------------
Panx1 has been proposed to fulfill a function in adaptive/inflammation responses following specific stimuli (Sosinsky et al., 2011). Panx1 channels have been shown to release ATP during gustatory channel response in taste bud cells (Romanov et al., 2007), the activation of the immune response in macrophages (Pelegrin and Surprenant, 2006), T lymphocytes (Schenk et al., 2008), and neurons (Silverman et al., 2009), pressure overload-induced fibrosis in the heart (Nishida et al., 2008) and NMDA receptor epileptiform electrical activity in the hippocampus (Thompson et al., 2008).
---------------------------------------
---------------------------------------
For example, in the study of Panx1 knockout mice by Bargiotas et al. (2011) where in situ hybridization images of Panx1 KO brain tissue were devoid of staining for Panx1 transcripts, the authors state that only one antibody (Penuela et al., 2007) out of six tested showed specificity for Panx1 in 