## Spacy -NER

ref:https://medium.com/@jack.rory.staunton/hi-wuraola-bb4b3967ce08
The NER model in spaCy is a transition-based system based on the chunking model from (Lample et al., 2016). Tokens are represented as hashed, embedded representations of the prefix, suffix, shape and lemmatized features of individual words. Next, we describe the data we used to train NER models in scispaCy. https://www.groundai.com/project/scispacy-fast-and-robust-models-for-biomedical-natural-language-processing/1


The main NER model in both released packages in scispaCy is trained on the mention spans in the MedMentions dataset (Murty et al., 2018). Since the MedMentions dataset was originally designed for entity linking, this model recognizes a wide variety of entity types, as well as non-standard syntactic phrases such as verbs and modifiers, but the model does not predict the entity type. In order to provide for users with more specific requirements around entity types, we release four additional packages en_ner_\{bc5cdr|craft|jnlpba|bionlp13cg\}_md with finer-grained NER models trained on BC5CDR (for chemicals and diseases; Li et al., 2016), CRAFT (for cell types, chemicals, proteins, genes; Bada et al., 2011), JNLPBA (for cell lines, cell types, DNAs, RNAs, proteins; Collier and Kim, 2004) and BioNLP13CG (for cancer genetics; Pyysalo et al., 2015), respectively.

In [1]:
import scispacy
import swifter
import pandas as pd
from spacy import displacy
import en_core_sci_md
import en_core_sci_md
import en_ner_bc5cdr_md
import en_ner_jnlpba_md
import en_ner_craft_md
import en_ner_bionlp13cg_md
from scispacy.abbreviation import AbbreviationDetector
from scispacy.linking import EntityLinker
from collections import OrderedDict,Counter
from pprint import pprint
from tqdm import tqdm
tqdm.pandas()
import regex as re

In [None]:
#Swiftapply, available on pip from the swifter package, makes it easy to apply
#any function to your pandas series or dataframe in the fastest available manner.
#swiftapply tries to run your operation in a vectorized fashion. 
#Failing that, it automatically decides whether it is faster to perform task 
# parallel processing or use a simple pandas apply.

In [5]:
df = pd.read_csv('capstone_data.csv')
df=df [['title','abstract','text_body']][:600]

In [6]:
df.shape

(507, 3)

In [10]:
#A little bit of superficial cleaning

def clean_dataset(text):
    text=re.sub("[\[].*?[\]]", "", str(text))  #remove in-text citation
    text=re.sub(r'^https?:\/\/.*[\r\n]*', '',text, flags=re.MULTILINE)#remove hyperlink
    text=re.sub(r'^a1111111111 a1111111111 a1111111111 a1111111111 a1111111111.*[\r\n]*',' ',text)#some text has a11111.. 
    text=re.sub(r' +', ' ',text ) #remove extra space
    text=re.sub(r's/ ( *)/\1/g','',text)
    
    return text

In [11]:
df['text_body'] = df['text_body'].apply(clean_dataset)
df['abstract'] = df['abstract'].apply(clean_dataset)

In [12]:
df.head()

Unnamed: 0,title,abstract,text_body
0,Recombination Every Day: Abundant Recombinatio...,Viral recombination can dramatically impact ev...,Introduction\n\nAs increasing numbers of full-...
1,Why can't I visit? The ethics of visitation re...,"Patients want, need and expect that their rela...",Introduction\n\nThe sudden emergence of severe...
2,Prospective evaluation of an internet-linked h...,INTRODUCTION: Critical care physicians may ben...,Introduction\n\nThe rate of expansion of medic...
3,Scanning the horizon: emerging hospital-wide t...,This commentary represents a selective survey ...,Introduction\n\nThis series of articles provid...
4,Characterization of the frameshift signal of E...,The ribosomal frameshifting signal of the mous...,INTRODUCTION\n\nProgrammed −1 ribosomal frames...


In [45]:
#Selecting articles from the df which contain treatment information
#droppping the regex and searching with a substring is faster
df_treat= df[df['title'].astype(str).str.contains("treatment",regex=False)
        ].reset_index(drop=True)



In [46]:
#shape of treatment df
df_treat.shape

(3, 3)

In [47]:
df_treat

Unnamed: 0,title,abstract,text_body
0,Phosphatidylserine treatment relieves the bloc...,BACKGROUND: A major determinant of retrovirus ...,Background\n\nMany of the cellular receptors f...
1,Immune reconstitution inflammatory syndrome (I...,The immune reconstitution inflammatory syndrom...,Pathogenesis of IRIS ::: Introduction\n\nDespi...
2,Diagnosis and treatment of severe sepsis,The burden of infection in industrialized coun...,Introduction\n\nSevere sepsis and septic shock...


In [51]:
def display_entities(model,document):
    """ A function that returns a tuple of displacy image of named or unnamed word entities and
        a set of unique entities recognized based on scispacy model in use
        Args: 
            model: A pretrained model from spaCy or ScispaCy
            document: text data to be analysed"""
    nlp = model.load()
    doc = nlp(document)
    displacy_image = displacy.render(doc, jupyter=True,style='ent')
    entity_and_label = set([(X.text, X.label_) for X in doc.ents])
    return  displacy_image, entity_and_label

In [52]:
document1=df_treat['abstract'][0]
document1

'BACKGROUND: A major determinant of retrovirus host range is the presence or absence of appropriate cell-surface receptors required for virus entry. Often orthologs of functional receptors are present in a wide range of species, but amino acid differences can render these receptors non-functional. In some cases amino acid differences result in additional N-linked glycosylation that blocks virus infection. The latter block to retrovirus infection can be overcome by treatment of cells with compounds such as tunicamycin, which prevent the addition of N-linked oligosaccharides. RESULTS: We have discovered that treatment of cells with liposomes composed of phosphatidylserine (PS) can also overcome the block to infection mediated by N-linked glycosylation. Importantly, this effect occurs without apparent change in the glycosylation state of the receptors for these viruses. This effect occurs with delayed kinetics compared to previous results showing enhancement of virus infection by PS treat

In [53]:
bc5dr_entities = display_entities(en_ner_bc5cdr_md,document1)

In [54]:
bc5dr_entities_dataframe = pd.DataFrame(bc5dr_entities[1],columns=['Entity','Label'])  #save returned values of entities and their labels in a pandas dataframe
bc5dr_entities_dataframe['Ner_model'] = 'bc5dr'  #include a column with constant value of NER model
bc5dr_entities_dataframe

Unnamed: 0,Entity,Label,Ner_model
0,amino acid,CHEMICAL,bc5dr
1,retrovirus infection,DISEASE,bc5dr
2,virus infection,DISEASE,bc5dr
3,tunicamycin,CHEMICAL,bc5dr
4,infection,DISEASE,bc5dr
5,phosphatidylserine,CHEMICAL,bc5dr


In [56]:

bionlp13cg_entities = display_entities(en_ner_bionlp13cg_md,document1)

In [57]:

bionlp13cg_entities_dataframe = pd.DataFrame(bionlp13cg_entities[1],columns=['Entity','Label']) #save returned values of entities and their labels in a pandas dataframe
bionlp13cg_entities_dataframe['Ner_model'] = 'bionlp13cg'  #include a column with constant value of NER model
bionlp13cg_entities_dataframe

Unnamed: 0,Entity,Label,Ner_model
0,liposomes,SIMPLE_CHEMICAL,bionlp13cg
1,cells,CELL,bionlp13cg
2,amino acid,AMINO_ACID,bionlp13cg
3,cell-surface receptors,GENE_OR_GENE_PRODUCT,bionlp13cg
4,retroviral,ORGANISM,bionlp13cg
5,retrovirus,ORGANISM,bionlp13cg
6,tunicamycin,SIMPLE_CHEMICAL,bionlp13cg
7,retrovirus host,ORGANISM,bionlp13cg
8,phosphatidylserine,SIMPLE_CHEMICAL,bionlp13cg


In [58]:
craft_entities = display_entities(en_ner_craft_md,document1)

In [60]:

craft_entities_dataframe = pd.DataFrame(craft_entities[1],columns=['Entity','Label'])  #save returned values of entities and their labels in a pandas dataframe
craft_entities_dataframe['Ner_model'] = 'craft' #include a column with constant value of NER model
craft_entities_dataframe

Unnamed: 0,Entity,Label,Ner_model
0,viruses,TAXON,craft
1,liposomes,GO,craft
2,cells,CL,craft
3,retrovirus,TAXON,craft
4,compounds,CHEBI,craft
5,N-linked oligosaccharides,CHEBI,craft
6,retroviral,TAXON,craft
7,phosphatidylserine,CHEBI,craft
8,virus,TAXON,craft
9,orthologs,SO,craft


In [59]:
jnlpba_entities = display_entities(en_ner_jnlpba_md,document1)

In [62]:
jnlpa_entities_dataframe = pd.DataFrame(jnlpba_entities[1],columns=['Entity','Label']) #save returned values of entities and their labels in a pandas dataframe
jnlpa_entities_dataframe['Ner_model'] = 'jnlpa' # include a column with constant value of NER model
jnlpa_entities_dataframe

Unnamed: 0,Entity,Label,Ner_model
0,cell-surface receptors,PROTEIN,jnlpa
1,retroviral receptors,PROTEIN,jnlpa
2,N-linked,PROTEIN,jnlpa
3,functional receptors,PROTEIN,jnlpa


In [63]:
entities_and_label_from_4_NER_model_dataframe = pd.concat([bc5dr_entities_dataframe,bionlp13cg_entities_dataframe,craft_entities_dataframe,jnlpa_entities_dataframe])
#Concatenate all pandas dataframe into one.
entities_and_label_from_4_NER_model_dataframe.to_csv('entities_and_label_from_4_scispacy_NER_models.csv', index=False) #Save dataframe to csv
entities_and_label_from_4_NER_model_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 0 to 3
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Entity     30 non-null     object
 1   Label      30 non-null     object
 2   Ner_model  30 non-null     object
dtypes: object(3)
memory usage: 960.0+ bytes
