# **Text mining problem description**

For my project, I am attempting to train a spacy NER model to recognize miRNA as a named entity. Although finding miRNA within a document is something that could possibly be done with regex matching, spacy offers many features in the NLP pipeline regarding named entities, and in order to utilize these features with a named entity, it must be able to be recognized by spacy, which cannot be done with regex matching (example below). 

![](https://nlpforhackers.io/wp-content/uploads/2018/03/Screen-Shot-2018-03-28-at-12.09.32.png)

Ideally, the miRNA entity would be added to an existing model, such as `scispacy`. By updating `scispacy`'s `en_ner_bionlp13cg_md` model which has a NER for things like `GENE_OR_GENE_PRODUCT`, `ORGANISM`, `PATHOLOGICAL_FORMATION`, and more, we could theoretically obtain the relationship between miRNA and these entities, which could be used on a larger scale to potentially identify genes within literature that may be a ssociated with a given miR to uncover regulatory networks and other associations. 

For the purpose of this assignment, we will not be updating an existing model (due to time constraints), but we will run through the process of trianing a model, and at the end identify other potential improvements for this pipeline that would ultimately provide high level, practical performance.

# 2) **Dataset Description**

To train a NER model in spacy, we need to know the start and end index of the string of interest. Seeing as there is no annotated data in the wild where this exists for miRNA, we will have to make some. We will do this by:

- 1) Fetching miR abstracts from pubmed
- 2) Cleaning the abstracts and sentence tokenizing
- 3) Using Regex to find miRNA and their position in the string
- 4) Format the data to be used by spacy

### 2.1) Install Dependencies

Note, for this exercise, we need to used `spacy 2.2.2`. We will ensure that this is the correct version imported below:

In [1]:
!pip install xlrd
!pip uninstall spacy
!pip install spacy==2.2.2

Uninstalling spacy-2.2.2:
  Would remove:
    /usr/local/bin/spacy
    /usr/local/lib/python3.7/dist-packages/bin/*
    /usr/local/lib/python3.7/dist-packages/spacy-2.2.2.dist-info/*
    /usr/local/lib/python3.7/dist-packages/spacy/*
  Would not remove (might be manually added):
    /usr/local/lib/python3.7/dist-packages/bin/theano_cache.py
    /usr/local/lib/python3.7/dist-packages/bin/theano_nose.py
Proceed (y/n)? y
  Successfully uninstalled spacy-2.2.2
Collecting spacy==2.2.2
  Using cached https://files.pythonhosted.org/packages/b9/01/fcb8ae3e836fea5c11fdb4c074d27b52bdf74b47bd9bb28a811b7ab37d49/spacy-2.2.2-cp37-cp37m-manylinux1_x86_64.whl
Installing collected packages: spacy
Successfully installed spacy-2.2.2


In [2]:
import urllib.request
import io
import gzip
import os
from pathlib import Path
import pandas as pd
try:
    from Bio import Entrez
except ModuleNotFoundError:
    !pip install Bio
    from Bio import Entrez
import re
import random
import spacy

### 2.2) Creating a database of known miRs and accessions numbers to query literature

To standardize things, we will make a key between miR names and accessions numbers. We will submit accession numbers into python, which will be handlesd by this database to get the corresponding miR, which will be used to query pubmed.

In [51]:
response = urllib.request.urlopen('ftp://mirbase.org/pub/mirbase/CURRENT/miRNA.xls.gz')
compressed_file = io.BytesIO(response.read())
decompressed_file = gzip.GzipFile(fileobj=compressed_file)

with open(Path(os.getcwd(), 'miRNA.xlsx'), 'wb') as outfile:
    outfile.write(decompressed_file.read())

mir_database = pd.read_excel('miRNA.xlsx')

mir_database_1 = mir_database.loc[:, ['Accession', 'ID']]
mir_database_2 = mir_database.loc[:, ['Mature1_Acc', 'Mature1_ID']].rename(columns = {'Mature1_Acc':'Accession', 'Mature1_ID':'ID'})
mir_database_3 = mir_database.loc[:, ['Mature2_Acc', 'Mature2_ID']].rename(columns = {'Mature2_Acc':'Accession', 'Mature2_ID':'ID'})

final_database = pd.concat([mir_database_1, mir_database_2, mir_database_3])

final_database.head()

Unnamed: 0,Accession,ID
0,MI0000001,cel-let-7
1,MI0000002,cel-lin-4
2,MI0000003,cel-mir-1
3,MI0000004,cel-mir-2
4,MI0000005,cel-mir-34


### 2.3) Define functions to 1) fetch abstracts from pubmed, and 2) clean the abstracts and standardize their format, and finally 3) query pubmed and clean the abstract using functions 1 & 2. 

In [5]:
# Function 1

def fetch_abstract(pmid):
    handle = Entrez.efetch(db='pubmed', id = pmid, retmode='xml')
    article = Entrez.read(handle)['PubmedArticle'][0]['MedlineCitation']['Article']
    if 'Abstract' in article:
            return article['Abstract']['AbstractText']
        
# Function 2

def concat_article(x):
    final_article = str()
    for i in range(len(x)):
        final_article = final_article + str(x[i]) + ' '
    return final_article

# Function 3

def get_literature(user_mir):
    
    filtered_database = final_database[final_database['Accession']  == user_mir]['ID']
    filtered_database = final_database[final_database['Accession']  == user_mir]['ID']

    if filtered_database.size == 1:
        mir = filtered_database.iloc[0]
        print('The accession number ' + user_mir + ' corresponds to miR ' + mir)
    else:
        print('miR accession is incorrect. Try again (caps sensitive)')

    Entrez.email = 'anonymous@gmail.com'
    esearch_query = Entrez.esearch(db="pubmed", term="mir-100", retmode="xml")
    esearch_result = Entrez.read(esearch_query)
    pmid_list = esearch_result['IdList']
    print("pmid's obtained: " + str(len(pmid_list)))
    
    abs_list = []

    for i in pmid_list:
        abs = fetch_abstract(i)
        abs_list.append(abs)
        
    abs_list = [concat_article(i) for i in abs_list if i is not None]
    
    return(abs_list)

### 2.4) Fetching Abstracts

Now, we will maually enter the accession number of 10 miRNA to use to get literature on using our functions.

In [6]:
training_mir = ['MI0000692', 'MI0000159', 'MI0000172', 'MI0000406', 'MI0000111', 'MI0000684', 'MI0000256', 'MI0000170', 'MI0000268', 'MI0002470']

Getting literature:

In [7]:
all_abstracts = []

for i in training_mir:
    abstracts = get_literature(i)
    all_abstracts = all_abstracts + abstracts

The accession number MI0000692 corresponds to miR mmu-mir-100
pmid's obtained: 20
The accession number MI0000159 corresponds to miR mmu-mir-133a-1
pmid's obtained: 20
The accession number MI0000172 corresponds to miR mmu-mir-150
pmid's obtained: 20
The accession number MI0000406 corresponds to miR mmu-mir-106a
pmid's obtained: 20
The accession number MI0000111 corresponds to miR hsa-mir-105-1
pmid's obtained: 20
The accession number MI0000684 corresponds to miR mmu-mir-107
pmid's obtained: 20
The accession number MI0000256 corresponds to miR mmu-mir-122
pmid's obtained: 20
The accession number MI0000170 corresponds to miR mmu-mir-146a
pmid's obtained: 20
The accession number MI0000268 corresponds to miR hsa-mir-34a
pmid's obtained: 20
The accession number MI0002470 corresponds to miR hsa-mir-486-1
pmid's obtained: 20


# 3) **Text Processing**

### 3.1) Sentence Tokenization

We will split all the abstracts into sentence tokens, seeing as this is standard practice when creating a spacy NER.

In [8]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')

all_sentences = []

for mir_abs in all_abstracts:
    abstr_sentences = sent_tokenize(mir_abs)
    all_sentences = all_sentences + abstr_sentences


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### 3.2) Making training and testing data

From our sentences, we will first randomly shuffle them:

In [9]:
random.shuffle(all_sentences)

And then we will allocate 80% for training, and 20% for testing:

In [10]:
training_sentences = all_sentences[0:int(.8 * len(all_sentences))]
testing_sentences = all_sentences[int(.8 * len(all_sentences)):len(all_sentences)]

### 3.3) Labelling the training data

Now, we will use a regex to find miRNA occurances in a sentence. For each occurance we find, we will also locate the string index that corresponds to the start and end of the string, which is required by spacy.

In [11]:
def make_training_data(string):
    if len([i for i in re.finditer('mir-\d+[^\s|.|,|!|?| |:|;]*', string.lower())]) != 0:
        ent_list = []
        for i in re.finditer('mir-\d+[^\s|.|,|!|?| |:|;]*', string.lower()):
            ent_code = (i.start(), i.end(), 'miR')
            ent_list.append(ent_code)
            
    else:
        ent_list = []      
    return((string, {'entities' : ent_list}))    

In [12]:
training_data = [make_training_data(i) for i in training_sentences]

An example of the format of the input data:

In [52]:
training_data[0:3]

[('Using TarBase V8 in DIANA tools, we acquired 1,520 potential targets (mRNA) from the five key DE-miRNAs, among which the159 DE-mRNAs also included 11 DEGs.',
  {'entities': []}),
 ('In conclusion, miR-100, miR-125b, miR-199a and miR-194 may have potential as prognostic and diagnostic biomarkers for GC.',
  {'entities': [(15, 22, 'miR'),
    (24, 32, 'miR'),
    (34, 42, 'miR'),
    (47, 54, 'miR')]}),
 ('This study aimed to identify differentially expressed (DE) miRNAs in breast cancer using the Cancer Genome Atlas.',
  {'entities': []})]

### 3.4) Data Exploration

Pick up here

In [14]:
!pip install spacy-lookups-data



In [15]:
def train_spacy(data,iterations):
    TRAIN_DATA = data
    nlp = spacy.blank('en')  # create blank Language class
    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True)
       

    # add labels
    for _, annotations in TRAIN_DATA:
         for ent in annotations.get('entities'):
            ner.add_label(ent[2])

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(iterations):
            print("Starting iteration " + str(itn))
            random.shuffle(TRAIN_DATA)
            losses = {}
            for text, annotations in TRAIN_DATA:
                nlp.update(
                    [text],  # batch of texts
                    [annotations],  # batch of annotations
                    drop=0.2,  # dropout - make it harder to memorise data
                    sgd=optimizer,  # callable to update weights
                    losses=losses)
            print(losses)
    return nlp

In [16]:
miRnlp = train_spacy(training_data, 10)

Starting iteration 0
{'ner': 724.416278879913}
Starting iteration 1
{'ner': 95.70888923383187}
Starting iteration 2
{'ner': 107.7119142875932}
Starting iteration 3
{'ner': 37.14966305430066}
Starting iteration 4
{'ner': 39.662877787169094}
Starting iteration 5
{'ner': 138.27446647558605}
Starting iteration 6
{'ner': 52.25174629664164}
Starting iteration 7
{'ner': 118.52414245554984}
Starting iteration 8
{'ner': 13.692096061201951}
Starting iteration 9
{'ner': 31.45547221134935}


In [17]:
for i in testing_sentences[0:5]:
    print(i)
    doc = miRnlp(i)
    print("Entities", [(ent.start_char, ent.end_char, ent.text, ent.label_) for ent in doc.ents])
    print()

Here, a biosensor CPs/AuNP-AuE, the gold nanoparticle (AuNP)-modified Au electrode (AuE) which was coupled with DNA capture probes (CPs), was developed to detect the content of miR-100 in the sera of GC patients.
Entities [(177, 184, 'miR-100', 'miR')]

The aim of the current study was to evaluate whether hyperglycemia is able to affect the expression of selected miRNAs in VAT of prediabetic (IFG) and diabetic (T2DM) patients vs. normoglycemic (NG) subjects using qPCR.
Entities []

MiR-100, miR-125b and miR-199a predicted poor prognosis in GC, while miR-194 predicted favorable prognosis in GC.
Entities [(0, 7, 'MiR-100', 'miR'), (9, 17, 'miR-125b', 'miR'), (22, 30, 'miR-199a', 'miR'), (69, 76, 'miR-194', 'miR')]

Enrichment analyses indicated involvement of 11 top DE miRNAs in oxidative stress, inflammation and insulin signaling.
Entities []

For this purpose, MCF-7 and MDA-MB-435 cells were seeded different number in E-plate 16 for proliferation experiment using an electrical impedanc

In [18]:
def get_model_ents(sent):
  doc = miRnlp(sent)
  ent_list = []
  for ent in doc.ents:
    ent_code = str(ent.start_char) + '_' + str(ent.end_char) + '_' + str(ent.label_)
    ent_list.append(ent_code)

  return(ent_list)


# {'entities': [(58, 68, 'miR')]})]

In [19]:
[get_model_ents(i) for i in testing_sentences[0:10]]

[['177_184_miR'],
 [],
 ['0_7_miR', '9_17_miR', '22_30_miR', '69_76_miR'],
 [],
 [],
 [],
 [],
 ['103_113_miR', '115_125_miR', '127_137_miR', '143_154_miR'],
 [],
 ['55_64_miR']]

In [20]:
all_ent_predictions = [get_model_ents(i) for i in testing_sentences]

In [21]:
def check_testing_data(string):
    if len([i for i in re.finditer('mir-\d+[^\s|.|,|!|?| |:|;]*', string.lower())]) != 0:
        ent_list = []
        for i in re.finditer('mir-\d+[^\s|.|,|!|?| |:|;]*', string.lower()):
            ent_code = str(i.start()) + '_' + str(i.end()) + '_' + 'miR'
            ent_list.append(ent_code)
            
    else:
        ent_list = []      
    return(ent_list)  

In [22]:
[check_testing_data(i) for i in testing_sentences[0:10]]

[['177_184_miR'],
 [],
 ['0_7_miR', '9_17_miR', '22_30_miR', '69_76_miR'],
 [],
 [],
 [],
 [],
 ['103_113_miR', '115_125_miR', '127_137_miR', '143_154_miR'],
 [],
 ['55_64_miR']]

In [23]:
all_correct_labs = [check_testing_data(i) for i in testing_sentences]

In [43]:
true_positive = 0
false_positive = 0
false_negative = 0

for entry in range(len(all_correct_labs)):

  for pred_label in range(len(all_ent_predictions[entry])):
    true_positive += all_ent_predictions[entry][pred_label] in all_correct_labs[entry]
    false_positive += all_ent_predictions[entry][pred_label] not in all_correct_labs[entry]
  
  for true_label in range(len(all_correct_labs[entry])):
    false_negative += all_correct_labs[entry][pred_label] not in all_ent_predictions[entry]

![](https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/350px-Precisionrecall.svg.png)

In [48]:
miR_precision = true_positive / (true_positive + false_positive)
miR_recall = true_positive / (true_positive + false_negative)

print("Precision: " + str(miR_precision))
print("Recall: " + str(miR_recall))

Precision: 1.0
Recall: 1.0


![](https://miro.medium.com/max/1530/1*wUdjcIb9J9Bq6f2GvX1jSA.png)

In [50]:
miR_f1 = 2 * ((miR_precision * miR_recall) / (miR_precision + miR_recall))

print("F1 Score: " + str(miR_f1))

F1 Score: 1.0
