# **Biof395 Final Project**
Shane Chambers

## **Overview and Description**

The goal of this project is to make a dynamic application that will summarize the literature findings surrounding a miRNA that the user inputs into the program. This will require the program to:
- Receive user input
- Query and download the relevant literature 
- process the literature 
- Transform the literature into feature representations
- Build a text mining model
- Evaluate the performance of the model

We will work through each aspect of this process below.

## **Receive User Input**

First, we will prompt the user to enter the [miRBase](http://www.mirbase.org/) accession of their miRNA of interest. This is a unique code associated with every known miRNA that standardizes the nomenclature to avoid confusion.

In [1]:
#user_mir = input('Enter the miRbase accession number of your miR of interest:')

# For this example, we will enter the accession of mmu-mir-100; MI0000692

user_mir = 'MI0000692'

Now, we will reference this to a database downloaded from the miRbase FTP site [containing every known miR and its accession number.](ftp://mirbase.org/pub/mirbase/CURRENT/miRNA.xls.gz) This file has been unzipped and downloaded, and is saved in this folder as `miRNA.xlsx`. Below, we will import and manipulate it using `pandas` so that the `user_mir` can be queried against it. 

In [9]:
import urllib.request
import io
import gzip
import os
from pathlib import Path

response = urllib.request.urlopen('ftp://mirbase.org/pub/mirbase/CURRENT/miRNA.xls.gz')
compressed_file = io.BytesIO(response.read())
decompressed_file = gzip.GzipFile(fileobj=compressed_file)

with open(Path(os.getcwd(), 'miRNA.xlsx'), 'wb') as outfile:
    outfile.write(decompressed_file.read())

In [10]:
import pandas as pd

mir_database = pd.read_excel('miRNA.xlsx')

mir_database_1 = mir_database.loc[:, ['Accession', 'ID']]
mir_database_2 = mir_database.loc[:, ['Mature1_Acc', 'Mature1_ID']].rename(columns = {'Mature1_Acc':'Accession', 'Mature1_ID':'ID'})
mir_database_3 = mir_database.loc[:, ['Mature2_Acc', 'Mature2_ID']].rename(columns = {'Mature2_Acc':'Accession', 'Mature2_ID':'ID'})

final_database = pd.concat([mir_database_1, mir_database_2, mir_database_3])

final_database.head()

Unnamed: 0,Accession,ID
0,MI0000001,cel-let-7
1,MI0000002,cel-lin-4
2,MI0000003,cel-mir-1
3,MI0000004,cel-mir-2
4,MI0000005,cel-mir-34


Now, we need to query the `user_mir` against this database and store the corresponding `ID` as the item we will search in pubmed. We will also check that there is an `ID` that corresponds with the given accession number. 

In [3]:
filtered_database = final_database[final_database['Accession']  == user_mir]['ID']

if filtered_database.size == 1:
    mir = filtered_database.iloc[0]
    print('The accession number ' + user_mir + ' corresponds to miR ' + mir)
else:
    print('miR accession is incorrect. Try again (caps sensitive)')

The accession number MI0000692 corresponds to miR mmu-mir-100


## **Obtaining Relevant Literature**

Using the corresponding miR, we will query pubmed and get relevant abstracts. 

In [4]:
from Bio import Entrez

Entrez.email = 'anonymous@gmail.com'
esearch_query = Entrez.esearch(db="pubmed", term="mir-100", retmode="xml")
esearch_result = Entrez.read(esearch_query)
pmid_list = esearch_result['IdList']

This fetches the pubmed ID's of literature that corresponds to our search, which we can preview below:

In [5]:
print("pmid's obtained: " + str(len(pmid_list)))

print("Example pmid: " + str(pmid_list[0]))

pmid's obtained: 20
Example pmid: 33592729


Now, we will create a function to fetch the abstract given the pmid:

In [6]:
def fetch_abstract(pmid):
    handle = Entrez.efetch(db='pubmed', id = pmid, retmode='xml')
    article = Entrez.read(handle)['PubmedArticle'][0]['MedlineCitation']['Article']
    if 'Abstract' in article:
            return article['Abstract']['AbstractText']

And now, we can iterate through the PMID's and download each corresponging abstract:

In [7]:
test_dict = {}

counter = 0
for i in pmid_list:
    counter += 1
    title = 'article_' + str(counter)
    abs = fetch_abstract(i)
    test_dict[title] = abs

Now, let's look at some of the abstracts we downloaded and see if any further processing is needed:

In [8]:
print(test_dict['article_1'])
print()
print(test_dict['article_2'])
print()
print(test_dict['article_3'])
print()
print(test_dict['article_4'])

['Gastric cancer (GC) is a common malignant digestive tract tumor that leads to high mortality worldwide. Early diagnosis of GC is very important for adequate treatment. However, a rapid, specific and sensitive method for the detection of GC is currently not available. Here, a biosensor CPs/AuNP-AuE, the gold nanoparticle (AuNP)-modified Au electrode (AuE) which was coupled with DNA capture probes (CPs), was developed to detect the content of miR-100 in the sera of GC patients. The results showed that AuNPs were uniformly deposited on the surface of AuE. AuNPs enhanced the electrical conductivity and improved the effective area of AuE. CPs were successfully assembled on AuNP-AuE that could be digested by duplex-specific nuclease (DSN) from the miR-100/CPs complex on the electrode, improving the sensitivity of the biosensor by recycling miR-100. The data revealed that the biosensor was highly specific for the detection of miR-100, which had the ability to distinguish one base-pair mista

As we can see, `article_4` is not formatted correctly. Unlike others, this abstract is split up into different sections (intro, methods, results, etc.). To fix this, we need to concatenate all sections into one coherent text chunk. Below we will design a function to detect articles in this format, and to concatenate them and replace them in the dictionary with the correctly formatted text.

In [9]:
len(test_dict['article_4'])

5

In [10]:
len(test_dict['article_1'])

1

In [11]:
str(test_dict['article_4'][0])

'Several studies have reported an association between microRNAs (miRNAs) and hypertension or cardiovascular disease (CVD). In a previous study performed on a group of 38 patients, we observed a cluster of 3 miRNAs (miR-378a-3p, miR-100-5p, and miR-486-5p) that were functionally associated with the cardiovascular system that predicted a favorable blood pressure (BP) response to continuous positive airway pressure (CPAP) treatment in patients with resistant hypertension (RH) and obstructive sleep apnea (OSA) (HIPARCO score). However, little is known regarding the molecular mechanisms underlying this phenomenon.'

In [12]:
def concat_article(x):
    final_article = str()
    for i in range(len(x)):
        final_article = final_article + str(x[i]) + ' '
    return final_article

Testing this function on `article_4` (which it should fix) and `article_1` (which it should do nothing to):

In [13]:
print(concat_article(test_dict['article_4']))
print()
print(concat_article(test_dict['article_1']))

Several studies have reported an association between microRNAs (miRNAs) and hypertension or cardiovascular disease (CVD). In a previous study performed on a group of 38 patients, we observed a cluster of 3 miRNAs (miR-378a-3p, miR-100-5p, and miR-486-5p) that were functionally associated with the cardiovascular system that predicted a favorable blood pressure (BP) response to continuous positive airway pressure (CPAP) treatment in patients with resistant hypertension (RH) and obstructive sleep apnea (OSA) (HIPARCO score). However, little is known regarding the molecular mechanisms underlying this phenomenon. The aim of the study was to perform a post hoc analysis to investigate the genes, functions, and pathways related to the previously found HIPARCO score miRNAs. We performed an enrichment analysis using Ingenuity pathway analysis. The genes potentially associated with the miRNAs were filtered based on their confidence level. Particularly for CVD, only the genes regulated by at least

Now, we will iterate through our dictionary and concatenate all our articles using the function above:

In [14]:
for i in test_dict:
    i = concat_article(i)

And look at articles 1 and 4 to ensure they are formatted how we'd like: 

In [15]:
print(test_dict['article_4'])
print()
print(test_dict['article_1'])

[StringElement('Several studies have reported an association between microRNAs (miRNAs) and hypertension or cardiovascular disease (CVD). In a previous study performed on a group of 38 patients, we observed a cluster of 3 miRNAs (miR-378a-3p, miR-100-5p, and miR-486-5p) that were functionally associated with the cardiovascular system that predicted a favorable blood pressure (BP) response to continuous positive airway pressure (CPAP) treatment in patients with resistant hypertension (RH) and obstructive sleep apnea (OSA) (HIPARCO score). However, little is known regarding the molecular mechanisms underlying this phenomenon.', attributes={'Label': 'BACKGROUND', 'NlmCategory': 'BACKGROUND'}), StringElement('The aim of the study was to perform a post hoc analysis to investigate the genes, functions, and pathways related to the previously found HIPARCO score miRNAs.', attributes={'Label': 'OBJECTIVES', 'NlmCategory': 'OBJECTIVE'}), StringElement('We performed an enrichment analysis using I