# Neuroinformatics: Using Python for literature searches

### Guest content & lecture by Monique Surles-Zeigler

In this notebook, you will learn to:
* Identify the conceptual and technical tools used to conduct informatics research (e.g. APIs, ontologies, bioentrez, BLAST)
* Identify the structure and use of json format
* Define MESH terms & describe their role in informatics research
* Explain the role and importance of informatics research
* Conduct a pubmed search using bioentrez

## Setup
We'll need many new packages that aren't in our DataHub for today's lab. To get everything setup, run our setup script below.

<mark>This will take a few minutes. Please restart the kernel when it is finished.</mark>

In [None]:
# Install and update packages
# More information about installing packages - https://packaging.python.org/en/latest/tutorials/installing-packages/
! pip install xmljson
! pip install xmltodict
! pip install Biopython
! pip install myst-nb

Now, we will import the loaded libraries from the installed programs. The `import X from` line imports the module X, and creates references to all public objects defined by that module in the current namespace.

In [3]:
from Bio import Entrez 

from xmljson import yahoo as yh
from xml.etree.ElementTree import fromstring
import xmltodict

import urllib

from copy import deepcopy
from itertools import product

import json

import pandas as pd 

# Show list of imported packages
%whos

Variable     Type        Data/Info
----------------------------------
Entrez       module      <module 'Bio.Entrez' from<...>/Bio/Entrez/__init__.py'>
deepcopy     function    <function deepcopy at 0x7f884bad3510>
fromstring   function    <function XML at 0x7f884e78e7b8>
json         module      <module 'json' from '/Use<...>hon3.7/json/__init__.py'>
l1           list        n=3
l2           list        n=3
l3           list        n=3
pd           module      <module 'pandas' from '/U<...>ages/pandas/__init__.py'>
product      type        <class 'itertools.product'>
time         module      <module 'time' (built-in)>
urllib       module      <module 'urllib' from '/U<...>n3.7/urllib/__init__.py'>
xmltodict    module      <module 'xmltodict' from <...>e-packages/xmltodict.py'>
yh           Yahoo       <xmljson.Yahoo object at 0x7f884e770240>


In [None]:
#review all of the installed packages and version 
! pip list

## Accessing NCBI databases with Biopython
**Biopython** is a set of freely available tools for biological computation written in Python. It contains a collection of python modules to search to deal with DNA, RNA & protein sequence operations such as reverse complementing of a DNA string, finding motifs in protein sequences, etc.

Bio.Entrez is the module within the BioPython package that provides code to access NCBI over the World Wide Web to retrieve various sorts of information. This module provides a number of functions which will return the data as a handle object. This is the standard interface used in Python for reading data from a file and provides methods or offers iteration over the contents line by line.

### Bio.Entrez is not the only sub-module in Biopython. [Other packages include](https://biopython.org/docs/1.75/api/index.html):
- Bio.GEO - Access to data from the Gene Expression Omibus database.
- Bio.KEGG - Access to data from the KEGG database.
- Bio.motifs - Access to tools for sequence motif analysis.


### Functions used in Bio.entrez

Bio.Entrez has a ton of different functions. We'll use a few (highlighted) in our notebook today . Read more about these functions on [the website](https://www.ncbi.nlm.nih.gov/books/NBK25499/).

The primary functions we'll use today are:

- **eInfo** - Provides the number of records indexed in each field of a given database, the date of the last update of the database, and the available links from the database to other Entrez databases.
- **eSearch** - Responds to a text query with the list of matching **Entrez Unique Identifier (UIDs)** in a given database (for later use in ESummary, EFetch or ELink), along with the term translations of the query.
- **efetch** - Retrieves records in the requested format from a list of one or more primary IDs or from the user’s environment
- **read** - Parses the XML results returned by any of the above functions.

**Note**: Entrez will request your email so that it can keep track of its users. Always provide your email with `Entrez.email = 'your@email.com`.

<div class="alert alert-success">

**Task 1**: Create a <b>function</b> called `get_info` that requests your email as a string, and assigns it to the variable name `Entrez.email`

</div>

In [None]:
## Task 1 - code block

As previously mentioned in the slides, the BioEntrez package provides access to multiple biomedical databases. <mark>Below, we'll access the Entrez API to search for a list of databases in Bio.Entrez.</mark>

In [None]:
# declare a variable (e.g.handle) where the results will be stored 
# pass within the Entrez.einfo() function.
handle = Entrez.einfo()
record = Entrez.read(handle)
print (record)

<div class="alert alert-success">

**Task 2**: Assign the output of `Entrez.einfo()` to a variable called `handle`. Within  the `einfo` function, search for information within the pubmed database (as a string) with the variable 'db'. There's additional information on how to do this [here](http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec145).
</div>

In [None]:
#Task 2 - code block

<div class="alert alert-success">

**Task 3**: Pull out information from the dictionary about 
1. Paper counts
2. Field list 
3. How many MESH terms are associated with the papers?
4. That other information can you pull it?
5. Can you make the output more readable?
</div>

In [1]:
#Task 3 - code block: Pull out information from the dictionary here


### The product function
Let's learn about a new function: `product()`, which we imported above from `itertools`. Documentation on this function can be found at https://docs.python.org/3/library/itertools.html. 
Multiple lists can be added to product and it returns every possible combination of values but remember that is an iterator (traverse through all values), so the contents is not output by `print()`.

<div class="alert alert-success">

**Task #4**: Let's test product out in the sample below
```{python}
l1 = ['a', 'b', 'c']
l2 = ['X', 'Y', 'Z']
l3 = [1, 2, 3]
p = product(l1, l2, l3)
```

1. What happens when you print p?
2. What type of object is p (e.g. list, dictionary, other)?
3. Can you figure out another way to view the output of p?


</div>


In [None]:
#Task 4 - code block

<div class="alert alert-success">

**Task #5**: The goal of this task is to create a function that takes in multiple lists (sound familiar?), and join combinations of the lists together. 

Complete the function below by completing the following steps:

1. Add the 3 lists (brain_region, cell_type, method) to the itertools function, product(). 
2. Iterate over the list created in step 1, preferably through list comprehension. Use the built-in join() function to join each element together with the string " AND "
3. Return the results from step 2 
    
</div>

In [None]:
# Task 5 - Code block. Complete the function here
def comb_list(brain_region, cell_type, method):
    '''iterate over terms given - in this case brain region, cell type and method'''
    # 1. Add the 3 list to the itertools function, product(). This function returns every possible combination of values from a group of lists.
    # 2. Iterate over the list created in step 1, preferably through list comprehension. Use the built-in join() function to join each element together with the string " AND "
    # 3. Return the results from step 2

In [9]:
# Lists to be feed into the comb_list() function
brain_region = ["hippocampus"]

cell_type = ["CA1 pyramidal cell", "CA2 pyramidal cell", "CA3 pyramidal cell", "CA4 Pyramidal Cell", "Dentate Gyrus Granule Cell", "Dentate Gyrus Basket Cell", "CA1 Basket Cell"]

method = ["rna seq* ","microarray","in situ hybridization", "polymerase chain reaction"]

<div class="alert alert-success">

**Task #6**: Now, use the `comb_term()` function that you created to create all combinations of the three list above.
</div>

In [None]:
#Task 6 - code block

<div class="alert alert-success">
 
**Task #7**: Create a script to search PubMed with the search terms created above (Task 6).
Previously, we used the script below 

```{python}
handle = Entrez.einfo()
record = Entrez.read(handle)
print (record)
```

This script allowed us to 
- To extract information (einfo)from the Entrez databases and declare a variable (e.g.handle) where the results will be stored 
- Then parse the results returned by einfo()

Now, let's use `esearch()` to search for papers in Pubmed within Entrez, read those results and return the output
More information about esearch and how to use it is [here](https://biopython-tutorial.readthedocs.io/en/latest/notebooks/09%20-%20Accessing%20NCBIs%20Entrez%20databases.html)

For this search we will use the search terms created above search for papers within PubMed. Let's try it below.

<mark>**Hint**: Remember, we need to search the terms above within the pubmed database. [more help here](https://dataguide.nlm.nih.gov/edirect/esearch.html)  .</mark>

What is the output? Is there a better way to view all of the results?
    
</div>

In [None]:
#Task 7 - code block

<div class="alert alert-success">

**Task #8**: Create the function to search PubMed `get_abstract` with the search terms created above.
Previously, we used the script below to find out more about the databases within Entrez

```{python}
handle = Entrez.einfo()
record = Entrez.read(handle)
print (record)
```

Then, in the Task above (Task 7), we used esearch to search the Pubmed database and retrieve PubMed IDs for our search terms.


Now, let's use `efetch()` to retreive the abstract from Pubmed within Entrez, read those results and return the output
More information about esearch and how to use it is here - 
    - Pubmed databaase, use pmid as an id for all papers and load all abstract

<mark>**Hint**: Remember, we need to some or all of the PubMed IDs from Task 7. [more help here](https://dataguide.nlm.nih.gov/edirect/efetch.html)</mark>

Don't forget to use your function. How does the results look? 

</div>

In [None]:
#Task 8 - code block

Sanity check for Task 8 - after you have attempted to complete. 
<br></br>

<details>
    <summary>Click once on <font color="pink"><b>this text</b></font> to hide/unhide the answer!</summary>
  
```{python}
def get_abstract(term_ids):
    fetch_handle = Entrez.efetch(db='pubmed', id=term_ids, retmode='xml', retmax=4000, rettype='abstract')
    return fetch_handle.read()
```
</details>

<div class="alert alert-success">

**Task #9**: Now for a bit of a clean up step. It is a bit difficult to properly read through the current output. Let's change the format of the text with the code below.

```{python}
y_abstracts = yh.data(fromstring(abstracts))
```
<mark>**Hint**: Add this line of code to the script from Task 8</mark>

In [None]:
# Task 9- Code block

<div class="alert alert-success">

***Task #10***: Exploratory task
- Have fun! Or as much as you can stand. 
Now, Let's put every thing together. This exercise is for you to see what the data looks like and how to extract data from the nested data structures (e.g., list, dictionaries,etc) 
    - Explore and try to extract all or some of the data. It may help to only look at a few abstracts
        -  Can you get to the PubMed Article? <mark>(Hint)- Remember, to check the datatypes (`type()`) to help you <mark>
        If so, delve deeper and see if you can get the 
        1. Article Title
        2. Journal Title
        3. Author List
        4. pmc id
        5. doi 
        6. Mesh headings
        7. Article date
    - Now, can you package this all in a funtion called findformat_abstract(), that takes in the search terms and returns the abstracts

</div>

In [None]:
# Task 10 - Code block

### Use findformat script
Below, we'll use a script (`findformat.ipynb`). Take a look a close look at this function -- what is it doing? How is it different than your code?

In [None]:
%run findformat.ipynb

In [None]:
#get all abstracts   
new_abstracts = {}
gene_abstracts = findformat_abstract(terms) 

In [None]:
#only get the abstracts with pmc ids
pmc_abstracts = {k: v for k, v in gene_abstracts.items() if len(v['PMC']) > 0} 

In [None]:
# This is a way to make a copy of dictionary, as a backup in case
#  Deepcopy () copies all the elements of an object as well as the memory location that contains data rather than containing the data itself.
gene_abstract_cp = deepcopy(gene_abstracts)
pmc_abstract_cp = deepcopy(pmc_abstracts)

In [None]:
print('original count', len(gene_abstracts))
print('PMC:', len(pmc_abstracts))
print('difference =', len(gene_abstracts)-len(pmc_abstracts))

<div class="alert alert-success">

The length of `pmc_abstracts` is less than that of `gene_abstract`. Why is that?

</div>

## Read results as a dataframe
It's difficult to visually parse dictionaries. Thankfully we have another tool at our disposal: pandas.

<div class="alert alert-success">
    
**Task 12**: Turn `gene_abstracts` into a pandas dataframe called `gene_abstract_df`.
    
</div>

In [None]:
# Task 12: code block - Turn gene_abstract into a df

Hmm, it would make a lot more sense if each paper had its own row -- that's how we typically conceptualize dataframes, with each row as a different observation, patient, cell, etc. We can **transpose** the dataframe using the [`transpose`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html) (or `T` for short) method.

<div class="alert alert-success">

**Task 13**: Using the `iloc` method, view *just* the first abstract.
    
</div>

In [None]:
# Task 13 - code block

In [None]:
# Let's run this bit of code to get the pmc papers
'''Let's make a copy of pmc_abstract'''
pmc_abstract = deepcopy(pmc_abstracts)
'''find open access files, extract results and methods sections and convert xml to json format'''

for k, v in pmc_abstracts.items():
    pmc = pmc_abstracts[k]['PMC']
    if len(pmc)>0:
        pmc_idno =(s.strip('PMC') for s in v['PMC'])
        #confirm that file is Open Access
        try_this = pmc_abstract[k]['PMC']
        find_pdf = "https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id={}".format(*try_this)
        #get xml record
        url = "https://www.ncbi.nlm.nih.gov/pmc/oai/oai.cgi?verb=GetRecord&identifier=oai:pubmedcentral.nih.gov:{}&metadataPrefix=pmc".format(*pmc_idno)
        pmc_abstract[k]['OA web address'] = url
        with urllib.request.urlopen(find_pdf) as response:
            the_file = response.read().decode('utf-8')
            dict_file = xmltodict.parse(the_file)

            try:
                pmc_abstract['ftp_record'] = dict_file['OA']['records']['record']['link']['@href']
            except KeyError:
                pmc_abstract['ftp_record'] = dict_file['OA']['error']['#text']
            except TypeError:
                pmc_abstract['ftp_record'] = dict_file['OA']['records']['record']

        with urllib.request.urlopen(url) as responsec:
            the_filec = responsec.read().decode('utf-8')
            dict_filec = xmltodict.parse(the_filec)
            
            try:
                #print (dict_filec['OAI-PMH']['GetRecord'].keys())
                data_level = dict_filec['OAI-PMH']['GetRecord']['record']['metadata']['article']['body']['sec']
                print (data_level)
            except KeyError:
                continue

<div class="alert alert-success">

**Task #14**: Exploratory task
- Have fun...again! I know, it is too much fun...run the code below and try to extract the methods and results sections from each paper or a subset of papers

I know it seems overwheming but the code is listed below. let's just see if you can filter the text. 
  
Can you extract the methods and results sections? 
    
</div>

In [None]:
# Task 14 - Beginning code - Now explore...

#Let's make a copy of pmc_abstract'''
pmc_abstract = deepcopy(pmc_abstracts)
'''find open access files, extract results and methods sections and convert xml to json format'''

for k, v in pmc_abstracts.items():
    pmc = pmc_abstracts[k]['PMC']
    if len(pmc)>0:
        pmc_idno =(s.strip('PMC') for s in v['PMC'])
        #confirm that file is Open Access
        try_this = pmc_abstract[k]['PMC']
        find_pdf = "https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id={}".format(*try_this)
        #get xml record
        url = "https://www.ncbi.nlm.nih.gov/pmc/oai/oai.cgi?verb=GetRecord&identifier=oai:pubmedcentral.nih.gov:{}&metadataPrefix=pmc".format(*pmc_idno)
        pmc_abstract[k]['OA web address'] = url
        with urllib.request.urlopen(find_pdf) as response:
            the_file = response.read().decode('utf-8')
            dict_file = xmltodict.parse(the_file)

            try:
                pmc_abstract['ftp_record'] = dict_file['OA']['records']['record']['link']['@href']
            except KeyError:
                pmc_abstract['ftp_record'] = dict_file['OA']['error']['#text']
            except TypeError:
                pmc_abstract['ftp_record'] = dict_file['OA']['records']['record']

        with urllib.request.urlopen(url) as responsec:
            the_filec = responsec.read().decode('utf-8')
            dict_filec = xmltodict.parse(the_filec)
            
            try:
                #print (dict_filec['OAI-PMH']['GetRecord'].keys())
                data_level = dict_filec['OAI-PMH']['GetRecord']['record']['metadata']['article']['body']['sec']
                print (data_level)
            except KeyError:
                continue

Below, we'll use a script (`getTexts.ipynb`). Take a look a close look at this function -- what is it doing? How is it different than your code?

In [None]:
%run getTexts.ipynb

In [None]:
format the Methods and Results section 
g_updated_records = getTexts(gene_abstracts)

In [None]:
#make another copy of the file since a lot of information is in here.
pmc_papers = deepcopy(g_updated_records)
pmc_papers

## Save results as a json & excel file

Below, we'll save our findings as both a json and an Excel file. **JavaScript Object Notation (JSON)** is a standardized format commonly used to transfer data between systems and used by a lot of databases and APIs. 
Like Python dictionaries, it represents objects as name/value pairs.


In [None]:
#save file as a json file
with open('g_updated_records.json', 'w') as outfile:        
    json.dump(g_updated_records, outfile)
    
#read in json file    
with open('g_updated_records.json', 'r') as newfile:
    g_updated_records = json.load(newfile)
    
#save file to an excel file - save file as pandas dataframe, save to excel
df_updated_records = dp = pd.DataFrame(gene_abstracts).T
g_updated_records.to_excel('g_updated_records.json.xlsx')