# Assignment 2: Neuroscience Literature Mining

This assignment will demonstrate how to pull data from an API, how to work with the [LISC package](https://lisc-tools.github.io/lisc/), and test your knowledge of working with classes and numpy.

**PLEASE DO NOT CHANGE THE NAME OF THIS FILE.**

**PLEASE DO NOT COPY & PASTE OR DELETE CELLS INCLUDED IN THE ASSIGNMENT.**


## Installing LISC on datahub

From the datahub ```Files``` browser, you'll see a dropdown menu titles ```New```. From there select ```Terminal```, and that will open a new terminal window.

Once in the Terminal, execute the following command:

```
pip install lisc
```

This will install the LISC package that you'll be using in this assignment.


## How to complete assignments

Whenever you see:

```
# YOUR CODE HERE
raise NotImplementedError()
```

You need to **replace (meaning, delete) these lines of code with code that answers the questions** and meets the specified criteria. Make sure you remove the 'raise' line when you do this (or your notebook will raise an error, regardless of any other code, and thus fail the grading tests).

You should write the answer to the questions in those cells (the ones with `# YOUR CODE HERE`), but you can also add extra cells to explore / investigate things if you need / want to. 

Any cell with `assert` statements in it is a test cell. You should not try to change or delete these cells. Note that there might be more than one assert that tests a particular question. 

If a test does fail, reading the error that is printed out should let you know which test failed, which may be useful for fixing it.

Note that some cells, including the test cells, may be read only, which means they won't let you edit them. If you cannot edit a cell - that is normal, and you shouldn't need to edit that cell.


## Tips & Tricks

The following are a couple tips & tricks that may help you if you get stuck on anything.

#### Printing Variables
You can (and should) print and check variables as you go. This allows you to check what values they hold, and fix things if anything unexpected happens.

#### Restarting the Kernel
- If you run cells out of order, you can end up overwriting things in your namespace. 
- If things seem to go weird, a good first step is to restart the kernel, which you can do from the kernel menu above.
- Even if everything seems to be working, it's a nice check to 'Restart & Run All', to make sure everything runs properly in order.

# Part 1: Fetching data from URLs

In this section we'll show you the very basics of how to look up how many articles are indexed in Pubmed that have specific words or phrases (n-grams) in their titles and/or abstracts.

For this example, we begin by counting how many articles have been published whose titles or abstracts contain the phrases "working memory" or "short term memory". Look through the cell below to get a sense of what it is doing, and then run the cell.


In [1]:
import urllib.request # default library for requesting data from URLs

# this is the base URL for making use of the NIH NLM database search API
u_eutils = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi'

##### all of the code that follows is to construct the full search URL #####

# what database to search; here, pubmed
u_db = 'db=pubmed'

# what information to return; here, count of the number of articles
u_rettype = 'rettype=count'

# what field to search; here, tiab (pubmed's code for TItle and ABstract)
u_field = 'field=tiab'

# what term(s) to search for; here "working memory" OR "short term memory"
# note that the double quotes around the phrases indicate we only want to return
    # searches that contain that exact phrase
u_term = 'term='
url_searchterms = '"working memory" OR "short term memory"'
url_searchterms = url_searchterms.replace(' ', '+') # URLs don't do spaces; replace with +
u_term = u_term + url_searchterms

# stitch the URL together from the parts we gave it
url = u_eutils + '?' + u_db + '&' + u_rettype + '&' + u_field + '&' + u_term

##### end URL construction block #####

with urllib.request.urlopen(url) as response:
    xml = response.read()
xml

b'<?xml version="1.0" encoding="UTF-8" ?>\n<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">\n<eSearchResult>\n\t<Count>46034</Count>\n</eSearchResult>\n'

## Q1: Making your own search

You can see, between the `<Count>` tags in the long response above, that this returns (on Feb 01, 2022) 46034 articles in Pubmed with the phrases "working memory" or "short term memory" in their titles and/or abstracts. This number, of course, will keep increasing over time (and might have increased since we released the assignment!).

1. Use the eutils documentation, specifically the esearch tools [here](https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch) and [here](https://www.ncbi.nlm.nih.gov/books/NBK3827/#pubmedhelp.Title_TI), to modify the above code to find the count of how many articles have been published containing any of the following three terms <font color='red'>**in their titles only**</font>:

    * dentate gyrus
    * entorhinal cortex
    * subiculum


2. In the cell below the url creation cell (after you've retrieved the xml) create a new variable, `number_of_articles`, that lists that count (as an integer). Remember the double quotes around the phrases, and name the search phrase string `url_searchterms`, just as in the example above.

In [2]:
# Run API query here

### BEGIN SOLUTION

# this is the base URL for making use of the NIH NLM database search API
u_eutils = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi'

##### all of the code that follows is to construct the full search URL #####

# what database to search; here, pubmed
u_db = 'db=pubmed'

# what information to return; here, count of the number of articles
u_rettype = 'rettype=count'

# what field to search; here, title (pubmed's code for title)
u_field = 'field='
u_field = u_field + 'ti'

# what term(s) to search for; here "working memory" OR "short term memory"
# note that the double quotes around the phrases indicate we only want to return
    # searches that contain that exact phrase
u_term = 'term='
url_searchterms = '"dentate gyrus" OR "entorhinal cortex" OR "subiculum"'
url_searchterms = url_searchterms.replace(' ', '+') # URLs don't do spaces; replace with +
u_term = u_term + url_searchterms

url = u_eutils + '?' + u_db + '&' + u_rettype + '&' + u_field + '&' + u_term

##### end URL construction block #####

with urllib.request.urlopen(url) as response:
    xml = response.read()
print(xml)

### END SOLUTION

b'<?xml version="1.0" encoding="UTF-8" ?>\n<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">\n<eSearchResult>\n\t<Count>5468</Count>\n</eSearchResult>\n'


In [13]:
# Create number_of_articles here
### BEGIN SOLUTION
number_of_articles = 5468
### END SOLUTION

In [9]:
# Tests for Q1 (worth 5 points)
assert isinstance(u_field,str)
assert isinstance(url_searchterms,str)
assert isinstance(url,str)

In [10]:
# Hidden tests for Q1 (worth 5 points)
### BEGIN HIDDEN TESTS
assert u_field == 'field=title' or u_field == 'field=ti'
### END HIDDEN TESTS

In [11]:
# Hidden tests for Q1 (worth 5 points)
### BEGIN HIDDEN TESTS
assert url_searchterms == '"dentate+gyrus"+OR+"entorhinal+cortex"+OR+"subiculum"'
### END HIDDEN TESTS

In [12]:
# Additional hidden tests for Q1 (worth 5 points)
### BEGIN HIDDEN TESTS
wiggle_room = 100
assert [(number_of_articles > (number_of_articles-wiggle_room)) 
        & (number_of_articles < (number_of_articles+wiggle_room))]
### END HIDDEN TESTS

# Part 2: Basic literature scanner (LISC) usage

Now, instead of all of this manual URL construction, we're going to take advanatage of an open-source toolbox from [Tom Donoghue](https://tomdonoghue.github.io) and [Voytek's lab](https://voyteklab.com) here at UC San Diego called LISC (for LIterature SCanner). It's available for pip install from pypi [here](https://pypi.org/project/lisc/).

## Demonstration of LISC
The below cell is all of the LISC imports you'll need.

Here, we create a list called <code>terms_a</code>, that contains the same three search terms we used above:

* dentate gyrus
* entorhinal cortex
* subiculum

Note the double quotes around the phrases. This time we're passing an argument for field that searches both titles and abstracts (<code>field='tiab'</code>). (Give it a few seconds to run.)

In [14]:
from lisc import Counts
from lisc.utils.db import SCDB
from lisc.plts.counts import *

terms_a = [['"dentate gyrus"'], ['"entorhinal cortex"'], ['"subiculum"']]

# Initialize counts object and use the add_terms method to add terms that we want to search
counts = Counts()
counts.add_terms(terms_a)

# Collect data using the run_collection method
counts.run_collection(verbose=True, db='pubmed', field='tiab')

# Check how many articles were found for each search term
counts.check_counts()

Running counts for:  "dentate gyrus"
Running counts for:  "entorhinal cortex"
Running counts for:  "subiculum"
The number of documents found for each search term is:
  '"dentate gyrus"'       -   18338
  '"entorhinal cortex"'   -    7536
  '"subiculum"'           -    2968


## Q2
`check_counts` is a(n):

* `A` attribute of `counts`
* `B` method of `counts`
* `C` inherited class of `counts`

Write your answer below as a new variable, <code>q2_answer</code>. So if the answer was a hypothetical option <code>D</code>, you would write:
​
<code>q2_answer = 'D'</code>

In [16]:
### BEGIN SOLUTION
q2_answer = 'B'
### END SOLUTION

In [17]:
# Tests for Q2 (worth 5 points)

assert isinstance(q2_answer,str)

### BEGIN HIDDEN TESTS
assert q2_answer == 'B'
### END HIDDEN TESTS

## Q3
1. Create a new list called <code>terms_b</code>, that contains the four lobes of the brain: 'frontal lobe', 'temporal lobe', 'parietal lobe', and 'occipital lobe'. Remember the quotes around the phrases. Check the counts for these terms. 
2. Create a new variable, <code>number_of_lobes_articles</code>, that is equal to the *total* number of articles (as an integer) that mention any of those phrases in their titles or abstracts.

In [18]:
# Run your LISC query here
### BEGIN SOLUTION
terms_b = [['"frontal lobe"'], ['"temporal lobe"'], ['"parietal lobe"'], ['"occipital lobe"']]

counts = Counts()
counts.add_terms(terms_b)

# Collect data
counts.run_collection(verbose=True, db='pubmed', field='tiab')

# Check how many articles were found for each search term
counts.check_counts()

### END SOLUTION

Running counts for:  "frontal lobe"
Running counts for:  "temporal lobe"
Running counts for:  "parietal lobe"
Running counts for:  "occipital lobe"
The number of documents found for each search term is:
  '"frontal lobe"'     -   20373
  '"temporal lobe"'    -   34333
  '"parietal lobe"'    -    8939
  '"occipital lobe"'   -    6357


In [19]:
# Assign number_of_lobes_articles here
number_of_lobes_articles = 70002

In [20]:
# Tests for Q3, wroth 5 points
assert isinstance(counts,object)
assert isinstance(number_of_lobes_articles,int)

In [21]:
# Hidden tests, worth 5 points
### BEGIN HIDDEN TESTS
wiggle_room = 1500
assert [(number_of_lobes_articles > (number_of_lobes_articles-wiggle_room)) & 
 (number_of_lobes_articles < (number_of_lobes_articles+wiggle_room))]
### END HIDDEN TESTS

## LISC: Term co-occurances
LISC not only finds the counts for each search time, but also runs each pairwise comparison between terms to find how many papers are published that talk about <code>term_i</code> <em>and</em> <code>term_j</code>. This is reported as an array in <code>counts.counts</code> where each <em>i,j</em> index is how many papers are published that mention <code>term_i</code> with <code>term_j</code>.

For example, for the four lobes, we get the following:

In [22]:
counts.counts

array([[   0, 7294, 4261, 2424],
       [7294,    0, 4317, 2854],
       [4261, 4317,    0, 2193],
       [2424, 2854, 2193,    0]])

The top row and left column are co-occurances for the the zeroth item in our list, "frontal lobe", with itself and all other terms. You'll note that LISC sets a term as having 0 publications matching with itself, by convention. You see that there are more papers published that mention "frontal lobe" with "temporal lobe" (>7200 together) than have ever been published about the "occipital lobe".

An interesting question is whether this is driven by true differences in how those regions are studied, or whether it's an issue of the fact that we are limited by the terminology being used when we get the counts using LISC. For example, neurophysiologists generally don't study the "occiptal lobe", rather they focus on subregions (V1, V2), or refer to the cortex specifically using the phase "occipital cortex" rather than "occipital lobe". One could try to get around this problem by explicitly including all of the subregions of the occipital lobe. Alternatively one could simply search for the more general "occipital", to encompass both "occipital lobe" and "occipital cortex", however this runs into the issue that the medical literature will also include research on the "occipital bone".

## Q4
Using LISC and the <code>terms_a</code> list, store the <code>counts.counts</code> array as a new variable, <code>cooccurance_matrix</code>.

In [23]:
# Initialize counts object and add the terms that we want to search
counts = Counts()
counts.add_terms(terms_a)

# Collect data
counts.run_collection(verbose=True, db='pubmed', field='tiab')

cooccurance_matrix = counts.counts
cooccurance_matrix

Running counts for:  "dentate gyrus"
Running counts for:  "entorhinal cortex"
Running counts for:  "subiculum"


array([[   0, 1428,  837],
       [1428,    0,  720],
       [ 837,  720,    0]])

In [25]:
# Tests for Q4 (worth 5 points)
import numpy as np
assert isinstance(cooccurance_matrix,np.ndarray)

### BEGIN HIDDEN TESTS
assert sum(np.diag(cooccurance_matrix)) == 0
### END HIDDEN TESTS

## Q5
At this point, you might have noticed that the output of counts is an numpy array. Use the cell below to assign the **shape** of your numpy array to a variable `matrix_shape`.

In [26]:
### BEGIN SOLUTION
import numpy as np
matrix_shape = np.shape(cooccurance_matrix)
### END SOLUTION

In [27]:
# Tests for Q5, worth 5 points
assert isinstance(matrix_shape,tuple)

### BEGIN HIDDEN TESTS
assert matrix_shape == (3, 3)
### END HIDDEN TESTS

## Q6
Which pair has the most papers published talking about them together?

* <code>A</code>: dentate gyrus with entorhinal cortex
* <code>B</code>: dentate gyrus with subiculum
* <code>C</code>: entorhinal cortex with subiculum

Write your answer below as a new variable, <code>q6_answer</code>. So if the answer was a hypothetical option <code>D</code>, you would write:

<code>q6_answer = 'D'</code>

In [28]:
### BEGIN SOLUTION
q6_answer = 'A'
### END SOLUTION

In [29]:
# Tests for Q6 (worth 5 points)

assert isinstance(q6_answer,str)

### BEGIN HIDDEN TESTS
assert q6_answer == 'A'
### END HIDDEN TESTS