## Exercise 1

Use the requests module (or urllib) to use the Entrez API (see slides 5) to identify the PubMed IDs for 1000 Alzheimers papers from 2022 and for 1000 cancer papers from 2022. (9 points)

Note: To search for a disease and a publication year, structure the term like: Alzheimers+AND+2022[pdat] (Here [pdat] indicates that this is a publication year, and the AND (has to be all caps) means both conditions should apply.) 

Use the Entrez API via requests/urllib to pull the metadata for each such paper found above (both cancer and Alzheimers) (and save a JSON file storing each paper's title, abstract, and the query that found it that is of the general form: (12 points) 

Here 32008517 would be the PubMed ID of one of the 2000 papers, specifically one that came from searching for Alzheimer's papers (it won't be in your data set because it was published in 2019). You should include the full AbstractText; I'm abridging here for clarity.

There are of course many more papers of each category, but is there any overlap in the two sets of papers that you identified? (3 points)

Hint: To do this, you'll probably want to look at one of the XML responses with a text editor so that you understand how it is structured.

Hint: Some papers like 32008517 have multiple AbstractText fields (e.g. when the abstract is structured). Be sure to store all parts. You could do this in many ways, from using a dictionary or a list or simply concatenating with a space in between. Discuss any pros or cons of your choice in your readme (1 point).

Caution: the PubMed API allows a rate of at most one query at a time and no more than 3 per second unless you have an API key. To be safe, use

after each query to the PubMed API. 

Note: This doesn't require 2002 separate queries. You can get the metadata for many articles at a time by using a comma separated list of ids. While GET queries have a total line length limit, you could use a POST query instead and get the information for all the papers in one pass. (We can use POST instead of GET here in part because this is not a RESTful API.)

Note: BioPython provides functions for accessing the PubMed API. Do not use them; use the requests module to do an HTTP or HTTPS request directly on a URL that you specify with the parameters that you specify. Why? Because this approach is general and will work in many contexts whereas BioPython only works for PubMed and only from Python.

NOTE: sometimes papers (e.g. 31842501) have italics in their title or abstract. If so, using .text won't work well; use ET.tostring(item, method="text").decode() instead.

## Response

In [84]:
# import modules 
import requests
import xml.dom.minidom as m
import xml.etree.ElementTree as et
import json
import time

#### Creating a function to return the list of a specified diseases using the requests module

In [85]:
def get_id(disease):
    #disease : cancer/alzheimers
    #year : 2022
    #count : 1000 papers per diseases
    #file type : xml
    r = requests.get(f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term={disease}+AND+2022[pdat]&retmax=1000&retmode=xml")
    time.sleep(1)
    doc = m.parseString(r.text)
    PubmedId = doc.getElementsByTagName('Id')
    IdList = []
    for i in range(len(PubmedId)):
        IdList.append(PubmedId[i].firstChild.data)

    return IdList

In [86]:
## Checking for the total length of the paper : as specified should be 1000

len(get_id("Alzheimers") + get_id("Cancer"))

2000

In [95]:
## PubMedId for Disease - Alzheimers

get_id("Alzheimers")

['36323061',
 '36322888',
 '36322800',
 '36322495',
 '36322470',
 '36321981',
 '36321927',
 '36321882',
 '36321654',
 '36321615',
 '36321363',
 '36321205',
 '36321194',
 '36320609',
 '36320346',
 '36319270',
 '36319045',
 '36318754',
 '36318594',
 '36318545',
 '36318372',
 '36317468',
 '36317413',
 '36316970',
 '36316783',
 '36316708',
 '36316501',
 '36316487',
 '36316461',
 '36316282',
 '36316035',
 '36315527',
 '36315115',
 '36314730',
 '36314503',
 '36314232',
 '36314212',
 '36314211',
 '36314210',
 '36314209',
 '36314208',
 '36314207',
 '36314206',
 '36314205',
 '36314204',
 '36314203',
 '36314202',
 '36314201',
 '36314200',
 '36314199',
 '36314055',
 '36313968',
 '36313967',
 '36313955',
 '36313229',
 '36312018',
 '36311713',
 '36311031',
 '36310167',
 '36309938',
 '36309725',
 '36309404',
 '36309183',
 '36309087',
 '36308033',
 '36307888',
 '36307518',
 '36306920',
 '36306735',
 '36306540',
 '36306459',
 '36306458',
 '36305768',
 '36305541',
 '36305459',
 '36305148',
 '36305125',

In [96]:
## PubMedId for Disease - Cancer

get_id("Cancer")

['36323507',
 '36323504',
 '36323475',
 '36323457',
 '36323452',
 '36323443',
 '36323436',
 '36323435',
 '36323434',
 '36323433',
 '36323432',
 '36323431',
 '36323430',
 '36323420',
 '36323419',
 '36323417',
 '36323370',
 '36323360',
 '36323356',
 '36323332',
 '36323327',
 '36323321',
 '36323311',
 '36323309',
 '36323304',
 '36323278',
 '36323268',
 '36323264',
 '36323262',
 '36323258',
 '36323253',
 '36323249',
 '36323248',
 '36323247',
 '36323234',
 '36323232',
 '36323225',
 '36323190',
 '36323185',
 '36323179',
 '36323171',
 '36323147',
 '36323135',
 '36323109',
 '36323093',
 '36323090',
 '36323089',
 '36323072',
 '36323071',
 '36323070',
 '36323057',
 '36323053',
 '36323052',
 '36323051',
 '36323050',
 '36323049',
 '36323048',
 '36323005',
 '36323004',
 '36322991',
 '36322988',
 '36322977',
 '36322963',
 '36322939',
 '36322938',
 '36322935',
 '36322931',
 '36322930',
 '36322929',
 '36322928',
 '36322919',
 '36322906',
 '36322884',
 '36322882',
 '36322880',
 '36322879',
 '36322878',

#### Finding overlap between two sets of papers

In [91]:
def overlap_PubmedId(disease1,disease2):
    return set(get_id(disease1))&set(get_id (disease2))

In [92]:
overlap_PubmedId("Alzheimers", "Cancer")

{'36314209'}

#### Finding the MetaData of the papers in Alzheimers and Cancer Sets

In [93]:
def metadata(disease):
    PubmedIdList = get_id(disease)
    disease_paper_dict = {}
    for PubmedId in PubmedIdList:
        time.sleep(1)
        r = requests.post(f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&id={int(PubmedId)}")
        doc = m.parseString(r.text)

        ArticleTitle = doc.getElementsByTagName('ArticleTitle')
        Title = ""
        if len(ArticleTitle) > 0:
            for elm in ArticleTitle:
                for textmessage in elm.childNodes:
                    try:
                        Title += textmessage._get_wholeText()
                        Title = et.tostring(Title, method = "text").decode()
                    
                    except AttributeError: 
                        for subnode in textmessage.childNodes:
                            if subnode.nodeType == m.Node.TEXT_NODE:
                                Title += subnode.data
     
        AbstractText = doc.getElementsByTagName('AbstractText')
        Abstract = ""
        if len(AbstractText) > 0:
            for elm in AbstractText:
                for textmessage in elm.childNodes:
                    try:
                        Abstract += textmessage._get_wholeText()
                        Abstract = et.tostring(Abstract, method = "text").decode()
                    except AttributeError: 
                        for subnode in textmessage.childNodes:
                            if subnode.nodeType == m.Node.TEXT_NODE:
                                Abstract += subnode.data

      
        MeshHeading = doc.getElementsByTagName('MeshHeading')
        ArticleMeshTerms = []
        if len(MeshHeading) > 0:
            try:
                for i in MeshHeading:
                    ArticleMeshTerms.append(i.firstChild.childNodes[0].nodeValue)
            except AttributeError: pass
            
        disease_paper_dict[PubmedId] = {
            'ArticleTitle': Title,
            'ArticleAbstract': Abstract,
            'Query': disease,
            'Mesh': ArticleMeshTerms
        }
        
    return  disease_paper_dict

#### I looped over the abstract section to get all the elements. I think it is an efficient and less time consuming way to find the and store the data. The data was then was store in a form of string and then to a common dictionary including the other data keys. But while going through the data file, the structured parameters names such as objective, results, method were not stored. It would be difficult to extract information of objective/method/results of all papers separately. 

#### Pulling Metadata for Alzheimers and saving the JSON file

In [36]:
alzheimers_metadata = metadata('Alzheimers')

In [37]:
with open("Alzheimer.json", "w") as f:
    json.dump(alzheimers_metadata, f, indent=4)

#### Pulling Metadata for Cancer and saving the JSON file

In [38]:
cancer_metadata = metadata('Cancer')

In [39]:
with open("Cancer.json", "w") as f:
    json.dump(cancer_metadata, f, indent=4)

#### Combing the data in a single json file

In [94]:
combined_data = metadata('Alzheimers')
cancer_data = metadata('Cancer')
combined_data.update(cancer_data)

with open('papers.json','w') as f:
    json.dump(combined_data,f)