# Classical Studies
## Practicum Milestone 1



### Download and unzip CSV

https://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz

PubMed lists all available articles in this compressed .csv file, we first need to download it and put into some data-structure to explore the available data.

In [30]:
import requests # pip install requests
import pubmed_parser as pp # Install from https://github.com/titipata/pubmed_parser
import pandas as pd
from bs4 import BeautifulSoup

In [15]:
# Read CSV into a pandas DataFrame
table = pd.read_csv('PMC-ids.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
# Check the column names
table.columns

Index(['Journal Title', 'ISSN', 'eISSN', 'Year', 'Volume', 'Issue', 'Page',
       'DOI', 'PMCID', 'PMID', 'Manuscript Id', 'Release Date'],
      dtype='object')

In [18]:
# Check the data types of each column
table.dtypes

Journal Title     object
ISSN              object
eISSN             object
Year              object
Volume            object
Issue             object
Page              object
DOI               object
PMCID             object
PMID             float64
Manuscript Id     object
Release Date      object
dtype: object

In [22]:
# The data is held in a pandas dataframe, find out how many rows (records) are in the csv file
# https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

shape = None #TODO
rows = None #TODO
columns = None #TODO

print(shape, rows, columns)


None None None


In [31]:
# Look at first few rows
table.head()

Unnamed: 0,Journal Title,ISSN,eISSN,Year,Volume,Issue,Page,DOI,PMCID,PMID,Manuscript Id,Release Date
0,Breast Cancer Res,1465-5411,1465-542X,2000,3,1,55,,PMC13900,11250746.0,,live
1,Breast Cancer Res,1465-5411,1465-542X,2000,3,1,61,,PMC13901,11250747.0,,live
2,Breast Cancer Res,1465-5411,1465-542X,2000,3,1,66,,PMC13902,11250748.0,,live
3,Breast Cancer Res,1465-5411,1465-542X,1999,2,1,59,10.1186/bcr29,PMC13911,11056684.0,,live
4,Breast Cancer Res,1465-5411,1465-542X,1999,2,1,64,,PMC13912,11400682.0,,live


### Now we need to create datasets for retracted and unretracted articles for analysis
We need to find which articles in the available set are retracted so we can analze and compare them. To do this we use the dataframe we made from the .csv and a retraction database.

In [6]:
# Check if an article is retracted using open-retractions API
def is_retracted(doi):
    r = requests.get('http://openretractions.com/api/doi/{}/data.json'.format(doi))
    if r.status_code == 200:
        resp = r.json()
        if 'retracted' in resp and resp['retracted']:
            return True
    return False

In [16]:
# Look at a subset of the table
df = table.iloc[71940:71960,].copy()
# Make a new column using pandas across the rows
df['Retracted'] = df.apply(lambda row: is_retracted(row['DOI']), axis=1)
df

Unnamed: 0,Journal Title,ISSN,eISSN,Year,Volume,Issue,Page,DOI,PMCID,PMID,Manuscript Id,Release Date,Retracted
71940,Proc Natl Acad Sci U S A,0027-8424,1091-6490,2002,99,6,3878,10.1073/pnas.002025599,PMC122617,11891271.0,,live,False
71941,Proc Natl Acad Sci U S A,0027-8424,1091-6490,2002,99,6,3884,10.1073/pnas.062321799,PMC122618,11891329.0,,live,False
71942,Proc Natl Acad Sci U S A,0027-8424,1091-6490,2002,99,6,3890,10.1073/pnas.062047499,PMC122619,11904439.0,,live,False
71943,Proc Natl Acad Sci U S A,0027-8424,1091-6490,2002,99,6,3896,10.1073/pnas.052496399,PMC122620,11891302.0,,live,False
71944,Proc Natl Acad Sci U S A,0027-8424,1091-6490,2002,99,6,3902,10.1073/pnas.052533799,PMC122621,11891303.0,,live,False
71945,Proc Natl Acad Sci U S A,0027-8424,1091-6490,2002,99,6,3908,10.1073/pnas.062010399,PMC122622,11904440.0,,live,True
71946,Proc Natl Acad Sci U S A,0027-8424,1091-6490,2002,99,6,3914,10.1073/pnas.062578399,PMC122623,11904441.0,,live,False
71947,Proc Natl Acad Sci U S A,0027-8424,1091-6490,2002,99,6,3920,10.1073/pnas.002024599,PMC122624,11891270.0,,live,False
71948,Proc Natl Acad Sci U S A,0027-8424,1091-6490,2002,99,6,3926,10.1073/pnas.062043799,PMC122625,11891327.0,,live,False
71949,Proc Natl Acad Sci U S A,0027-8424,1091-6490,2002,99,6,3932,10.1073/pnas.052713799,PMC122626,11867761.0,,live,False


In [29]:
# Get the retracted one
df[df['Retracted'] == True]    

Unnamed: 0,Journal Title,ISSN,eISSN,Year,Volume,Issue,Page,DOI,PMCID,PMID,Manuscript Id,Release Date,Retracted
71945,Proc Natl Acad Sci U S A,0027-8424,1091-6490,2002,99,6,3908,10.1073/pnas.062010399,PMC122622,11904440.0,,live,True


### Use BeautifulSoup and requests to download XML
For an individual article, the data is in the XML format and can be taken from the PubMed API

The XML contains the full text as well as metadata useful for our analysis

In [28]:
# Use beautiful-soup to look at XML for an individual pmcid
r = requests.get('https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=122622')
bs = BeautifulSoup(r.content, 'xml')
bs

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE pmc-articleset PUBLIC "-//NLM//DTD ARTICLE SET 2.0//EN" "https://dtd.nlm.nih.gov/ncbi/pmc/articleset/nlm-articleset-2.0.dtd">
<pmc-articleset><article article-type="research-article" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<!--The publisher of this article does not allow downloading of the full text in XML form.-->
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Proc Natl Acad Sci U S A</journal-id>
<journal-id journal-id-type="publisher-id">PNAS</journal-id>
<journal-title>Proceedings of the National Academy of Sciences of the United States of America</journal-title>
<issn pub-type="ppub">0027-8424</issn>
<issn pub-type="epub">1091-6490</issn>
<publisher>
<publisher-name>The National Academy of Sciences</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">11904440</article-id>
<article-id pub-id-type="pmc">122622</article-id>
<article-i

In [32]:
# List of unformatted PMCIDs
ids = [
"pmc-id: PMC1165578;",
"pmc-id: PMC372567;",
"pmc-id: PMC372038;",
"pmc-id: PMC1165527;",
"pmc-id: PMC372360;",
"pmc-id: PMC371755;",
"pmc-id: PMC372035;",
"pmc-id: PMC1537255;",
"pmc-id: PMC370643;",
"pmc-id: PMC371636;",
"pmc-id: PMC370594;",
"pmc-id: PMC370790;",
"pmc-id: PMC1162518;",
"pmc-id: PMC2229793;"
]

### XML Parser
Now we are going to use an XML parser to easily extract the metadata and text from the PubMed API

In [10]:
# Parse an individual one to get the numerical id
pmid_str = ids[0]
pmid_str = pmid_str.split()[1]
pmid_str = pmid_str.split(';')[0]
pmcid = pmid_str

In [17]:
# Get the author list
author_list = pp.parse_xml_web(pmcid, save_xml=False)['authors']
print(author_list.split(';'))
print(len(author_list.split(';')))

['R S Downie']
1


### Activity
Write a function to print the authors list for all of the authors of each paper using the `ids` list

What other features can you extract using pubmed-parser https://github.com/titipata/pubmed_parser ?