In [1]:
# import a module just a .py file
from Bio import Entrez

In [2]:
# email you use from NCBI
Entrez.email = "benmainye@gmail.com"

In [3]:
# make a placeholder to store result from querying 
handle = Entrez.esearch(db = "pubmed", term="[Open science] AND Kenya")

In [4]:
# make another placeholder which queries NCBI to get the NCBI
# IDs of interest
record = Entrez.read(handle)

In [5]:
# The result is a dictionary that contains various values
# Try running this line of code without the bracket ["IdList"]
%save record["IdList"] > NCBIids.txt

'> NCBIids.txt' was not found in history, as a file, url, nor in the user namespace.


Great! We now have **PubMed IDs** that could contain Id's that have papers related to term *[Open science] AND Kenya.* This is the same way you'd request for information at pubmed in NCBI. Let go ahead and get the information the full paper if possible in the next step.

In [6]:
# We just need to change our handle to get for instance a summary of 
# the data the papers we need
handle2 = Entrez.esummary(db="pubmed", id = "30123385")

In [7]:
# Let's bring our result back from NCBI
record2 = Entrez.read(handle2)

# see what we are capable of subsetting
record2

[DictElement({'Item': [], 'Id': '30123385', 'PubDate': '2018', 'EPubDate': '2018 Jul 19', 'Source': 'Open AIDS J', 'AuthorList': ['Govender K', 'Masebo WGB', 'Nyamaruze P', 'Cowden RG', 'Schunter BT', 'Bains A'], 'LastAuthor': 'Bains A', 'Title': 'HIV Prevention in Adolescents and Young People in the Eastern and Southern African Region: A Review of Key Challenges Impeding Actions for an Effective Response.', 'Volume': '12', 'Issue': '', 'Pages': '53-67', 'LangList': ['English'], 'NlmUniqueID': '101480215', 'ISSN': '', 'ESSN': '1874-6136', 'PubTypeList': ['Journal Article', 'Review'], 'RecordStatus': 'PubMed', 'PubStatus': '', 'ArticleIds': DictElement({'pubmed': ['30123385'], 'medline': [], 'doi': '10.2174/1874613601812010053', 'pii': 'TOAIDJ-12-53', 'pmc': 'PMC6062910', 'rid': '30123385', 'eid': '30123385', 'pmcid': 'pmc-id: PMC6062910;'}, attributes={}), 'DOI': '10.2174/1874613601812010053', 'History': DictElement({'pubmed': ['2018/08/21 06:00'], 'medline': ['2018/08/21 06:01'], 'rec

In [8]:
print("Extract interesting entries in the data")
print("")
print(record2[0]['Id'])
print("")
print(record2[0]['Title'])
print("")
print(record2[0]['AuthorList'])
print("")
print(record2[0]['FullJournalName'])
print("")
print(record2[0]['EPubDate'])

Extract interesting entries in the data

30123385

HIV Prevention in Adolescents and Young People in the Eastern and Southern African Region: A Review of Key Challenges Impeding Actions for an Effective Response.

['Govender K', 'Masebo WGB', 'Nyamaruze P', 'Cowden RG', 'Schunter BT', 'Bains A']

The open AIDS journal

2018 Jul 19


In [9]:
# We can't extract everything we want so let's just get the full paper
# if we can
?Entrez.efetch

In [10]:
# As Caleb said this will give us XML output which we continue to parse
# Notice, the documentation of each argument need more annotation
handle4 = Entrez.efetch(db="pubmed", id = "30123385", rettype="gb",retmode="text")

# fetching the result from the database
# print(handle4.read())

In [11]:
# Checking what type of object is return XML maybe?
# remove the delete the hash tag to see for yourself
# type(handle4.read())

In [12]:
# Using store magic command to store the output to a file
# Storing in variables for later is not allowed
%store handle4.read() >> file.txt

Writing 'handle4.read()' (str) to file 'file.txt'.


In [13]:
# you can use shell commands directly on the cell of jupyter
%cat file.txt







1. Open AIDS J. 2018 Jul 19;12:53-67. doi: 10.2174/1874613601812010053. eCollection 
2018.

HIV Prevention in Adolescents and Young People in the Eastern and Southern
African Region: A Review of Key Challenges Impeding Actions for an Effective
Response.

Govender K(1), Masebo WGB(1), Nyamaruze P(2), Cowden RG(3), Schunter BT(4), Bains
A(4).

Author information: 
(1)Health Economics and HIV and AIDS Research Division, University of
KwaZulu-Natal, Durban, South Africa.
(2)School of Applied Human Sciences, University of KwaZulu-Natal, Durban, South
Africa.
(3)Department of Psychology, Middle Tennessee State University, Murfreesboro,
United States of America.
(4)UNICEF, Eastern and Southern Africa Regional Office, Nairobi, Kenya.

The global commitment to ending the AIDS epidemic by 2030 places HIV prevention
at the centre of the response. With the disease continuing to disproportionately 
affect young populations in the Eastern and Southern African Region (ESAR),
particularly adoles

We have to write code 20 times! Nope there's an efficient way to solve our this problem, at least partially. It's .... you guessed it write a function. Go back to the code we wrote we put everything together and change just a few things and we are golden right? RIGHT?

In [14]:
# write a function to automate rewriting the code over and over again
# This is hack function you could use a loop to feed into the other using a generator that could traverse the list
# generated and give you the result try and fix this function
def paper_retriever(email, searchterm, pubmedid):
    '''The paper retriever function takes your email which uses the same name email as an 
    argument, pubmedid you can get this from the previous function, searchterm take the NCBI type of query as a string
    and renamefile just changing your file names to avoid confusion.
    
    Return the full paper depending on if it's open access or not.
    '''
    # Enter your own email
    Entrez.email = email
    
    # search NCBI from the particular search term with method esearch
    handle = Entrez.esearch(db="pubmed", term=searchterm)
    
    # get the results
    record = Entrez.read(handle)
    
    # the method efetch does and fetches the information you need brings it back to your Ipython session
    handle2 = Entrez.efetch(db="pubmed", id = pubmedid, rettype="gb",retmode="text")
    
    # seeing the results
#     print("Extract interesting entries in the data")
#     print("")
#     print(record[0]['Id'])
#     print("")
#     print(record[0]['Title'])
#     print("")
#     print(record[0]['AuthorList'])
#     print("")
#     print(record[0]['FullJournalName'])
#     print("")
#     print(record[0]['EPubDate'])
    
    # using cell magic in a function in the jupyter notebook
    return handle2.read()

In [15]:
# calling the function as a test
print (paper_retriever(email="benmainye@gmail.com", searchterm="[Open science] AND Kenya",pubmedid=30123385))


1. Open AIDS J. 2018 Jul 19;12:53-67. doi: 10.2174/1874613601812010053. eCollection 
2018.

HIV Prevention in Adolescents and Young People in the Eastern and Southern
African Region: A Review of Key Challenges Impeding Actions for an Effective
Response.

Govender K(1), Masebo WGB(1), Nyamaruze P(2), Cowden RG(3), Schunter BT(4), Bains
A(4).

Author information: 
(1)Health Economics and HIV and AIDS Research Division, University of
KwaZulu-Natal, Durban, South Africa.
(2)School of Applied Human Sciences, University of KwaZulu-Natal, Durban, South
Africa.
(3)Department of Psychology, Middle Tennessee State University, Murfreesboro,
United States of America.
(4)UNICEF, Eastern and Southern Africa Regional Office, Nairobi, Kenya.

The global commitment to ending the AIDS epidemic by 2030 places HIV prevention
at the centre of the response. With the disease continuing to disproportionately 
affect young populations in the Eastern and Southern African Region (ESAR),
particularly adolescent 

In [16]:
paper = paper_retriever(email="benmainye@gmail.com", searchterm="[Open science] AND Kenya",pubmedid=30123385)

In [17]:
paper

'\n1. Open AIDS J. 2018 Jul 19;12:53-67. doi: 10.2174/1874613601812010053. eCollection \n2018.\n\nHIV Prevention in Adolescents and Young People in the Eastern and Southern\nAfrican Region: A Review of Key Challenges Impeding Actions for an Effective\nResponse.\n\nGovender K(1), Masebo WGB(1), Nyamaruze P(2), Cowden RG(3), Schunter BT(4), Bains\nA(4).\n\nAuthor information: \n(1)Health Economics and HIV and AIDS Research Division, University of\nKwaZulu-Natal, Durban, South Africa.\n(2)School of Applied Human Sciences, University of KwaZulu-Natal, Durban, South\nAfrica.\n(3)Department of Psychology, Middle Tennessee State University, Murfreesboro,\nUnited States of America.\n(4)UNICEF, Eastern and Southern Africa Regional Office, Nairobi, Kenya.\n\nThe global commitment to ending the AIDS epidemic by 2030 places HIV prevention\nat the centre of the response. With the disease continuing to disproportionately \naffect young populations in the Eastern and Southern African Region (ESAR),\n

In [18]:
%store paper >> file2.txt

Writing 'paper' (str) to file 'file2.txt'.


In [19]:
%cat file2.txt


1. Open AIDS J. 2018 Jul 19;12:53-67. doi: 10.2174/1874613601812010053. eCollection 
2018.

HIV Prevention in Adolescents and Young People in the Eastern and Southern
African Region: A Review of Key Challenges Impeding Actions for an Effective
Response.

Govender K(1), Masebo WGB(1), Nyamaruze P(2), Cowden RG(3), Schunter BT(4), Bains
A(4).

Author information: 
(1)Health Economics and HIV and AIDS Research Division, University of
KwaZulu-Natal, Durban, South Africa.
(2)School of Applied Human Sciences, University of KwaZulu-Natal, Durban, South
Africa.
(3)Department of Psychology, Middle Tennessee State University, Murfreesboro,
United States of America.
(4)UNICEF, Eastern and Southern Africa Regional Office, Nairobi, Kenya.

The global commitment to ending the AIDS epidemic by 2030 places HIV prevention
at the centre of the response. With the disease continuing to disproportionately 
affect young populations in the Eastern and Southern African Region (ESAR),
particularly adolescent 

In [20]:
%%writefile paper_retriever.py
from Bio import Entrez
def paper_retriever(email, searchterm, pubmedid):
    '''The paper retriever function takes your email which uses the same name email as an 
    argument, pubmedid you can get this from the previous function, searchterm take the NCBI type of query as a string.
    
    Return the full paper depending on if it's open access or not.
    '''
    # Enter your own email
    Entrez.email = email
    
    # search NCBI from the particular search term with method esearch
    handle = Entrez.esearch(db="pubmed", term=searchterm)
    
    # get the results
    record = Entrez.read(handle)
    
    # the method efetch does and fetches the information you need brings it back to your Ipython session
    handle2 = Entrez.efetch(db="pubmed", id = pubmedid, rettype="gb",retmode="text")
    
    # using cell magic in a function in the jupyter notebook
    return handle2.read()

paper1 = paper_retriever(email="benmainye@gmail.com", searchterm="[Open science] AND Kenya",pubmedid=30123385)
#%store paper >> papers1.txt
print(paper1)

Overwriting paper_retriever.py


In [21]:
from Bio import Entrez
def paper_parser(term, identity):
    handle = Entrez.esearch(db="pubmed", term=identity)
    record = Entrez.read(handle)
    print (record)
    handle2 = Entrez.esummary(db="pubmed", id = identity)
    record2 = Entrez.read(handle2)
    print("Extract interesting entries in the data")
    print("")
    print(record2[0]['Id'])
    print("")
    print(record2[0]['Title'])
    print("")
    print(record2[0]['AuthorList'])
    print("")
    print(record2[0]['FullJournalName'])
    print("")
    print(record2[0]['EPubDate'])
    return record

print(paper_parser(term="[Open science] AND Kenya", identity=30123385))

DictElement({'Count': '1', 'RetMax': '1', 'RetStart': '0', 'IdList': ['30123385'], 'TranslationSet': [], 'TranslationStack': [DictElement({'Term': '30123385[UID]', 'Field': 'UID', 'Count': '-1', 'Explode': 'N'}, attributes={}), 'GROUP'], 'QueryTranslation': '30123385[UID]'}, attributes={})
Extract interesting entries in the data

30123385

HIV Prevention in Adolescents and Young People in the Eastern and Southern African Region: A Review of Key Challenges Impeding Actions for an Effective Response.

['Govender K', 'Masebo WGB', 'Nyamaruze P', 'Cowden RG', 'Schunter BT', 'Bains A']

The open AIDS journal

2018 Jul 19
DictElement({'Count': '1', 'RetMax': '1', 'RetStart': '0', 'IdList': ['30123385'], 'TranslationSet': [], 'TranslationStack': [DictElement({'Term': '30123385[UID]', 'Field': 'UID', 'Count': '-1', 'Explode': 'N'}, attributes={}), 'GROUP'], 'QueryTranslation': '30123385[UID]'}, attributes={})


In [22]:
%%writefile paper_parser.py
from Bio import Entrez
def paper_parser(term, identity):
    Entrez.email = "benmainye@gmail.com" #use your email
    handle = Entrez.esearch(db="pubmed", term=identity)
    record = Entrez.read(handle)
    print (record)
    handle2 = Entrez.esummary(db="pubmed", id = identity)
    record2 = Entrez.read(handle2)
    print("Extract interesting entries in the data")
    print("")
    print(record2[0]['Id'])
    print("")
    print(record2[0]['Title'])
    print("")
    print(record2[0]['AuthorList'])
    print("")
    print(record2[0]['FullJournalName'])
    print("")
    print(record2[0]['EPubDate'])
    return record2

paper2 = paper_parser(term="[Open science] AND Kenya", identity=30123385)
#%store paper2 >> papers2.txt run in ipython session
print(paper2)

Overwriting paper_parser.py


In [23]:
%run ../Code/paper_retriever.py


1. Open AIDS J. 2018 Jul 19;12:53-67. doi: 10.2174/1874613601812010053. eCollection 
2018.

HIV Prevention in Adolescents and Young People in the Eastern and Southern
African Region: A Review of Key Challenges Impeding Actions for an Effective
Response.

Govender K(1), Masebo WGB(1), Nyamaruze P(2), Cowden RG(3), Schunter BT(4), Bains
A(4).

Author information: 
(1)Health Economics and HIV and AIDS Research Division, University of
KwaZulu-Natal, Durban, South Africa.
(2)School of Applied Human Sciences, University of KwaZulu-Natal, Durban, South
Africa.
(3)Department of Psychology, Middle Tennessee State University, Murfreesboro,
United States of America.
(4)UNICEF, Eastern and Southern Africa Regional Office, Nairobi, Kenya.

The global commitment to ending the AIDS epidemic by 2030 places HIV prevention
at the centre of the response. With the disease continuing to disproportionately 
affect young populations in the Eastern and Southern African Region (ESAR),
particularly adolescent 

In [24]:
%run ../Code/paper_parser.py

DictElement({'Count': '1', 'RetMax': '1', 'RetStart': '0', 'IdList': ['30123385'], 'TranslationSet': [], 'TranslationStack': [DictElement({'Term': '30123385[UID]', 'Field': 'UID', 'Count': '-1', 'Explode': 'N'}, attributes={}), 'GROUP'], 'QueryTranslation': '30123385[UID]'}, attributes={})
Extract interesting entries in the data

30123385

HIV Prevention in Adolescents and Young People in the Eastern and Southern African Region: A Review of Key Challenges Impeding Actions for an Effective Response.

['Govender K', 'Masebo WGB', 'Nyamaruze P', 'Cowden RG', 'Schunter BT', 'Bains A']

The open AIDS journal

2018 Jul 19
[DictElement({'Item': [], 'Id': '30123385', 'PubDate': '2018', 'EPubDate': '2018 Jul 19', 'Source': 'Open AIDS J', 'AuthorList': ['Govender K', 'Masebo WGB', 'Nyamaruze P', 'Cowden RG', 'Schunter BT', 'Bains A'], 'LastAuthor': 'Bains A', 'Title': 'HIV Prevention in Adolescents and Young People in the Eastern and Southern African Region: A Review of Key Challenges Impeding A

In [25]:
# stores the interesting NCBI results
%store record["IdList"] >> NCBIids.txt

NameError: name 'IdList' is not defined

>group publications by date
import the data
do some cleaning that is, parsing datetime objects
select the necessary columns
groupby date and publications columns

In [None]:
%cat kenyan_papers_details.txt

In [4]:
# import data manipulation library 
import pandas as pd

In [2]:
# importing data using the read_csv function delimiter set to tabs
df = pd.read_csv('../Data/kenyan_papers_details.txt', delimiter="\t")

In [3]:
# .head() allows you to see the first observations 
df.head()

Unnamed: 0,Id,AuthorList,DOI,EPubDate,FullJournalName,HasAbstract,LastAuthor,NlmUniqueID,PubDate,PubTypeList,RecordStatus,Source,Title
0,30165703,"['Achieng L', 'Riedel DJ']",10.1093/infdis/jiy436,2018 Aug 24,The Journal of infectious diseases,0,Riedel DJ,413675,2018 Aug 24,['Journal Article'],PubMed - as supplied by publisher,J Infect Dis,Dolutegravir Resistance and Failure in a Kenya...
1,30165548,"['Letizia A', 'Eller MA', 'Polyak C', 'Eller L...",10.1093/infdis/jiy509,2018 Aug 27,The Journal of infectious diseases,1,Ake JA,413675,2018 Aug 27,['Journal Article'],PubMed - as supplied by publisher,J Infect Dis,Biomarkers of Inflammation Correlate with Clin...
2,30165370,"['Lalani T', 'Tisdale MD', 'Liu J', 'Mitra I',...",10.1371/journal.pone.0202178,2018 Aug 30,PloS one,1,Riddle MS,101285081,2018,['Journal Article'],PubMed - in process,PLoS One,Comparison of stool collection and storage on ...
3,30161172,"['Ayieko J', 'Brown L', 'Anthierens S', 'Van R...",10.1371/journal.pone.0202990,2018 Aug 30,PloS one,1,Camlin CS,101285081,2018,['Journal Article'],PubMed - in process,PLoS One,"""Hurdles on the path to 90-90-90 and beyond"": ..."
4,30161101,"['Golicha Q', 'Shetty S', 'Nasiblov O', 'Husse...",10.15585/mmwr.mm6734a4,2018 Aug 31,MMWR. Morbidity and mortality weekly report,1,Burton JW,7802429,2018 Aug 31,['Journal Article'],PubMed - in process,MMWR Morb Mortal Wkly Rep,"Cholera Outbreak in Dadaab Refugee Camp, Kenya..."


In [60]:
# see the column names of the dataframe
df.columns

Index(['Id', 'AuthorList', 'DOI', 'EPubDate', 'FullJournalName', 'HasAbstract',
       'LastAuthor', 'NlmUniqueID', 'PubDate', 'PubTypeList', 'RecordStatus',
       'Source', 'Title'],
      dtype='object')

In [40]:
# gives a concise summary of the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2269 entries, 0 to 2268
Data columns (total 13 columns):
Id                 2269 non-null int64
AuthorList         2269 non-null object
DOI                2222 non-null object
EPubDate           1805 non-null object
FullJournalName    2269 non-null object
HasAbstract        2269 non-null int64
LastAuthor         2263 non-null object
NlmUniqueID        2269 non-null object
PubDate            2269 non-null object
PubTypeList        2269 non-null object
RecordStatus       2269 non-null object
Source             2269 non-null object
Title              2269 non-null object
dtypes: int64(2), object(11)
memory usage: 230.5+ KB


In [73]:
# dimensions of the dataframe (rows, columns)
df.shape

(2269, 13)

Columns of interest include: EpubDate

In [None]:
df.PubDate[1]

In [45]:
# df['EPubDate'] = df['EPubDate']

In [64]:
# counting the number of times a publication was 
# posted for this sample 2018 Jan 18 was common
df.EPubDate.value_counts()[:10]

2018 Jan 18    15
2017 Oct 23    14
2018 Mar 27    14
2017 Oct 10    13
2018 Mar 15    13
2018 Aug 10    12
2018 Jul 6     12
2017 Sep 13    12
2018 Apr 19    12
2017 Aug 22    11
Name: EPubDate, dtype: int64

In [65]:
# a lot of publications were posted in 2018 for this 
# sample
df.PubDate.value_counts()[:10]

2018        254
2017        197
2018 Mar     74
2017 Dec     63
2018 Jun     60
2018 Aug     57
2018 Jul     57
2018 Feb     53
2018 Jan     53
2018 Apr     51
Name: PubDate, dtype: int64

In [80]:
# answers the question what are the popular journals researchers post their work
# This could coicide with the type of research that's done mostly namely the malaria journal 
df_group_pubs = df.groupby(by=["FullJournalName"]).count().sort_values(by="Source", ascending=False)
df_group_pubs

Unnamed: 0_level_0,Id,AuthorList,DOI,EPubDate,HasAbstract,LastAuthor,NlmUniqueID,PubDate,PubTypeList,RecordStatus,Source,Title
FullJournalName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
PloS one,138,138,138,138,138,138,138,138,138,138,138,138
Malaria journal,48,48,48,48,48,48,48,48,48,48,48,48
Scientific reports,47,47,47,47,47,47,47,47,47,47,47,47
PLoS neglected tropical diseases,42,42,42,42,42,42,42,42,42,42,42,42
The Pan African medical journal,38,38,38,38,38,38,38,38,38,38,38,38
The Journal of infectious diseases,35,35,35,4,35,35,35,35,35,35,35,35
The American journal of tropical medicine and hygiene,34,34,34,31,34,34,34,34,34,34,34,34
BMJ global health,33,33,33,32,33,30,33,33,33,33,33,33
BMC public health,32,32,32,32,32,32,32,32,32,32,32,32
"AIDS (London, England)",31,31,31,7,31,30,31,31,31,31,31,31


In [82]:
# grouping by publication date and fulljournalname, papers were posted recently in PloS one
# to get all the results i recommend subsetting the column and writing to a file
df_group_pubs2 = df.groupby(by=["PubDate","FullJournalName"]).count().sort_values(by="Source", ascending=False)
df_group_pubs2

Unnamed: 0_level_0,Unnamed: 1_level_0,Id,AuthorList,DOI,EPubDate,HasAbstract,LastAuthor,NlmUniqueID,PubTypeList,RecordStatus,Source,Title
PubDate,FullJournalName,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2018,PloS one,88,88,88,88,88,88,88,88,88,88,88
2017,PloS one,50,50,50,50,50,50,50,50,50,50,50
2017,The Pan African medical journal,30,30,30,30,30,30,30,30,30,30,30
2017,Wellcome open research,23,23,23,23,23,23,23,23,23,23,23
2018,BMJ global health,19,19,19,19,19,17,19,19,19,19,19
2017,BMJ global health,13,13,13,13,13,12,13,13,13,13,13
2018,Global health action,10,10,10,0,10,10,10,10,10,10,10
2018 Apr 25,Public health action,10,10,10,0,10,10,10,10,10,10,10
2017 Oct,PLoS neglected tropical diseases,9,9,9,9,9,9,9,9,9,9,9
2018,The Pan African medical journal,8,8,8,8,8,8,8,8,8,8,8


In [83]:
# counting the number of occurences of a paricular journals as confirmation that our function works
# wellcome open research has 27 papers
df_group_pubs.EPubDate.sort_values(ascending=False)

FullJournalName
PloS one                                                                                          138
Malaria journal                                                                                    48
Scientific reports                                                                                 47
PLoS neglected tropical diseases                                                                   42
The Pan African medical journal                                                                    38
BMJ global health                                                                                  32
BMC public health                                                                                  32
The American journal of tropical medicine and hygiene                                              31
Wellcome open research                                                                             27
BMC health services research                                      

In [86]:
# same thing but taking publication date into consideration 
df_group_pubs.PubDate.sort_values(ascending=False)

FullJournalName
PloS one                                                                                                138
Malaria journal                                                                                          48
Scientific reports                                                                                       47
PLoS neglected tropical diseases                                                                         42
The Pan African medical journal                                                                          38
The Journal of infectious diseases                                                                       35
The American journal of tropical medicine and hygiene                                                    34
BMJ global health                                                                                        33
BMC public health                                                                                        32
AIDS (London

In [68]:
df_group_pubs2.sort_values(by="FullJournalName",ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Id,AuthorList,DOI,EPubDate,HasAbstract,LastAuthor,NlmUniqueID,PubTypeList,RecordStatus,Source,Title
PubDate,FullJournalName,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2018,mHealth,2,2,2,2,2,2,2,2,2,2,2
2018 Apr 3,mBio,1,1,1,1,1,1,1,1,1,1,1
2017 Sep 13,eLife,1,1,1,1,1,1,1,1,1,1,1
2018 Apr 25,eLife,1,1,1,1,1,1,1,1,1,1,1
2018 Mar 4,Zootaxa,1,1,0,1,1,1,1,1,1,1,1
2018 Apr 17,Zootaxa,1,1,0,1,1,1,1,1,1,1,1
2018 Mar 22,Zootaxa,1,1,0,1,1,1,1,1,1,1,1
2017 Nov 30,Zootaxa,1,1,0,1,1,1,1,1,1,1,1
2017 Oct 17,Zootaxa,1,1,0,1,1,1,1,1,1,1,1
2018 Mar 21,Zootaxa,1,1,0,1,1,1,1,1,1,1,1


group pubs by journals

import the data
do some cleaning that is, parsing datetime objects
select the necessary columns
groupby date and publications columns

In [69]:
df["FullJournalName"].value_counts()

PloS one                                                                                                138
Malaria journal                                                                                          48
Scientific reports                                                                                       47
PLoS neglected tropical diseases                                                                         42
The Pan African medical journal                                                                          38
The Journal of infectious diseases                                                                       35
The American journal of tropical medicine and hygiene                                                    34
BMJ global health                                                                                        33
BMC public health                                                                                        32
AIDS (London, England)      

In [91]:
def summary (filename):
    '''The function summary takes a file and outputs important summary statistics namely:
    * The number of records
    * The most popular times of the year a journal is published
    * group publications by date
    * group publication by Journals
    
    Returns tables with the above information.'''
    df = pd.read_csv(filename, delimiter="\t")
    msg = "The number of publications from the sample gotten from NCBI (rows,columns) {}"
    msg.format(df.shape); print("=" * 100)
    print ("Common electronic publication dates" + str(df.EPubDate.value_counts()[:10]))
    print ("Common Publication dates" + str(df.PubDate.value_counts()[:10]))
    print("=" * 100)
    print("Grouping publications by date and journal")
    df_group_pubs2 = df.groupby(by=["PubDate","FullJournalName"]).count().sort_values(by="Source", ascending=False) 
    print(df_group_pubs2)
    print("="*100)
    print("Sorting the publications found by journals")
    print(df_group_pubs.PubDate.sort_values(ascending=False))
    
    return "Done"

In [92]:
print (summary('../Data/kenyan_papers_details.txt'))

Common electronic publication dates2018 Jan 18    15
2017 Oct 23    14
2018 Mar 27    14
2017 Oct 10    13
2018 Mar 15    13
2018 Aug 10    12
2018 Jul 6     12
2017 Sep 13    12
2018 Apr 19    12
2017 Aug 22    11
Name: EPubDate, dtype: int64
Common Publication dates2018        254
2017        197
2018 Mar     74
2017 Dec     63
2018 Jun     60
2018 Aug     57
2018 Jul     57
2018 Feb     53
2018 Jan     53
2018 Apr     51
Name: PubDate, dtype: int64
Grouping publications by date and journal
                                                                Id  \
PubDate     FullJournalName                                          
2018        PloS one                                            88   
2017        PloS one                                            50   
            The Pan African medical journal                     30   
            Wellcome open research                              23   
2018        BMJ global health                                   19   
2017        

In [6]:
# taking the df column with author list to determine the most frequent author in the 2000 paper sample
df.AuthorList.value_counts()[1:100]

# Found 4 occurrences of the following authors 
# 'Muchemi SK', 'Zebitz CPW', 'Borgemeister C', 'Akutse KS', 'Foba CN', 'Ekesi S', 'Fiaboe KKM'
# 'Williams PCM', 'Berkley JA'

['Muchemi SK', 'Zebitz CPW', 'Borgemeister C', 'Akutse KS', 'Foba CN', 'Ekesi S', 'Fiaboe KKM']                                                                                                                                                                                                                                                                                                                                                                                                                                             4
['Williams PCM', 'Berkley JA']                                                                                                                                                                                                                                                                                                                                                                                                                                                            

In [10]:
# commands used find the common and least common author names our sample
# %cut -f2 kenyan_papers_details.txt | head -10
# %cut -f2 kenyan_papers_details.txt > authorList2000papers.txt

UsageError: Cell magic `%%cut` not found.


# common names in publications

|name| occurence of name|
| ------------- |-----:|
|Van| 114| 
|Wang |60|
|Otieno| 52|
|Zhang |49|
|Bukusi| 46|
|Cheng|45|

Above are the common names that were found in the kenyan_papers_details.txt file. People with the name Van and Wang were the most common followed by Otieno, Zhang, Bukusi and Cheng. These were determined by the wordcloud application using the file authorList2000papers.txt [here]( https://gettingappy.shinyapps.io/wordcloudunigrams/). In the image the size of the word corresponds to the frequency of the text.

![The wordcloud and how the parameters were set](plots/wordcloud_result.png)

In [3]:
?pd.read_csv

In [5]:
# load in the dataframe and give the columns names
df2 = pd.read_csv("PMID_PMC_Journal_Year.txt",delimiter="\t",names=["pmid","pmd","journal","year"])

In [6]:
# no rows are lost
# pandas loads in an all or nothing approach
df2.shape

(25256, 4)

In [7]:
# concise summary of the dataset
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25256 entries, 0 to 25255
Data columns (total 4 columns):
pmid       25256 non-null object
pmd        6705 non-null object
journal    12313 non-null object
year       12304 non-null object
dtypes: object(4)
memory usage: 789.3+ KB


In [5]:
# Use R programming language in the notebook 
# https://www.linkedin.com/pulse/interfacing-r-from-python-3-jupyter-notebook-jared-stufft/
# import rpy2.rinterface 
# %load_ext rpy2.ipython

In [None]:
# %%R -i df2
# head(df2)

In [8]:
# convert the txt file into a csv for visualization in R
df2.to_csv("PMID_PMC_Journal_Year.csv")