# PubMed Topic Tracker
## 1. Search and download

This tool allows to build PubMed queries, download entries, parse them and save them to a neat .csv file. It takes as input a PubMed query, and outputs a dataset (i.e: a folder containing a PubMed export, its metadata saved in the log file, and the Medline file for eventually importing the references you are analysing in Zotero or similar software). 

The output can be explored with the second and third notebooks of this collection.

Dependencies:
- pandas 1.2.1
- IPython 7.19.0
- tqdm 4.55.1
- shutils 0.1.0

In [8]:
# Import libraries
from time import sleep
import PubGetParse as pg
import pandas as pd
import numpy as np
from IPython.display import clear_output
from tqdm import tqdm
import time
from collections import Counter
import re
import os
from shutil import copy2

# Define log file
log = "log.py"

In [9]:
# Log file
if os.path.exists(log):
    os.remove(log)
    open(log, 'w').close()
else:
    open(log, 'w').close()
timestr = time.strftime("%Y.%m.%d-%H:%M:%S")
with open(log, "w") as f:
    f.write("# This is a log file. It is saved as .py so that the following notebooks can easily import it and use its information.\n\n")
with open(log, "a") as f:
    f.write("# started at: " + timestr + "\n\n") 

## First step: definition and segmentation of the query
The query must not contain time references. PubMed allows only max. 100k results per query, hence the main query will be splitted in years; e.g: if the time references are 1990 - 1995 the software will run one query for 1990, one for 1991, and so on up to 1995. Detailed information on the segmented queries are saved in the log for reproducibility.

Every other PubMed tag can be used.

In [10]:
# Definition of the query. The best idea is to define it in PubMed and then copypaste it here.
print("Important: do not include timepoints in your query, they will be defined via this interface")
pubmed_query = input("Paste here your PubMed query:")
year0 = int(input("\nIn order to better manage the amount of results, the query will be segmented by year. \nFrom what year do you want to start?"))
year1 = int(input("Up to what year do you want to search?"))

x = range(year0, (year1 + 1) )
yearlist = []
for n in x:
    yearlist.append(n)
querylist = []
for x in yearlist:
    timequery = "\"" + str(x) + "/01/01" + "\"" + "[Date - Publication] : " + "\"" + str(x) + "/12/31" "\"" + "[Date - Publication]" + " AND " + pubmed_query
    querylist.append(timequery)

displayquery = "\"" + str(year0) + "/01/01" + "\"" + "[Date - Publication] : " + "\"" + str(year1) + "/12/31" "\"" + "[Date - Publication]" + " AND " + pubmed_query

# Add log
timestr = time.strftime("%Y.%m.%d-%H:%M:%S")
with open(log, "a") as f:
    f.write("year0 = " + "\"" +  str(year0) + "\"" + "\n")
    f.write("year1 = " + "\"" + str(year1) + "\"" + "\n")
    f.write("keywords = " + "\"" + pubmed_query + "\"" + "\n\n")
    f.write("'''\n")
    f.write("query = " + displayquery + "\n") 
    f.write("   Segmented as:\n")
    for x in querylist:
        f.write(x + "\n")
    f.write("'''\n\n")
print("\nThis query will be performed in PubMed, segmented by year:\n", displayquery)


Important: do not include timepoints in your query, they will be defined via this interface

This query will be performed in PubMed, segmented by year:
 "2021/01/01"[Date - Publication] : "2022/12/31"[Date - Publication] AND xml


In [11]:
# Run the queries (segmented by year) and merge all the PubMed IDs in one list
pids = []
for timequery in tqdm(querylist):
    x = pg.get_p_ids(timequery)
    pids.extend(x)
    sleep(0.3)
len1 = len(pids)

# Clean duplicates from list
pids = list(dict.fromkeys(pids))
len2 = len(pids)
dropped = len1-len2

print("Query completed. " + str(len1) + " PubMed IDs retrieved. " + str(dropped) + " duplicate entries dropped.\n ")
print("Downloading the non-duplicate entries, which are " + str(len2))
# Add log
timestr = time.strftime("%Y.%m.%d-%H:%M:%S")
with open(log, "a") as f:
    f.write("# Query executed at: " + timestr + "\n")
    f.write("paper_count_original = " + "\"" + str(len1) + "\"" + "\n")
estimatedtime = round((len2 / 1.5)/60, 2)


100%|██████████| 2/2 [00:02<00:00,  1.24s/it]

Query completed. 53 PubMed IDs retrieved. 2 duplicate entries dropped.
 
Downloading the non-duplicate entries, which are 51





## Retrieving MedLine entries for each one of the IDs and parsing them
Here we pass every PubMed ID previously retrieved to the API. The API responds with the MEDLINE record, from which we parse and save what follows:

pid, pid_type, year, journal, publisher, title, book_title, abstract, oabstract, authors, editors, language, meshterms, keywords, coi, doi

In [12]:
# Here we download every article and we parse it as a list

# Add log
timestr = time.strftime("%Y.%m.%d-%H:%M:%S")
with open(log, "a") as f:
    f.write("# Download started at: " + timestr + "\n")

estimatedtime = round((len(pids) / 1.5)/60, 2)
print("\n\nEstimated time for downloading and parsing: ", estimatedtime, "minutes. \nThis assumes 1.5 iterations per second.\nGo grab yourself a coffee ;)")

# Retrieve and parse entries
entrylist = []
for pid in tqdm(pids):
    x = pg.get_parse_article_re(pid)
    entrylist.append(x)
    sleep(0.3)#need to slow it down to avoid being kicked from the server, pity.

# Add log
timestr = time.strftime("%Y.%m.%d-%H:%M:%S")
with open(log, "a") as f:
    f.write("# Download finished at: " + timestr + "\n")

  0%|          | 0/51 [00:00<?, ?it/s]



Estimated time for downloading and parsing:  0.57 minutes. 
This assumes 1.5 iterations per second.
Go grab yourself a coffee ;)


100%|██████████| 51/51 [00:42<00:00,  1.19it/s]


---
## MedLine entries become a neat dataframe
Here we check for duplicates using PubMed IDs, we remove articles published outside the time interval specified in the query and create a dataframe with the content of every entry. We finally export it as a .csv file. 

### Important: some cleaning is performed on the data
PubMed saves multiple dates per entry and can include in the results papers published before the desired timepoint because they have been indexed, so added to the database, years later. 
Hence, to provide clean results, here we remove from the dataset the papers whose actual publication date was outside the scope of the query.

In [13]:
df = pd.DataFrame(entrylist, columns =[
    "p_id", "pid_type", "year", "journal", "publisher", "title", "book_title", "abstract", "oabstract", "authors", "editors", "language", "meshterms", "keywords", "coi", "grant", "doi"])
df.index += 1 

# Replace empty cells with NA and cast year to INT
df = df.replace(r'^\s*$', np.nan, regex=True)
df["year"] = df["year"].astype('float').astype('Int32')

# check time interval
lenght0 = len(df.index)
df = df.drop(df[df.year < int(year0)].index)
df = df.drop(df[df.year > int(year1)].index)
df = df.reset_index(drop=True)
df.index += 1
lenght1 = len(df.index)

message = ("Dropped " + str(lenght0-lenght1) + " entries due to publication time outside query parameters.")
print(message)
lenght2 = len(df.index)
message = (str(lenght2) + " entries included.")
print(message)

# Export the dataframe
timestr = time.strftime("%Y%m%d-%H%M%S")
exportdir = ("export/" + timestr)
os.mkdir(exportdir)
df.to_csv(exportdir + "/PubMed full records.csv",  sep=';')

print(str(len1) + " entries found. " + str(dropped) + " duplicate entries dropped.\n ")
print(str(len2) + " records successfully saved to .csv in " + exportdir + ". \nYou can go ahead with the analysis :)")
display(df.head(20))

# Add log
timestr = time.strftime("%Y.%m.%d-%H:%M:%S")
with open(log, "a") as f:
    f.write("paper_count_no_duplicates = " + "\"" + str(len(df)) + "\"" + "\n")
    f.write("# Data exported at: " + timestr + " to : " + exportdir + "\n")
    f.write("exportdir = " + "\"" + exportdir + "\"" + "\n")

# Copy the log file to the export folder as documentation
destination_log = exportdir + "/log.txt"
copy2(log, destination_log)

Dropped 0 entries due to publication time outside query parameters.
51 entries included.
53 entries found. 2 duplicate entries dropped.
 
51 records successfully saved to .csv in export/20220303-215132. 
You can go ahead with the analysis :)


Unnamed: 0,p_id,pid_type,year,journal,publisher,title,book_title,abstract,oabstract,authors,editors,language,meshterms,keywords,coi,grant,doi
1,35194431,Article,2021,Iranian journal of pharmaceutical research : IJPR,,Pharmacogenomics Implementation and Hurdles to...,,"Having multiple dimensions, uncertainties and ...",,"Ayati N, Afzali M, Hasanzad M, Kebriaeezadeh A...",,eng,,"[Developing countries, Dynamic challenges, Ira...",The authors declare no conflict of interest.,,10.22037/ijpr.2021.114899.15091
2,35106138,Article,2021,F1000Research,,Improving the support for XML dynamic updates ...,,Background : As the standard for the exchange ...,,"Haw SC, Amin A, Wong CO, Subramaniam S",,eng,,"[XML databases, XML labeling scheme., XML-RDB ...",No competing interests were disclosed.,,10.12688/f1000research.69108.1
3,35028636,Article,2022,Journal of mass spectrometry and advances in t...,,Listening to your mass spectrometer: An open-s...,,Introduction: We have developed a set of tools...,,"Pablo A, Hoofnagle AN, Mathias PC",,eng,,"[Dashboard, Database, GB, Gigabyte, LC-MS/MS, ...",The authors declare that they have no known co...,,10.1016/j.jmsacl.2021.12.003
4,34972171,Article,2021,PloS one,,Medical data integration using HL7 standards f...,,Integration between information systems is cri...,,"AlQudah AA, Al-Emran M, Shaalan K",,eng,"[*Appointments and Schedules, Computer Securit...",,The authors have declared that no competing in...,,10.1371/journal.pone.0262067
5,34916929,Article,2021,Frontiers in pharmacology,,Efficacy and Safety of Traditional Chinese Med...,,Background: Heart failure as an important issu...,,"Lin S, Shi Q, Ge Z, Liu Y, Cao Y, Yang Y, Zhao...",,eng,,"[bayesian model, heart failure, network meta-a...",The authors declare that the research was cond...,,10.3389/fphar.2021.659707
6,34890097,Article,2021,The FEBS journal,,EnzymeML-a data exchange format for biocatalys...,,EnzymeML is an XML-based data exchange format ...,,"Range J, Halupczok C, Lohmann J, Swainston N, ...",,eng,,"[FAIR data principles, Python, Systems Biology...",,EXC310/Deutsche Forschungsgemeinschaft,10.1111/febs.16318
7,34770684,Article,2021,"Sensors (Basel, Switzerland)",,Control and Diagnostics System Generator for C...,,FPGA-based data acquisition and processing sys...,,"Zabolotny WM, Guminski M, Kruszewski M, Muller...",,eng,"[*Computers, *Software]","[FPGA, VHDL, Wishbone, control interface, syst...",,,10.3390/s21217378
8,34760250,Article,2021,Food science & nutrition,,The effect of curculigo orchioides (Xianmao) o...,,The Chinese materia medica Xianmao (XM) is wid...,,"Chen L, Qu B, Wang H, Liu H, Guan Y, Zhou J, Z...",,eng,,"[RT-PCR, Xianmao, kidney energy metabolism, me...",The authors declared no potential conflicts of...,,10.1002/fsn3.2573
9,34734333,Article,2021,Protoplasma,,Construction of an N6-methyladenosine lncRNA- ...,,The present paper aims to shed light on the in...,,"Yu ZL, Zhu ZM",,eng,,"[Bioinformatics analysis, Colorectal cancer, I...",,81860433/National Natural Science Foundation o...,10.1007/s00709-021-01718-x
10,34720253,Article,2021,Scientometrics,,Software review: The JATSdecoder package-extra...,,JATSdecoder is a general toolbox which facilit...,,Boschen I,,eng,,"[Meta-research, PubMed central, Software, Text...",Conflict of interestThe author declares no con...,,10.1007/s11192-021-04162-z


'export/20220303-215132/log.txt'

## Creating MedLine file

Here we create a MedLine file from the entries included in the analysis. The MedLine file can then be used to import the references (and to get the papers) in reference management software, e.g. Zotero.

In [14]:
# Create MedLine file from the dataframe for import in reference management software
message = ("Creating MedLine file from the dataframe...")
print(message)
pids_to_get = df["p_id"].tolist()
medline_file = "medline.txt"
medline_new = open(medline_file, "w")
medline_new.close()

for x in tqdm(pids_to_get):
    x = str(x)
    pg.art_to_medline(x, medline_file)
message = ("MedLine file created.")
print(message)

destination_medline = exportdir + "/medline.txt"
copy2(medline_file, destination_medline)

  0%|          | 0/51 [00:00<?, ?it/s]

Creating MedLine file from the dataframe...


100%|██████████| 51/51 [00:39<00:00,  1.29it/s]

MedLine file created.





'export/20220303-215132/medline.txt'

In [15]:
# Add log
timestr = time.strftime("%Y.%m.%d-%H:%M:%S")
with open(log, "a") as f:
    f.write("paper_count_no_duplicates = " + "\"" + str(len(df)) + "\"" + "\n")
    f.write("# Data exported at: " + timestr + " to : " + exportdir + "\n")
    f.write("exportdir = " + "\"" + exportdir + "\"" + "\n")

# Copy the log file to the export folder as documentation
destination_log = exportdir + "/log.txt"
copy2(log, destination_log)

'export/20220303-215132/log.txt'