# Biopython PubMed queries

In [1]:
from Bio import Entrez
import entrez_utils
import utils
import pandas as pd

**These two are required to be set**  
Requested at the NCBI profile page (registration required)

In [4]:
# Entrez.api_key = '20af26e91ae36b8ec830da38ca84b872a209'
# Entrez.email = 'michiel.noback@gmail.com'

entrez_utils.init('20af26e91ae36b8ec830da38ca84b872a209', 'michiel.noback@gmail.com')

In [3]:
#run this to reload the entrez_utils module after changes during session
#from importlib import reload 
#reload(entrez_utils);

In [5]:
# STEP 1: fetch ids for query terms
# see https://www.epa.gov/ingredients-used-pesticide-products/types-pesticide-ingredients
pesticide_terms_small = ["pesticide", "fungicide", "insecticide", "herbicide"]
pesticide_terms_extended = ["pesticide", "fungicide", "insecticide", "herbicide", "rodenticides", "algicides",
                           "antifoulant", "biocide", "defoliant", "miticide", "molluscicide", "ovicide", "nematicide"]

pm_ids = entrez_utils.query_pubmed(pesticide_terms_small,
                                    start_date="2020/01/01", end_date="2022/03/10", # end_date defaults to today
                                    retmax=10)

Querying Pubmed with: (pesticide OR fungicide OR insecticide OR herbicide) AND (2020/01/01:2022/03/10[dp])


In [6]:
# STEP 2: fetch title and abstract for each pubmed ID

#pm_ids = ['40059550', '40054240', '40054238', '40051450', '40049114']
entrez_utils.fetch_abstracts(pm_ids, output_file='new_abstracts.csv')

Fetching abstracts from 0 to 10.


## New Validation datasets generated

In order to validate the ML algorithms validation sets were constructed. To this end, five queries were carried out:

```python
# positives
ids = entrez_utils.query_pubmed(pesticide_terms_small, retmax=1000)

# negatives
ids = entrez_utils.query_pubmed(["soil"], retmax=200)
# and so forth with
[["soil"], ["gene"], ["cell"], ["disease"], ["crop"], ["health"], ["mouse"], ["bacteria"], ["protein"], ["cancer"]]

```

The resulting files are concatenated into a single file (except for the pesticide file):

```bash
#concat without headers
awk FNR!=1 *.csv > tmp.csv
wc -l tmp.csv # 1200
sort -u tmp.csv > tmp2.csv
cat header.csv tmp2.csv > validation_set_master.csv
wc -l validation_set_master.csv # 879
```

A function was written to generate a random validation set of differing composition. See below.

In [13]:
#run this to reload the entrez_utils module after changes during session
#terms = [["soil"], ["gene"], ["cell"], ["disease"], ["crop"], ["health"], ["mouse"], ["bacteria"], ["protein"], ["cancer"]]
terms = [["soil"], ["gene"]]

import time
retmax=500
for term in terms:
    #ids = entrez_utils.query_pubmed(term, retmax=retmax)
    out_file = 'abstracts_' + "_".join(term) + '_' + str(retmax) + '.csv'
    print(f'writing {out_file}')
    # entrez_utils.fetch_abstracts(ids, output_file='abstracts_soil.csv')
    # # sleep so not to get blacklisted
    time.sleep(5)

writing abstracts_soil_500.csv
writing abstracts_gene_500.csv


### Sampling validation data

The following function was written in `utils.py`:

```python
def sample_validation_set(positives_file, negatives_file, 
                          n_positive=30, n_negative=470,
                          out_file='validation_set.csv',
                          abstract_required=True):
    #code omitted
```

In [8]:
from importlib import reload 
reload(utils);

positives_file = './data/pesticide_abstracts_narrow.csv' # has 250
negatives_file = './data/validation_set_master.csv' # has 1000
out_file = './data/validation_set_500.csv'
utils.sample_validation_set(positives_file, negatives_file, out_file = out_file)

In [9]:
# check the generated file
val_data = pd.read_csv(out_file, sep='\t')
val_data.head()

Unnamed: 0,mid,title,abstract,label,text_label
0,40058542,Iron at the helm: steering arsenic speciation ...,The toxicity and bioavailability of arsenic (A...,0,contr
1,40065997,The assembly and annotation of two teinturier ...,"Teinturier grapevines, known for their pigment...",0,contr
2,40066500,Who cares about the dying? - Unpacking integra...,Integrating palliative care into the trajector...,0,contr
3,40064377,The stability and elimination of mammalian env...,We assessed the viability of aerosolized human...,0,contr
4,40064517,Microdroplet-Mediated Enzyme Activity Enhancem...,On-site measurements of enzyme activity in com...,0,contr


In [10]:
# check proportions of pos/neg
# could be less when there is overlap in the two datasets
# if this is the case, duplicates will be removed
val_data['text_label'].value_counts()

text_label
contr    470
pest      30
Name: count, dtype: int64

### Some general-purpose example code

In [3]:
stream = Entrez.einfo(db="pubmed")
record = Entrez.read(stream)
record["DbInfo"]["Description"]

'PubMed bibliographic record'

In [8]:
record["DbInfo"]["LastUpdate"]

'2025/03/10 18:38'

In [9]:
record["DbInfo"]["Count"]

'38514474'

In [62]:
print(record.keys())

dict_keys(['Count', 'RetMax', 'RetStart', 'IdList', 'TranslationSet', 'QueryTranslation'])


In [4]:
for field in record["DbInfo"]["FieldList"]:
    print("%(Name)s, %(FullName)s, %(Description)s" % field)

ALL, All Fields, All terms from all searchable fields
UID, UID, Unique number assigned to publication
FILT, Filter, Limits the records
TITL, Title, Words in title of publication
MESH, MeSH Terms, Medical Subject Headings assigned to publication
MAJR, MeSH Major Topic, MeSH terms of major importance to publication
JOUR, Journal, Journal abbreviation of publication
AFFL, Affiliation, Author's institutional affiliation and address
ECNO, EC/RN Number, EC number for enzyme or CAS registry number
SUBS, Supplementary Concept, CAS chemical name or MEDLINE Substance Name
PDAT, Date - Publication, Date of publication
EDAT, Date - Entry, Date publication first accessible through Entrez
VOL, Volume, Volume number of publication
PAGE, Pagination, Page number(s) of publication
PTYP, Publication Type, Type of publication (e.g., review)
LANG, Language, Language of publication
ISS, Issue, Issue number of publication
SUBH, MeSH Subheading, Additional specificity for MeSH term
SI, Secondary Source ID, Cr