# **Extracting articles ID from PubMed with a list of pharmaceutical name for statistical analysis**

The code consists of the following Task

Extracting all the list of article Ids for articles that use pharmaceutical name without reference to the corresponding scientfic name for a list of pharmaceutical names

Required Installations

In [None]:
!pip install biopython
!pip install beautifulsoup4
!pip install --upgrade openpyxl==3.0.5
!pip install --upgrade pandas==1.1.2
!pip install --upgrade python==3.7
!pip install XlsxWriter

In [None]:
#python: 3.7 |openpyxl: 3.0.5 | pandas: 1.1.2 |xlrd 1.2.0
#pip install --upgrade pandas==0.19.2


import dependent libraries

In [None]:
import pandas as pd
import re
import Bio
import requests
from bs4 import BeautifulSoup
from Bio import Entrez
Entrez.email =  "radhu.palliyana@gmail.com" # provide mail id after creating api key
from openpyxl import load_workbook
import xlsxwriter

In [None]:
from google.colab import drive
drive.mount('/content/drive') 

Load MPNS version 11 datasets:


1.   mpns_v11_non_sci_names_1.csv containing non scientific names or common or pharmaceutical names of medicinal plants
2.   mpns_v11_plants_1.csv containing scientific names of medicinal plants
3.   mpns_v11_synonyms_1.csv containing synonyms or old scientific names of medicinal plants

In [None]:
mpns_non_sci = pd.read_csv("/content/drive/MyDrive/Dissertation/mpns_v11_non_sci_names_1.csv")
mpns_plant = pd.read_csv("/content/drive/MyDrive/Dissertation/mpns_v11_plants_1.csv")
#mpns_synon = pd.read_csv("/content/drive/MyDrive/Dissertation/mpns_v11_synonyms_1.csv")

**Extracting all the article Ids for articles that use pharmaceutical name without reference to the corresponding scientfic name**

merging Table 1 ,mpns plant dataset and Table 3 , mpns non scientific dataset together with corresponding name id

As per MPNS Data dictionary name_id field value links each and every row in TABLE 3 Non Scientific Names to ONE data row in either TABLE 1 PLANTS OR TABLE 2 SYNONYMS


**Data Cleaning and feature engineering**

In [None]:
mpns_non_sci.rename(columns ={'name_id':'name_id_non_sci'},inplace= True)# renaming the column name_id to name_id_non_sci for mpns non scientific(table 3) dataset

In [None]:
art1 = pd.merge(mpns_plant,mpns_non_sci, how = "left" , left_on= "name_id",right_on="plant_id") # merging of the tables 1 and 3
#print(art1.head())
# Checking for header values
for col in art1.columns:
    print(col)

In [None]:
#checking for null values
print(art1.shape[0] - art1.count())

In [None]:
#dataframe created for plants names with non_scientific names used after merging with dataset containing scientific name(mpns_plant) and non scientific name(mpns_non_sci) of plants
art1_non_sci = art1[~art1['name'].isnull()] # plants with non - scientific name null is removed
art1_non_sci = art1_non_sci[~art1_non_sci['full_scientific_name'].isnull()] #plants with scientific name null is removed as TypeError: decoding to str: need a bytes-like object, float found is displayed due to null scientific names
print(art1_non_sci.shape[0] - art1_non_sci.count()) # checking for null values

In [None]:
# feature selection
pd.options.mode.chained_assignment = None #ignoring the warning caused by dropping . value default 'warn'
#dropping the following coumns as its not required for the mapping
art1_non_sci.drop(['genus_hybrid','species_hybrid','infra_species','parent_author'],axis ='columns',inplace = True)
#After confirming from Kew Garden team low quality matches are ignored
art1_non_sci.drop(art1_non_sci.loc[art1_non_sci['quality_rating']=='L'].index, inplace = True)
print(art1_non_sci.shape[0] - art1_non_sci.count())
print(art1_non_sci.head())

In [None]:
#dataframe created for plants names with pharmaceutical names used after merging with scientific name
art1_pharm = (art1_non_sci.loc[art1_non_sci['name_type']== 'pharmaceutical'])
print(art1_pharm.head())

In [None]:
#checking for null values
print(art1_pharm.shape[0] - art1_pharm.count())

Checking for multiple occurence of pharmaceutical name

In [None]:
art1_pharm_duplicate =art1_pharm[art1_pharm.duplicated('name')]
print('Duplicated rows are ', art1_pharm_duplicate) # 920 rows duplicated which means there are multiple occurence of the name

Checking for single occurence

In [None]:
art1_pharm_unique = art1_pharm.name.unique()


When tried to extract first 100 rows it throwed AttributeError: 'NoneType' object has no attribute 'get_text' article id 9495358 in 14 th position which didnt retrieved the text from article.
When tried to extract the last 100 rows it throwed the same attribut error for article id 9476632 in 2nd position. So the code stoped execution in 2nd article itself.
Both these articles when checked in pubmed are available:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9495358/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9476632/
However using the link to fetch XML for pubmed doesnt retrieve this articles:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=9495358
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=9476632

when tried to retreive maximum 1 article for the top 100 pharmaceutical name its seen that the term in 13th position with article id 9495358 throwed the attribute error.
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=9495358

Hence the top 10 pharmaceutical names acan be retrieved for the study when ret max is kept as 50.

This is also due to the limitations in the code as it considers only articles having a body and abstract tag.
Hence the single occurence of pharmaceutical name search is implemented for each of the following selected pharmaceutical names to have a statistics of the number of articles having correct scientific name:
abelmoschi corolla, angelicae radix pulverata, ziziphi spinosae semen, hibisci radix, achillea millefolium, achyranthis bidentatae radix, aconiti radix cocta, actinidiae fructus, acori tatarinowii rhizoma, benzoe tonkinensis, boehmeriae radix

In [None]:
art1_pharm_head = art1_pharm.head(15) #top 10 terms taken from the dataset

In [None]:
print(art1_pharm_head['name'])

In [None]:
#selecting range
art1_pharm_head = art1_pharm.iloc[506:507] #(last update 401.To run 401st position)rows taken from the dataset after 3rd value till 5(includes 5)

In [None]:

#phram_name = art1_pharm['name']
#phram_name =   ['epilobii herba','epimedii wushanensis folium','equiseti hiemalis herba'] #  'epilobii herba','epimedii wushanensis folium', 
import time
phram_name = art1_pharm_head['name']
#phram_name =   ['abelmosichi corolla','epimedii wushanensis folium','equiseti hiemalis herba'] #use if providing single name or a small list of names
 
for j in phram_name:
    
    term_pharm = f"{j}"
    #print(term_pharm)
    scientific_name_pharm =(art1_pharm.loc[art1_pharm['name']== term_pharm,'full_scientific_name']).iloc[0] #full scientific name corresponding to pharmaceutical name is selected
    scientific_name_pharm = scientific_name_pharm.strip('. ')# strip all '.' from the beginning and end of string
    len_sci_pharm= len(scientific_name_pharm ) #calculating the length of the scientific name for the given pharmaceutical name
    len_term_pharm= len(term_pharm ) #calculating the length of the for the given pharmaceutical name
    time.sleep(0.5)
    handle = Entrez.esearch(db ="pmc", term= term_pharm,retmax= "500")# search and retrieve max 50 article id for each pharmaceutical name
    rec_list = Entrez.read(handle)
    handle.close()
    #print(rec_list['Count'])
    Total_article = (rec_list['Count'])
    #print(len(rec_list['IdList']))
    Ret_max_val = (len(rec_list['IdList']))
    #print(rec_list['IdList'])
    total_id = rec_list['IdList']
    #print("scientific name of term "+term_pharm+" is :",scientific_name_pharm )
    
    no_pharm_1 = 0 #count number of articles with pharmaceutical name in the body of the article
    no_pharm_2 = 0 #count number of articles with pharmaceutical name in the abstract of the article
    no_sci_1 = 0 #count number of articles with scientific name in the body of the article
    no_sci_2 = 0 #count number of articles with scientific name in the abstract of the article
    no_body = 0  #Count number of articles with no body
    Tot_extrac =0 #Actual number of articles retreived       
    
    for id in total_id:
      time.sleep(0.5)
      handle = Entrez.efetch(db='pmc', id = id , retmode = 'xml')
      total_content =  handle.read()
      #print("Entire text in the article id",id)
      soup = BeautifulSoup(total_content,"html.parser")
      abstracts = soup.find('abstract')#find the tag named 'abstract'
      body = soup.find('body')#find the tag named 'body'
      
      try:
        body_text = body.get_text()
        body_term_text_pharm_1 = (body.get_text()).lower()# entire text converted to lower case        
        #print(body_text) #print entire body of the article
        sci_name_body= body_text.find(scientific_name_pharm )#gets the position or the starting index of the word
        abstract_text = abstracts.get_text()
        #print(abstract_text) #print entire abstract of the article
        sci_name_abstract= abstract_text.find(scientific_name_pharm )#gets the position or the starting index of the word
      
        term_name_body= body_term_text_pharm_1.find(term_pharm )#gets the position or the starting index of the word

        abstract_term_text_pharm_1 = (abstracts.get_text()).lower()# entire text converted to lower case
        term_name_abstract= abstract_term_text_pharm_1.find(term_pharm)#gets the position or the starting index of the word
      
        #Checking for pharmaceutical name in the body of article
        extract_body_term_pharm = body_term_text_pharm_1[term_name_body:term_name_body+ len_term_pharm] #the pharmaceutical name is extracted from the body of the article using string slicing
        print("pharmaceutical name",extract_body_term_pharm)
      
      
        if extract_body_term_pharm == term_pharm : # verifying if the pharmaceutical name in the body of the article matches with actual scientific name of the plant
          #print("pharmaceutical name is present in body of the article :",id)
          no_pharm_1 +=1
        else:
      
          print("pharmaceutical name is not present in body of the article :",id)

        #Checking for pharmaceutical name in the abstract of article
        extract_abstract_term_pharm = abstract_term_text_pharm_1[term_name_abstract:term_name_abstract+ len_term_pharm] #the pharmaceutical name is extracted from the body of the article using string slicing
        print("pharmaceutical name",extract_abstract_term_pharm)
      
      
        if extract_abstract_term_pharm == term_pharm : # verifying if the pharmaceutical name in the body of the article matches with actual scientific name of the plant
          #print("pharmaceutical name is present in abstract of the article :",id)
          no_pharm_2 +=1
        else:
      
          print("pharmaceutical name is not present in abstract of the article :",id)     


        #Checking for scientific name in the body of article
        extract_body_sci_pharm = body_text[sci_name_body:sci_name_body+ len_sci_pharm] #the scientific name is extracted from the body of the article using string slicing
        #print("scientific name",extract_body_sci_pharm)
      
        if extract_body_sci_pharm == scientific_name_pharm : # verifying if the scientific name in the body of the article matches with actual scientific name of the plant
          #print("Scientific name is present in body of the article :",id)
          no_sci_1 +=1
        else:
      
          print("Scientific name is not present in body of the article :",id)

        #Checking for scientific name in the abstract of the article
        extract_abstract_sci_pharm = abstract_text[sci_name_abstract:sci_name_abstract+ len_sci_pharm] #the scientific name is extracted from the abstract of the article using string slicing
        #print("scientific name",extract_abstract_sci_pharm)
      
        if extract_abstract_sci_pharm == scientific_name_pharm : # verifying if the scientific name in the abstract of the article matches with actual scientific name of the plant
          #print("Scientific name is present in abstract of the article :",id)
          no_sci_2 +=1
        else:      
          print("Scientific name is not present in  abstract of the article :",id)  
        
        Tot_extrac +=1

      except AttributeError:
        no_body +=1
        continue
    df_ex_pharm = pd.DataFrame({'Pharmaceutical_name':[term_pharm],'Scientific_name':[scientific_name_pharm],'pharm_body': [no_pharm_1],'pharm_abstract':[no_pharm_2],'sci_body':[no_sci_1],'sci_abstract':[no_sci_2],"Tot_No_PubMed_article": [Total_article],"Retrive_max_value": [Ret_max_val],"No_article_No_body": [no_body],"Actual_article_retreived":[Tot_extrac]})
    #with pd.ExcelWriter("Pharm_result.xlsx",mode="a",engine="openpyxl",if_sheet_exists="overlay") as writer:
    #df_ex_pharm.to_excel(writer, sheet_name="Pharm_Output",header=None, startrow=writer.sheets["Pharm_Output"].max_row,index=True,index_label="No.")
    writer = pd.ExcelWriter('Pharm_result.xlsx', engine ='openpyxl',mode ='a')#,if_sheet_exists="overlay"  #,if_sheet_exists="replace"
    writer.book = load_workbook('Pharm_result.xlsx')
    writer.sheets = dict((ws.title,ws) for ws in writer.book.worksheets)
    reader = pd.read_excel(r'Pharm_result.xlsx')
    df_ex_pharm.to_excel(writer,index= True,index_label="No.",header = False,sheet_name="Pharm_Output",startrow = len(reader)+1)
    #df_ex_pharm.to_excel(writer,index= True,index_label="No.",sheet_name="Pharm_Output",header = False,startrow=writer.sheets["Pharm_Output"].max_row )
    writer.save()
    writer.close()
        

#print("No of times pharmaceutical name "+term_pharm+ " appeared in article is : ",no_pharm_1+no_pharm_2)  
#print("No of times Scientific name "+scientific_name_pharm+" appeared ",no_sci_1+no_sci_2)
#print("No of times articles without body ",no_body)

"""
df_ex_pharm = pd.DataFrame({'Pharmaceutical_name':[term_pharm],'Scientific_name':[scientific_name_pharm],'pharm_body': [no_pharm_1],'pharm_abstract':[no_pharm_2],'sci_body':[no_sci_1],'sci_abstract':[no_sci_2],"Tot_No_PubMed_article": [Total_article],"Retrive_max_article": [Ret_max_val],"No_article_No_body": [no_body]})
writer.book = load_workbook('Pharm_result.xlsx')
writer.sheets = dict((ws.title,ws) for ws in writer.book.worksheets)
df_ex_pharm.to_excel(writer,index= True,index_label="No.",sheet_name="Pharm_Output", )
writer.save()
writer.close()
"""
"""
      if(no_pharm_1!=0):
        writer.book = load_workbook('Pharm_result.xlsx')
      df_ex_pharm.to_excel(writer,index= False,index_label="No.",sheet_name="Pharm_Output" )
      #no_pharm_1 = no_pharm_1 + 1
      writer.save()
      if(no_pharm_2!=0):
        writer.book = load_workbook('Pharm_result.xlsx')
      df_ex_pharm.to_excel(writer,index= True,index_label="No.",sheet_name="Pharm_Output")      
      #no_pharm_2 = no_pharm_2 + 1
      writer.save()      
      if(no_sci_1!=0):
        writer.book = load_workbook('Pharm_result.xlsx')
      df_ex_pharm.to_excel(writer,index= True,index_label="No.",sheet_name="Pharm_Output" )
      #no_sci_1 = no_sci_1 + 1
      writer.save()      
      if(no_sci_2!=0):
        writer.book = load_workbook('Pharm_result.xlsx')
      df_ex_pharm.to_excel(writer,index= True,index_label="No.",sheet_name="Pharm_Output")
      #no_sci_2 = no_sci_2 + 1
      writer.save()
      if(no_body!=0):
        writer.book = load_workbook('Pharm_result.xlsx')
      df_ex_pharm.to_excel(writer,index= True,index_label="No.",sheet_name="Pharm_Output" )
      #no_body = no_body + 1
      
      writer.save()
      writer.close()
      

"""


"""
    #print(df_ex_pharm)
    #df_ex_pharm.to_excel('Pharm_result.xlsx',index= True,index_label="No.",sheet_name="Pharm_Output")
    writer = pd.ExcelWriter('Pharm_result_2.xlsx', engine ='openpyxl') 
    writer.book = load_workbook('Pharm_result_2.xlsx') #try to open an existing workbook

    writer.sheets = dict((ws.title,ws) for ws in writer.book.worksheets)
    #read existing file
    reader = pd.read_excel(r'Pharm_result_2.xlsx')
    #write out the new sheet
    df_ex_pharm.to_excel(writer,index= True,index_label="No.",header = False,sheet_name="Pharm_Output",startrow = len(reader)+1)
"""

"""
    df_ex_pharm = pd.DataFrame({'Pharmaceutical_name':[term_pharm],'Scientific_name':[scientific_name_pharm],'pharm_body': [no_pharm_1],'pharm_abstract':[no_pharm_2],'sci_body':[no_sci_1],'sci_abstract':[no_sci_2],"Tot_No_PubMed_article": [Total_article],"Retrive_max_article": [Ret_max_val],"No_article_No_body": [no_body]})
    writer = pd.ExcelWriter('Pharm_result_3.xlsx', engine ='xlsxwriter',mode ='A')
    # Convert the dataframe to an XlsxWriter Excel object.
    df_ex_pharm.to_excel(writer,index= True,index_label="No.", sheet_name='Pharm_Output')    

    # Close the Pandas Excel writer and output the Excel file.
    writer.save()
 
"""
"""
      df_ex_pharm = pd.DataFrame({'Pharmaceutical_name':[term_pharm],'Scientific_name':[scientific_name_pharm],'pharm_body': [no_pharm_1],'pharm_abstract':[no_pharm_2],'sci_body':[no_sci_1],'sci_abstract':[no_sci_2],"Tot_No_PubMed_article": [Total_article],"Retrive_max_article": [Ret_max_val],"No_article_No_body": [no_body]})
    
   
      writer = pd.ExcelWriter('Pharm_result.xlsx', engine ='openpyxl')

      
      df_ex_pharm.to_excel('Pharm_result.xlsx',index= True,index_label="No.",sheet_name="Pharm_Output")
      writer.book = load_workbook('Pharm_result.xlsx')
      writer.sheets = dict((ws.title, ws) for ws in writer.book.worksheets)
      # Convert the dataframe to an XlsxWriter Excel object.
      #df_ex_pharm.to_excel(writer,index= True,index_label="No.", sheet_name='Pharm_Output')   
      reader = pd.read_excel(r'Pharm_result.xlsx') 
      df_ex_pharm.to_excel(writer,index= True,index_label="No.",header = False,sheet_name="Pharm_Output",startrow = len(reader)+1)

      # Close the Pandas Excel writer and output the Excel file.
      writer.save()
      writer.close()
"""


# Ignore entire code from this section onwards
Contain trail versions of the code

In [None]:
#This is the first trial code for ignoring articles without body

#phram_name = art1_pharm['name']
phram_name =   ['echinaceae pallidae herba'] #'epilobii herba','epimedii wushanensis folium'

for j in phram_name:
    
    term_pharm = f"{j}"
    print(term_pharm)
    scientific_name_pharm =(art1_pharm.loc[art1_pharm['name']== term_pharm,'full_scientific_name']).iloc[0] #full scientific name corresponding to pharmaceutical name is selected
    scientific_name_pharm = scientific_name_pharm.strip('. ')# strip all '.' from the beginning and end of string
    len_sci_pharm= len(scientific_name_pharm ) #calculating the length of the scientific name for the given pharmaceutical name
    len_term_pharm= len(term_pharm ) #calculating the length of the for the given pharmaceutical name
    handle = Entrez.esearch(db ="pmc", term= term_pharm,retmax= "50")# search and retrieve max 50 article id for each pharmaceutical name
    rec_list = Entrez.read(handle)
    handle.close()
    print(rec_list['Count'])
    Total_article = (rec_list['Count'])
    print(len(rec_list['IdList']))
    Ret_max_val = (len(rec_list['IdList']))
    print(rec_list['IdList'])
    total_id = rec_list['IdList']
    print("scientific name of term "+term_pharm+" is :",scientific_name_pharm )
    
    no_pharm_1 = 0
    no_pharm_2 = 0
    no_sci_1 = 0
    no_sci_2 = 0
 
    
    writer = pd.ExcelWriter('Pharm_result.xlsx', engine ='openpyxl')
    

    for id in total_id:
      handle = Entrez.efetch(db='pmc', id = id , retmode = 'xml')
      total_content =  handle.read()
      print("Entire text in the article id",id)
      soup = BeautifulSoup(total_content,"html.parser")
      abstracts = soup.find('abstract')#find the tag named 'abstract'
      body = soup.find('body')#find the tag named 'body'
      no_body = 0
      try:
        body_text = body.get_text()
        body_term_text_pharm_1 = (body.get_text()).lower()# entire text converted to lower case        
        #print(body_text) #print entire body of the article
      except AttributeError:
        no_body +=1
        pass
        
      sci_name_body= body_text.find(scientific_name_pharm )#gets the position or the starting index of the word
      abstract_text = abstracts.get_text()
      #print(abstract_text) #print entire abstract of the article
      sci_name_abstract= abstract_text.find(scientific_name_pharm )#gets the position or the starting index of the word
      
      term_name_body= body_term_text_pharm_1.find(term_pharm )#gets the position or the starting index of the word

      abstract_term_text_pharm_1 = (abstracts.get_text()).lower()# entire text converted to lower case
      term_name_abstract= abstract_term_text_pharm_1.find(term_pharm)#gets the position or the starting index of the word
      
      #Checking for pharmaceutical name in the body of article
      extract_body_term_pharm = body_term_text_pharm_1[term_name_body:term_name_body+ len_term_pharm] #the pharmaceutical name is extracted from the body of the article using string slicing
      print("pharmaceutical name",extract_body_term_pharm)
      
      
      if extract_body_term_pharm == term_pharm : # verifying if the pharmaceutical name in the body of the article matches with actual scientific name of the plant
        print("pharmaceutical name is present in body of the article :",id)
        no_pharm_1 +=1
      else:
      
        print("pharmaceutical name is not present in body of the article :",id)

      #Checking for pharmaceutical name in the abstract of article
      extract_abstract_term_pharm = abstract_term_text_pharm_1[term_name_abstract:term_name_abstract+ len_term_pharm] #the pharmaceutical name is extracted from the body of the article using string slicing
      print("pharmaceutical name",extract_abstract_term_pharm)
      
      
      if extract_abstract_term_pharm == term_pharm : # verifying if the pharmaceutical name in the body of the article matches with actual scientific name of the plant
        print("pharmaceutical name is present in abstract of the article :",id)
        no_pharm_2 +=1
      else:
      
        print("pharmaceutical name is not present in abstract of the article :",id)     


      #Checking for scientific name in the body of article
      extract_body_sci_pharm = body_text[sci_name_body:sci_name_body+ len_sci_pharm] #the scientific name is extracted from the body of the article using string slicing
      print("scientific name",extract_body_sci_pharm)
      
      if extract_body_sci_pharm == scientific_name_pharm : # verifying if the scientific name in the body of the article matches with actual scientific name of the plant
        print("Scientific name is present in body of the article :",id)
        no_sci_1 +=1
      else:
      
        print("Scientific name is not present in body of the article :",id)

      #Checking for scientific name in the abstract of the article
      extract_abstract_sci_pharm = abstract_text[sci_name_abstract:sci_name_abstract+ len_sci_pharm] #the scientific name is extracted from the abstract of the article using string slicing
      print("scientific name",extract_abstract_sci_pharm)
      
      if extract_abstract_sci_pharm == scientific_name_pharm : # verifying if the scientific name in the abstract of the article matches with actual scientific name of the plant
        print("Scientific name is present in abstract of the article :",id)
        no_sci_2 +=1
      else:      
        print("Scientific name is not present in  abstract of the article :",id)  

      print("No of times pharmaceutical name "+term_pharm+ " appeared in article is : ",no_pharm_1+no_pharm_2)  
      print("No of times Scientific name "+scientific_name_pharm+" appeared ",no_sci_1+no_sci_2)
      print("No of times articles without body ",no_body)

      #change_excel = pd.DataFrame([no_pharm_1],[no_pharm_2],[no_sci_1],[no_sci_2]],columns=['pharm_body','pharm_abstract','sci_body','sci_abstract'])
      #print(change_excel)

      df_ex_pharm = pd.DataFrame({'Pharmaceutical_name':[term_pharm],'Scientific_name':[scientific_name_pharm],'pharm_body': [no_pharm_1],'pharm_abstract':[no_pharm_2],'sci_body':[no_sci_1],'sci_abstract':[no_sci_2],"Tot_No_PubMed_article": [Total_article],"Retrive_max_article": [Ret_max_val],"No_article_No_body": [no_body]})
      writer.book = load_workbook('Pharm_result.xlsx')
      writer.sheets = dict((ws.title,ws) for ws in writer.book.worksheets)
      df_ex_pharm.to_excel(writer,index= True,index_label="No.",sheet_name="Pharm_Output" )
      writer.save()
      writer.close()

"""
      if(no_pharm_1!=0):
        writer.book = load_workbook('Pharm_result.xlsx')
      df_ex_pharm.to_excel(writer,index= False,index_label="No.",sheet_name="Pharm_Output" )
      #no_pharm_1 = no_pharm_1 + 1
      writer.save()
      if(no_pharm_2!=0):
        writer.book = load_workbook('Pharm_result.xlsx')
      df_ex_pharm.to_excel(writer,index= True,index_label="No.",sheet_name="Pharm_Output")      
      #no_pharm_2 = no_pharm_2 + 1
      writer.save()      
      if(no_sci_1!=0):
        writer.book = load_workbook('Pharm_result.xlsx')
      df_ex_pharm.to_excel(writer,index= True,index_label="No.",sheet_name="Pharm_Output" )
      #no_sci_1 = no_sci_1 + 1
      writer.save()      
      if(no_sci_2!=0):
        writer.book = load_workbook('Pharm_result.xlsx')
      df_ex_pharm.to_excel(writer,index= True,index_label="No.",sheet_name="Pharm_Output")
      #no_sci_2 = no_sci_2 + 1
      writer.save()
      if(no_body!=0):
        writer.book = load_workbook('Pharm_result.xlsx')
      df_ex_pharm.to_excel(writer,index= True,index_label="No.",sheet_name="Pharm_Output" )
      #no_body = no_body + 1
      
      writer.save()
      writer.close()
      

"""


"""
    #print(df_ex_pharm)
    #df_ex_pharm.to_excel('Pharm_result.xlsx',index= True,index_label="No.",sheet_name="Pharm_Output")
    writer = pd.ExcelWriter('Pharm_result_2.xlsx', engine ='openpyxl') 
    writer.book = load_workbook('Pharm_result_2.xlsx') #try to open an existing workbook

    writer.sheets = dict((ws.title,ws) for ws in writer.book.worksheets)
    #read existing file
    reader = pd.read_excel(r'Pharm_result_2.xlsx')
    #write out the new sheet
    df_ex_pharm.to_excel(writer,index= True,index_label="No.",header = False,sheet_name="Pharm_Output",startrow = len(reader)+1)
"""

"""
    df_ex_pharm = pd.DataFrame({'Pharmaceutical_name':[term_pharm],'Scientific_name':[scientific_name_pharm],'pharm_body': [no_pharm_1],'pharm_abstract':[no_pharm_2],'sci_body':[no_sci_1],'sci_abstract':[no_sci_2],"Tot_No_PubMed_article": [Total_article],"Retrive_max_article": [Ret_max_val],"No_article_No_body": [no_body]})
    writer = pd.ExcelWriter('Pharm_result_3.xlsx', engine ='xlsxwriter',mode ='A')
    # Convert the dataframe to an XlsxWriter Excel object.
    df_ex_pharm.to_excel(writer,index= True,index_label="No.", sheet_name='Pharm_Output')    

    # Close the Pandas Excel writer and output the Excel file.
    writer.save()
 
"""
"""
      df_ex_pharm = pd.DataFrame({'Pharmaceutical_name':[term_pharm],'Scientific_name':[scientific_name_pharm],'pharm_body': [no_pharm_1],'pharm_abstract':[no_pharm_2],'sci_body':[no_sci_1],'sci_abstract':[no_sci_2],"Tot_No_PubMed_article": [Total_article],"Retrive_max_article": [Ret_max_val],"No_article_No_body": [no_body]})
    
   
      writer = pd.ExcelWriter('Pharm_result.xlsx', engine ='openpyxl')

      
      df_ex_pharm.to_excel('Pharm_result.xlsx',index= True,index_label="No.",sheet_name="Pharm_Output")
      writer.book = load_workbook('Pharm_result.xlsx')
      writer.sheets = dict((ws.title, ws) for ws in writer.book.worksheets)
      # Convert the dataframe to an XlsxWriter Excel object.
      #df_ex_pharm.to_excel(writer,index= True,index_label="No.", sheet_name='Pharm_Output')   
      reader = pd.read_excel(r'Pharm_result.xlsx') 
      df_ex_pharm.to_excel(writer,index= True,index_label="No.",header = False,sheet_name="Pharm_Output",startrow = len(reader)+1)

      # Close the Pandas Excel writer and output the Excel file.
      writer.save()
      writer.close()
"""


⛔

Looking for articles having pharmaceutical names without reference to a scientific name

In [None]:
# taking pharmaceutical name and checking for scientific names in the articles    

term = 'ephedra radix'   #The term need to be fetched for each individual term in the list :abelmoschi corolla, angelicae radix pulverata, ziziphi spinosae semen, hibisci radix, achillea millefolium, achyranthis bidentatae radix, aconiti radix cocta, actinidiae fructus, acori tatarinowii rhizoma, benzoe tonkinensis, boehmeriae radix
print(term)
handle = Entrez.esearch(db ="pmc", term= term,retmax= "338")# retrieve max is changed for article id for each pharmaceutical name.If the search throws error, the corresponding id is not considered and the count of article till that run is taken to get the retmax value to be entered for successfull run of the code.
rec_list = Entrez.read(handle)
handle.close()
print(rec_list['Count'])# displays the total number of articles Id containing the given pharmaceutical name
print(len(rec_list['IdList']))#List the total number of article retrieved. If total number of article  containing the pharmaceutical name are more than retmax parameter value given, it returns the given value in retmax.
total_id = rec_list['IdList']
print('The article ids corresponding to the given pharmaceutical name are :' ,total_id)# displays the articles Ids containing the given pharmaceutical name

In [None]:
scientific_name_pharm =(art1_pharm.loc[art1_pharm['name']== term,'full_scientific_name']).iloc[0] #full scientific name corresponding to pharmaceutical name is selected
scientific_name_pharm = scientific_name_pharm.strip('. ')# strip all '.' from the beginning and end of string
len_sci_pharm= len(scientific_name_pharm ) #calculating the length of the scientific name for the given pharmaceutical name
len_term_pharm= len(term ) #calculating the length of the for the given pharmaceutical name

In [None]:
print(scientific_name_pharm)

In [None]:
#extracting and checking for each article id corresponding to the given term if the correct scientific name is provided or not
#total_id =  ['9549772', '9492967', '9234258', '9206260', '9476726', '9404188', '9377893', '9376358', '9350179', '9135524', '8966664', '8901378', '9011225', '8861989', '8844433', '8759103', '8691044', '8485712', '8352711', '9009852', '8190277', '7964117', '7939017', '7801274', '7821280', '7735863', '7592451', '7556113', '9009856', '9009854', '7416080', '7391051', '7367001', '7331568', '7328533', '7319939', '7296303', '7273822', '7227203', '7159848', '7129866', '7104236', '7118592', '6521559', '6452558', '6304600', '6178163', '3776539', '3679842', '7126446']
total_id =  ['8165967',  '2521810','402576']


#


for id in total_id:
    handle = Entrez.efetch(db='pmc', id = id , retmode = 'xml')
    total_content =  handle.read()
    print("Entire text in the article id",id)
    #print(total_content) # print the entire content of the article in html
    soup = BeautifulSoup(total_content,"html.parser")
    abstracts = soup.find('abstract')#find the tag named 'abstract'
    body = soup.find('body')#find the tag named 'body'
    body_text = body.get_text()
    #print(body_text) #print entire body of the article
    sci_name_body= body_text.find(scientific_name_pharm )#gets the position or the starting index of the word
    abstract_text = abstracts.get_text()
    #print(abstract_text) #print entire abstract of the article
    sci_name_abstract= abstract_text.find(scientific_name_pharm )#gets the position or the starting index of the word

    body_term_text_pharm_1 = (body.get_text()).lower()# entire text converted to lower case
    term_name_body= body_term_text_pharm_1.find(term )#gets the position or the starting index of the word

    abstract_term_text_pharm_1 = (abstracts.get_text()).lower()# entire text converted to lower case
    term_name_abstract= abstract_term_text_pharm_1.find(term )#gets the position or the starting index of the word

#Checking for pharmaceutical name in the body of article
    extract_body_term_pharm = body_term_text_pharm_1[term_name_body:term_name_body+ len_term_pharm] #the pharmaceutical name is extracted from the body of the article using string slicing
    print("pharmaceutical name",extract_body_term_pharm)
    if extract_body_term_pharm == term : # verifying if the pharmaceutical name in the body of the article matches with actual scientific name of the plant
      print("pharmaceutical name is present in body of the article :",id)
    else:
      
      print("pharmaceutical name is not present in body of the article :",id)     

#Checking for pharmaceutical name in the abstract of article
    extract_abstract_term_pharm = abstract_term_text_pharm_1[term_name_abstract:term_name_abstract+ len_term_pharm] #the pharmaceutical name is extracted from the body of the article using string slicing
    print("pharmaceutical name",extract_abstract_term_pharm)
    if extract_abstract_term_pharm == term : # verifying if the pharmaceutical name in the body of the article matches with actual scientific name of the plant
      print("pharmaceutical name is present in abstract of the article :",id)
    else:
      
      print("pharmaceutical name is not present in abstract of the article :",id)     


#Checking for scientific name in the body of article
    extract_body_sci_pharm = body_text[sci_name_body:sci_name_body+ len_sci_pharm] #the scientific name is extracted from the body of the article using string slicing
    print("scientific name",extract_body_sci_pharm)
    if extract_body_sci_pharm == scientific_name_pharm : # verifying if the scientific name in the body of the article matches with actual scientific name of the plant
      print("Scientific name is present in body of the article :",id)
    else:
      
      print("Scientific name is not present in body of the article :",id)

#Checking for scientific name in the abstract of the article
    extract_abstract_sci_pharm = abstract_text[sci_name_abstract:sci_name_abstract+ len_sci_pharm] #the scientific name is extracted from the abstract of the article using string slicing
    print("scientific name",extract_abstract_sci_pharm)
    if extract_abstract_sci_pharm == scientific_name_pharm : # verifying if the scientific name in the abstract of the article matches with actual scientific name of the plant
      print("Scientific name is present in abstract of the article :",id)
    else:
      
      print("Scientific name is not present in  abstract of the article :",id)          


In [None]:
""" 
#code for checking through all pharmaceutical name and retreive the corresponding article id
phram_name = art1_pharm['name']
term_pharm = '' # input the common name of medicinal plant

for j in phram_name:
    
    term_pharm = f"{j}[name]"
    print(term_pharm)
    handle = Entrez.esearch(db ="pmc", term= term_pharm,retmax= "50")# search and retrieve max 50 article id for each pharmaceutical name
    rec_list = Entrez.read(handle)
    handle.close()
    print(rec_list['Count'])
    print(len(rec_list['IdList']))
    print(rec_list['IdList'])
    total_id = rec_list['IdList']
"""

In [None]:
"""
#Code to retreive if search criteria has two or more parameters like genus and species.
pharm_gen= art1_pharm['genus'].tolist() #convert the genus column to list
pharm_spe= art1_pharm['species'].tolist() #convert the species column to list
term = '' # input the common name of medicinal plant
for (i,j) in zip(pharm_gen,pharm_spe):
    
    term = f"{i}[genus] AND {j}[species]"
    print(term)
    handle = Entrez.esearch(db ="pmc", term= term,retmax= "50")
    rec_list = Entrez.read(handle)
    handle.close()
    print(rec_list['Count'])
    print(len(rec_list['IdList']))
    print(rec_list['IdList'])
    total_id = rec_list['IdList']

"""

**Reference:**
Entrez is a molecular biology database system that provides integrated access to nucleotide and protein sequence.The system is produced by the National Center for Biotechnology Information (NCBI).

Entrez Programming Utilities user guide is available at : https://www.ncbi.nlm.nih.gov/books/NBK25501/