<a href="https://colab.research.google.com/github/Sandukkk/Learner_AI-ML/blob/main/pubmed_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

PubMed is a free resource supporting the search and retrieval of biomedical and life sciences literature with the aim of improving health–both globally and personally.
The PubMed database contains more than 35 million citations and abstracts of biomedical literature. It does not include full text journal articles; however, links to the full text are often present when available from other sources, such as the publisher's website or PubMed Central (PMC).
Available to the public online since 1996, PubMed was developed and is maintained by the National Center for Biotechnology Information (NCBI), at the U.S. National Library of Medicine (NLM), located at the National Institutes of Health (NIH).



*   Install pubmed python API packge
*   Get abstracts
*   Clean texts
*   Save data


Intall package if it has not been installed before

In [1]:
 pip install metapub

Collecting metapub
  Downloading metapub-0.5.5.tar.gz (120 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m120.3/120.3 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting eutils (from metapub)
  Downloading eutils-0.6.0-py2.py3-none-any.whl (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.9/41.9 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting habanero (from metapub)
  Downloading habanero-1.2.3-py2.py3-none-any.whl (30 kB)
Collecting cssselect (from metapub)
  Downloading cssselect-1.2.0-py2.py3-none-any.whl (18 kB)
Collecting unidecode (from metapub)
  Downloading Unidecode-1.3.6-py3-none-any.whl (235 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.9/235.9 kB[0m [31m29.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting docopt (from metapub)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collect

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from google.colab import drive
drive.mount('gdrive')
import os

Mounted at gdrive


In [4]:
data_dir = "/content/gdrive/My Drive/pubmed/"


In [5]:
#Extract the abstract using the keywords


def get_pubmed_abs(keywords, num_article):
  from metapub import PubMedFetcher
  fetch = PubMedFetcher()
  pmids = fetch.pmids_for_query(keywords, retmax=num_article)
  abstracts = {}
  for pmid in pmids:
    abstracts[pmid] = fetch.article_by_pmid(pmid).abstract
  Abstract = pd.DataFrame(list(abstracts.items()),columns = ['pmid','Abstract'])
  return Abstract

def get_pubmed_title(keywords, num_article):
  from metapub import PubMedFetcher
  fetch = PubMedFetcher()
  pmids = fetch.pmids_for_query(keywords, retmax=num_article)
  abstracts = {}
  for pmid in pmids:
    abstracts[pmid] = fetch.article_by_pmid(pmid).title
  Abstract = pd.DataFrame(list(abstracts.items()),columns = ['pmid','Title'])
  return Abstract

In [6]:
keyword = "covid"
num_article = 300

In [7]:
df_abs = get_pubmed_abs(keyword, num_article)



In [8]:
df_abs.dropna(inplace = True)
df_abs.head()

Unnamed: 0,pmid,Abstract
0,37379638,How is professional purpose impacted in the co...
1,37379636,This paper examines the daily practices of car...
2,37379564,Innovation in laboratory testing algorithms to...
3,37379547,"Hamman's syndrome, or spontaneous pneumomedias..."
4,37379541,The Muñiz hospital is an institution with hist...


In [9]:
df_abs["Abstract"] = df_abs["Abstract"]\
.apply(lambda x: x.replace("INTRODUCTION:",""))\
.apply(lambda x: x.replace("IMPORTANCE:",""))\
.apply(lambda x: x.replace("BACKGROUND:",""))

In [10]:
df_abs.head()

Unnamed: 0,pmid,Abstract
0,37379638,How is professional purpose impacted in the co...
1,37379636,This paper examines the daily practices of car...
2,37379564,Innovation in laboratory testing algorithms to...
3,37379547,"Hamman's syndrome, or spontaneous pneumomedias..."
4,37379541,The Muñiz hospital is an institution with hist...


**Clean text remove puctuations**

In [11]:
def cleanup_text(text):
    import re
    # remove punctuation
    text = re.sub('[^a-zA-Z0-9]', ' ', text)
    # remove multiple spaces
    text = re.sub(r' +', ' ', text)
    # remove newline
    text = re.sub(r'\n', ' ', text)
    return text

**Save text data into the csv file **

In [12]:
df_abs["Abstract"] = df_abs["Abstract"].apply(lambda x: cleanup_text(x))

In [13]:
df_abs.to_csv(os.path.join(data_dir, "pubmed_abs.csv"), index = False)