**PubMed Insight Finder**

 Required Libraries
- **requests**: For making HTTP requests to scrape data.
- **beautifulsoup4**: For parsing HTML content.
- **faiss-cpu**: for efficient vector similarity searches.
- **sentence-transformers**: For generating embeddings.
- **transformers**: summarizing retrieved content.


In [None]:
!pip install requests
!pip install beautifulsoup4
!pip install faiss-cpu
!pip install sentence-transformers
!pip install transformers

Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m45.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.9.0


on the second step  Scrape Data from PubMed
      functions:
1. **search_pubmed**: Searches PubMed for a query and returns links to articles.
2. **get_detailed_article_data**: Retrieves detailed information, like authors and abstracts, from each article's page.

 i tested the scraper by querying PubMed for **diabetes** articles.


In [None]:
from bs4 import BeautifulSoup
import requests
import time

def search_pubmed(query, max_results=10):
    base_url = 'https://pubmed.ncbi.nlm.nih.gov/'
    search_url = f"{base_url}?term={query.replace(' ', '+')}"
    response = requests.get(search_url)
    if response.status_code != 200:
        print("Failed to retrieve data from PubMed.")
        return []

    soup = BeautifulSoup(response.content, 'html.parser')
    articles = soup.find_all('article', class_='full-docsum', limit=max_results)

    results = []
    for article in articles:
        title_tag = article.find('a', class_='docsum-title')
        title = title_tag.text.strip() if title_tag else 'No Title'
        link = f"{base_url}{title_tag['href']}" if title_tag else 'No Link'

        article_data = get_detailed_article_data(link)
        results.append({
            'title': title,
            'link': link,
            'authors': article_data['authors'],
            'abstract': article_data['abstract'],
            'paragraph': article_data['abstract'][:50]
        })
    return results

def get_detailed_article_data(article_url):
    time.sleep(1)
    response = requests.get(article_url)
    if response.status_code != 200:
        print(f"Failed to retrieve data from {article_url}")
        return {'authors': 'No Authors', 'abstract': 'No Abstract', 'paragraph': 'No Paragraph'}

    soup = BeautifulSoup(response.content, 'html.parser')

    authors_tag = soup.find('div', class_='authors')
    authors = ' '.join(authors_tag.text.split()) if authors_tag else 'No Authors'


    abstract_tag = soup.find('div', class_='abstract')
    abstract = ' '.join(abstract_tag.text.split()) if abstract_tag else 'No Abstract'

    paragraph = abstract[:200] + '...' if abstract != 'No Abstract' else 'No Paragraph'

    return {'authors': authors, 'abstract': abstract, 'paragraph': paragraph}

query = "diabetes"
articles = search_pubmed(query, max_results=5)

for article in articles:
    print(f"Title: {article['title']}")
    print(f"Link: {article['link']}")
    print(f"Authors: {article['authors']}")
    print(f"Abstract: {article['abstract']}\n")
    print(f"Paragraph: {article['paragraph']}\n")


Title: Diagnosis and Management of Central Diabetes Insipidus in Adults.
Link: https://pubmed.ncbi.nlm.nih.gov//35771962/
Authors: Maria Tomkins 1 , Sarah Lawless 1 , Julie Martin-Grace 1 , Mark Sherlock 1 , Chris J Thompson 1
Abstract: Abstract Central diabetes insipidus (CDI) is a clinical syndrome which results from loss or impaired function of vasopressinergic neurons in the hypothalamus/posterior pituitary, resulting in impaired synthesis and/or secretion of arginine vasopressin (AVP). AVP deficiency leads to the inability to concentrate urine and excessive renal water losses, resulting in a clinical syndrome of hypotonic polyuria with compensatory thirst. CDI is caused by diverse etiologies, although it typically develops due to neoplastic, traumatic, or autoimmune destruction of AVP-synthesizing/secreting neurons. This review focuses on the diagnosis and management of CDI, providing insights into the physiological disturbances underpinning the syndrome. Recent developments in di

In [None]:
!pip install sentence_transformers




thir step **Creating Embeddings and Indexing with FAISS**

on this cell i used `SentenceTransformer` to encode article abstracts into embeddings, which are then indexed using FAISS for fast similarity searching:
- Loads a `SentenceTransformer` model (`paraphrase-MiniLM-L6-v2`) to generate embeddings.
- Stores embeddings in a FAISS index with Euclidean distance (L2) metric for efficient retrieval.



In [None]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import os


os.environ["HF_TOKEN"] = "<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>"

model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

abstracts = [article['abstract'] for article in articles]
embeddings = model.encode(abstracts)

embedding_dim = embeddings.shape[1]
index = faiss.IndexFlatL2(embedding_dim)
index.add(np.array(embeddings).astype(np.float32))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

the fourth step is  **Indexing and Article Retrieval Functions**

This cell defines functions to:
- `create_index()`: Generate embeddings from article abstracts and add them to a FAISS index.
- `retrieve_articles()`: Retrieve the top-k most relevant articles based on a query by searching in the FAISS index.



In [None]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import os


def create_index(articles):
    abstracts = [article['abstract'] for article in articles]
    embeddings = model.encode(abstracts)

    embedding_dim = embeddings.shape[1]
    index = faiss.IndexFlatL2(embedding_dim)
    index.add(np.array(embeddings).astype(np.float32))

    return model, index

def retrieve_articles(query, top_k=3, model=None, index=None, articles=None):
    query_embedding = model.encode([query])
    distances, indices = index.search(np.array(query_embedding).astype(np.float32), top_k)
    retrieved_articles = [articles[i] for i in indices[0]]
    return retrieved_articles

query = "latest research on diabetes mellitus treatment"
articles = search_pubmed(query, max_results=5)
model, index = create_index(articles)
retrieved_articles = retrieve_articles(query, top_k=5, model=model, index=index, articles=articles)

for article in retrieved_articles:
    print(f"Title: {article['title']}")
    print(f"Link: {article['link']}")
    print(f"Authors: {article['authors']}")
    print(f"Abstract: {article['abstract']}\n")
    print(f"Paragraph: {article['paragraph']}\n")



Title: Advances in Research on Type 2 Diabetes Mellitus Targets and Therapeutic Agents.
Link: https://pubmed.ncbi.nlm.nih.gov//37686185/
Authors: Jingqian Su 1 2 3 , Yingsheng Luo 1 2 3 , Shan Hu 1 2 3 , Lu Tang 1 2 3 , Songying Ouyang 1 2 3 4
Abstract: Abstract Diabetes mellitus is a chronic multifaceted disease with multiple potential complications, the treatment of which can only delay and prolong the terminal stage of the disease, i.e., type 2 diabetes mellitus (T2DM). The World Health Organization predicts that diabetes will be the seventh leading cause of death by 2030. Although many antidiabetic medicines have been successfully developed in recent years, such as GLP-1 receptor agonists and SGLT-2 inhibitors, single-target drugs are gradually failing to meet the therapeutic requirements owing to the individual variability, diversity of pathogenesis, and organismal resistance. Therefore, there remains a need to investigate the pathogenesis of T2DM in more depth, identify multiple 

**Summarize Retrieved Articles**
i used Hugging Face’s transformers library to summarize the abstracts of retrieved articles.
- **generate_summary**: it Takes the retrieved articles and creates a summary response.


In [None]:
!pip install gradio

Collecting gradio
  Downloading gradio-5.4.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.4-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.4.0-py3-none-any.whl.metadata (2.9 kB)
Collecting gradio-client==1.4.2 (from gradio)
  Downloading gradio_client-1.4.2-py3-none-any.whl.metadata (7.1 kB)
Collecting httpx>=0.24.1 (from gradio)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting huggingface-hub>=0.25.1 (from gradio)
  Downloading huggingface_hub-0.26.2-py3-none-any.whl.metadata (13 kB)
Collecting markupsafe~=2.0 (from gradio)
  Downloading MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting orjson~=3.0 (from gradio)
  Downloading orjson-3.10.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.w

finally i'm Creating a User Interface using Gradio

This cell defines a Gradio interface for a PubMed article summarizer:
- `generate_summary()`: A placeholder function to simulate article summary generation.
- `interface()`: Creates and formats article summaries for the Gradio interface.

The interface allows the user to:
1. Enter a search query.
2. Select the number of articles to retrieve.
3. View summarized article details.

The `iface.launch(share=True, debug=True)` line initiates the Gradio interface and provides a shareable link.


In [None]:
import gradio as gr

def generate_summary(query, max_results):

    return [
        {
            'title': f'Article Title {i+1}',
            'authors': ['Author A', 'Author B'],
            'link': f'https://pubmed.ncbi.nlm.nih.gov/{i+1}',
            'summary': f'Summary for article {i+1}'
        } for i in range(max_results)
    ]

def interface(query, max_results):
    summaries = generate_summary(query, max_results)
    return "\n\n".join(
        [f"**Title:** {art['title']}\n**Authors:** {', '.join(art['authors'])}\n**Link:** {art['link']}\n**Summary:** {art['summary']}" for art in summaries]
    )
iface = gr.Interface(
    fn=interface,
    inputs=[
        gr.Textbox(label="Enter search query:"),
        gr.Slider(minimum=1, maximum=10, value=5, label="Number of articles to retrieve:")
    ],
    outputs="text",
    title="PubMed Article Summarizer",
    description="Enter a topic to search for PubMed articles and get summarized abstracts."
)

iface.launch(share=True, debug=True)


Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://4b02e709a1fae92429.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
