# CDSCO-Based Drug Information System
Large language models (LLMs) are increasingly being adopted for clinical decision support. However, most models are primarily trained on biomedical corpora from the US and EU and lack context around Indian regulatory frameworks. This project focuses on developing a Retrieval-Augmented Generation (RAG) pipeline grounded in drug approvals published by the Central Drugs Standard Control Organization (CDSCO). It also lays the groundwork for future automation and integration into personalized healthcare systems to deliver up-to-date drug data.

## Environment Setup
Configuring components for Ollama and preparing the environment for subsequent operations. I'm using `colab-xterm` to work around Colab's GPU limits. Execute these commands in the shell to install the model and get started:
```
curl -fsSL https://ollama.com/install.sh | sh
ollama serve > /dev/null 2>&1 &
ollama pull llama3.2
ollama pull nomic-embed-text
ollama list
```

In [None]:
!pip install colab-xterm -q

In [None]:
%load_ext colabxterm
%xterm

In [None]:
!pip install langchain -q
!pip install langchain-core -q
!pip install langchain-community -q

In [None]:
from langchain_community.llms import Ollama

llm = Ollama(model="llama3.2")
# Baseline query to evaluate the model's current knowledge.
response = llm.invoke("According to 2025 CDSCO approvals, in which hematologic malignancies is Zanubrutinib indicated, and what are the recommended combinations for relapsed or refractory cases?")
print(response)

  llm = Ollama(model="llama3.2")


I can't provide real-time information or updates after my knowledge cutoff date of December 2023. For the most recent information on CDSCO approvals, indications, and recommended combinations for Zanubrutinib in hematologic malignancies as of 2025, I recommend consulting a reliable medical source or the official website of the Central Drugs Standard Control Organization (CDSCO) for the latest updates.


In [None]:
!pip install beautifulsoup4 chromadb langchain ollama PyMuPDF -q

In [None]:
import os
import csv
from urllib.parse import urljoin

import requests
import fitz
import ollama
from bs4 import BeautifulSoup

from langchain_core.documents import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import OllamaEmbeddings

## Web Scraping and Document Ingestion
Approved drug information on the CDSCO website is not available in a directly usable format. Instead, the data is published through PDF documents linked within the webpage. Each link leads to a JSP-based intermediary page where the actual PDF file is rendered inside an `<iframe>`. As a result, the data must be retrieved by accessing and parsing the embedded PDF files individually.

In [None]:
# Set up headers and base URL
headers = {
    "User-Agent": "Mozilla/5.0"
}
base_url = "https://cdsco.gov.in"

# Page to scrape
target_url = "https://cdsco.gov.in/opencms/opencms/en/Approval_new/Approved-New-Drugs/#"

In [None]:
# Request and parse HTML
response = requests.get(target_url, headers=headers, timeout=10)
response.raise_for_status()

soup = BeautifulSoup(response.text, "html.parser")

In [None]:
# Find all 'a' tags linking to download JSPs
pdf_links = []
for a in soup.find_all('a', href=True):
    href = a['href']
    if "download_file_division.jsp" in href:
        full_url = urljoin(base_url, href)
        pdf_links.append(full_url)

In [None]:
# Save to CSV
with open("pdf_links.csv", mode='w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['PDF_URL'])
    for link in pdf_links:
        writer.writerow([link])

print(f"Extracted and saved {len(pdf_links)} PDF download links to 'pdf_links.csv'")

Extracted and saved 38 PDF download links to 'pdf_links.csv'


In [None]:
# Load PDF URLs from CSV
pdf_urls = []

with open("pdf_links.csv", newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        pdf_urls.append(row['PDF_URL'])

In [None]:
docs = []

headers = {
    "User-Agent": "Mozilla/5.0",
    "Referer": "https://cdsco.gov.in/opencms/opencms/en/Approval_new/Approved-New-Drugs/#"
}

# Read CSV and fetch each PDF URL
with open("pdf_links.csv", newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for i, row in enumerate(reader):
        jsp_url = row["PDF_URL"]
        try:
            response = requests.get(jsp_url, headers=headers, timeout=15)
            soup = BeautifulSoup(response.text, "html.parser")
            iframe = soup.find("iframe")

            if iframe and iframe.has_attr("src"):
                # Download PDF content from iframe src
                pdf_url = urljoin(jsp_url, iframe["src"])
                pdf_resp = requests.get(pdf_url, headers=headers, timeout=15)

                if b"%PDF" in pdf_resp.content[:4]:  # Check PDF header
                    doc = fitz.open(stream=pdf_resp.content, filetype="pdf")
                    text = "\n".join(page.get_text() for page in doc)
                    doc.close()

                    docs.append(Document(page_content=text, metadata={"source": f"url_{i}"}))

        except Exception as e:
            print(f"Error on {jsp_url}: {e}")

## Text Processing
The PDFs contained tabular data with multi-line entries, so a chunk size of 512 chars felt like a safe boundary to capture full entries, with an overlap of 50 in case a drug entry is on the edge of a chunk. These chunks are then converted into vector embeddings and stored in Chroma.

In [None]:
# Split docs into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
splits = text_splitter.split_documents(docs)

In [None]:
# Create embeddings and vectorstore
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings)
retriever = vectorstore.as_retriever()

## RAG Configuration
Setting up functions to retrieve context from Chroma for Ollama to generate informed responses.

In [None]:
# LLM response function
def ollama_llm(question, context):
    formatted_prompt = f"Question: {question}\nContext: {context}\nAnswer:"
    response = ollama.chat(model='llama3.2', messages=[{'role': 'user', 'content': formatted_prompt}])
    return response['message']['content']

In [None]:
# RAG chain
def rag_chain(question):
    retrieved_docs = retriever.invoke(question)
    formatted_context = "\n\n".join(doc.page_content for doc in retrieved_docs)
    return ollama_llm(question, formatted_context)

In [None]:
# Query RAG
answer = rag_chain("According to 2025 CDSCO approvals, in which hematologic malignancies is Zanubrutinib indicated, and what are the recommended combinations for relapsed or refractory cases?")
print(answer)

According to 2025 CDSCO approvals, Zanubrutinib is indicated in hematologic malignancies for the treatment of:

1. Mantle cell lymphoma (MCL) who have received at least one prior therapy.
2. Waldenstrom’s macrogloubulinemia (WM)
3. Relapsed or refractory marginal zone lymphoma (MZL) who have received at least one anti-CD20-based regimen.
4. Chronic lymphocytic leukemia (CLL) or small lymphocytic lymphoma (SLL).
5. Relapsed or refractory follicular lymphoma (FL), in combination with obinutuzumab, after two or more lines of systemic therapy.

Additionally, Zanubrutinib is indicated for the treatment of relapsed follicular B-cell non-Hodgkin lymphoma (FL) and small lymphocytic lymphoma (SLL) in patients who have received at least two prior systemic therapies.
