# 📘 Phase 1: Fetching and Preprocessing PubMed Data


In this section, we fetch biomedical content relevant to mental health disorders — specifically **depression**, **psychosis**, and **anxiety** — from PubMed via the Entrez API. We preprocess the text to remove noise and prepare it for vector embedding.
    

In [None]:

!pip install -q biopython nltk
from Bio import Entrez
import nltk
import re

nltk.download('punkt')
    

In [None]:

# Configure Entrez email
Entrez.email = "abiodunadebisi614@gmail.com"  # Replace with your email for Entrez access

# Search and fetch PubMed abstracts related to mental health disorders
def fetch_pubmed_abstracts(query, max_results=10):
    handle = Entrez.esearch(db="pubmed", term=query, retmax=max_results)
    record = Entrez.read(handle)
    id_list = record["IdList"]
    handle.close()

    abstracts = []
    if id_list:
        handle = Entrez.efetch(db="pubmed", id=",".join(id_list), rettype="abstract", retmode="text")
        abstracts = handle.read().split("\n\n")
        handle.close()
    return abstracts

# Fetch sample data
abstracts = fetch_pubmed_abstracts("depression OR psychosis OR anxiety")
len(abstracts), abstracts[:2]
    

### 🔍 Clean and Normalize the Text

In [None]:

def clean_text(text):
    text = re.sub(r"\n", " ", text)
    text = re.sub(r"[^a-zA-Z0-9.,;:!?()\-\s]", "", text)
    text = re.sub(r"\s+", " ", text)
    return text.strip()

cleaned_abstracts = [clean_text(abs) for abs in abstracts if abs.strip()]
cleaned_abstracts[:2]
    

### ✂️ Split into Chunks for Embedding

In [None]:

from nltk.tokenize import sent_tokenize

def chunk_text(text, max_length=500):
    sentences = sent_tokenize(text)
    chunks, current_chunk = [], ""
    for sent in sentences:
        if len(current_chunk) + len(sent) <= max_length:
            current_chunk += " " + sent
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sent
    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks

text_chunks = []
for doc in cleaned_abstracts:
    text_chunks.extend(chunk_text(doc))

len(text_chunks), text_chunks[:3]
    

# 📘 Phase 2: Embedding and FAISS Vector Store


Now that we have cleaned and chunked the text data, we convert each chunk into vector embeddings using a pre-trained model from `sentence-transformers`. Then, we store the vectors in **FAISS**, a high-performance similarity search library.
    

In [None]:

!pip install -q faiss-cpu sentence-transformers
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

# Load a lightweight biomedical transformer
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Encode the chunks
embeddings = model.encode(text_chunks, show_progress_bar=True)
embeddings = np.array(embeddings).astype("float32")
embeddings.shape
    

### 🧠 Index the Embeddings Using FAISS

In [None]:

# Create FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)
index.ntotal
    

### 🔍 Sample Similarity Search

In [None]:

# Query with a new sentence
query = "drugs for treating severe depression"
query_vector = model.encode([query]).astype("float32")

# Search
top_k = 5
distances, indices = index.search(query_vector, top_k)

print("🔎 Top retrieved chunks:")
for idx in indices[0]:
    print("-", text_chunks[idx])
    