## Introduction
- Goal: Answer neuroscience-related questions using semantic search + QA.
- Stack: `sentence-transformers`, `ChromaDB`, `transformers`, `tinyroberta-squad2`

In [1]:
# Install required packages
!pip install -q biopython sentence-transformers chromadb #openai
!pip install -q transformers accelerate bitsandbytes
!pip uninstall -y bitsandbytes
!pip install --upgrade git+https://github.com/TimDettmers/bitsandbytes.git

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m50.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.5/19.5 MB[0m [31m70.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.2/284.2 kB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m43.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.6/101.6 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00

In [2]:
from Bio import Entrez
from sentence_transformers import SentenceTransformer
import pandas as pd
from tqdm import tqdm
import time
import chromadb
#import openai
import os
import xml.etree.ElementTree as ET
from transformers import pipeline

### Step 1: Fetch Abstracts for Pipeline Prototyping

In [3]:
# Set API keys and email (you need to create a secret key at OpenAI if this is your first time)
#openai.api_key = os.environ.get("OPENAI_API_KEY")
#Entrez.email = os.environ.get("NCBI_EMAIL")

In [4]:
# Function that get neuroscience abstracts from PubMed
def get_abstracts(query, max_results=100):
  """
  Get abstracts from PubMed based on a given query.
  query (str): a query line that contains keywords to search and filters, e.g., "semantic memory AND brain"
  max_results (int): the maximum number of abstracts to fetch; 100 to keep things light for prototyping
  return (pd.DataFrame): a dataframe with two columns, abstract "id" and "text"
  """
  handle = Entrez.esearch(db="pubmed", term=query, retmax=max_results)
  record = Entrez.read(handle)
  handle.close()

  id_list = record["IdList"] # abostract IDs
  abstracts = []
  batch_size = 100  # max IDs to fetch per request
  for start in tqdm(range(0, len(id_list), batch_size)):
    batch_ids = id_list[start:start + batch_size]

    # Fetch article metadata in XML
    handle = Entrez.efetch(db="pubmed", id=",".join(batch_ids), rettype="xml")
    xml_data = handle.read()
    handle.close()

    root = ET.fromstring(xml_data)
    print(root)
    # Iterate over articles per batch
    for article in root.findall(".//PubmedArticle"):
      # Extract abstract text parts
      abstract_texts = article.findall(".//AbstractText")
      abstract = " ".join([elem.text for elem in abstract_texts if elem.text])

      # Extract article title
      title_elem = article.find(".//ArticleTitle")
      title = title_elem.text if title_elem is not None else "No Title"

      # Extract date of publishing (for double check)
      pub_date_elem = article.find(".//PubDate")
      year = pub_date_elem.findtext("Year", default="")
      month = pub_date_elem.findtext("Month", default="")
      day = pub_date_elem.findtext("Day", default="")
      # Put these parts together into a whole date
      date = f"{year}-{month}-{day}" if year and month and day else "No Date"


      if abstract:
        abstracts.append({"title": title, "date": date, "text": abstract})

  return pd.DataFrame(abstracts)


In [5]:
# Get abstracts!
df = get_abstracts("semantic memory AND brain", max_results=100)

            Email address is not specified.

            To make use of NCBI's E-utilities, NCBI requires you to specify your
            email address with each request.  As an example, if your email address
            is A.N.Other@example.com, you can specify it as follows:
               from Bio import Entrez
               Entrez.email = 'A.N.Other@example.com'
            In case of excessive usage of the E-utilities, NCBI will attempt to contact
            a user at the email address provided before blocking access to the
            E-utilities.
100%|██████████| 1/1 [00:01<00:00,  1.17s/it]

<Element 'PubmedArticleSet' at 0x7822443a5030>





In [6]:
# Clean the abstract text
df["text"] = df["text"].str.replace("\n", " ").str.strip()
# Save
df.to_csv("abstracts.csv", index=False)
df.tail()
# We seem to have ended up with abstracts from only 2025.
# Not a problem for prototyping and it even can reflect more recent
# trends when we look at semantic space later. But we need more
# balanced selection for future scaling

Unnamed: 0,title,date,text
91,Multi-Branch CNN-LSTM Fusion Network-Driven Sy...,No Date,The high volume of emergency room patients oft...
92,FM-APP: Foundation Model for Any Phenotype Pre...,2024-Nov-27,Predicting individual-level non-neuroimaging p...
93,Mechanisms of Verbal Fluency Impairment in Str...,No Date,Verbal fluency provides a unique index of the ...
94,Validation of the Cognitive-Emotional Perspect...,No Date,BackgroundTheory of mind (ToM) is crucial for ...
95,Towards a genetics of semantics? False memorie...,2025-Apr-15,Williams syndrome (WS) is a rare genetic neuro...


### Step 2: Create Embeddings with Sentence-Transformers

In [7]:
model = SentenceTransformer("all-MiniLM-L6-v2")
# encode and save the embedding to the df
df["embedding"] = model.encode(df["text"].tolist(), show_progress_bar=True).tolist()

df.to_pickle("embedded_corpus.pkl")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

In [8]:
# Use ChromaDB to store vector embeddings for scalable similarity search
chroma_client = chromadb.Client()
#chroma_client.delete_collection("abstracts")  # to be sure that it's clean slate
collection = chroma_client.create_collection("abstracts")

for i, row in df.iterrows():
  collection.add(
      ids=[f"doc_{i}"],
      documents=[row["text"]],
      metadatas=[{"title": row["title"] if row["title"] is not None else "Untitled", \
            "date": row["date"] if row["date"] is not None else "Unknown"}],
      embeddings=[row["embedding"]]
  )

### Step 3: Semantic Search & Answer Generation

In [9]:
# Semantic search func
def semantic_search(query_Q, top_k=3):
  """
  Search semantically closest results based on your query (in this context a question asked by user)
  query_Q (str): a query to be encoded as embeddings, e.g., a question: "What brain regions are involved in semantic memory?"
  top_k (int): the number of semantically closest results to return
  return (dict): a dictionary with keys "ids" (doc_XX), "documents", "metadatas" and "distances"
  """
  query_vec = model.encode([query_Q])[0]
  results = collection.query(query_embeddings=[query_vec], n_results=top_k)

  return results

# Tryout func
results = semantic_search("What cortical regions are involved in semantic memory?")
results["documents"][0]

['This study examines the evolving perspective on semantic processing, which has shifted from the traditional view of an isolated semantic memory system to one that recognizes the involvement of dynamic, distributed neural networks. Recent evidence supports the hypothesis that semantic processing engages both modality-specific and multimodal regions, with the latter serving as integrative "semantic hubs." In this context, our research focuses on the posterior parietal cortices (PPC) and their role in processing space-related semantics. We utilized a low-frequency repetitive Transcranial Magnetic Stimulation (rTMS) protocol targeting the PPC in 11 healthy participants across two tasks. In the first task, in which participants read aloud words from various semantic categories, including space-related terms, no rTMS effects were observed. In the second task, which required participants to respond aloud in a dichotomous manner to questions that either involved or did not involve spatial re

In [15]:
# Use a light, free model for a test
qa_pipeline = pipeline("question-answering", model= "deepset/tinyroberta-squad2")

# Test model installation
question = "What is the main finding of the study?"
context = "The study found that memory recall and future thinking share neural patterns across the default mode network."

result = qa_pipeline(question=question, context=context)
print(result['answer'])

config.json:   0%|          | 0.00/835 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/326M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cuda:0


memory recall and future thinking share neural patterns across the default mode network


In [11]:
# Answer Generation func

def generate_answer(question, docs):
  """
  Use semantic search results (retrieved abstracts) for answer generation based on an OpenAI GPT model.
  question (str): question asked by user
  docs (dict): results of semantic search with keys "ids", "documents", "metadatas" and "distances"
  return (str): a string with the answer
  """
  # Use both abstract text and title as context
  context = "\n\n".join([f"{meta['title']}:\n{doc}" for meta, doc in zip(docs["metadatas"][0], docs["documents"][0])])

  # Apply the QA model
  result = qa_pipeline(question=question, context=context)

  # Return just the answer
  return result['answer']

In [16]:
question = "What cortical regions are involved in semantic memory?"
retrieved_docs = semantic_search(question, top_k=3)
answer = generate_answer(question, retrieved_docs)

print("ANSWER:\n", answer)

ANSWER:
 posterior parietal


In [17]:
question = "How does the brain transition between thoughts during memory recall?"
retrieved_docs = semantic_search(question, top_k=3)
answer = generate_answer(question, retrieved_docs)

print("ANSWER:\n", answer)

ANSWER:
 activate the brain's default and control networks


It seems that, with the current light, free GPT model, our semantic Q&A is short but (more or less) accurate, evn though the processing quota (namely, the length of context that can be possibly processed) is clearly limited by the GPT model used here.

Future improvement can be about:
- Better models (LLaMA, Mistral, or GPT)
- Longer context handling (chunking, summarization)
- Citation + source tracing

### Reflection: Semantic Memory in Neuroscience vs. NLP Embeddings

**Semantic memory (neuroscience):**  
- Our long-term store of general knowledge, facts, and concepts  
- Independent of personal experience (e.g., knowing that Paris is the capital of France)  
- Supported by distributed brain regions, especially temporal and parietal cortices  
- Structured, conceptual, and used for reasoning and language  

**“Semantic” embeddings in NLP/LLMs:**  
- Vector representations learned from large text corpora to capture contextual similarity  
- Used for similarity search and retrieval based on numerical proximity in embedding space  
- Reflect patterns of word usage rather than explicit conceptual structure  

**Why the distinction matters:**  
- Our semantic search pipeline uses these embeddings to retrieve relevant text snippets—an *approximation* of semantic memory  
- It captures associative similarity rather than true conceptual knowledge  
- To model human-like semantic memory more fully, systems would need explicit concept representations, ontologies, or symbolic reasoning  

**In summary:**  
While embeddings power effective and practical semantic search, they differ fundamentally from the structured knowledge that defines semantic memory in the brain. Our approach is a useful proxy, but the gap highlights exciting challenges ahead for building truly “semantic” AI.
