# Semantic search

Ready to go models.

| Model Name                                                               | Type                | Dim | Quality (Biomedical)                | Speed (CPU)    | Memory Usage            | Sentence-Level Optimized |
| ------------------------------------------------------------------------ | ------------------- | --- | ----------------------------------- | -------------- | ----------------------- | ------------------------ |
| **BioWordVec (BioSentVec)**<br>`BioWordVec_PubMed_MIMICIII_d200.vec.bin` | Static (word-level) | 200 | ⚠️ Low–Moderate                     | ✅✅✅ Very Fast  | ✅ Very Low (\~1 GB RAM) | ❌ No                     |
| **`all-MiniLM-L6-v2`**                                                   | SBERT (MiniLM)      | 384 | ✅ Moderate (general)                | ✅✅✅ Very Fast  | ✅ Low (\~80 MB)         | ✅ Yes                    |
| **`pritamdeka/S-PubMedBert-MS-MARCO`**                                   | SBERT (PubMedBERT)  | 768 | ✅✅✅ Excellent                       | ⚠️ Medium      | ⚠️ Moderate-High        | ✅ Yes                    |
| **`thenlper/gte-base`**                                                  | GTE (BERT)          | 768 | ✅✅ Good                             | ✅✅ Fast        | ✅ Moderate (\~400 MB)   | ✅ Yes                    |
| **`nomic-ai/nomic-embed-text-v1.5`**                                     | OpenCLIP-style      | 768 | ✅✅ Very Good (general + scientific) | ⚠️ Medium-Slow | ❗ High (\~1 GB+)        | ⚠️ Partial (CLS token)   |
| **`microsoft/BiomedNLP-PubMedBERT...`**                                  | Raw BERT            | 768 | ✅✅✅ Best-in-domain                  | 🐢 Slow        | ❗ High (\~1.2 GB)       | ❌ No (needs pooling)     |


In [None]:
import logging
import pandas as pd
from tqdm.auto import tqdm
import os
import psycopg2
from more_itertools import sliced
from math import ceil
import concurrent
import multiprocessing
import numpy as np
import concurrent.futures
import queue
import threading

from pysrc.papers.db.postgres_utils import ints_to_vals
from pysrc.config import PubtrendsConfig
config = PubtrendsConfig(test=False)






logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s: %(message)s')
logger = logging.getLogger('notebook')

%matplotlib inline
%config InlineBackend.figure_format='retina'

In [None]:
# Configures weather to use Postgres index or use Fast index in Faiss
EXACT_SEARCH = False

# Connections with main PubTrends database

In [None]:
connection_string_full_db = f"""
                    host={config.postgres_host} \
                    port={config.postgres_port} \
                    dbname={config.postgres_database} \
                    user={config.postgres_username} \
                    password={config.postgres_password}
                """.strip()

In [None]:
def load_publications(pids):
    with psycopg2.connect(connection_string_full_db) as connection:
        connection.set_session(readonly=True)
    vals = ints_to_vals(pids)
    query = f'''
                SELECT P.pmid as id, title, abstract, type
                FROM PMPublications P
                WHERE P.pmid IN (VALUES {vals});
                '''
    with connection.cursor() as cursor:
        cursor.execute(query)
        df = pd.DataFrame(cursor.fetchall(),
                          columns=['id', 'title', 'abstract', 'type'],
                          dtype=object)
        return df


In [None]:
def load_publications_year(year):
    with psycopg2.connect(connection_string_full_db) as connection:
        connection.set_session(readonly=True)
        query = f'''
                SELECT P.pmid as id, title, abstract
                FROM PMPublications P
                WHERE year = {year}
                ORDER BY pmid;
                '''
        with connection.cursor() as cursor:
            cursor.execute(query)
            df = pd.DataFrame(cursor.fetchall(),
                              columns=['id', 'title', 'abstract'],
                              dtype=object)
            return df

In [None]:

! mkdir -p ~/pubtrends_years

def fetch_year(year):
    path = os.path.expanduser(f'~/pubtrends_years/{year}.csv.gz')
    if os.path.exists(path):
        try:
            return pd.read_csv(path, compression='gzip')
        except Exception as e:
            logger.error(f'Error reading {path}: {e}')
            os.path.remove(path)
            return fetch_year(year)
    else:
        df = load_publications_year(year)
        df.to_csv(path, index=None, compression='gzip')
        return df


In [None]:
# load_publications_year(2025).head(10)

# Chunking

In [None]:
from pysrc.papers.analysis.text import get_chunks

MAX_TOKENS = 128

text = "Staphylococcus aureus is a rare cause of postinfectious glomerulonephritis, and Staphylococcus-related glo-merulonephritis primarily occurs in middle-aged or elderly patients. Patients with Staphylococcus-related glomerulonephritis also present with hematuria, proteinuria of varying degrees, rising serum creatinine levels, and/or edema. The severity of renal insufficiency is proportional to the degree of proliferation and crescent formation. Here, we present a diabetic patient admitted with a history of 1 week of left elbow pain. Laboratory results revealed that erythrocyte sedimentation rate was 110 mm/hour, serum creatinine level was 1 mg/dL, C-reactive protein level was 150 mg/L, and magnetic resonance imaging showed signal changes in favor of osteomyelitis at the olecranon level, with diffuse edematous appearance in the elbow skin tissue and increased intra-articular effusion. After diagnosis of osteomyelitis, ampicillin/sulbactam and teicoplanin were administered. After day 7 of admission, the patient developed acute kidney injury requiring hemodialysis under antibiotic treatment. Kidney biopsy was performed to determine the underlying cause, which showed Staphylococcus-related glomerulonephritis. Recovery of renal func-tions was observed after antibiotic and supportive treatment."

chunks = get_chunks(text, MAX_TOKENS)
print(f"Number of chunks: {len(chunks)}")

for i, chunk in enumerate(chunks):
    print(f"\nChunk {i+1}:")
    print(chunk)

# Embeddings with Sentence Transformer

In [None]:
from pysrc.embeddings.sentence_transformer.sentence_transformer import SentenceTransformerModel

sentence_transformer_model = SentenceTransformerModel()
sentence_transformer_model.download_and_load_model
emb = sentence_transformer_model.encode(['This is a test.', 'This is a test2'])
print(emb.shape)

In [None]:
embedding_dimension = emb.shape[1]
text_embedding = lambda t: sentence_transformer_model.encode(t)
batch_texts_embeddings = sentence_transformer_model.encode_parallel
embeddings_model = sentence_transformer_model

In [None]:
device = sentence_transformer_model.device
embeddings_model_name = 'all_MiniLM_L6_v2'
embedding_dimension = 384

# Embeddings with HugginFace Wrapper model

In [None]:
# from more_itertools import sliced
# import numpy as np
# import torch
# from transformers import AutoModel, AutoTokenizer
#
# if torch.backends.mps.is_available() and torch.backends.mps.is_built():
#     device = 'mps'
# elif torch.cuda.is_available():
#     device = 'gpu'
# else:
#     device = 'cpu'
#
# class SentenceTransformerWrapper:
#     def __init__(self, model_name, attention):
#         print(f'Loading model into {device}')
#         self.device = device
#         self.attention = attention
#         self.tokenizer = AutoTokenizer.from_pretrained(model_name)
#         self.model = AutoModel.from_pretrained(model_name, trust_remote_code=True).to(device)
#         self.model.eval()
#
#     @staticmethod
#     def mean_pooling(model_output, attention_mask):
#         token_embeddings = model_output.last_hidden_state  # (batch_size, seq_len, hidden_size)
#         input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
#         summed = torch.sum(token_embeddings * input_mask_expanded, dim=1)
#         summed_mask = torch.clamp(input_mask_expanded.sum(dim=1), min=1e-9)
#         return summed / summed_mask
#
#     def encode(self, sentences, batch_size=32):
#         all_embeddings = []
#
#         with torch.no_grad():
#             for batch in tqdm(list(sliced(sentences, batch_size))):
#                 inputs = self.tokenizer(
#                     batch,
#                     return_tensors="pt",
#                     padding=True,
#                     truncation=True,
#                     max_length=1024,
#                 ).to(self.device)
#
#                 outputs = self.model(**inputs)
#                 if self.attention:
#                     embeddings = SentenceTransformerWrapper.mean_pooling(outputs, inputs['attention_mask'])
#                 else:
#                     embeddings = outputs.last_hidden_state[:, 0, :]
#
#                 all_embeddings.append(embeddings.cpu().numpy())
#
#         return np.vstack(all_embeddings)

In [None]:
# # Decent model for biomedical embeddings
# # wrapped_model = SentenceTransformerWrapper("nomic-ai/nomic-embed-text-v1.5", False)
# # Also good, and slightly faster than nomic-embed
# wrapped_model = SentenceTransformerWrapper("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext", True)
# embeddings = wrapped_model.encode('Test sentence')
# embeddings.shape

In [None]:
# from more_itertools import sliced
# from math import ceil
# import concurrent
# import multiprocessing
# import numpy as np
#
# def parallel_texts_embeddings_wrapper(texts):
#     if device != 'cpu':
#         return wrapped_model.encode(texts)
#     # Default to number of CPUs for max workers
#     max_workers = multiprocessing.cpu_count()
#     # Compute parallel on different threads, since we use the same fasttext model
#     with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
#         futures = [
#             executor.submit(lambda ts: wrapped_model.encode(ts), ts)
#                    for ts in sliced(texts, int(ceil(len(texts) / max_workers)))
#         ]
#         # Important: keep order of results!!!
#         return np.vstack([future.result() for future in futures])

In [None]:
# embeddings_model_name = BiomedNLP_PubMedBERT
# text_embedding = lambda t: wrapped_model.encode([t])
# batch_texts_embeddings = parallel_texts_embeddings_wrapper
# embeddings_model = wrapped_model
# embedding_dimension = embeddings.shape[0]

# Prepare Postgresql + pgvector for embeddings search

Create DB Postgresql + pgvector
```
docker run --rm --name pubtrends-postgres -p 5432:5432 \
        -m 32G \
        -e POSTGRES_USER=biolabs -e POSTGRES_PASSWORD=mysecretpassword \
        -e POSTGRES_DB=pubtrends \
        -v ~/pgvector/:/var/lib/postgresql/data \
        -e PGDATA=/var/lib/postgresql/data/pgdata \
        -d pgvector/pgvector:pg17
```

In [None]:
semantics_search_host = 'localhost'
semantics_search_port = 5432
semantics_search_database = 'pubtrends'
semantics_search_username = 'biolabs'
semantics_search_password = 'mysecretpassword'

semantics_search_connection_string = f"""
                    host={semantics_search_host} \
                    port={semantics_search_port} \
                    dbname={semantics_search_database} \
                    user={semantics_search_username} \
                    password={semantics_search_password}
                """.strip()

In [None]:
# Embeddings DB initialization
with psycopg2.connect(semantics_search_connection_string) as connection:
    connection.set_session(readonly=False)
    query = f'''
            CREATE EXTENSION IF NOT EXISTS vector;
            create table {embeddings_model_name}(
                pmid    integer,
                chunk   integer,
                embedding vector({embedding_dimension})
            );
            CREATE INDEX pmid_chunk_idx_{embeddings_model_name}
            ON {embeddings_model_name}(pmid, chunk);
            '''
    with connection.cursor() as cursor:
        cursor.execute(query)
    connection.commit()

In [None]:
if not EXACT_SEARCH:
    # Create an index for fast vector similarity search using cosine distance
    # Index may slightly change results vs exact match search, but it's much faster!
    with psycopg2.connect(semantics_search_connection_string) as connection:
        connection.set_session(readonly=False)
        query = f'''
                CREATE INDEX embedding_idx_{embeddings_model_name}
                ON {embeddings_model_name}
                USING ivfflat (embedding vector_cosine_ops)
                WITH (lists = 100);
            '''
        with connection.cursor() as cursor:
            cursor.execute(query)
        connection.commit()

In [None]:
def collect_ids_without_embeddings(pids):
    with psycopg2.connect(semantics_search_connection_string) as connection:
        connection.set_session(readonly=True)
    vals = ints_to_vals(pids)
    query = f'''
                SELECT pmid
                FROM {embeddings_model_name} P
                WHERE P.pmid IN (VALUES {vals});
                '''
    with connection.cursor() as cursor:
        cursor.execute(query)
        df = pd.DataFrame(cursor.fetchall(), columns=['pmid'], dtype=object)
        pids_with_embeddings = set(df['pmid'])
        return [pid for pid in pids if pid not in pids_with_embeddings]


# Store embeddings into Postgresql

In [None]:
def l2norm(v):
    norm = np.linalg.norm(v)
    if norm == 0:
        norm = np.finfo(v.dtype).eps
    v /= norm
    return v

In [None]:
from psycopg2.extras import execute_values

def store_embeddings_to_postgresql(chunk_embeddings, chunk_idx):
    # Normalize embeddings if using cosine similarity
    data = [(pmid, chunk, l2norm(e).tolist())
            for (pmid, chunk), e in zip(chunk_idx, chunk_embeddings)]
    with psycopg2.connect(semantics_search_connection_string) as connection:
        with connection.cursor() as cursor:
            execute_values(
                cursor,
                f"INSERT INTO {embeddings_model_name} (pmid, chunk, embedding) VALUES %s",
                data
            )
        connection.commit()

In [None]:
from pysrc.papers.analysis.text import parallel_collect_chunks

# Thread safe queue to store chunks
chunks_queue = queue.Queue()

# Thread safe queue to store embeddings
embeddings_queue = queue.Queue()

def empty_embeddings_queue():
    while not chunks_queue.empty():
        chunks_queue.get()
    while not embeddings_queue.empty():
        embeddings_queue.get()

def collect_chunks_work(pids, texts):
    if pids is None or texts is None:
        return
    chunks, chunk_idx = parallel_collect_chunks(pids, texts, MAX_TOKENS, 1)
    chunks_queue.put((chunks, chunk_idx))

def compute_embeddings_work():
    try:
        chunks, chunk_idx = chunks_queue.get_nowait()  # Non-blocking
        chunk_embeddings = batch_texts_embeddings(chunks)
        embeddings_queue.put((chunk_embeddings, chunk_idx))
    except queue.Empty:
        pass

def store_embeddings_work():
    try:
        chunk_embeddings, chunk_idx = embeddings_queue.get_nowait()  # Non-blocking
        store_embeddings_to_postgresql(chunk_embeddings, chunk_idx)
    except queue.Empty:
        pass

def compute_embeddings_and_store_work():
    try:
        chunks, chunk_idx = chunks_queue.get_nowait()  # Non-blocking
        chunk_embeddings = batch_texts_embeddings(chunks)
        store_embeddings_to_postgresql(chunk_embeddings, chunk_idx)
    except queue.Empty:
        pass

def process_cpu_work(pids, texts):
    # Create threads
    threads = [
        threading.Thread(target=collect_chunks_work, args=(pids, texts)),
        threading.Thread(target=compute_embeddings_and_store_work, args=()),
    ]
    # Start the threads
    for t in threads:
        t.start()
    # Wait for both threads to complete
    for t in threads:
        t.join()

def process_gpu_work(pids, texts):
    # Create threads
    threads = [
        threading.Thread(target=collect_chunks_work, args=(pids, texts)),
        threading.Thread(target=compute_embeddings_work, args=()),
        threading.Thread(target=store_embeddings_work, args=())
    ]
    # Start the threads
    for t in threads:
        t.start()
    # Wait for both threads to complete
    for t in threads:
        t.join()

def process_work(pids, texts):
    if device == 'cpu':
        # Parallel chunking + embeddins | store
        process_cpu_work(pids, texts)
    else:
        # Parallel chunkin | embeddings | store
        process_gpu_work(pids, texts)


In [None]:
from more_itertools import sliced

CHUNK_SIZE = 10_000

empty_embeddings_queue()

for year in range(2025, 1969, -1):
    print(f'Processing year {year}')
    df = fetch_year(year)
    pids_to_process = set(collect_ids_without_embeddings(df['id']))
    print(f'To process {len(pids_to_process)}')
    df = df[df['id'].isin(pids_to_process)]

    print('Storing embeddings into DB')
    index_slices = sliced(range(len(df)), CHUNK_SIZE)
    for index_slice in tqdm(list(index_slices)):
        print(f'\rProcessing chunks {index_slice[0]}-{index_slice[-1]}          ', end='')
        chunk_df = df.iloc[index_slice]
        pids_chunk = list(chunk_df['id'])
        texts = [f'{title}. {abstract}' for title, abstract in zip(chunk_df['title'], chunk_df['abstract'])]
        process_work(pids, texts)
    # Finally, process the work left in the queue
    for _ in range(3):
        process_work(None, None)

    print('Done')

# Semantic search with Postgresql

In [None]:
with psycopg2.connect(semantics_search_connection_string) as connection:
    with connection.cursor() as cursor:
        cursor.execute(f'SELECT COUNT(*) FROM {embeddings_model_name}')
        total_rows = cursor.fetchone()[0]
        print(f'Total embeddings: {total_rows}')

In [None]:
def semantic_search_postgresql(query, k):
    query_vector = text_embedding(query)
    # Normalize embeddings if using cosine similarity
    embedding = l2norm(query_vector).tolist()
    with psycopg2.connect(semantics_search_connection_string) as connection:
        with connection.cursor() as cursor:
            cursor.execute(f"""
                   SELECT pmid, chunk, embedding <=> %s::vector AS distance
                   FROM {embeddings_model_name}
                   ORDER BY distance
                   LIMIT %s
                   """, (embedding, k))

            results = cursor.fetchall()
            return pd.DataFrame(data=results, columns=['pmid', 'chunk', 'distance'])

In [None]:
search_pg = semantic_search_postgresql("epigenetic human aging", 1000)
search_pg

In [None]:
pmids_pg = search_pg['pmid']
len(pmids_pg.unique())

# Store embeddings into Faiss from Postgresql

In [None]:
import faiss

! mkdir -p ~/faiss_{embeddings_model_name}

FAISS_INDEX_FILE = os.path.expanduser(f'~/faiss_{embeddings_model_name}/embeddings.index')
PIDS_INDEX_FILE = os.path.expanduser(f'~/faiss_{embeddings_model_name}/pids.csv.gz')

def create_faiss():
    if EXACT_SEARCH:
        print('Exact search index')
        index = faiss.IndexFlatIP(embedding_dimension)
    else:
        print('Approximate search index')
        quantizer = faiss.IndexFlatL2(embedding_dimension)
        index = faiss.IndexIVFPQ(quantizer, embedding_dimension, 200, 16, 8)
    return index

def create_or_load_faiss():
    if os.path.exists(FAISS_INDEX_FILE):
        print('Loading Faiss index from existing file')
        index = faiss.read_index(FAISS_INDEX_FILE)
        # For accurate search
        index.nprobe = 200
    else:
        print('Creating empty Faiss index')
        index = create_faiss()
    if os.path.exists(PIDS_INDEX_FILE):
        pids_idx = pd.read_csv(PIDS_INDEX_FILE, compression='gzip')
    else:
        pids_idx = pd.DataFrame(data=[], columns=['pmid', 'chunk', 'year', 'noreview'], dtype=int)
    return index, pids_idx

In [None]:
import numpy as np
import ast  # For safely converting string to list

def collect_pids_types_year(year):
    with psycopg2.connect(connection_string_full_db) as connection:
        connection.set_session(readonly=True)
        query = f'''SELECT pmid
                FROM PMPublications P
                WHERE year = {year}
                ORDER BY pmid;
                '''
    with connection.cursor() as cursor:
        cursor.execute(query)
        return [v[0] for v in cursor.fetchall()]

def sample_embeddings(n=10_000):
    with psycopg2.connect(semantics_search_connection_string) as connection:
        with connection.cursor() as cursor:
            print('Sampling training data')
            query = f"""
                    SELECT embedding FROM {embeddings_model_name}
                    TABLESAMPLE SYSTEM (0.1)  -- Approx. 0.1% of table
                    LIMIT {n};
            """
            cursor.execute(query)
            embeddings = [np.array(ast.literal_eval(row[0])) for row in cursor.fetchall()]
            return embeddings

def load_embeddings_by_ids(pids):
    vals = ints_to_vals(pids)
    with psycopg2.connect(semantics_search_connection_string) as connection:
        with connection.cursor() as cursor:
            query = f"""
                    SELECT pmid, chunk, embedding FROM {embeddings_model_name}
                    WHERE pmid IN (VALUES {vals})
                    ORDER BY pmid, chunk;
            """
            cursor.execute(query)
            result = cursor.fetchall()
            index = [(pmid, chunk) for pmid, chunk, _ in result]
            embeddings = [np.array(ast.literal_eval(row[2])) for row in result]
            return index, embeddings


In [None]:
# Thread safe queue to store embeddings
embeddings_pg_queue = queue.Queue()

def empty_to_fs_queue():
    # Empty the queue safely
    while not embeddings_pg_queue.empty():
        embeddings_pg_queue.get()

def load_embeddings_pg_work(pids):
    if len(pids) == 0:
        return
    index, embeddings = load_embeddings_by_ids(pids)
    embeddings_pg_queue.put((index, embeddings))

def store_embeddings_fs_work():
    global faiss_index
    global pids_idx
    try:
        index, embeddings = embeddings_pg_queue.get_nowait()  # Non-blocking
        embeddings = np.array(embeddings).astype('float32')
        if (len(embeddings.shape) == 1 or
                embeddings.shape[1] != embedding_dimension or
                len(index) != embeddings.shape[0]):
            print(f'Problematic chunk embeddings, {embeddings.shape}')
            return
        t = pd.DataFrame(data=index, columns=['pmid', 'chunk'])
        pids_idx = pd.concat([pids_idx, t], ignore_index=True).reset_index(drop=True)
        faiss_index.add(embeddings)
    except queue.Empty:
        pass

def process_store_embeddings_work(pids):
    # Create threads
    threads = [
        threading.Thread(target=load_embeddings_pg_work, args=([pids])),
        threading.Thread(target=store_embeddings_fs_work, args=())
    ]
    # Start the threads
    for t in threads:
        t.start()
    # Wait for both threads to complete
    for t in threads:
        t.join()

In [None]:
faiss_index, pids_idx = create_or_load_faiss()

if not EXACT_SEARCH and len(pids_idx) == 0:
    embeddings = sample_embeddings()
    embeddings = np.array(embeddings).astype('float32')
    print(f'Training index on {embeddings.shape[0]} embeddings')
    faiss_index.train(embeddings)

assert len(pids_idx) == faiss_index.ntotal

In [None]:
CHUNK_SIZE = 1000

empty_to_fs_queue()

for year in range(2025, 2015, -1):
    print(f'Processing year {year}')
    pids_year = collect_pids_types_year(year)
    pids_year = list(set(pids_year) - set(pids_idx['pmid']))
    print(f'To process {len(pids_year)}')
    if len(pids_year) == 0:
        continue

    for i, chunk in tqdm(list(enumerate(sliced(pids_year, CHUNK_SIZE)))):
        chunk_offset = i * CHUNK_SIZE
        print(f'\rProcessing embeddings {chunk_offset}-{min(len(pids_year), chunk_offset + CHUNK_SIZE)}      ', end='')
        process_store_embeddings_work(chunk)

    # Finally, process the work left in the queue
    for _ in range(3):
        process_store_embeddings_work([])

    print('Storing FAISS index')
    faiss.write_index(faiss_index, FAISS_INDEX_FILE)
    print('Storing Ids index')
    pids_idx.to_csv(PIDS_INDEX_FILE, index=False, compression='gzip')

print('Done')

In [None]:
print(len(pids_idx))
print(faiss_index.ntotal)

# Semantic search with Faiss

In [None]:
faiss_index, pids_idx = create_or_load_faiss()
print(len(pids_idx))

In [None]:
def semantic_search_faiss(query_text, k):
    query_vector = text_embedding(query_text).reshape(1, -1)
    # Normalize embeddings if using cosine similarity
    faiss.normalize_L2(query_vector)
    similarities, indices = faiss_index.search(query_vector.astype('float32'), k)
    t = pids_idx.iloc[indices[0]].copy().reset_index(drop=True)
    t['similarity'] = similarities[0]
    return t

In [None]:
search_fs = semantic_search_faiss("epigenetic human aging", 10_000)
search_fs

In [None]:
# At some point, vectors are similar and too far away from the request vector, which limits the number of results
pmids_fs = search_fs['pmid']
len(pmids_fs.unique())

In [None]:
load_publications(pmids_fs.unique())['title']

# Comparison Postgresql vs Faiss semantic search

In [None]:
# print(f'Postgresql {len(pmids_pg.unique())}')
# print(f'Faiss {len(pmids_fs.unique())}')
# overlap = set(list(pmids_pg)) & set(list(pmids_fs))
# print(f'Overlap {len(overlap)}')

# Apply additional semantic filtering on search results

In [None]:
search = semantic_search_faiss(
    "epigenetic changes in stem cell differentiation in human",
    1000
)
search_ids = search['pmid']
print(len(search_ids.unique()))
search

In [None]:
publications = load_publications(search_ids)
search_ids = publications['id']
publications.head(5)

In [None]:
def collect_chunks_embeddings(df):
    print('\rCollecting chunks           ', end='')
    pids = list(df['id'])
    texts = [f'{title}. {abstract}' for title, abstract in zip(df['title'], df['abstract'])]
    chunks, chunk_idx = parallel_collect_chunks(pids, texts, MAX_TOKENS)
    print(f'\rComputing {len(chunks)} embeddings   ', end='')
    chunk_embeddings = batch_texts_embeddings(chunks)
    return chunk_embeddings, chunk_idx

In [None]:
print('Compute documents embeddings')
embeddings, chunk_idx = collect_chunks_embeddings(publications)
embeddings = [l2norm(e) for e in embeddings]

In [None]:
print('Compute filters embeddings')

positive_filters = ['homo sapience', 'human', 'mammal', 'human cell']
negative_filters = ['cancer', 'tumor', 'tumor genesis', 'adenoma', 'carcinoma', 'mouse']

print(f'Computing filters embeddings embeddings')
negative_filters_embeddings = [l2norm(e) for e in batch_texts_embeddings(positive_filters)]
positive_filters_embeddings = [l2norm(e) for e in batch_texts_embeddings(negative_filters)]

negative_filters_scores = [(pmid, max([np.dot(e, ne) for ne in negative_filters_embeddings]))
                           for (pmid,_), e in zip(chunk_idx, embeddings)]
positive_filters_scores = [(pmid, min([np.dot(e, ne) for ne in positive_filters_embeddings]))
                           for (pmid,_), e in zip(chunk_idx, embeddings)]

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(15, 4))
axes = [plt.subplot(1, 3, i+1) for i in range(3)]
ax = axes[0]
ns = [s for _, s in negative_filters_scores]
sns.histplot(ns, kde=True, ax=ax)
ax.set_title('Negative filters')

ax = axes[1]
ps = [s for _, s in positive_filters_scores]
sns.histplot(ps, kde=True, ax=ax)
ax.set_title('Positive filters')

ax = axes[2]
sns.scatterplot(x=ns, y=ps, ax=ax)
sns.rugplot(x=ns, y=ps, height=.1, alpha=0.01, ax=ax)
ax.set_xlabel('Negative filters')
ax.set_ylabel('Positive filters')
ax.set_title('Positive filters vs negative filters')

plt.show()

In [None]:
max_negative_filter_score = 0.1
min_positive_filter_score = 0.05

filtered_ids = [
    pmid for (pmid, ps), (_, ns) in zip(positive_filters_scores, negative_filters_scores)
    if ps > min_positive_filter_score and ns < max_negative_filter_score
]

filtered_publications = load_publications(filtered_ids)
filtered_publications['title']

# Visualization of semantic search results

Launch fasttext endpoint API so that analyzer can use it
  ```
  conda activate pubtrends
  export PYTHONPATH=$PYTHONPATH:$(pwd)
  python pysrc/fasttext/fasttext_app.py
  ```

In [None]:
from pysrc.papers.db.pm_postgres_loader import PubmedPostgresLoader
from pysrc.papers.analyzer import PapersAnalyzer

loader = PubmedPostgresLoader(config)
analyzer = PapersAnalyzer(loader, config)

In [None]:
try:
    analyzer.analyze_papers(filtered_ids, 5)
finally:
    loader.close_connection()
    analyzer.teardown()

In [None]:
from bokeh.plotting import show
from pysrc.papers.plot.plotter import Plotter

analyzer.search_ids = filtered_ids
plotter = Plotter(config, analyzer)


In [None]:
# show(plotter.plot_top_cited_papers())

In [None]:
show(plotter.plot_papers_graph())

In [None]:
show(plotter.topics_hierarchy_with_keywords())