<h1>Installs</h1>

This Notebook has the code for preparing, embedding and uploading the abstracts

In [1]:
! pip install pinecone-client
! pip install faunadb
! pip install ndg-httpsclient
! pip install pyopenssl
! pip install pyasn1
! pip install -U sentence-transformers
! pip install transformers

^C
Collecting pinecone-client
  Downloading pinecone_client-2.2.4-py3-none-any.whl.metadata (7.8 kB)
Collecting requests>=2.19.0 (from pinecone-client)
  Using cached requests-2.31.0-py3-none-any.whl.metadata (4.6 kB)
Collecting pyyaml>=5.4 (from pinecone-client)
  Using cached PyYAML-6.0.1-cp312-cp312-win_amd64.whl.metadata (2.1 kB)
Collecting loguru>=0.5.0 (from pinecone-client)
  Downloading loguru-0.7.2-py3-none-any.whl.metadata (23 kB)
Collecting typing-extensions>=3.7.4 (from pinecone-client)
  Using cached typing_extensions-4.9.0-py3-none-any.whl.metadata (3.0 kB)
Collecting dnspython>=2.0.0 (from pinecone-client)
  Downloading dnspython-2.4.2-py3-none-any.whl.metadata (4.9 kB)
Collecting urllib3>=1.21.1 (from pinecone-client)
  Using cached urllib3-2.1.0-py3-none-any.whl.metadata (6.4 kB)
Collecting tqdm>=4.64.1 (from pinecone-client)
  Downloading tqdm-4.66.1-py3-none-any.whl.metadata (57 kB)
     ---------------------------------------- 0.0/57.6 kB ? eta -:--:--
     --------


[notice] A new release of pip is available: 23.3.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip



<h1>Extract Data from .csv</h1>
*   AB = Column that corresponds to the text of the abstract


In [None]:
import csv
from itertools import islice
import uuid

# Load data, store abstract text
batch_input = []
with open('data/INLPT_class/articles.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        batch_input.append(row["AB"])

In [None]:
print(len(batch_input))

57560


<h3>PubMedBert is a on pubmed data finetuned BERT sentence embedding model. We embedd per Abstract</h3> 
*   https://huggingface.co/pritamdeka/S-PubMedBert-MS-MARCO

<h3>E5-large-V2 a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification   </h3>
*   https://huggingface.co/intfloat/e5-large-v22




In [None]:
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer
import re

#embed data
# model = SentenceTransformer('pritamdeka/S-PubMedBert-MS-MARCO')
# tokenizer = AutoTokenizer.from_pretrained('pritamdeka/S-PubMedBert-MS-MARCO')
# max_token_size = 350
# print(model)
# print(tokenizer)

model = SentenceTransformer('intfloat/e5-large-v2')
tokenizer = AutoTokenizer.from_pretrained('intfloat/e5-large-v2')
max_token_size = 512
print(model)
print(tokenizer)


SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)
BertTokenizerFast(name_or_path='intfloat/e5-large-v2', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True)

<h3>Chunk data</h3>

- Divide each abstract into chunks such that each chunk corresponds to max 512 tokens
- Create overlaps between chunks
- Add local context as metadata to each chunke

In [None]:
#Helper functions

def get_front_context(s, m,overlap):
  offset = len(overlap.split())
  words = s.split()[:-offset]
  output = []
  count = 0
  for word in reversed(words):
    if(count > m):
      break;
    output.insert(0, word)
    count += 1
  return " ".join(output)

def get_back_context(s, m,overlap):
  offset = len(overlap.split())
  words = s.split()[offset:]
  output = []
  count = 0
  for word in words:
    if(count > m):
      break;
    output.append(word)
    count += 1
  return " ".join(output)

def combine_strings(string_1, string_2, n):
    words_1 = string_1.split()
    words_2 = string_2.split()
    front = []
    back = []
    count = 0

    for word in reversed(words_1):
        token = tokenizer.convert_ids_to_tokens( tokenizer( word,add_special_tokens=False)["input_ids"])
        count += len(token)
        if(count > n):
            count = 0
            break;
        front.insert(0, word)

    for word in words_2:
        token = tokenizer.convert_ids_to_tokens( tokenizer( word,add_special_tokens=False)["input_ids"])
        count += len(token)
        if(count > n):
            count = 0
            break;
        back.append(word)

    front = " ".join(front)
    back = " ".join(back)
    combined_string = front + " " + back
    return (combined_string, front, back)

def refine_sentences_fixed(s, sentences):
    all_sentences = []
    count = 0
    words = sentences.split()
    for word in words:
        if(count % s == 0):
            all_sentences.append([])
            count = 0
        token = tokenizer.convert_ids_to_tokens( tokenizer(word,add_special_tokens=False)["input_ids"])
        if(count + len(token) > s):
            all_sentences.append([])
            count = 0
        count += len(token)
        if not all_sentences[len(all_sentences)-1]:
            all_sentences[len(all_sentences)-1].append(word)
        else:
            all_sentences[len(all_sentences)-1][0] +=  " " + word

    return [item for sublist in all_sentences for item in sublist]

def chop_text_by_words(text, n):
    m = int(n*0.2)
    token_size = max_token_size
    dataForm = []
    #chop text by tokens
    text_chopped_by_words = refine_sentences_fixed(token_size, text)
    #make list of tuples for medata and add overlaps
    for i, chunk in enumerate(text_chopped_by_words):
      if(len(text_chopped_by_words) == 1):
        dataForm.append(["query: " + chunk, {"front_context": ""}, {"back_context":""},str(uuid.uuid4()), text])
        break;
      if i == 0:
          dataForm.append(["query: " + chunk, {"front_context": ""} , {"back_context": " ".join(text_chopped_by_words[i+1].split()[:m])},str(uuid.uuid4()), text])
          overlap = combine_strings(text_chopped_by_words[i],text_chopped_by_words[i+1], token_size/2)
          dataForm.append(["query: " + overlap[0],{"front_context": get_front_context(str(text_chopped_by_words[i]),m,overlap[1])},{"back_context":get_back_context(text_chopped_by_words[i+1],m,overlap[2])},str(uuid.uuid4()), text])

      elif i == len(text_chopped_by_words)-1:
          dataForm.append(["query: " + chunk, {"front_context": " ".join(text_chopped_by_words[i-1].split()[-(m):])}, {"back_context":""},str(uuid.uuid4()), text])

      else:
          dataForm.append(["query: " + chunk, {"front_context": " ".join(text_chopped_by_words[i-1].split()[-(m):])}, {"back_context": " ".join(text_chopped_by_words[i+1].split()[:(m)])},str(uuid.uuid4()), text ])
          overlap = combine_strings(text_chopped_by_words[i],text_chopped_by_words[i+1], token_size/2)
          dataForm.append(["query: " + overlap[0],  {"front_context": get_front_context(str(text_chopped_by_words[i]),m,overlap[1])},{"back_context": get_back_context(text_chopped_by_words[i+1],m,overlap[2])},str(uuid.uuid4()), text])

    return dataForm

def remove_unicode_escape(s):
    return re.sub(r'\\u....', '', s)

In [None]:
from tqdm.notebook import tqdm
chunks = []
for abstract in tqdm(batch_input):
  chunks.append(chop_text_by_words(abstract, max_token_size))

  0%|          | 0/57560 [00:00<?, ?it/s]

<h3>Embedd data</h3>

In [None]:
import numpy as np

embedding_list = []

#This approach creates an embedding for every single chunk of an abstract (or one embedding for abstract if abstract didnt got chunked)
all_abstract_chunks = [inner_list[0] for middle_list in chunks for inner_list in middle_list]
multi_list = [embedding_list.append(model.encode(chunk)) for chunk in tqdm(all_abstract_chunks)]

#This way of embedding results in one embedding per abstract <-> chunked abstracts will be merged together by mean-pooling, this makes no use out of the local context
# for batch in tqdm(chunks):
#   if(len(batch) == 1):
#       embedding_list.append(model.encode(batch[0][0]))
#   elif len(batch) > 1:
#       flattened_list = [level_2 for level_1 in batch for level_2 in level_1[0:1] if level_1]
#       multi_list = [model.encode(chunk) for chunk in flattened_list]
#       embedding_list.append(np.mean(multi_list, axis=0))


  0%|          | 0/67080 [00:00<?, ?it/s]

-   batch_input: List of abstracts
-   chunks: List of lists of lists of chunked abstracts + local context for metadata +id + text of whole abstract
-   embedding_list: list of embeddings 

Create a vector that holds the data that is going to be uploaded to Pinecone Vector DB. To Pinecone we upload the embedding together with an id that maps the embedding to its original text in the DB (FaunaDB)

<h3>Prepare data, initialize pinecone manager and upload</h3>

In [None]:
import uuid

pinecone_vectors = []
count = 0
for batch in tqdm(chunks):
    for data in batch:
        pinecone_vectors.append((data[3], embedding_list[count].tolist()))
        count += 1
        
#createvec = [{pinecone_vectors.append((data[3], embedding_list[count])),}  for batch in tqdm(chunks) for data in batch]

  0%|          | 0/57560 [00:00<?, ?it/s]

In [None]:
import pinecone

#Init pinecone index
pinecone.init(api_key="4d2c2cd0-cf55-43c5-afb1-11001dc68709", environment="gcp-starter")
index = pinecone.Index("inlp-med-ws2324")

In [None]:
import random
import itertools

#Split data into m n-sized lists, used for bulk upload
def bulk_upload(iterable, n):
  bulk = [iterable[x:x+n] for x in range(0, len(iterable), n)]
  return bulk


In [None]:
#upsert data to pinecone
for ids_vectors_chunk in tqdm(bulk_upload(pinecone_vectors,500)):
  index.upsert(vectors=ids_vectors_chunk)

  0%|          | 0/135 [00:00<?, ?it/s]

<h3>Prepare Data, initialize FaunaDB client and upload</h3>

*   https://v4.dashboard.fauna.com/db/eu/medicalData




In [None]:
#Initialize FaunaDB
from faunadb import query as q
from faunadb.objects import Ref
from faunadb.client import FaunaClient

client = FaunaClient(
  secret="fnAFW8NnOqAAzXyJUw9gkBUoCOopcrX3c8zdIJy0",
)

In [None]:
chunk_test = chunks
upload = []
for chunk in chunk_test:
    for data in chunk:
        upload.append(data)
            
print(len(upload))
chunkeddata = bulk_upload(upload, 5000)

67080


In [None]:
#Upload data into FaunaDB
def upload_data_to_fauna(data):
        for row in data:
            client.query(q.create(q.collection("metadata"),{"data": {"chunk": row[0], "front_context": row[1],"back_context": row[2],"id": row[3],"abstract": row[4]}}))

In [None]:
for i in range(len(chunkeddata)):
  upload_data_to_fauna(chunkeddata[i])

How to query from Pinecone

In [None]:
def return_document(query):
  embeded_vector = model.encode(query).tolist()

  query_response = index.query(
      embeded_vector,
      top_k=3,
      )
  return query_response

return_document("CASK disorder phenotype")

{'matches': [{'id': '65639466-3a21-4e2d-8c91-6fc106523e02',
              'score': 0.850698709,
              'values': []},
             {'id': 'ef02e706-a21f-44c7-94f4-d79f1969b22f',
              'score': 0.844490886,
              'values': []},
             {'id': '7e9e6077-1f25-4b27-abb4-b477a17144b2',
              'score': 0.835636497,
              'values': []}],
 'namespace': ''}

How to Query from FaunaDB, first create index then query

In [None]:
result = client.query(
  q.paginate(q.match(q.index("metadata"), "65639466-3a21-4e2d-8c91-6fc106523e02"))
)
print(result)

{'data': [['query: CLINICAL CHARACTERISTICS: CASK disorders include a spectrum of phenotypes in both females and males. Two main types of clinical presentation are seen: Microcephalywith pontine and cerebellar hypoplasia (MICPCH), generally associated withpathogenic loss-of-function variants in CASK. X-linked intellectual disability(XLID) with or without nystagmus, generally associated with hypomorphic CASKpathogenic variants. MICPCH is typically seen in females with moderate-to-severeintellectual disability, progressive microcephaly with or without ophthalmologicanomalies, and sensorineural hearing loss. Most are able to sit independently;20%-25% attain the ability to walk; language is nearly absent in most. Neurologicfeatures may include axial hypotonia, hypertonia/spasticity of the extremities,and dystonia or other movement disorders. Nearly 40% have seizures by age tenyears. Behaviors may include sleep disturbances, hand stereotypies, and selfbiting. MICPCH in males may occur with 

ToDo: Implement Pinecone and FaunaDB API into backend
