## Converting and Storing Embeddings Locally.

In [1]:
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

In [2]:
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS

embedding_dim = len(embeddings.embed_query("hello world"))
index = faiss.IndexFlatL2(embedding_dim)

vector_store = FAISS(
    embedding_function=embeddings,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

In [3]:
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
import json
from pathlib import Path

path_to_doc = Path('../processed_text/')
docs = []
for doc in list(path_to_doc.glob('*.json')):
    with open(doc,encoding='utf-8-sig') as file:
        document=json.load(file)
        docs.append(Document(
            page_content=document['processed_text'],
            metadata = document['metadata']
        ))

In [4]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200,separators=["\n\n", "\n", ".", "!", "?", " ", ""])
all_splits = text_splitter.split_documents(docs)
_ = vector_store.add_documents(documents=all_splits)

In [5]:
vector_store.save_local(Path("../faiss_index_store"))

## Testing Embeddings

In [6]:
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings

# Use the same embedding model
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Load the saved index
vector_store = FAISS.load_local(
    folder_path="../faiss_index_store",
    embeddings=embeddings,
    allow_dangerous_deserialization=True
)


In [7]:
doc = vector_store.docstore.search(vector_store.index_to_docstore_id[20])
print(doc.page_content[:500])


. These studies suggested that, in theory, any eukaryotic cell can be studied using scRNA-seq. Consistent with this, a consortium of biomedical researchers has recently committed to employ scRNA-seq for creating a transcriptomic atlas of every cell type in the human body—the Human Cell Atlas [51]. This will provide a highly valuable reference for future basic research and translational studies. Although there is great confidence in the general utility of scRNA-seq, one technical barrier must be 


In [9]:
query = "applications of single-cell RNA sequencing in cancer research"

results = vector_store.similarity_search(query, k=3)  # get top 3 matches

for i, res in enumerate(results):
    print(f"\n Result {i+1}:")
    print(res.page_content)  # show first 300 chars of chunk


 Result 1:
. Oncogenesis. 2021;10(10):66. CAS 33. Haque A, Engel J, Teichmann SA, Lonnberg T. A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications. Genome Med. 2017;9(1):75. 34. Lafzi A, Moutinho C, Picelli S, Heyn H. Tutorial: guidelines for the experimental design of single-cell RNA sequencing studies. Nat Protoc. 2018;13(12):2742–57. CAS 35. Kinker GS, Greenwald AC, Tal R, Orlova Z, Cuoco MS, Mcfarland JM, et al. Pan-cancer single-cell RNA-seq identifies recurring programs of cellular heterogeneity. Nat Genet. 2020;52(11):1208–18. CAS 36. Suva ML, Tirosh I. Single-cell RNA sequencing in cancer: lessons learned and emerging challenges. Mol Cell. 2019;75(1):7–12. CAS 37. Ramachandran P, Matchett KP, Dobie R, Wilson-Kanamori JR, Henderson NC. Single-cell technologies in hepatology: new insights into liver biology and disease pathogenesis. Nat Rev Gastroenterol Hepatol. 2020;17(8):457–72. 38

 Result 2:
. Single-cell profiling of tumor heter

## Problems

- OpenAI token limit exceeded
- Any api related models comes with limits
- Tried different HuggingFace models
    - spectre2
    - BioBert
    - mpnet 
    - jinaai
- The cosine similarity for all the models is between 0.3 and 0.7
- Might be due to latex format in the processed text.

### Apr 8th

- Still the cosine similarity is giving low values