## Converting and Storing Embeddings Locally.

In [2]:
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [4]:
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS

embedding_dim = len(embeddings.embed_query("hello world"))
index = faiss.IndexFlatL2(embedding_dim)

vector_store = FAISS(
    embedding_function=embeddings,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

In [10]:
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
import json
from pathlib import Path

path_to_doc = Path('../processed_text/')
docs = []
for doc in list(path_to_doc.glob('*.json')):
    with open(doc,encoding='utf-8-sig') as file:
        document=json.load(file)
        docs.append(Document(
            page_content=document['processed_text'],
            metadata = document['metadata']
        ))

In [12]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_splits = text_splitter.split_documents(docs)
_ = vector_store.add_documents(documents=all_splits)

In [13]:
vector_store.save_local(Path("../faiss_index_store"))

## Testing Embeddings

In [14]:
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings

# Use the same embedding model
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Load the saved index
vector_store = FAISS.load_local(
    folder_path="../faiss_index_store",
    embeddings=embeddings,
    allow_dangerous_deserialization=True
)


In [15]:
doc = vector_store.docstore.search(vector_store.index_to_docstore_id[20])
print(doc.page_content[:500])


and their association with other minerals was obtained using STEM EDS spectral X-ray mapping. TEM and STEM images of two examples of these U-bearing grains are shown in Figure S1. ### Treatment and Preparation of Single Cells from 3D Colonoids for Single-Cell RNA Sequencing Colonoids used for sequencing are described in Table S1. UBD was diluted in organoid expansion media, then colonoids in Matrigel were treated overnight (18 h) with UBD, with control colonoids receiving the same volume/volume 


In [19]:
query = "methylation"

results = vector_store.similarity_search(query, k=3)  # get top 3 matches

for i, res in enumerate(results):
    print(f"\n Result {i+1}:")
    print(res.page_content)  # show first 300 chars of chunk


 Result 1:
in the human brain Article Open access 01 April 2021 ### Fundamentals of DNA methylation in development Article 10 December 2024 ## INTRODUCTION Genetics is the study of heritable changes in gene activity or function due to the direct alteration of the DNA sequence. Such alterations include point mutations, deletions, insertions, and translocation. In contrast, epigenetics is the study of heritable changes in gene activity or function that is not associated with any change of the DNA sequence itself. Although virtually all cells in an organism contain the same genetic information, not all genes are expressed simultaneously by all cell types. In a broader sense, epigenetic mechanisms mediate the diversified gene expression profiles in a variety of cells and tissues in multicellular organisms. In this chapter, we would introduce a major epigenetic mechanism involving direct chemical modification to the DNA called DNA methylation. Historically, DNA methylation was discovered i

## Problems

- OpenAI token limit exceeded
- Any api related models comes with limits
- Tried different HuggingFace models
    - spectre2
    - BioBert
    - mpnet 
    - jinaai
- The cosine similarity for all the models is between 0.3 and 0.7
- Might be due to latex format in the processed text.