<a href="https://colab.research.google.com/github/AlvinKimata/ml-projects/blob/main/RAG/write_faiss_embeddings_to_index_file.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
!pip install langchain faiss-cpu sentence_transformers chromadb

In [None]:
# import
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader

In [None]:
# load the document and split it into chunks
loader = TextLoader("/content/Testing the Massively Multilingual Speech (MMS) Model that Supports 1162 Languages.txt")
documents = loader.load()


In [None]:
with open("/content/Testing the Massively Multilingual Speech (MMS) Model that Supports 1162 Languages.txt", 'r') as f:
  data = f.read()

In [None]:
# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_text(data)


In [None]:
# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

embedding_function

HuggingFaceEmbeddings(client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
), model_name='all-MiniLM-L6-v2', cache_folder=None, model_kwargs={}, encode_kwargs={})

In [None]:
# load it into Chroma
db = Chroma.from_texts(docs, embedding_function)

# query it
# query = "What did the president say about Ketanji Brown Jackson"
query = "What is the first line of the document?"
docs = db.similarity_search(query)

# print results
print(docs[0].page_content)

Meta used religious texts, such as the Bible, to build a model covering this wide range of languages. These texts have several interesting components: first, they are translated into many languages, and second, there are publicly available audio recordings of people reading these texts in different languages. Thus, the main dataset where this model was trained was the New Testament, which the research team was able to collect for over 1,100 languages and provided more than 32h of data per language. They went further to make it recognize 4,000 languages. This was done by using unlabeled recordings of various other Christian religious readings. From the experiments results, even though the data is from a specific domain, it can generalize well.


In [None]:
text_splitter = CharacterTextSplitter()
splits = text_splitter.split_text(data)



In [None]:
from langchain.vectorstores.faiss import FAISS
import faiss

# store = FAISS.from_documents(docs, embedding_function)
store = FAISS.from_texts(splits, embedding_function)
# faiss.write_index(store.index, '/content/faiss_index')

In [None]:
store.save_local('/content/faiss_index')

In [None]:
new_db = FAISS.load_local(r'/content/faiss_index', embedding_function)