In [1]:
from langchain_community.embeddings import OllamaEmbeddings
from langchain.vectorstores import DeepLake
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chat_models import ChatOllama
from langchain.chains import RetrievalQA



In [2]:
# create our documents
texts = [
    "Napoleon Bonaparte was born in 15 August 1769",
    "Louis XIV was born in 5 September 1638",
    "Lady Gaga was born in 28 March 1986",
    "Michael Jeffrey Jordan was born in 17 February 1963"
]
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.create_documents(texts)

In [4]:
# initialize embeddings model
embeddings = OllamaEmbeddings()


# create Deep Lake dataset
# TODO: use your organization id here. (by default, org id is your username)
my_activeloop_org_id = "georgeasro"
my_activeloop_dataset_name = "langchain_course_embeddings"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"
db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)

# add documents to our Deep Lake dataset
db.add_documents(docs)

Using embedding function is deprecated and will be removed in the future. Please use embedding instead.


env: ACTIVELOOP_TOKEN=eyJhbGciOiJub25lIiwidHlwIjoiSldUIn0.eyJpZCI6Imdlb3JnZWFzcm8iLCJhcGlfa2V5IjoibFY0dEVsM2FRMWlybVBPSVZSdlBMYmM4RDhtbXJVTU1mNjN1ejduSUFsZGpoIn0.
Your Deep Lake dataset has been successfully created!


Creating 4 embeddings in 1 batches of size 4:: 100%|██████████| 1/1 [00:19<00:00, 19.95s/it]

Dataset(path='hub://georgeasro/langchain_course_embeddings', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype      shape     dtype  compression
  -------    -------    -------   -------  ------- 
   text       text      (4, 1)      str     None   
 metadata     json      (4, 1)      str     None   
 embedding  embedding  (4, 4096)  float32   None   
    id        text      (4, 1)      str     None   





['3935104e-d252-11ee-ae3b-00155d9fa9e7',
 '393510d0-d252-11ee-ae3b-00155d9fa9e7',
 '393510ee-d252-11ee-ae3b-00155d9fa9e7',
 '39351102-d252-11ee-ae3b-00155d9fa9e7']

In [5]:
# create retriever from db
retriever = db.as_retriever()

In [None]:
# istantiate the llm wrapper
model = ChatOllama(model='llama2')

# create the question-answering chain
qa_chain = RetrievalQA.from_llm(model, retriever=retriever)

# ask a question to the chain
qa_chain.run("When was Michael Jeffrey Jordan born?")

### Let's break down each step to understand how these technologies work together.

- OpenAI and LangChain Integration: LangChain, a library built for chaining NLP models, is designed to work seamlessly with OpenAI's GPT-3.5-turbo model for language understanding and generation. 
- You've initialized OpenAI embeddings using OpenAIEmbeddings(), and these embeddings are later used to transform the text into a high-dimensional vector representation. This vector representation captures the semantic essence of the text and is essential for information retrieval tasks.
- Deep Lake: Deep Lake is a Vector Store for creating, storing, and querying vector representations (also known as embeddings) of data.
- Text Retrieval: Using the db.as_retriever() function, you've transformed the Deep Lake dataset into a retriever object. This object is designed to fetch the most relevant pieces of text from the dataset based on the semantic similarity of their embeddings.
- Question Answering: The final step involves setting up a RetrievalQA chain from LangChain. This chain is designed to accept a natural language question, transform it into an embedding, retrieve the most relevant document chunks from the Deep Lake dataset, and generate a natural language answer. The ChatOpenAI model, which is the underlying model of this chain, is responsible for both the question embedding and the answer generation.