In [1]:
%load_ext dotenv
%dotenv

In [2]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

In [3]:
embedding = OpenAIEmbeddings(model = "text-embedding-ada-002")

In [4]:
vectorstore_from_directory = Chroma(persist_directory = "./vector-store",
                                   embedding_function = embedding)

## Similarity Search

In [5]:
question = "What programming languages do data scientists use?"

In [6]:
retrieved_docs=vectorstore_from_directory.similarity_search(query = question,
                                            k = 3)
for i in retrieved_docs:
    print(f"Page Content: {i.page_content}\n------------\nLecture Title: {i.metadata['Lecture Title']}\n")

Page Content: What about big data? Apart from R and Python, people working in this area are often proficient in other languages like Java or Scala. These two have not been developed specifically for doing statistical analyses, however they turn out to be very useful when combining data from multiple sources. All right! Let’s finish off with machine learning. When it comes to machine learning, we often deal with big data
------------
Lecture Title: Programming Languages & Software Employed in Data Science - All the Tools You Need

Page Content: What about big data? Apart from R and Python, people working in this area are often proficient in other languages like Java or Scala. These two have not been developed specifically for doing statistical analyses, however they turn out to be very useful when combining data from multiple sources. All right! Let’s finish off with machine learning. When it comes to machine learning, we often deal with big data
------------
Lecture Title: Programming 

The limitation to similarity search, is that it only retrieves documents with high similarity scores, so there is a chance it retrieves the same document multiple times. 

## Maximal Marginal Relevance Search

In [7]:
question = "What software do data scientist use?"

In [16]:
retrieved_docs=vectorstore_from_directory.max_marginal_relevance_search(
    query = question,
    k = 3,
    lambda_mult = .3
)

In [17]:
for i in retrieved_docs:
    print(f"Page Content: {i.page_content}\n------------\nLecture Title: {i.metadata['Lecture Title']}\n")

Page Content: As you can see from the infographic, R, and Python are the two most popular tools across all columns. Their biggest advantage is that they can manipulate data and are integrated within multiple data and data science software platforms. They are not just suitable for mathematical and statistical computations. In other words, R, and Python are adaptable. They can solve a wide variety of business and data-related problems from beginning to the end
------------
Lecture Title: Programming Languages & Software Employed in Data Science - All the Tools You Need

Page Content: Alright! So… How are the techniques used in data, business intelligence, or predictive analytics applied in real life? Certainly, with the help of computers. You can basically split the relevant tools into two categories—programming languages and software. Knowing a programming language enables you to devise programs that can execute specific operations. Moreover, you can reuse these programs whenever you ne

## Vectorstore-Backed Retriever

In [19]:
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document

In [20]:
embedding = OpenAIEmbeddings()
vectorstore = Chroma(persist_directory = "./vector-store",
                    embedding_function = embedding)

  vectorstore = Chroma(persist_directory = "./vector-store",


In [22]:
len(vectorstore.get()["documents"])

140

In [25]:
retriever = vectorstore.as_retriever(search_type = "mmr",
                                     search_kwargs = {"k":3, 
                                                     "lambda_mult" : .7}
                                     )

In [26]:
question = "What software do data scientist use?"

In [28]:
retrieved_docs = retriever.invoke(question)   
# this is a runnable class so we can start using invoke()

In [29]:
for i in retrieved_docs:
    print(f"Page Content: {i.page_content}\n------------\nLecture Title: {i.metadata['Lecture Title']}\n")

Page Content: As you can see from the infographic, R, and Python are the two most popular tools across all columns. Their biggest advantage is that they can manipulate data and are integrated within multiple data and data science software platforms. They are not just suitable for mathematical and statistical computations. In other words, R, and Python are adaptable. They can solve a wide variety of business and data-related problems from beginning to the end
------------
Lecture Title: Programming Languages & Software Employed in Data Science - All the Tools You Need

Page Content: Alright! So… How are the techniques used in data, business intelligence, or predictive analytics applied in real life? Certainly, with the help of computers. You can basically split the relevant tools into two categories—programming languages and software. Knowing a programming language enables you to devise programs that can execute specific operations. Moreover, you can reuse these programs whenever you ne