Retrieval: Similarity search

In [3]:
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.documents import Document

In [4]:
embedding = OpenAIEmbeddings(model = "text-embedding-ada-002")

In [5]:
vectorstore = Chroma(persist_directory="./database", embedding_function=embedding)

doc added to vectorstore

In [6]:
added_document = Document(page_content='Alright! So… Let’s discuss the not-so-obvious differences between the terms analysis and analytics. Due to the similarity of the words, some people believe they share the same meaning, and thus use them interchangeably. Technically, this isn’t correct. There is, in fact, a distinct difference between the two. And the reason for one often being used instead of the other is the lack of a transparent understanding of both. So, let’s clear this up, shall we? First, we will start with analysis', 
                          metadata={'Course Title': 'Introduction to Data and Data Science', 
                                    'Lecture Title': 'Analysis vs Analytics'})

In [7]:
vectorstore.add_documents([added_document])

['0edb5dc4-ee2e-42dc-88bc-831966a50e04']

In [8]:
que = "What kind of programming languages data scientist use? "

In [9]:
retrived_docs = vectorstore.similarity_search(query=que, k=5)

In [10]:
for page in retrived_docs:
    print(page.page_content)

What about big data? Apart from R and Python, people working in this area are often proficient in other languages like Java or Scala. These two have not been developed specifically for doing statistical analyses, however they turn out to be very useful when combining data from multiple sources. All right! Let’s finish off with machine learning. When it comes to machine learning, we often deal with big data
Thus, we need a lot of computational power, and we can expect people to use the languages similar to those in the big data column. Apart from R, Python, and MATLAB, other, faster languages are used like Java, JavaScript, C, C++, and Scala. Cool. What we said may be wonderful, but that’s not all! By using one or more programming languages, people create application software or, as they are sometimes called, software solutions, that are adjusted for specific business needs
Alright! So… How are the techniques used in data, business intelligence, or predictive analytics applied in real l

Retrieval: Maximal Marginal Relevance Search

In [13]:
retrived_docs = vectorstore.max_marginal_relevance_search(query=que, k=5, lambda_mult=0.1, filter={"Lecture Title":"Programming Languages & Software Employed in Data Science - All the Tools You Need"})

In [14]:
for page in retrived_docs:
    print(page.page_content)

What about big data? Apart from R and Python, people working in this area are often proficient in other languages like Java or Scala. These two have not been developed specifically for doing statistical analyses, however they turn out to be very useful when combining data from multiple sources. All right! Let’s finish off with machine learning. When it comes to machine learning, we often deal with big data
Great! We hope we gave you a good idea about the level of applicability of the most frequently used programming and software tools in the field of data science. Thank you for watching!
It’s actually a software framework which was designed to address the complexity of big data and its computational intensity. Most notably, Hadoop distributes the computational tasks on multiple computers which is basically the way to handle big data nowadays. Power BI, SaS, Qlik, and especially Tableau are top-notch examples of software designed for business intelligence visualizations
Their smaller sc

Retrieval: Vectorstore Backend Retriever

In [15]:
len(vectorstore.get()['documents'])

23

In [16]:
retriver = vectorstore.as_retriever(search_type = "mmr", search_kwargs={"k":3, "lambda_mult":0.7})
retriver

VectorStoreRetriever(tags=['Chroma', 'OpenAIEmbeddings'], vectorstore=<langchain_chroma.vectorstores.Chroma object at 0x000002685CDBF0E0>, search_type='mmr', search_kwargs={'k': 3, 'lambda_mult': 0.7})

In [17]:
que = "what software do data scientist do?"

In [19]:
docs = retriver.invoke(que)
docs

[Document(id='b213b4ef-043c-4f70-a51a-4546fd61ef8e', metadata={'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need'}, page_content='Alright! So… How are the techniques used in data, business intelligence, or predictive analytics applied in real life? Certainly, with the help of computers. You can basically split the relevant tools into two categories—programming languages and software. Knowing a programming language enables you to devise programs that can execute specific operations. Moreover, you can reuse these programs whenever you need to execute the same action'),
 Document(id='7e98f2f5-e551-47ad-93e4-616ec668e475', metadata={'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need'}, page_content='Among the many applications we have plotted, we can say there is an increasing amount of software designed for working with big data such as Apache Hadoop, Apache Hbase, and Mongo DB. In terms o

In [21]:
for i in docs:
    print(f"{i.page_content}---->\n{i.metadata["Lecture Title"]}\n")

Alright! So… How are the techniques used in data, business intelligence, or predictive analytics applied in real life? Certainly, with the help of computers. You can basically split the relevant tools into two categories—programming languages and software. Knowing a programming language enables you to devise programs that can execute specific operations. Moreover, you can reuse these programs whenever you need to execute the same action---->
Programming Languages & Software Employed in Data Science - All the Tools You Need

Among the many applications we have plotted, we can say there is an increasing amount of software designed for working with big data such as Apache Hadoop, Apache Hbase, and Mongo DB. In terms of big data, Hadoop is the name that must stick with you. Hadoop is listed as a software in the sense that it is a collection of programs, but don’t imagine it as a nice-looking application---->
Programming Languages & Software Employed in Data Science - All the Tools You Need