### **Retriever and Chain with Langchain**

In [6]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("study.pdf")
docs = loader.load()
docs

[Document(page_content='ITDO6014 \nAI AND DS-1 \nModule 4: Introduction to DS CS380 1', metadata={'source': 'study.pdf', 'page': 0}),
 Document(page_content='Introduction to Data Science \n◻What is Data? \n◻Data is the collection of facts and bits of information. In \nthe real world, the data is either structured or \nunstructured. \n◻Structured data \xa0is data that has an order and a \nwell-defined structure. As the structured data is consistent \nand well-defined, it is an easy task to store and access it. \nAlso, searching for data is easy as we can use indexes to \nstore structured data.\xa0 \n◻\xa02', metadata={'source': 'study.pdf', 'page': 1}),
 Document(page_content='Introduction to Data Science \n◻Another type is unstructured data. It is an inconsistent \ntype as it doesn’t have any structure, format, or \nsequence. The unstructured data is error-prone when we \nperform indexing on it. Hence, it is a difficult task to \nunderstand and operate on unstructured data. \nInteresti

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter=RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
documents=text_splitter.split_documents(docs)
documents[:5]

[Document(page_content='ITDO6014 \nAI AND DS-1 \nModule 4: Introduction to DS CS380 1', metadata={'source': 'study.pdf', 'page': 0}),
 Document(page_content='Introduction to Data Science \n◻What is Data? \n◻Data is the collection of facts and bits of information. In \nthe real world, the data is either structured or \nunstructured. \n◻Structured data \xa0is data that has an order and a \nwell-defined structure. As the structured data is consistent \nand well-defined, it is an easy task to store and access it. \nAlso, searching for data is easy as we can use indexes to \nstore structured data.\xa0 \n◻\xa02', metadata={'source': 'study.pdf', 'page': 1}),
 Document(page_content='Introduction to Data Science \n◻Another type is unstructured data. It is an inconsistent \ntype as it doesn’t have any structure, format, or \nsequence. The unstructured data is error-prone when we \nperform indexing on it. Hence, it is a difficult task to \nunderstand and operate on unstructured data. \nInteresti

In [8]:
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import FAISS

db=FAISS.from_documents(documents[:30], OllamaEmbeddings())

In [9]:
db

<langchain_community.vectorstores.faiss.FAISS at 0x19aa52094f0>

In [10]:
query="What is Data Science?"
result=db.similarity_search(query)
result[0].page_content

'Introduction to Data Science \n◻Structured Data : This type of data contains organized data. It has \na fixed schema. Thus, it is easy to understand and analyze \nstructured data. \n◻Semi-structured Data : The data in the form of various file formats \nlike XML, JSON, and CSV is categorized as semi-structured data. \nIt is partially organized data, which makes it difficult to \nunderstand. \n◻Unstructured Data : This type of data does not have a \nwell-defined structure or a schema. The real-world data is always \nunstructured and hence challenging to understand. This data is \ngenerated through various digital channels including mobile \nphones, the Internet, social media, and e-commerce websites. 11'

In [11]:
from langchain_community.llms import Ollama

llm = Ollama(model="llama2")
llm

Ollama()

In [12]:
## Design Chat Prompt Template
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template("""
        Answer the following question based only on the provided context.Think step by step before providing a detailed answer.I will tip you $1000 if the user finds the answer helpful.
        <context>
            {context}
        </context>
        Question: {input}""")

In [13]:
from langchain.chains.combine_documents import create_stuff_documents_chain

doc_chain = create_stuff_documents_chain(llm, prompt)

In [14]:
retriever = db.as_retriever()
retriever

VectorStoreRetriever(tags=['FAISS', 'OllamaEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x0000019AA52094F0>)

In [15]:
from langchain.chains import create_retrieval_chain

retriever_chain = create_retrieval_chain(retriever, doc_chain)

In [18]:
res = retriever_chain.invoke({"input":"Characteristics of Big Data."})

In [19]:
res['answer']

'Thank you for providing the context! Based on the information provided, here are some characteristics of big data:\n\n1. Volume: Big data refers to the massive amounts of structured and unstructured data that companies generate every day. This data can range from terabytes to exabytes in size.\n2. Velocity: Big data is generated at a high speed, often in real-time, and requires processing and analysis to be done quickly to keep up with the pace of business.\n3. Variety: Big data comes in various forms, including structured data (e.g., relational databases), semi-structured data (e.g., XML), and unstructured data (e.g., social media posts, images, videos).\n4. Veracity: Big data may contain errors, inconsistencies, or missing values, which can affect the accuracy of analysis and decision-making.\n5. Value: Big data has the potential to provide significant business value by revealing insights, improving operational efficiency, and enabling innovation.\n6. Complexity: Big data often invo