<h1> Importing Necessary Libraries</h1>

Embeddings and Prompts

In [4]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.prompts import PromptTemplate

Chains and Utilities

In [6]:
from langchain.chains import RetrievalQA
from langchain.chains import LLMChain
from langchain.text_splitter import CharacterTextSplitter

PDF Handling

In [7]:
from PyPDF2 import PdfReader

Typing Extensions and Schema

In [8]:
from typing_extensions import Concatenate
from langchain.schema import Document

Conversational Memory

In [9]:
from langchain.memory import ConversationBufferWindowMemory

<h2> Loading a PDF File
To load a PDF file and initialize a PdfReader object for further processing: </h2>

In [10]:
loader = PdfReader('lecture_notes.pdf')

<h2>Extracting Text from PDF Pages</h2>

In [12]:
# To extract text from each page of a loaded PDF document using PdfReader:

raw_text = ''
for i, page in enumerate(loader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

In [13]:
raw_text

'What are large language models (LLMs)?\nLarge language models (LLMs) are a category of foundation models trained on immense amounts of data making them\ncapable of understanding and generating natural language and other types of content to perform a wide range of tasks.\nLLMs have become a household name thanks to the role they have played in bringing generative AI to the forefront of\nthe public interest, as well as the point on which organizations are focusing to adopt artificial intelligence across\nnumerous business functions and use cases.\nLLMs are a class of foundation models, which are trained on enormous amounts of data to provide the foundational\ncapabilities needed to drive multiple use cases and applications, as well as resolve a multitude of tasks. This is in stark\ncontrast to the idea of building and training domain specific models for each of these use cases individually, which is\nprohibitive under many criteria (most importantly cost and infrastructure), stifles syn

<h2> Splitting Text into Chunks </h2>

The very first step of loading our raw data to convert them into our knowledge base(vector DB). The Data Loading Stage.

Raw Data → Converted to Chunks → Indexed as Vectors in VectorDB

Character Splitting: Naive Way
Fixed length chunk with some overlap between successive chunks

The easiest form of splitting with no regards to sentence structure or semantics.

Chunk Size: is the number of characters ’N’ per split

Chunk Overlap: Number of characters with overlap between successive chunks

In [17]:
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

<h2> Initializing HuggingFace Embeddings </h2>

An embedding is a numerical representation of a piece of information, for example, text, documents, images, audio, etc. The representation captures the semantic meaning of what is being embedded, making it robust for many industry applications.

In [18]:
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

<h2> Creating Document Objects </h2>

In [19]:
documents = [Document(page_content=text) for text in texts]

<h2> Initializing FAISS Vector Database </h2>

A vector database indexes and stores vector embeddings for fast retrieval and similarity search, with capabilities like CRUD operations, metadata filtering, horizontal scaling, and serverless.

First, we use the embedding model to create vector embeddings for the content we want to index.
The vector embedding is inserted into the vector database, with some reference to the original content the embedding was created from.
When the application issues a query, we use the same embedding model to create embeddings for the query and use those embeddings to query the database for similar vector embeddings. As mentioned before, those similar embeddings are associated with the original content that was used to create them.

Facebook AI Similarity Search (Faiss) is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning.

In [20]:
vectordb = FAISS.from_documents(documents=documents, embedding=embeddings)

<h2> Initializing GooglePalm Language Model </h2>

PaLM 2 is our next generation language model with improved multilingual, reasoning and coding capabilities that builds on Google’s legacy of breakthrough research in machine learning and responsible AI.

It excels at advanced reasoning tasks, including code and math, classification and question answering, translation and multilingual proficiency, and natural language generation better than our previous state-of-the-art LLMs, including PaLM. It can accomplish these tasks because of the way it was built – bringing together compute-optimal scaling, an improved dataset mixture, and model architecture improvements.

In [22]:
llm = GooglePalm(google_api_key="YOUR_GOOGLEPALM_API", temperature=0.1) # You can get it from https://aistudio.google.com/app/

<h2> Setting Up Retrieval-Based Question Answering System </h2>

In [23]:
retriever = vectordb.as_retriever(score_threshold=0.7)

prompt_template = """Given the following context and a question, generate an answer based on this context only. 
If applicable, include the source of the information. If the answer is not found in the context, state "I don't know."
Do not provide extra information or make up an answer.

CONTEXT: {context}

QUESTION: {question}
"""

PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    input_key="query",
    return_source_documents=True,
    output_key='result',
    chain_type_kwargs={"prompt": PROMPT}
)

In [25]:
chain('what is llm')

{'query': 'what is llm',
 'result': 'A large language model (LLM) is a category of foundation models trained on immense amounts of data making them\ncapable of understanding and generating natural language and other types of content to perform a wide range of tasks.',
 'source_documents': [Document(page_content='What are large language models (LLMs)?\nLarge language models (LLMs) are a category of foundation models trained on immense amounts of data making them\ncapable of understanding and generating natural language and other types of content to perform a wide range of tasks.\nLLMs have become a household name thanks to the role they have played in bringing generative AI to the forefront of\nthe public interest, as well as the point on which organizations are focusing to adopt artificial intelligence across\nnumerous business functions and use cases.\nLLMs are a class of foundation models, which are trained on enormous amounts of data to provide the foundational\ncapabilities needed 

In [26]:
chain('what is llm')['result']

'A large language model (LLM) is a category of foundation models trained on immense amounts of data making them\ncapable of understanding and generating natural language and other types of content to perform a wide range of tasks.'

In [27]:
chain("llm full form")['result']

'Large language models'

In [28]:
chain("IOT full form")['result']

"I don't know."

In [29]:
chain("what is the previous question asked")['result']

"I don't know."

In the current system setup, there is no provision for retaining context from past interactions, limiting the ability to maintain continuity in conversations or provide personalized responses based on historical interactions.

<h2> Adding Conversation Memory </h2>

To enhance the interaction capabilities of the system by incorporating memory for previous conversations:

<h2> Integrating Conversation Memory into Retrieval-Based QA System </h2>

In [30]:
prompt_template = """Given the following context and a question, generate an answer based on this context only and also check the history if asking any previous. 
If applicable, include the source of the information. If the answer is not found in the context, state "I don't know."
Do not provide extra information or make up an answer.

CONTEXT: {context}

QUESTION: {question}
"""


PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)



memory = ConversationBufferWindowMemory(k=3,output_key='result')

chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    input_key="query",
    return_source_documents=True,
    output_key='result',
    chain_type_kwargs={"prompt": PROMPT},
    memory=memory
)

In [31]:
chain('What is LLM')

{'query': 'What is LLM',
 'history': '',
 'result': 'LLMs are a category of foundation models trained on immense amounts of data making them capable of understanding and generating natural language and other types of content to perform a wide range of tasks.',
 'source_documents': [Document(page_content='What are large language models (LLMs)?\nLarge language models (LLMs) are a category of foundation models trained on immense amounts of data making them\ncapable of understanding and generating natural language and other types of content to perform a wide range of tasks.\nLLMs have become a household name thanks to the role they have played in bringing generative AI to the forefront of\nthe public interest, as well as the point on which organizations are focusing to adopt artificial intelligence across\nnumerous business functions and use cases.\nLLMs are a class of foundation models, which are trained on enormous amounts of data to provide the foundational\ncapabilities needed to drive

In [32]:
chain("IOT full form")['result']

"I don't know."

In [33]:
chain("what is the first question asked")['result']

'The first question asked is "What are Large Language Models?"'

In [34]:
chain("What are some milestone model architectures and papers in the last few years?")['result']

'Transformer, introduced in the paper “Attention is All You Need” by Vaswani et al., have become the foundation for\nmost state-of-the-art NLP models.'

In [35]:
chain("What are the layers in a transformer block?")['result']

'Self-Attention Mechanism, Positional Encoding, Feed-Forward Neural Networks'

In [37]:
chain("Tell me about datasets used to train LLMs and how they are cleaned")['result']

'LLMs are trained on massive datasets of text, code, and other forms of content. The size of these datasets can range\nfrom hundreds of millions to billions of words. The most common sources of data for LLMs are Wikipedia, news articles,\nand books. However, LLMs can also be trained on data from social media, email, and other sources.\n\nThe process of cleaning data for LLMs is complex and challenging. The data must be free of errors and inconsistencies, and\nit must be structured in a way that the LLM can understand. Cleaning data for LLMs typically involves removing\nduplicates, correcting errors, and normalizing text.\n\nThere are a number of tools and techniques that can be used to clean data for LLMs. Some of the most common\ntechniques include:\n\n* **Tokenization:** This process breaks text into individual words or tokens.\n* **Stemming:** This process removes the endings of words, such as "-ed" and "-ing".\n* **Lemmatization:** This process reduces words to their base form, suc

In [40]:
chain("who are you")['result']

'I am a large language model (LLM) that was trained on a massive dataset of text and code. I am able to understand and generate natural language, translate languages, write different kinds of creative content, and answer your questions. I am still under development, but I am learning new things every day.'

<h2> Intoducing Agents</h2>

In [50]:
import os
os.environ['SERPAPI_API_KEY'] = "YOUR_SERP_API"    # You can get your free API from https://serpapi.com/manage-api-key

<h2> Initializing Agent with Tools and LLM </h2>

In [47]:
from langchain.agents import AgentType, initialize_agent , load_tools

tools = load_tools(["serpapi","llm-math","wikipedia"], llm = llm)
agent = initialize_agent(
    tools,
    llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

In [48]:
agent.run("What is fullform of IOT?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mI need to know the fullform of IOT
Action: wikipedia
Action Input: IOT[0m



  lis = BeautifulSoup(html).find_all('li')



Observation: [38;5;200m[1;3mPage: Internet of things
Summary: The Internet of things (IoT) describes devices with sensors, processing ability, software and other technologies that connect and exchange data with other devices and systems over the Internet or other communications networks. The Internet of things encompasses electronics, communication, and computer science engineering. "Internet of things" has been considered a misnomer because devices do not need to be connected to the public internet; they only need to be connected to a network and be individually addressable.
The field has evolved due to the convergence of multiple technologies, including ubiquitous computing, commodity sensors, and increasingly powerful embedded systems, as well as machine learning. Older fields of embedded systems, wireless sensor networks, control systems, automation (including home and building automation), independently and collectively enable the Internet of things.  In the consumer market, Io

'Internet of Things'

In [53]:
agent.run("What is India's Population in 2024?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mI need to find out India's population in 2024
Action: wikipedia
Action Input: India's population[0m
Observation: [38;5;200m[1;3mPage: Demographics of India
Summary: India is the most populous country in the world with one-sixth of the world's population. 
According to estimates from the United Nations (UN), India has overtaken China as the country with the largest population in the world, with a population of 1,425,775,850 at the end of April 2023.
Between 1975 and 2010, the population doubled to 1.2 billion, reaching the billion mark in 2000. According to the UN's World Population dashboard, India's population now stands at slightly over 1.428 billion, edging past China's population of 1.425 billion people, as reported by the news agency Bloomberg. Its population is set to reach 1.7 billion by 2050. In 2017 its population growth rate was 0.98%, ranking 112th in the world; in contrast, from 1972 to 1983, India's population

'1.425,775,850'