### Notebook Summary

This notebook demonstrates the process of building a question-answering system using LangChain and Pinecone, leveraging the content of PDF documents. The key techniques used are:

*   **Loading Documents:** Using `PyPDFDirectoryLoader` to load text content from PDF files in a specified directory.
*   **Text Splitting:** Employing `RecursiveCharacterTextSplitter` to break down the extracted text into smaller, manageable chunks for processing.
*   **Generating Embeddings:** Utilizing `OpenAIEmbeddings` to create vector representations of the text chunks. These embeddings capture the semantic meaning of the text.
*   **Initializing Pinecone:** Setting up a connection to the Pinecone vector database using your API key.
*   **Creating/Loading a Vector Store:** Using `PineconeVectorStore` to either create a new index in Pinecone with the generated embeddings or load an existing one. This vector store allows for efficient similarity search.
*   **Similarity Search:** Performing similarity searches on the vector store to find document chunks most relevant to a given query.
*   **Setting up a RetrievalQA Chain:** Configuring a `RetrievalQA` chain with a language model (`OpenAI`) and the vector store retriever. The `chain_type="stuff"` is used to pass the retrieved document chunks as context to the language model.
*   **Question Answering:** Using the `RetrievalQA` chain to answer questions based on the context provided by the retrieved document chunks.
*   **Interactive Chat Loop:** Implementing a simple command-line interface for interactive question answering with the built system.

### **Loading required packages**

In [None]:
!pip install -q langchain_pinecone langchain_openai langchain_community pinecone openai tiktoken --upgrade

### **Load the PDF Files**

In [None]:
!mkdir pdfs

### **Extract the Text from the PDFs**

In [None]:
from langchain.document_loaders import PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader("pdfs/")
text_data = loader.load()

In [None]:
print(text_data)

[Document(metadata={'producer': 'macOS Version 12.6.1 (Build 21G217) Quartz PDFContext', 'creator': 'Microsoft Word', 'creationdate': '2023-03-17T18:08:34+00:00', 'author': 'NVIDIA', 'keywords': 'Large Language Models, What is a large language model, llm, how do large language models work', 'moddate': '2023-03-17T11:30:56-07:00', 'subject': 'Large Language Models', 'title': 'A Beginner’s Guide to Large Language Models', 'source': 'pdfs/nvidia_llm.pdf', 'total_pages': 25, 'page': 0, 'page_label': '1'}, page_content='A Beginner’s Guide to \nLarge Language Models\nPart 1\nContributors:\nAnnamalai Chockalingam\nAnkur Patel\nShashank Verma\nTiffany Yeung'), Document(metadata={'producer': 'macOS Version 12.6.1 (Build 21G217) Quartz PDFContext', 'creator': 'Microsoft Word', 'creationdate': '2023-03-17T18:08:34+00:00', 'author': 'NVIDIA', 'keywords': 'Large Language Models, What is a large language model, llm, how do large language models work', 'moddate': '2023-03-17T11:30:56-07:00', 'subject

### **Split Extracted text in Chunks**

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=20,
    length_function=len
)

In [None]:
# Split the document into chunks
chunks = text_splitter.split_documents(text_data)

In [None]:
import textwrap

def wrap_text(text, width):
  return textwrap.fill(text, width=width)

for index in range(4):
  print(f"Chunk_{index+1}\n\n{wrap_text(chunks[index].page_content, 80)}")
  print("\n================================================================\n")

Chunk_1

A Beginner’s Guide to  Large Language Models Part 1 Contributors: Annamalai
Chockalingam Ankur Patel Shashank Verma Tiffany Yeung


Chunk_2

A Beginner’s Guide to Large Language Models  2  Table of Contents  Preface .....
................................................................................
.................................................................. 3  Glossary .
................................................................................
..................................................................... 5


Chunk_3

Introduction to LLMs............................................................
...................................................................... 8  What
Are Large Language Models (LLMs)? ..............................................
............................................ 8  Foundation Language Models vs.
Fine-Tuned Language Models
...................................................... 11


Chunk_4

Evolution of Large Language

In [None]:
len(chunks)

128

### **Downloading the Embeddings**

In [None]:
import os
import openai

# include your openai api key in the secrets section
# of this notebook and turn the notebook access on.
from google.colab import userdata
openai.api_key = userdata.get('OPENAI_API_KEY')

In [None]:
from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(openai_api_key=openai.api_key)

In [None]:
result = embeddings.embed_query("Hello, world!")

In [None]:
len(result)

1536

### **Initializing Pinecone**

In [None]:
import os
# API key only (no environment needed for new serverless)
PINECONE_API_KEY = userdata.get('PINECONE_API_KEY')

In [None]:
from pinecone import Pinecone

pc = Pinecone(api_key=PINECONE_API_KEY)

index_name = "pinecone-1536-cosine"

### **Create Embeddings for each Chunk**

In [None]:
from langchain_pinecone import PineconeVectorStore

# Create docsearch (vector store)
vectordb = PineconeVectorStore.from_texts(
    [t.page_content for t in chunks],
    embedding=embeddings,
    index_name=index_name
)

### **If you already have an index, you can load it like this**

In [None]:
vectordb = PineconeVectorStore.from_existing_index(
    index_name=index_name,
    embedding=embeddings
)

### **Similarity Search**

In [None]:
query1 = "What is an Large Language Model?"
query2 = "Why are Large Language Models useful?"

In [None]:
# k= how many similar_docs to return
similar_docs = vectordb.similarity_search(query2, k=3)

In [None]:
for index in range(len(similar_docs)):
  print(f"doc_{index+1}\n\n{wrap_text(similar_docs[index].page_content, 80)}")
  print("\n================================================================\n")

doc_1

A Beginner’s Guide to Large Language Models 20    How Enterprises Can Benefit
From Using  Large Language Models  Enterprises need to tackle language-related
tasks every day. This includes  more obvious text tasks, such as writing emails
or generating content, but also tasks like analyzing  patient data for health
risks or providing companionship to customers. All of these tasks can be
automated using large language models.


doc_2

A Beginner’s Guide to Large Language Models 20    How Enterprises Can Benefit
From Using  Large Language Models  Enterprises need to tackle language-related
tasks every day. This includes  more obvious text tasks, such as writing emails
or generating content, but also tasks like analyzing  patient data for health
risks or providing companionship to customers. All of these tasks can be
automated using large language models.


doc_3

A Beginner’s Guide to Large Language Models 9    Although all language models
can perform NLP tasks, they differ in other

### **Creating LLM Wrapper for Structured Answer**

In [None]:
from langchain_openai import OpenAI

llm = OpenAI(temperature=0, openai_api_key=openai.api_key)

In [None]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectordb.as_retriever()
)

### **Question/Answering**

In [None]:
qa.invoke(query1)

{'query': 'What is an Large Language Model?',
 'result': ' A large language model is a type of artificial intelligence system that is capable of generating human-like text based on the patterns and relationships it learns from vast amounts of data. It uses deep learning to analyze and process large sets of data, such as books, articles, and web pages.'}

In [None]:
qa.invoke(query2)

{'query': 'Why are Large Language Models useful?',
 'result': ' Large Language Models are useful because they can automate various language-related tasks, such as writing emails, generating content, analyzing data, and providing customer companionship. They are considered large in size because they are trained using large amounts of data and have a huge number of learnable parameters, making them more accurate and efficient in performing tasks on new or never-before-seen data.'}

### **Interactive Loop for QA**

In [None]:
import sys

def chat_loop(qa_chain):
    print("🤖 Chatbot ready! Type 'exit' to quit.\n")

    while True:
        try:
            user_input = input("You: ").strip()

            if user_input.lower() in {"exit", "quit"}:
                print("👋 Exiting chatbot. Bye!")
                break

            if not user_input:
                continue  # skip empty inputs

            result = qa_chain({"query": user_input})
            answer = result.get("result", "⚠️ Sorry, I couldn't generate an answer.")

            print(f"Bot: {answer}\n")

        except KeyboardInterrupt:
            print("\n🛑 Interrupted. Exiting chatbot.")
            break
        except Exception as e:
            print(f"⚠️ Error: {e}\n")

# Usage
chat_loop(qa)

🤖 Chatbot ready! Type 'exit' to quit.

You: What is the use of an LLM?
Bot:  The use of an LLM is to unlock cutting-edge possibilities and revolutionize operations for enterprises.

You: Who developed the LLM?
Bot:  The LLM was developed by various companies and startups in the LLM field.

You: 
You: exit
👋 Exiting chatbot. Bye!
