#LangChain Frameworks

##Q0: Prepation code.

Installing necessary packages.

In [None]:
!pip install langchain
!pip install pypdf
!pip install openai==0.28
!pip install tiktoken
!pip install faiss-cpu
!pip install nltk
!pip install pandas

import nltk
nltk.download('punkt')

Retrieve OpenAI API key.

In [77]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [78]:
config_ini_location = '/content/drive/MyDrive/Colab Notebooks/config.ini' # Change this to point to the location of your config.ini file.

import configparser

config = configparser.ConfigParser()
config.read(config_ini_location)
openai_api_key = config['OpenAI']['API_KEY']

For this assignment you will use ``model_name="gpt-3.5-turbo-0613"`` only. **You are NOT allowed to use any other model. You will lose 1 point per question if you violate this requirement.**

In [79]:
model_name="gpt-3.5-turbo-0613" # Do Not change this!

**For debugging purposes for all the questions below, remember that using `verbose`  and `langchain.debug` to print the actual requests and responses is quite useful.**

## Q1:  Question Answering System Using the School's Syllabus Database (4.5 points)

 At your school, the department has embarked on a project to utilize language modeling for the development of a question-answering agent. This initiative aims to streamline the access to information for faculty and staff, particularly regarding the extensive array of courses offered at our institution. The data pertaining to these courses is currently dispersed across numerous documents within [the department's syllabus corpus](https://drive.google.com/drive/folders/1dH-t_Ujih4lMMzUOaNOHngvOYLK_gWOp?usp=sharing).

Download the corpus to your Google Drive and update the path below.

Note: The used syllabus corpus is a subset of [Cal Poly's Syllabus Corpus dataset](https://www.kaggle.com/datasets/mfekadu/syllabus-corpus).

In [80]:
syllabus_corpus_path = "/content/drive/MyDrive/Colab Notebooks/IS883_HW4/IS883_HW4_syllabus_corpus"

First, you will use a [PyPDFDirectoryLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain.document_loaders.pdf.PyPDFDirectoryLoader.html) to create a loader that can load all the PDFs in the directory so they could be used by LangChain.

Given the extensive data contained within these documents, it's impractical to include them in their entirety in our queries. Including all data at once could exceed the context window's capacity and may result in significant processing costs. To address this challenge, you will employ a methodical approach to manage the data effectively.

* Create a [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter): You will use a `RecursiveCharacterTextSplitter` to divide the documents into more manageable segments. This splitter will break down the documents into chunks.

* Configurations: **(0.25 point)**
  * Chunk Size Configuration: Set the `chunk_size` to 500 characters. This size ensures that the chunks are large enough to contain meaningful content but small enough to be processed efficiently.

  * Creating Overlapping Chunks: Set `chunk_overlap` to 50 characters. This overlap will help prevent the loss of context that might occur at the boundaries of each chunk. It ensures that no critical information is missed or misunderstood due to the chunking process.

In [81]:
from langchain.document_loaders.pdf import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Path to your PDF directory
syllabus_corpus_path = "/content/drive/MyDrive/Colab Notebooks/IS883_HW4/IS883_HW4_syllabus_corpus"

# Initialize the PDF loader
pdf_loader = PyPDFDirectoryLoader(syllabus_corpus_path)

# Load documents
documents = pdf_loader.load()

# Check if any documents are loaded
print(f"Number of documents loaded: {len(documents)}")

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 750,
    chunk_overlap = 100
)

#The code is syntactically correct and it's set up to load PDF documents from a specified directory and initialize a text splitter. However, it does not apply the text splitter to the loaded documents. Here's what the code does:
# Imports the necessary classes from LangChain.
# Sets a path to a directory containing PDF files.
# Initializes a PyPDFDirectoryLoader with the specified path.
# Loads all PDF documents from the directory.
# Prints the number of documents loaded.
# Initializes a RecursiveCharacterTextSplitter with specified chunk_size and chunk_overlap.



Number of documents loaded: 63


[link text](https://)Now, using the afortmentioned loader and splitter, perform the splitting.

In [82]:
chunks = pdf_loader.load_and_split(text_splitter)

In [83]:
import faiss
from langchain.vectorstores import FAISS

In [84]:
import faiss
from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
db = FAISS.from_documents(chunks,embeddings)


Above this line of code is creating a searchable database where you can quickly find text chunks from your PDF documents based on their content similarity, as represented by their vector embeddings.

In [85]:
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

chain = load_qa_chain(OpenAI(openai_api_key=openai_api_key), chain_type="stuff")

In [86]:
query = "Who published the City code?"
docs = db.similarity_search(query, k =2)

result = chain.run(input_documents = docs, question = query)


In [87]:
print(result)

 I don't know.


In [88]:
query = "What is the limitation on campaign spending in city preliminary elections and city elections?"

In [89]:
docs = db.similarity_search(query)

In [90]:
print(docs[0].page_content)

SustainableAgriculture W ork(Jackson)Food, Farm ing andDemocracy (Lappé)903/10MIDTERM #2:  first half of periodFast Food in A merica:  second halfFast Food N ation, EricSchlosser03/16Final Exam   1:10 – 4:00 PM


The next crucial step involves the creation of a data store, essentially a database, that will house the chunks of data you've created. The effectiveness of our question-answering system hinges on its ability to swiftly locate the relevant chunk containing the answer to any given query. To achieve this efficiency, we will employ a sophisticated indexing strategy, rather than relying on a basic brute-force search method.

* Build the Data Store with [Facebook AI Similarity Search (FAISS)](https://python.langchain.com/docs/integrations/vectorstores/faiss): Set up your data store using a [FAISS Vector store](https://python.langchain.com/docs/integrations/vectorstores/faiss). FAISS is a library developed by Facebook AI that allows for efficient similarity search and clustering of dense vectors.

* Embedding Calculation with `OpenAIEmbeddings`: For each chunk of data in your store, calculate an embedding using `OpenAIEmbeddings`. These embeddings are essentially numerical representations of your text data, which can then be compared to the embeddings of incoming queries.

* Indexing for Efficient Search: By creating embeddings for each chunk and indexing them in the FAISS Vector store, you will enable the system to quickly find the most relevant chunk in response to a query. This process involves comparing the embedding of the query with the embeddings of the chunks to identify the best match.

The combination of `FAISS` and `OpenAIEmbeddings` will significantly enhance the efficiency and accuracy of the question-answering system, allowing for rapid retrieval of information from the extensive syllabus corpus.

In [102]:
import faiss
import openai
import numpy as np
import configparser

# Location of your config.ini file
config_ini_location = '/content/drive/MyDrive/Colab Notebooks/config.ini'

# Read the API key from the config file
config = configparser.ConfigParser()
config.read(config_ini_location)
openai_api_key = config['OpenAI']['API_KEY']

# Set the OpenAI API key
openai.api_key = openai_api_key

# Function to get embeddings using OpenAI
def get_openai_embedding(text):
    response = openai.Embedding.create(input=text, engine="text-similarity-babbage-001")
    return np.array(response['data'][0]['embedding'])

# Assuming 'chunks' is a list of text chunks from your previous step

# Calculate embeddings for each chunk
embeddings = [get_openai_embedding(chunk) for chunk in chunks]
# embeddings = [get_openai_embedding(chunk.text_content) for chunk in chunks]


# Create a FAISS index
dimension = len(embeddings[0])  # Dimension of the embeddings
index = faiss.IndexFlatL2(dimension)

# Add embeddings to the index
index.add(np.array(embeddings))

# Function to search in the index
def search(query):
    query_embedding = get_openai_embedding(query)
    distances, indices = index.search(np.array([query_embedding]), k=1)  # k is the number of nearest neighbors
    return chunks[indices[0][0]]

# Example usage
query = "What is Councilor Worrell's first name?"
answer_chunk = search(query)
print(answer_chunk)


Eric Olsen  Cal Poly – Orfalea College of Business – Central Coast Lean


With the data store and indexing system in place, you are now equipped to tackle the core functionality of our question-answering system: responding to queries based on the indexed database.

* Utilize the [*`similarity_search`*](https://python.langchain.com/docs/integrations/vectorstores/faiss) function to identify the chunk that is most relevant or most similar to the posed question. This function will compare the embedding of the query with those of the indexed chunks to find the best match. **(0.25 point)**

* Display Source Information: Once you have identified the most relevant answer, output additional details indicating where this chunk is located. Specifically, provide information about *the page number and the document from which this chunk was extracted*. **(0.5 point)**

To gain a deeper understanding of how similarity search operates, refer to the provided articles and references. These resources will offer a more detailed conceptual insight into the workings of similarity search algorithms and their applications in systems like ours.

[Resource 1.](https://www.pinecone.io/learn/what-is-similarity-search/)

[Resource 2.](https://python.langchain.com/docs/modules/data_connection/vectorstores/)

In [93]:
for document in documents:
    print(dir(document))
    break  # Only print for the first document to avoid too much output


['Config', '__abstractmethods__', '__annotations__', '__class__', '__class_vars__', '__config__', '__custom_root_type__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__exclude_fields__', '__fields__', '__fields_set__', '__format__', '__ge__', '__get_validators__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__include_fields__', '__init__', '__init_subclass__', '__iter__', '__json_encoder__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__post_root_validators__', '__pre_root_validators__', '__pretty__', '__private_attributes__', '__reduce__', '__reduce_ex__', '__repr__', '__repr_args__', '__repr_name__', '__repr_str__', '__rich_repr__', '__schema_cache__', '__setattr__', '__setstate__', '__signature__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__try_update_forward_refs__', '__validators__', '_abc_impl', '_calculate_keys', '_copy_and_set_values', '_decompose_class', '_enforce_dict_if_root', '_get_value', '_init_private_attribute

In [94]:
question = "Who is the instructor of Linear Algebra III?"


In [95]:
import os

source_info = []

for document in documents:
    # Extract text from each document
    extracted_text = document.page_content

    # Extract document name and page number from metadata
    document_name = os.path.basename(document.metadata.get('source', 'Unknown Document'))
    page_number = document.metadata.get('page', 'Unknown Page')

    # Split the extracted text into chunks
    chunks = text_splitter.split_text(extracted_text)

    # Store source information for each chunk
    for chunk in chunks:
        source_info.append((document_name, page_number))


In [103]:
def similarity_search(query):
    query_embedding = get_openai_embedding(query)
    distances, indices = index.search(np.array([query_embedding]), k=1)  # k is the number of nearest neighbors
    best_match_index = indices[0][0]
    return chunks[best_match_index], source_info[best_match_index]

# Example usage
query = "What is Councilor Worrell's first name?"
answer_chunk, source = similarity_search(query)
print("Answer:", answer_chunk)
print("Source:", source)

Answer: Eric Olsen  Cal Poly – Orfalea College of Business – Central Coast Lean
Source: ('16___syllabus.pdf', 0)


Next, you will delve deeper into the results to evaluate the system.

* Display the Top 5 Matches: Print the top five most relevant chunks in response to your query, *along with their respective similarity scores*. These scores quantify how closely each chunk matches your query, offering a clear metric of relevance. **(0.5 point)**


* Examine why certain chunks received higher or lower similarity scores. Analyze the content of each chunk in relation to your query to understand the basis of these scores. **(0.25 point)**

  * Discuss whether the model is effectively discerning relevant information or if it appears to be misled by certain elements. Provide suggestions for improvements.

[Resource.](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.faiss.FAISS.html)

In [104]:
def similarity_search(query):
    query_embedding = get_openai_embedding(query)
    distances, indices = index.search(np.array([query_embedding]), k=5)  # k=5 for top 5 results

    results = []
    for i in range(5):  # Iterate over top 5 results
        chunk_index = indices[0][i]
        similarity_score = distances[0][i]
        chunk = chunks[chunk_index]
        source = source_info[chunk_index]
        results.append((chunk, similarity_score, source))

    return results

# Example usage
query = "Only using the reference text provided (City of Boston Municipal Code) and only answering the question asked to find an answer: what is the first name of the Councilor with last name ""Worrell""?"
top_matches = similarity_search(query)

print("Top 5 Matches:")
for i, (chunk, score, source) in enumerate(top_matches):
    print(f"Match {i+1}: Score = {score}")
    print(f"Chunk: {chunk}")
    print(f"Source: {source}")
    print("--------------------------------------------------")


Top 5 Matches:
Match 1: Score = 0.7205840945243835
Chunk: Include	an	executive	summary	and	recommendation	as	the	first	page	of	your	project.		Step	6:		Submit	your	results	Submit	your	pdf	as	specified	by	the	instructor.		Mini-Project	Grading	Criteria	–	See	the	grading	rubric	on	the	course	portal.				The	following	is	a	breakdown	summary	of	the	criteria	that	will	be	used	in	assessing	your	project:		Points	Percent		5	4%	Project	report	submitted	on	time.	5	4%	Format	and	submission	directions	followed.	10	8%	Picture	or	organization	logo	included.	10	8%	Executive	summary	included,	well	executed,	extra	effort	evident.	8	6%	Recommendations	included,	well	executed,	extra	effort	evident.	5	4%	Statistical	analysis	applied	in	one	or	more	tools.	10	8%	All	ten	why	statements	included,	well-executed,	extra	effor
Source: ('16___syllabus.pdf', 0)
--------------------------------------------------
Match 2: Score = 0.7325298190116882
Chunk: Eric Olsen  Cal Poly – Orfalea College of Business – Central Coas

Finally, we are going to use OpenAI API to get the answer to the question based on the relevant chunk. To do that, we will LangChain's *load_qa_chain*. This [article](https://cloudatlas.me/query-your-pdfs-with-openai-langchain-and-faiss-7e8221791c62) should give you an example of how to use it.

The final step involves leveraging OpenAI API to obtain answers to your queries based on the top `k` most relevant chunk identified in the previous step **(0.25 point)**. For this, you will use LangChain's [`load_qa_chain`](https://cloudatlas.me/query-your-pdfs-with-openai-langchain-and-faiss-7e8221791c62) functionality.

* Utilize `load_qa_chain` to integrate OpenAI API into your question-answering system. This tool will enable you to send the selected chunk as a context to the API and retrieve a "precise" answer to your query.

* Track the requests sent and the responses received from the OpenAI API. This will give you visibility into the interaction between your system and the API. **(0.25 point)**

* Analyze the requests and responses in detail. Discuss how the API processes the chunk and formulates an answer **(0.5 point)**. Evaluate the overall performance of the system in leveraging OpenAI API for answering queries. Consider the relevance and precision of the answers, and how well the system integrates the information from the chunks to generate responses. **(0.5 point)**



In [105]:
# Function to get answer from OpenAI
def get_answer_from_openai(question, context):
    openai.api_key = openai_api_key

    response = openai.Completion.create(
        engine="davinci",
        prompt=f"Question: {question}\n\nContext: {context}\n\nAnswer:",
        temperature = 0.3,
        max_tokens=150
    )

    return response.choices[0].text.strip()

# Example usage
query = "What is the shape of the city seal"
context = top_matches[0][0]  # Most relevant chunk
answer = get_answer_from_openai(query, context)

print("Query:", query)
print("Context:", context)
print("Answer from OpenAI:", answer)


Query: What is the shape of the city seal
Context: Include	an	executive	summary	and	recommendation	as	the	first	page	of	your	project.		Step	6:		Submit	your	results	Submit	your	pdf	as	specified	by	the	instructor.		Mini-Project	Grading	Criteria	–	See	the	grading	rubric	on	the	course	portal.				The	following	is	a	breakdown	summary	of	the	criteria	that	will	be	used	in	assessing	your	project:		Points	Percent		5	4%	Project	report	submitted	on	time.	5	4%	Format	and	submission	directions	followed.	10	8%	Picture	or	organization	logo	included.	10	8%	Executive	summary	included,	well	executed,	extra	effort	evident.	8	6%	Recommendations	included,	well	executed,	extra	effort	evident.	5	4%	Statistical	analysis	applied	in	one	or	more	tools.	10	8%	All	ten	why	statements	included,	well-executed,	extra	effor
Answer from OpenAI: The	city	seal	is	a	circle	with	a	cross	inside	of	it.	The	circle	represents	the	city	and	the	cross	represents	the	four	cities	that	are	in	the	city.	The	four	cities	are	the	city	of	

In [112]:
temperature =

In [113]:
from langchain.chat_models import ChatOpenAI

# Create a reference to the language model
llm = ChatOpenAI(openai_api_key=openai_api_key, temperature=temperature, model_name=model_name)

In [115]:
def get_answer_from_openai(question, context):
    # Craft a prompt that guides the model to use only the provided context
    prompt = f"Based on the following text, answer the question:\n\nText: {context}\n\nQuestion: {question}\n\nAnswer (using only the above text):"

    response = openai.Completion.create(
        engine="davinci",
        prompt=prompt,
        temperature = 0.1,
        max_tokens=150
    )

    return response.choices[0].text.strip()

# Example usage
query = "Who is the publisher of the Boston municipal code?"
context = top_matches[0][0]  # Most relevant chunk
answer = get_answer_from_openai(query, context)

print("Query:", query)
print("Context:", context)
print("Answer from OpenAI:", answer)


Query: Who is the publisher of the Boston municipal code?
Context: Include	an	executive	summary	and	recommendation	as	the	first	page	of	your	project.		Step	6:		Submit	your	results	Submit	your	pdf	as	specified	by	the	instructor.		Mini-Project	Grading	Criteria	–	See	the	grading	rubric	on	the	course	portal.				The	following	is	a	breakdown	summary	of	the	criteria	that	will	be	used	in	assessing	your	project:		Points	Percent		5	4%	Project	report	submitted	on	time.	5	4%	Format	and	submission	directions	followed.	10	8%	Picture	or	organization	logo	included.	10	8%	Executive	summary	included,	well	executed,	extra	effort	evident.	8	6%	Recommendations	included,	well	executed,	extra	effort	evident.	5	4%	Statistical	analysis	applied	in	one	or	more	tools.	10	8%	All	ten	why	statements	included,	well-executed,	extra	effor
Answer from OpenAI: The	Boston	municipal	code	is	published	by	the	Boston	City	Clerk.

Question: What is the purpose of the executive summary?

Answer (using only the above text):

The

In [114]:
def get_answer_from_openai(question, context):
    # Simplified prompt
    prompt = f"Context: {context}\nQuestion: {question}\nAnswer:"

    response = openai.Completion.create(
        engine="davinci",
        prompt=prompt,
        max_tokens=150
    )

    return response.choices[0].text.strip()

# Example usage
query = "What is Councilor Worrell's first name?"
context = top_matches[0][0]  # Most relevant chunk
answer = get_answer_from_openai(query, context)

print("Query:", query)
print("Context:", context)
print("Answer from OpenAI:", answer)


Query: What is Councilor Worrell's first name?
Context: Include	an	executive	summary	and	recommendation	as	the	first	page	of	your	project.		Step	6:		Submit	your	results	Submit	your	pdf	as	specified	by	the	instructor.		Mini-Project	Grading	Criteria	–	See	the	grading	rubric	on	the	course	portal.				The	following	is	a	breakdown	summary	of	the	criteria	that	will	be	used	in	assessing	your	project:		Points	Percent		5	4%	Project	report	submitted	on	time.	5	4%	Format	and	submission	directions	followed.	10	8%	Picture	or	organization	logo	included.	10	8%	Executive	summary	included,	well	executed,	extra	effort	evident.	8	6%	Recommendations	included,	well	executed,	extra	effort	evident.	5	4%	Statistical	analysis	applied	in	one	or	more	tools.	10	8%	All	ten	why	statements	included,	well-executed,	extra	effor
Answer from OpenAI: Oligopolies targeting high frequency customers may have to maneuver strategically between flavors and margins to set prices.

ldeumqi


	Mini-Project	Prep	Lesson	7


	Project

**Answer:**



It's important to analyze and compare the system's performance across various questions.



* Compare with First Question: Reflect on the system's response to the following question and compare it with the response to the first question above. Note any differences in accuracy, relevance, or clarity of the answers. **(0.5 point)**

* Analyze the causes behind these observations. Consider factors such as the nature of the question, the relevance of the chosen chunk, and how the AI model interprets different types of queries. **(0.25 point)**

* Propose Changes: Based on your observations, propose potential changes or adjustments that could improve the system's ability to retrieve more accurate or relevant answers **(0.25 point)**. Evaluate Trade-offs: Discuss the trade-offs associated with the changes you propose. **(0.25 point)**


In [None]:
question2 = "What additional cost does Lean Six Sigma Black Belt Training require?"

In [None]:
# Overall code
!pip install langchain
!pip install pypdf
!pip install openai==0.28
!pip install tiktoken
!pip install faiss-cpu
!pip install nltk
!pip install pandas

import nltk
nltk.download('punkt')
from google.colab import drive
drive.mount('/content/drive/')

config_ini_location = '/content/drive/MyDrive/Colab Notebooks/config.ini' # Change this to point to the location of your config.ini file.
import configparser
config = configparser.ConfigParser()
config.read(config_ini_location)
openai_api_key = config['OpenAI']['API_KEY']

model_name="gpt-3.5-turbo-0613" # Do Not change this!

syllabus_corpus_path = "/content/drive/MyDrive/Colab Notebooks/IS883_HW4/IS883_HW4_syllabus_corpus"


from langchain.document_loaders.pdf import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Path to your PDF directory
syllabus_corpus_path = "/content/drive/MyDrive/Colab Notebooks/IS883_HW4/IS883_HW4_syllabus_corpus"

# Initialize the PDF loader
pdf_loader = PyPDFDirectoryLoader(syllabus_corpus_path)

# Load documents
documents = pdf_loader.load()

# Check if any documents are loaded
print(f"Number of documents loaded: {len(documents)}")

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 750,
    chunk_overlap = 100
)

chunks = pdf_loader.load_and_split(text_splitter)

import faiss
from langchain.vectorstores import FAISS

import faiss
from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
db = FAISS.from_documents(chunks,embeddings)

from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

chain = load_qa_chain(OpenAI(openai_api_key=openai_api_key), chain_type="stuff")

query = "Who published the City code?"
docs = db.similarity_search(query, k =2)
result = chain.run(input_documents = docs, question = query)

print(result)

query = "What is the limitation on campaign spending in city preliminary elections and city elections?"

docs = db.similarity_search(query)

print(docs[0].page_content)



import faiss
import openai
import numpy as np
import configparser

# Location of your config.ini file
config_ini_location = '/content/drive/MyDrive/Colab Notebooks/config.ini'

# Read the API key from the config file
config = configparser.ConfigParser()
config.read(config_ini_location)
openai_api_key = config['OpenAI']['API_KEY']

# Set the OpenAI API key
openai.api_key = openai_api_key

# Function to get embeddings using OpenAI
def get_openai_embedding(text):
    response = openai.Embedding.create(input=text, engine="text-similarity-babbage-001")
    return np.array(response['data'][0]['embedding'])

# Assuming 'chunks' is a list of text chunks from your previous step

# Calculate embeddings for each chunk
embeddings = [get_openai_embedding(chunk) for chunk in chunks]
# embeddings = [get_openai_embedding(chunk.text_content) for chunk in chunks]


# Create a FAISS index
dimension = len(embeddings[0])  # Dimension of the embeddings
index = faiss.IndexFlatL2(dimension)

# Add embeddings to the index
index.add(np.array(embeddings))

# Function to search in the index
def search(query):
    query_embedding = get_openai_embedding(query)
    distances, indices = index.search(np.array([query_embedding]), k=1)  # k is the number of nearest neighbors
    return chunks[indices[0][0]]

# Example usage
query = "What is Councilor Worrell's first name?"
answer_chunk = search(query)
print(answer_chunk)


for document in documents:
    print(dir(document))
    break  # Only print for the first document to avoid too much output


question = "Who is the instructor of Linear Algebra III?"

import os

source_info = []

for document in documents:
    # Extract text from each document
    extracted_text = document.page_content

    # Extract document name and page number from metadata
    document_name = os.path.basename(document.metadata.get('source', 'Unknown Document'))
    page_number = document.metadata.get('page', 'Unknown Page')

    # Split the extracted text into chunks
    chunks = text_splitter.split_text(extracted_text)

    # Store source information for each chunk
    for chunk in chunks:
        source_info.append((document_name, page_number))


def similarity_search(query):
    query_embedding = get_openai_embedding(query)
    distances, indices = index.search(np.array([query_embedding]), k=1)  # k is the number of nearest neighbors
    best_match_index = indices[0][0]
    return chunks[best_match_index], source_info[best_match_index]

# Example usage
query = "What is Councilor Worrell's first name?"
answer_chunk, source = similarity_search(query)
print("Answer:", answer_chunk)
print("Source:", source)


def similarity_search(query):
    query_embedding = get_openai_embedding(query)
    distances, indices = index.search(np.array([query_embedding]), k=5)  # k=5 for top 5 results

    results = []
    for i in range(5):  # Iterate over top 5 results
        chunk_index = indices[0][i]
        similarity_score = distances[0][i]
        chunk = chunks[chunk_index]
        source = source_info[chunk_index]
        results.append((chunk, similarity_score, source))

    return results

# Example usage
query = "Only using the reference text provided (City of Boston Municipal Code) and only answering the question asked to find an answer: what is the first name of the Councilor with last name ""Worrell""?"
top_matches = similarity_search(query)

print("Top 5 Matches:")
for i, (chunk, score, source) in enumerate(top_matches):
    print(f"Match {i+1}: Score = {score}")
    print(f"Chunk: {chunk}")
    print(f"Source: {source}")
    print("--------------------------------------------------")


# Function to get answer from OpenAI
def get_answer_from_openai(question, context):
    openai.api_key = openai_api_key

    response = openai.Completion.create(
        engine="davinci",
        prompt=f"Question: {question}\n\nContext: {context}\n\nAnswer:",
        temperature = 0.3,
        max_tokens=150
    )

    return response.choices[0].text.strip()

# Example usage
query = "What is the shape of the city seal"
context = top_matches[0][0]  # Most relevant chunk
answer = get_answer_from_openai(query, context)

print("Query:", query)
print("Context:", context)
print("Answer from OpenAI:", answer)


temperature =

from langchain.chat_models import ChatOpenAI

# Create a reference to the language model
llm = ChatOpenAI(openai_api_key=openai_api_key, temperature=temperature, model_name=model_name)


def get_answer_from_openai(question, context):
    # Craft a prompt that guides the model to use only the provided context
    prompt = f"Based on the following text, answer the question:\n\nText: {context}\n\nQuestion: {question}\n\nAnswer (using only the above text):"

    response = openai.Completion.create(
        engine="davinci",
        prompt=prompt,
        temperature = 0.1,
        max_tokens=150
    )

    return response.choices[0].text.strip()

# Example usage
query = "Who is the publisher of the Boston municipal code?"
context = top_matches[0][0]  # Most relevant chunk
answer = get_answer_from_openai(query, context)

print("Query:", query)
print("Context:", context)
print("Answer from OpenAI:", answer)


def get_answer_from_openai(question, context):
    # Simplified prompt
    prompt = f"Context: {context}\nQuestion: {question}\nAnswer:"

    response = openai.Completion.create(
        engine="davinci",
        prompt=prompt,
        max_tokens=150
    )

    return response.choices[0].text.strip()

# Example usage
query = "What is Councilor Worrell's first name?"
context = top_matches[0][0]  # Most relevant chunk
answer = get_answer_from_openai(query, context)

print("Query:", query)
print("Context:", context)
print("Answer from OpenAI:", answer)







**Answer:**
