# Prompt Engineering
This lab is based upon https://realpython.com/build-llm-rag-chatbot-with-langchain/ with the following caveats: 
1. *Versions are important.* Prompts for one version of ChatGPT might not work with later or earlier versions. 
2. *Watch out for context buffer length.* Overflowing the context buffer can have unexpected results.
3. *Use of LLMs to verify work is optional.* There are no truy free LLMs. You may have to spend money to verify your results. Spending money is optional.
4. As of writing, *Google Gemini* is the LLM with the most chance of allowing free testing of your prompt. 

Study the tutorial above to get the basic idea. Then answer the following questions.

__Question 1:__ Using the strategy in the article, design a ChatGPT prompt that allows answering questions about the BSDS degree as documented here: https://engineering.tufts.edu/cs/current-students/undergraduate/bachelor-science-data-science. Display the proposed prompt below, leaving a placeholder for the question at the end. Optionally, run the prompt through an LLM and display the results of answering a specific question. 

In [37]:
#prompt template
bsds_template_str = """
Behave as though you are an academic advisor for the BSDS at Tufts University.
Your job is to answer questions about the Bachelor of Science in Data Science (BSDS) program.
Use only the provided context to answer the question.
Be as detailed as possible, but do not make up any information.
If the answer is not in the context, say "I don't know."

{context}

Question: {question}
"""

*Optionally, enter an example query and the response generated by an LLM. Tell me which one you used:*

__Question 2:__ Based upon the article, install `LangChain` and develop a LangChain template that embeds the user's question into your prompt. Demonstrate this template on an arbitrary question. 

In [38]:
from langchain.prompts import ChatPromptTemplate
from langchain.document_loaders import WebBaseLoader
import google.generativeai as genai

#define "context" for answering questions as the BSDS website
loader = WebBaseLoader("https://engineering.tufts.edu/cs/current-students/undergraduate/bachelor-science-data-science")
documents = loader.load()
context = "\n\n".join([doc.page_content for doc in documents])

#prompt template
bsds_template_str = """
Behave as though you are an academic advisor for the BSDS at Tufts University.
Your job is to answer questions about the Bachelor of Science in Data Science (BSDS) program.
Use only the provided context to answer the question.
Be as detailed as possible, but do not make up any information.
If the answer is not in the context, say "I don't know."

{context}

Question: {question}
"""
bsds_prompt = ChatPromptTemplate.from_template(bsds_template_str)

#sample question posed by user
question = "How many courses do I need to complete the earn BSDS"

#use Gemini to answer question and print response
genai.configure(api_key="REDACTED")  
model = genai.GenerativeModel("models/gemini-1.5-pro-latest")
response = model.generate_content(bsds_prompt.format(context=context, question=question))
print("LLM's Response:")
print(response.text)


LLM's Response:
38 courses are required to complete the BSDS program.



__Question 3:__ Install `ChromaDB` as a substitute for AuraDB and populate it with information on all SOE undergraduate degrees as found here: https://engineering.tufts.edu/cs/current-students/undergraduate/. Add at least one entry per degree. It's fine to cut and paste here.  See https://docs.trychroma.com/docs/overview/getting-started .

In [39]:
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
from chromadb import PersistentClient

# Use CPU-compatible embedding function
embedding_fn = SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")

client = PersistentClient(path=".chroma")
collection = client.get_or_create_collection(
    name="SOE_undergraduate_degrees",
    embedding_function=embedding_fn
)

# Now populate the collection as before
degrees = {
    "bsds": "The Bachelor of Science in Data Science (BSDS) combines statistics, computer science, and domain knowledge to extract insights from data.",
    "bscs": "The Bachelor of Science in Computer Science (BSCS) offers core instruction in software development, algorithms, and computer systems.",
    "bsbme": "The Bachelor of Science in Biomedical Engineering (BSBME) integrates engineering with biological and medical sciences to improve human health.",
    "bsche": "The Bachelor of Science in Chemical Engineering (BSCHE) focuses on transforming raw materials into valuable products through chemical processes.",
    "bsce": "The Bachelor of Science in Civil Engineering (BSCE) prepares students to design and maintain infrastructure such as bridges, roads, and water systems.",
    "bsee": "The Bachelor of Science in Electrical Engineering (BSEE) emphasizes electronics, signal processing, and embedded systems.",
    "bsme": "The Bachelor of Science in Mechanical Engineering (BSME) is focused on machine design, thermodynamics, fluid dynamics, and manufacturing systems.",
    "bscpe": "The Bachelor of Science in Computer Engineering (BSCPE) bridges hardware and software through digital logic, embedded systems, and microprocessors.",
    "bshfe": "The Bachelor of Science in Human Factors Engineering (BSHFE) combines psychology and engineering to design human-centered systems and interfaces.",
    "bseve": "The Bachelor of Science in Environmental Engineering (BSEVE) addresses issues related to water quality, pollution, and sustainable development.",
    "bsas": "The Bachelor of Science in Architectural Studies (BSAS) blends architectural design, environmental sustainability, and structural principles.",
    "bses": "The Bachelor of Science in Engineering Science (BSES) provides a flexible interdisciplinary engineering curriculum tailored to student interests.",
    "bsep": "The Bachelor of Science in Engineering Physics (BSEP) prepares students for careers at the intersection of physics and advanced technology.",
    "bseh": "The Bachelor of Science in Environmental Health (BSEH) explores how environmental factors impact public health and safety."
}

collection.add(
    documents=list(degrees.values()),
    metadatas=[{"degree": key.upper()} for key in degrees],
    ids=list(degrees.keys())
)


Add of existing embedding ID: bsds
Add of existing embedding ID: bscs
Add of existing embedding ID: bsbme
Add of existing embedding ID: bsche
Add of existing embedding ID: bsce
Add of existing embedding ID: bsee
Add of existing embedding ID: bsme
Add of existing embedding ID: bscpe
Add of existing embedding ID: bshfe
Add of existing embedding ID: bseve
Add of existing embedding ID: bsas
Add of existing embedding ID: bses
Add of existing embedding ID: bsep
Add of existing embedding ID: bseh
Insert of existing embedding ID: bsds
Insert of existing embedding ID: bscs
Insert of existing embedding ID: bsbme
Insert of existing embedding ID: bsche
Insert of existing embedding ID: bsce
Insert of existing embedding ID: bsee
Insert of existing embedding ID: bsme
Insert of existing embedding ID: bscpe
Insert of existing embedding ID: bshfe
Insert of existing embedding ID: bseve
Insert of existing embedding ID: bsas
Insert of existing embedding ID: bses
Insert of existing embedding ID: bsep
Insert

In [40]:
#TEST SEARCHES FOR SOE UNDERGRADUATE DEGREE

test_queries = [
    "Which degree focuses on human health?",
    "Which program blends architecture and engineering?",
    "Which program deals with machines and thermodynamics?",
    "What degree is about water pollution and the environment?",
    "Which degree involves software, algorithms, and coding?",
    "What program is about designing user-friendly systems?",
    "What degree helps you understand how computers work at the hardware level?",
    "Which program offers an interdisciplinary engineering path?",
    "What program connects physics and technology?"
]

for q in test_queries:
    results = collection.query(query_texts=[q], n_results=1)
    answer = results["documents"][0][0]
    print(f"Query: {q}\nTop Match: {answer}\n{'-'*80}")

Query: Which degree focuses on human health?
Top Match: The Bachelor of Science in Biomedical Engineering (BSBME) integrates engineering with biological and medical sciences to improve human health.
--------------------------------------------------------------------------------
Query: Which program blends architecture and engineering?
Top Match: The Bachelor of Science in Computer Engineering (BSCPE) bridges hardware and software through digital logic, embedded systems, and microprocessors.
--------------------------------------------------------------------------------
Query: Which program deals with machines and thermodynamics?
Top Match: The Bachelor of Science in Mechanical Engineering (BSME) is focused on machine design, thermodynamics, fluid dynamics, and manufacturing systems.
--------------------------------------------------------------------------------
Query: What degree is about water pollution and the environment?
Top Match: The Bachelor of Science in Environmental Engine

__Question 4:__ Write code that does a `ChromaDB` query to determine the most reasonable shards to insert as information for a user query.

In [41]:
#name: get_relevant_shards
#purpose: To retrieve the top-k most relevant document shards from a ChromaDB collection 
#based on a user's query, for use in LLM prompting or downstream applications.
#inputs:
#   query (str): A natural language question or search query from the user.
#   collection (ChromaDB Collection): A ChromaDB collection containing embedded documents.
#   top_k (int): Optional. The number of most relevant shards to retrieve. Default is 3.
#returns:
#   context (str): A single string combining the top-k matched document shards, separated by newlines.
#   matches (list of str): A list of the top-k matched document shard strings.
def get_relevant_shards(query, collection, top_k=3):
    results = collection.query(query_texts=[query], n_results=top_k)
    matches = results["documents"][0]  # top_k most relevant text chunks
    context = "\n\n".join(matches)
    return context, matches

In [42]:
#TEST SOLUTION

user_query = "Which degree includes software development?"
context, shards = get_relevant_shards(user_query, collection)

print("Context to insert in LLM prompt:\n")
print(context)

print("\nShards:")
for i, shard in enumerate(shards, 1):
    print(f"\n--- Shard {i} ---\n{shard}")
    

Context to insert in LLM prompt:

The Bachelor of Science in Computer Science (BSCS) offers core instruction in software development, algorithms, and computer systems.

The Bachelor of Science in Computer Engineering (BSCPE) bridges hardware and software through digital logic, embedded systems, and microprocessors.

The Bachelor of Science in Civil Engineering (BSCE) prepares students to design and maintain infrastructure such as bridges, roads, and water systems.

Shards:

--- Shard 1 ---
The Bachelor of Science in Computer Science (BSCS) offers core instruction in software development, algorithms, and computer systems.

--- Shard 2 ---
The Bachelor of Science in Computer Engineering (BSCPE) bridges hardware and software through digital logic, embedded systems, and microprocessors.

--- Shard 3 ---
The Bachelor of Science in Civil Engineering (BSCE) prepares students to design and maintain infrastructure such as bridges, roads, and water systems.


__Question 5:__ Write an actual query handler that puts together LangChain, ChromaDB, and your prompt as recommended in the article.Test it and exhibit its response prompt. Optionally, try this with an LLM and exhibit the results. 

In [43]:
#Name: query_handler
#Purpose: To generate an LLM-based answer to a user's question by retrieving relevant context from ChromaDB, formatting 
#it into a LangChain prompt, and querying Gemini.
#Inputs:
#   user_question (str): The natural language question provided by the user.
#   collection (ChromaDB Collection): A ChromaDB collection containing embedded text shards.
#   top_k (int): Optional. The number of top-matching context shards to retrieve from ChromaDB. Default is 3.
#Returns:
#   A string containing the LLM's response to the user question, based only on the retrieved context.
def query_handler(user_question, collection, top_k=3):
    
    #query ChromaDB for relevant shards
    context, _ = get_relevant_shards(user_question, collection, top_k=top_k)

    #create prompt using LangChain
    template = """
    You are an academic advisor for Tufts University's School of Engineering.
    Use only the provided information to answer the user's question.
    If the answer is not in the context, say "I don't know."

    {context}

    Question: {question}
    """
    prompt_template = ChatPromptTemplate.from_template(template)
    full_prompt = prompt_template.format(context=context, question=user_question)

    genai.configure(api_key="REDACTED")
    model = genai.GenerativeModel("models/gemini-1.5-pro-latest")
    response = model.generate_content(full_prompt)

    return response.text.strip()


In [44]:
#TEST SOLUTION
user_q = "Which degree involves coding and algorithms?"
response = query_handler(user_q, collection)
print(f"Q: {user_q}\nA: {response}")

Q: Which degree involves coding and algorithms?
A: The Bachelor of Science in Computer Science (BSCS) and the Bachelor of Science in Data Science (BSDS) involve coding and algorithms.


*Optionally, paste results of running this on an LLM here:* 

# __To submit this assignment:__
1. __Leave computations' outputs alone for grading.__ Do not select `Kernel/Restart Kernel and run all cells`.
2. Check that all cells ran and that there are no syntax errors. 
3. Save this notebook.
4. Upload the saved notebook to GradeScope.