<a href="https://colab.research.google.com/github/jeffheaton/app_generative_ai/blob/main/assignments/assignment_yourname_t81_559_class6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-559: Applications of Generative AI
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/index.html)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-559/).

**Module 6 Assignment: RAG**

**Student Name: Your Name**

# Assignment Instructions

Use RAG to have the LLM read and analyze the story ["Clockwork Dreams and Brass Shadows"](https://data.heatonresearch.com/data/t81-559/assignments/clockwork.pdf). The story can be found at the following URL.

* https://data.heatonresearch.com/data/t81-559/assignments/clockwork.pdf

Answer the following questions.

1. What is the invention that could change everything?
2. What is Eliza Hawthorne's job title?
3. Who orchestrating the conspiracy?
4. Does Victor have a last name? (yes or no)
5. What city does the story take place in?
6. What is Jasper Thorne's job title?

These answers should be as simple as possible, yes/no, a city name, or a simple job title like "software engineer". Product a table that might look like the following. Note that these are **NOT** the correct answers. Submit all answers in lower case. Your answers must match the solution exactly.

| Question   | Answer                          |
|------------|---------------------------------|
| 1.         | computer                        |
| 2.         | software engineer               |
| 3.         | sebastian                       |
| 4.         | yes                             |
| 5.         | cincinnati                      |
| 6.         | police officer                  |
|------------|---------------------------------|

Submit a dataframe with answers to the the questions above, in this format.





# Google CoLab Instructions

If you are using Google CoLab, it will be necessary to mount your GDrive so that you can send your notebook during the submit process. Running the following code will map your GDrive to ```/content/drive```.

In [None]:
import os

try:
  from google.colab import drive, userdata
  drive.mount('/content/drive', force_remount=True)
  COLAB = True
  print("Note: using Google CoLab")
except:
  print("Note: not using Google CoLab")
  COLAB = False

# Assignment Submission Key - Was sent you first week of class.
# If you are in both classes, this is the same key.
if COLAB:
  # For Colab, add to your "Secrets" (key icon at the left)
  key = userdata.get('T81_559_KEY')
else:
  # If not colab, enter your key here, or use an environment variable.
  # (this is only an example key, use yours)
  key = ""

# OpenAI Secrets
if COLAB:
    os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

# Install needed libraries in CoLab
if COLAB:
    !pip install langchain langchain_openai langchain_community pypdf pdfkit sentence-transformers chromadb
    !apt-get install wkhtmltopdf

Mounted at /content/drive
Note: using Google CoLab
E: dpkg was interrupted, you must manually run 'dpkg --configure -a' to correct the problem. 


# Assignment Submit Function

You will submit the 10 programming assignments electronically.  The following submit function can be used to do this.  My server will perform a basic check of each assignment and let you know if it sees any basic problems.

**It is unlikely that should need to modify this function.**

In [None]:
import base64
import os
import numpy as np
import pandas as pd
import requests
import PIL
import PIL.Image
import io
from typing import List, Union

# This function submits an assignment.  You can submit an assignment as much as you like, only the final
# submission counts.  The paramaters are as follows:
# data - List of pandas dataframes or images.
# key - Your student key that was emailed to you.
# course - The course that you are in, currently t81-558 or t81-559.
# no - The assignment class number, should be 1 through 10.
# source_file - The full path to your Python or IPYNB file.  This must have "_class1" as part of its name.
# .             The number must match your assignment number.  For example "_class2" for class assignment #2.

def submit(
    data: List[Union[pd.DataFrame, PIL.Image.Image]],
    key: str,
    course: str,
    no: int,
    source_file: str = None
) -> None:
    if source_file is None and '__file__' not in globals():
        raise Exception("Must specify a filename when in a Jupyter notebook.")
    if source_file is None:
        source_file = __file__

    suffix = f'_class{no}'
    if suffix not in source_file:
        raise Exception(f"{suffix} must be part of the filename.")

    ext = os.path.splitext(source_file)[-1].lower()
    if ext not in ['.ipynb', '.py']:
        raise Exception(f"Source file is {ext}; must be .py or .ipynb")

    with open(source_file, "rb") as file:
        encoded_python = base64.b64encode(file.read()).decode('ascii')

    payload = []
    for item in data:
        if isinstance(item, PIL.Image.Image):
            buffered = io.BytesIO()
            item.save(buffered, format="PNG")
            payload.append({'PNG': base64.b64encode(buffered.getvalue()).decode('ascii')})
        elif isinstance(item, pd.DataFrame):
            payload.append({'CSV': base64.b64encode(item.to_csv(index=False).encode('ascii')).decode("ascii")})
        else:
            raise ValueError(f"Unsupported data type: {type(item)}")

    response = requests.post(
        "https://api.heatonresearch.com/wu/submit",
        headers={'x-api-key': key},
        json={
            'payload': payload,
            'assignment': no,
            'course': course,
            'ext': ext,
            'py': encoded_python
        }
    )

    if response.status_code == 200:
        print(f"Success: {response.text}")
    else:
        print(f"Failure: {response.text}")

# Assignment #6 Sample Code

The following code provides a starting point for this assignment.

In [None]:
import os
import pandas as pd
from scipy.stats import zscore
import string
import requests
import pypdf
from io import BytesIO
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.vectorstores import Chroma
from langchain import hub
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI

# You must identify your source file.  (modify for your local setup)
# file="/content/drive/My Drive/Colab Notebooks/assignment_yourname_t81_559_class6.ipynb"  # Google CoLab
# file='C:\\Users\\jeffh\\projects\\t81_559_deep_learning\\assignments\\assignment_yourname_t81_559_class6.ipynb'  # Windows
# file='/Users/jheaton/projects/t81_559_deep_learning/assignments/assignment_yourname_t81_559_class6.ipynb'  # Mac/Linux

file="/content/drive/My Drive/Colab Notebooks/assignment_ZhijiangLi_t81_559_class6.ipynb"

# Begin assignment

url = "https://data.heatonresearch.com/data/t81-559/assignments/clockwork.pdf"


def extract_pdf_text(pdf_content):
    pdf_file = BytesIO(pdf_content)
    reader = pypdf.PdfReader(pdf_file)
    text = ""
    for page in reader.pages:
        text += page.extract_text()
    return text

# Fetch the PDF content
response = requests.get(url)
response.raise_for_status()
content = extract_pdf_text(response.content)

chunk_size = 900
overlap = 300

def chunk_text(text, chunk_size, overlap):
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunks.append(text[i:i + chunk_size])
    return chunks

chunks = chunk_text(content, chunk_size, overlap)

# Convert chunks into LangChain documents
from langchain.schema import Document
documents = [Document(page_content=chunk) for chunk in chunks]


text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")



db = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_clockwork_db")


rag_prompt = hub.pull("rlm/rag-prompt")

def format_documents(documents):
    return "\n\n".join(doc.page_content for doc in documents)

retriever = db.as_retriever()
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.2, n=1)

qa_chain = (
    {"context": retriever | format_documents, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

questions = [
    "What is the invention that could change everything?",
    "What is Eliza Hawthorne's job title?",
    "Who is orchestrating the conspiracy?",
    "Does Victor have a last name? (yes or no)",
    "What city does the story take place in?",
    "What is Jasper Thorne's job title?",
]

# Query the RAG model and process answers
def clean_answer(question, raw_answer):
    raw_answer = raw_answer.lower().strip()
    mappings = {
        "automaton": "automaton",
        "inventor": "inventor",
        "engineer": "inventor",
        "victor": "victor",
        "yes": "yes",
        "no": "no",
        "london": "london",
        "captain": "airship captain"
    }
    for key in mappings:
        if key in raw_answer:
            return mappings[key]
    return raw_answer  # Default to processed raw answer if no mapping found

answers = [clean_answer(q, qa_chain.invoke(q)) for q in questions]

# answers = [qa_chain.invoke(q).lower() for q in questions]

# # answers = []
# for q in questions:
#     answer = qa_chain.invoke(q).lower().strip()
#     if "automaton" in answer:
#         answer = "automaton"
#     elif "inventor" in answer or "engineer" in answer:
#         answer = "inventor"
#     elif "victor" in answer:
#         answer = "victor"
#     elif "yes" in answer:
#         answer = "yes"
#     elif "no" in answer:
#         answer = "no"
#     elif "london" in answer:
#         answer = "london"
#     answers.append(answer)

data = {
    'question': [1, 2, 3, 4, 5, 6],
    'answer': answers
}
df_submit = pd.DataFrame(data)

df_submit

# Note, this data is wrong, its just an example of the format.
# data = {
#     'question': [1, 2, 3, 4, 5, 6],
#     'answer': ['computer', 'software engineer', 'sebastian', 'yes', 'cincinnati', 'police officer']
# }
# df_submit = pd.DataFrame(data)

# Submit
submit(source_file=file,data=[df_submit],course='t81-559',key=key,no=6)



Success: Submitted Assignment 6 (t81-559) for l.zhijiang:
You have submitted this assignment 8 times. (this is fine)
