# Question answering using embeddings-based search

GPT excels at answering questions, but only on topics it remembers from its training data.  In this lab, the seach-ask method is used to answer questions using private data.

1. Search: search your library of text for relevant text sections.
2. Ask: insert the retrieved text sections into a message to ChatGPT and ask it the question

To complete this lab, you will need to set up an OpenAI API account and set up your development enviroment. Here's a link to the [QuickStart](https://platform.openai.com/docs/quickstart?context=python). Pay special attention to step 2 - set up your API key.


## Load Text to Search

Place the documents that your application will search in the docs folder.  

This functions read various types of text files (PDF, DOCX, TXT).  The following libraries are used:
- [PyPDF](https://pypdf.readthedocs.io/en/stable/)
- [python-docx](https://python-docx.readthedocs.io/en/latest/)
  

In [None]:
#!pip install python-docx
#!pip install pypdf
import os
from pypdf import PdfReader
import docx 

def read_pdf(file_path):
    with open(file_path, "rb") as file:
        pdf_reader = PdfReader(file)
        text = ""
        for page_num in range(len(pdf_reader.pages)):
            text += pdf_reader.pages[page_num].extract_text()
    return text

def read_word(file_path):
    doc = docx.Document(file_path)
    text = ""
    for paragraph in doc.paragraphs:
        text += paragraph.text + "\n"
    return text

def read_txt(file_path):
    with open(file_path, "r") as file:
        text = file.read()
    return text

def read_documents_from_directory(directory):
    combined_text = ""
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)
        if filename.endswith(".pdf"):
            combined_text += read_pdf(file_path)
        elif filename.endswith(".docx"):
            combined_text += read_word(file_path)
        elif filename.endswith(".txt"):
            combined_text += read_txt(file_path)
    return combined_text

train_directory = 'docs/'
text = read_documents_from_directory(train_directory)

## Split into chunks

This lab uses the [LangChain](https://www.langchain.com) library. **LangChain** is a framework for developing applications powered by large language models (LLMs).

In [None]:
#!pip install langchain
from langchain.text_splitter import CharacterTextSplitter

char_text_splitter = CharacterTextSplitter(separator="\n", chunk_size=1000, 
                                      chunk_overlap=200, length_function=len)

text_chunks = char_text_splitter.split_text(text)

## Create embeddings

There are two steps for create an easily searchable embedding.  

First create an [OpenAI embedding](https://platform.openai.com/docs/guides/embeddings) using the LangChain library.  Then create a [FAISS](https://python.langchain.com/docs/integrations/vectorstores/faiss/) vector database for efficient search.

In [None]:
#!pip install langchain_openai
#!pip install langchain_community
#!pip install faiss-cpu #or !pip instal faiss-gpu

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

embeddings = OpenAIEmbeddings()
docsearch = FAISS.from_texts(text_chunks, embeddings)

## Questions

Write five questions related to your documents.

In [None]:
queries = [
    "Do emeritus professors have library privileges?", 
    "How many years of service are required to be eligible for emeritus status?", 
    "Who are the main characters in Emma?", 
    "When are Dr. Howard's office hours?", 
    "For Natural Language Processing, what percentage of the final grade are homework assignments?", 
    "When and where does Mobile Application Development meet?"]

Create a LangChain chain to send queries and related text to OpenAI.

In [None]:
from langchain.chains.question_answering import load_qa_chain
from langchain_openai import OpenAI

llm = OpenAI()  #get a reference to the LLM
chain = load_qa_chain(llm, chain_type="stuff") #create a LangChain chain

## Answer Questions

For each question
1. Search the embedding for related text
2. Send the related text and the question to the OpenAI api
3. Print thte question and answer

In [None]:
for query in queries:
    docs = docsearch.similarity_search(query )
    response = chain.invoke({"input_documents" : docs, "question" :query})
    print(" ")
    print(query)
    print(response["output_text"])