## OpenAI vs. Local Embeddings
Performance Comparison
- OpenAI's Embedding Model
- InstructorEmbedding (https://huggingface.co/hkunlp/instructor-xl)

In [None]:
!pip -q install langchain openai tiktoken chromadb pypdf sentence_transformers InstructorEmbedding faiss-cpu

In [40]:
from secret_key import openapi_key
import os
os.environ['OPENAI_API_KEY'] = openapi_key

In [41]:
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import DirectoryLoader

In [42]:
# InstructorEmbedding
from InstructorEmbedding import INSTRUCTOR
from langchain.embeddings import HuggingFaceInstructEmbeddings

In [43]:
# OpenAI Embedding
from langchain.embeddings import OpenAIEmbeddings

### Load Multiple files from Directory

In [44]:
# connect your Google Drive
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive"

Mounted at /content/gdrive


In [45]:
# loader = TextLoader('single_text_file.txt')
loader = DirectoryLoader(f'{root_dir}/Documents/', glob="./*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()

In [46]:
# documents

### Divide and Conquer

In [47]:
text_splitter = RecursiveCharacterTextSplitter(
                                               chunk_size=1000,
                                               chunk_overlap=200)

texts = text_splitter.split_documents(documents)

In [48]:
texts[0]

Document(page_content='Gracenote.ai: Legal Generative AI for Regulatory Compliance  Jules Ioannidis, Joshua Harper, Ming Sheng Quah 1 and Dan Hunter 1, 2  1 Gracenote.ai, Melbourne Australia 2 The Dickson Poon School of Law, King’s College London, United Kingdom*   Abstract: We investigate the transformative potential of large language models (LLMs) in the legal and regulatory compliance domain by developing advanced generative AI solutions, including a horizon scanning tool, an obligations generation tool, and an LLM-based expert system. Our approach combines the LangChain framework, OpenAI’s GPT-4, text embeddings, and prompt engineering techniques to effectively reduce hallucinations and generate reliable and accurate domain-specific outputs. A human-in-the-loop control mechanism is used as a final backstop to ensure accuracy and mitigate risk. Our findings emphasise the role of LLMs as foundation engines in specialist tools and lay the groundwork for building the next generation of

In [49]:
len(texts)

60

### Get Embeddings for OUR Documents

In [50]:
# !pip install faiss-cpu

In [51]:
import pickle
import faiss
from langchain.vectorstores import FAISS

In [52]:
def store_embeddings(docs, embeddings, sotre_name, path):

    vectorStore = FAISS.from_documents(docs, embeddings)

    with open(f"{path}/faiss_{sotre_name}.pkl", "wb") as f:
        pickle.dump(vectorStore, f)

In [53]:
def load_embeddings(sotre_name, path):
    with open(f"{path}/faiss_{sotre_name}.pkl", "rb") as f:
        VectorStore = pickle.load(f)
    return VectorStore

### HF Instructor Embeddings

In [54]:
from langchain.embeddings import HuggingFaceInstructEmbeddings

instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl",
                                                      model_kwargs={"device": "cuda"})

load INSTRUCTOR_Transformer
max_seq_length  512


In [55]:
Embedding_store_path = f"{root_dir}/Embedding_store"

In [58]:
db_instructEmbedd = FAISS.from_documents(texts, instructor_embeddings)

In [59]:
retriever = db_instructEmbedd.as_retriever(search_kwargs={"k": 3})

In [60]:
retriever.search_type

'similarity'

In [61]:
retriever.search_kwargs

{'k': 3}

In [80]:
docs = retriever.get_relevant_documents("Who is the author")

In [85]:

docs

[Document(page_content='interfaces, one for an author/publisher and one for the end-user/client.   For the authoring tool, upon login the author is presented with all work product that is pending from monitored feeds for that author. The author can adjust which feeds are monitored from a settings page.  The authoring environment itself has three main panes—the leftmost pane (“ASIC places interim stop orders...”) is the timeline for all content still to be published, the middle pane is the original content from the monitored source, and the rightmost pane contains the GPT-generated content. The author compares the original source with the summary to assess accuracy and quality, and they can adjust settings on the prompt and edit the summary prior to publication. This is done to reduce the hallucination/integrity issues inherent in LLMs. (A topic which we examine in more detail in section 5 below.). Refer to Figure 1 in Annexure A.  Upon publishing the content, the update is stored in a 

In [81]:
docs[0]

Document(page_content='interfaces, one for an author/publisher and one for the end-user/client.   For the authoring tool, upon login the author is presented with all work product that is pending from monitored feeds for that author. The author can adjust which feeds are monitored from a settings page.  The authoring environment itself has three main panes—the leftmost pane (“ASIC places interim stop orders...”) is the timeline for all content still to be published, the middle pane is the original content from the monitored source, and the rightmost pane contains the GPT-generated content. The author compares the original source with the summary to assess accuracy and quality, and they can adjust settings on the prompt and edit the summary prior to publication. This is done to reduce the hallucination/integrity issues inherent in LLMs. (A topic which we examine in more detail in section 5 below.). Refer to Figure 1 in Annexure A.  Upon publishing the content, the update is stored in a s

In [82]:
# create the chain to answer questions
qa_chain_instrucEmbed = RetrievalQA.from_chain_type(llm=OpenAI(temperature=0.2, ),
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True)

### OpenAI's Embeddings

In [83]:
from langchain.embeddings import OpenAIEmbeddings

In [84]:
embeddings = OpenAIEmbeddings()

In [67]:
# store_embeddings(texts,
#                  embeddings,
#                  sotre_name='openAIEmbeddings',
#                  path=Embedding_store_path)

In [68]:
# db_openAIEmbedd = load_embeddings(sotre_name='openAIEmbeddings',
#                                     path=Embedding_store_path)

In [69]:
db_openAIEmbedd = FAISS.from_documents(texts, embeddings)
retriever_openai = db_openAIEmbedd.as_retriever(search_kwargs={"k": 3})

In [70]:
# create the chain to answer questions
qa_chain_openai = RetrievalQA.from_chain_type(llm=OpenAI(temperature=0.2, ),
                                  chain_type="stuff",
                                  retriever=retriever_openai,
                                  return_source_documents=True)

### Testing both MODELS

In [71]:
## Cite sources

import textwrap

def wrap_text_preserve_newlines(text, width=110):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text

def process_llm_response(llm_response):
    print(wrap_text_preserve_newlines(llm_response['result']))
    print('\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

In [72]:
query = 'who are the authors of GPT4all technical report?'

print('-------------------Instructor Embeddings------------------\n')
llm_response = qa_chain_instrucEmbed(query)
process_llm_response(llm_response)

-------------------Instructor Embeddings------------------

 The authors of GPT4all Technical Report are OpenAI.

Sources:
/content/gdrive/My Drive/Documents/paper3.pdf
/content/gdrive/My Drive/Documents/paper3.pdf
/content/gdrive/My Drive/Documents/paper3.pdf


In [73]:
query = 'who are the authors of GPT4all technical report?'

print('-------------------OpenAI Embeddings------------------')
llm_response = qa_chain_openai(query)
process_llm_response(llm_response)
print('\n\n\n')

-------------------OpenAI Embeddings------------------
 I don't know.

Sources:
/content/gdrive/My Drive/Documents/paper3.pdf
/content/gdrive/My Drive/Documents/paper3.pdf
/content/gdrive/My Drive/Documents/paper3.pdf






In [74]:
query = 'How was the GPT4All-J model trained?'

print('-------------------Instructor Embeddings------------------\n')
llm_response = qa_chain_instrucEmbed(query)
process_llm_response(llm_response)

-------------------Instructor Embeddings------------------

 GPT4All-J was trained using prompt engineering to constrain the output generated by the model.

Sources:
/content/gdrive/My Drive/Documents/paper3.pdf
/content/gdrive/My Drive/Documents/paper3.pdf
/content/gdrive/My Drive/Documents/paper3.pdf


In [75]:
query = 'How was the GPT4All-J model trained?'

print('-------------------OpenAI Embeddings------------------')
llm_response = qa_chain_openai(query)
process_llm_response(llm_response)
print('\n\n\n')

-------------------OpenAI Embeddings------------------




 I don't know.

Sources:
/content/gdrive/My Drive/Documents/paper3.pdf
/content/gdrive/My Drive/Documents/paper3.pdf
/content/gdrive/My Drive/Documents/paper3.pdf






In [76]:
query = '"What was the cost of training the GPT4all model?"'

print('-------------------Instructor Embeddings------------------\n')
llm_response = qa_chain_instrucEmbed(query)
process_llm_response(llm_response)



-------------------Instructor Embeddings------------------





 I don't know.

Sources:
/content/gdrive/My Drive/Documents/paper3.pdf
/content/gdrive/My Drive/Documents/paper3.pdf
/content/gdrive/My Drive/Documents/paper3.pdf


In [77]:
query = '"What was the cost of training the GPT4all model?"'

print('-------------------OpenAI Embeddings------------------')
llm_response = qa_chain_openai(query)
process_llm_response(llm_response)
print('\n\n\n')

-------------------OpenAI Embeddings------------------




 I don't know.

Sources:
/content/gdrive/My Drive/Documents/paper3.pdf
/content/gdrive/My Drive/Documents/paper3.pdf
/content/gdrive/My Drive/Documents/paper3.pdf






In [None]:
query = "what license is GPT4All-J using?"

# print('-------------------OpenAI Embeddings------------------')
# llm_response = qa_chain_openai(query)
# process_llm_response(llm_response)
# print('\n\n\n')
print('-------------------Instructor Embeddings------------------\n')
llm_response = qa_chain_instrucEmbed(query)
process_llm_response(llm_response)

-------------------Instructor Embeddings------------------

 GPT4All-J is using an Apache 2 license.

Sources:
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All_Technical_Report.pdf
/content/gdrive/My Drive/Documents/2023_GPT4All-J_Technical_Report_2.pdf
