# **RAG: Get insights from your business data through LLM**

**By Rodolphe Segbedji:**


0.   Installing dependencies, Importing Libraries, and API Keys
1.   Loading data source with LangChain
2.   Splitting or Chunking with LangChain
3.   Embedding text and storing embeddings
4.   Creating retrieval function
5.   Creating chatbot with chat memory (OPTIONAL) 


0.   Installing dependencies, Importing Libraries, and API Keys

In [3]:
# !pip install -q langchain
# !pip install -q langchain-openai
# !pip install -q openai 
# !pip install -q faiss-client
# !pip install -q python-dotenv
# !pip install -q pandas
# !pip install -q matplotlib
# !pip install -q PyPDF2

# !pip install -q weaviate-client
# !pip install -q pinecone-client
# !pip install -q pgvector
# !pip install -q unstructured[pdf]

In [4]:
import os
import faiss
import tiktoken
import pandas as pd
import matplotlib.pyplot as plt
from dotenv import load_dotenv
from PyPDF2 import PdfReader
#import gradio as gr

from langchain_community.document_loaders import PyPDFLoader
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import CharacterTextSplitter

from langchain_openai import OpenAIEmbeddings

from langchain.vectorstores import FAISS
from langchain_community.vectorstores.weaviate import Weaviate
from langchain_community.vectorstores.pgvector import PGVector 

from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from langchain.chains import ConversationalRetrievalChain

In [5]:
load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

In [6]:
print(os.getcwd())
print(OPENAI_API_KEY)

/home/crs/10Academy/w6/my_rag/notebooks
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


1. Loading data source with LangChain

In [7]:
# Load data with langchain PyPDFLoader
pdfloader = PyPDFLoader('../data/gpt-4.pdf')     

# split/chunk the data loaded
pages = pdfloader.load_and_split()
print(len(pages), pages[0])

splitter = RecursiveCharacterTextSplitter(
    #separators = ["\n\n", "\n", " ", ""],
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)

pages_chunks = splitter.split_documents(pages)
print(len(pages_chunks), pages_chunks[0])


113 page_content='GPT-4 Technical Report\nOpenAI∗\nAbstract\nWe report the development of GPT-4, a large-scale, multimodal model which can\naccept image and text inputs and produce text outputs. While less capable than\nhumans in many real-world scenarios, GPT-4 exhibits human-level performance\non various professional and academic benchmarks, including passing a simulated\nbar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-\nbased model pre-trained to predict the next token in a document. The post-training\nalignment process results in improved performance on measures of factuality and\nadherence to desired behavior. A core component of this project was developing\ninfrastructure and optimization methods that behave predictably across a wide\nrange of scales. This allowed us to accurately predict some aspects of GPT-4’s\nperformance based on models trained with no more than 1/1,000th the compute of\nGPT-4.\n1 Introduction\nThis technical report presents GP

Another Methods to load, split, embedd

In [8]:
pdfReader = PdfReader('../data/gpt-4.pdf')

from typing_extensions import Concatenate
# read text from pdf
raw_text = ''
for i, page in enumerate(pdfReader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

# We need to split the `raw_text` using CharacterTextSplitter such that it should not increase token size
text_splitter = CharacterTextSplitter(
    #separator = '\n',
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
text_chunks = text_splitter.split_text(raw_text) 


In [9]:
type(raw_text), type(pdfReader), type(pdfloader), type(pages[0])

(str,
 PyPDF2._reader.PdfReader,
 langchain_community.document_loaders.pdf.PyPDFLoader,
 langchain_core.documents.base.Document)

In [10]:
print(f"{type(text_chunks)}")
print(f"{len(text_chunks)}")
print(f"{text_chunks[0]}")
#print(f"{text_chunks[1]}")

<class 'list'>
1
GPT-4 Technical Report
OpenAI∗
Abstract
We report the development of GPT-4, a large-scale, multimodal model which can
accept image and text inputs and produce text outputs. While less capable than
humans in many real-world scenarios, GPT-4 exhibits human-level performance
on various professional and academic benchmarks, including passing a simulated
bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-
based model pre-trained to predict the next token in a document. The post-training
alignment process results in improved performance on measures of factuality and
adherence to desired behavior. A core component of this project was developing
infrastructure and optimization methods that behave predictably across a wide
range of scales. This allowed us to accurately predict some aspects of GPT-4’s
performance based on models trained with no more than 1/1,000th the compute of
GPT-4.
1 Introduction
This technical report presents GPT-4, a large mult

In [11]:
type(pages_chunks[0]), type(text_chunks[0])

(langchain_core.documents.base.Document, str)

3. Embedding

In [12]:
embeddings = OpenAIEmbeddings(disallowed_special=())

4. Vector Store/ Database

In [13]:

db = FAISS.from_documents(pages_chunks, embeddings)

In [14]:
db

<langchain_community.vectorstores.faiss.FAISS at 0x7f192e2e73a0>

In [15]:
vector_store = FAISS.from_texts(text_chunks, embeddings)

In [16]:
vector_store

<langchain_community.vectorstores.faiss.FAISS at 0x7f192e2e5cc0>

5. Retrieval

In [17]:
# Check similarity search is working
query = "what are the limitations of gpt-4 ?"
docs = db.similarity_search(query)
docs[0]

Document(page_content='5 Limitations\nDespite its capabilities, GPT-4 has similar limitations as earlier GPT models. Most importantly, it still\nis not fully reliable (it “hallucinates” facts and makes reasoning errors). Great care should be taken\nwhen using language model outputs, particularly in high-stakes contexts, with the exact protocol\n(such as human review, grounding with additional context, or avoiding high-stakes uses altogether)\nmatching the needs of specific applications. See our System Card for details.\nGPT-4 significantly reduces hallucinations relative to previous GPT-3.5 models (which have them-\nselves been improving with continued iteration). GPT-4 scores 19 percentage points higher than our\nlatest GPT-3.5 on our internal, adversarially-designed factuality evaluations (Figure 6).', metadata={'source': '../data/gpt-4.pdf', 'page': 9})

In [18]:
# Create QA chain to integrate similarity search with user queries (answer query from knowledge base)

chain = load_qa_chain(OpenAI(temperature=0), chain_type="stuff")

query = "Who created transformers?"
docs = db.similarity_search(query)

chain.run(input_documents=docs, question=query)

  warn_deprecated(
  warn_deprecated(


' The creators of transformers include William Fedus, Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, Lukasz Kaiser, Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu, Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov, Jacob Devlin,'

5. Create chatbot with chat memory 

In [19]:
from IPython.display import display
import ipywidgets as widgets

# Create conversation chain that uses our vectordb as retriver, this also allows for chat history management
qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0.1), db.as_retriever())

In [20]:
chat_history = []

def on_submit(_):
    query = input_box.value
    input_box.value = ""
    
    if query.lower() == 'exit':
        print("Thank you for using the State of the Union chatbot!")
        return
    
    result = qa({"question": query, "chat_history": chat_history})
    chat_history.append((query, result['answer']))
    
    display(widgets.HTML(f'<b>User:</b> {query}'))
    display(widgets.HTML(f'<b><font color="blue">Chatbot:</font></b> {result["answer"]}'))

print("Welcome to the Transformers chatbot! Type 'exit' to stop.")

input_box = widgets.Text(placeholder='Please enter your question:')
input_box.on_submit(on_submit)

display(input_box)

Welcome to the Transformers chatbot! Type 'exit' to stop.


  input_box.on_submit(on_submit)


Text(value='', placeholder='Please enter your question:')

Automatic Prompt Generation 

In [None]:
client = OpenAI(api_key=OPENAI_API_KEY)

def generate_response(prompt):
    user_prompt = f'''"Break down the prompt generation step by step based on the following prompt pairs = "Linux Terminal","I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output inside one unique code block, and nothing else. do not write explanations. do not type commands unless I instruct you to do so. when i need to tell you something in english, i will do so by putting text inside curly brackets [like this]"
    "English Translator and Improver","I want you to act as an English translator, spelling corrector and improver. I will speak to you in any language and you will detect the language, translate it and answer in the corrected and improved version of my text, in English. I want you to replace my simplified A0-level words and sentences with more beautiful and elegant, upper level English words and sentences. Keep the meaning same, but make them more literary. I want you to only reply the correction, the improvements and nothing else, do not write explanations."
    "`position` Interviewer","I want you to act as an interviewer. I will be the candidate and you will ask me the interview questions for the `position` position. I want you to only reply as the interviewer. Do not write all the conservation at once. I want you to only do the interview with me. Ask me the questions and wait for my answers. Do not write explanations. Ask me the questions one by one like an interviewer does and wait for my answers. "
    "JavaScript Console","I want you to act as a javascript console. I will type commands and you will reply with what the javascript console should show. I want you to only reply with the terminal output inside one unique code block, and nothing else. do not write explanations. do not type commands unless I instruct you to do so. when i need to tell you something in english, i will do so by putting text inside curly brackets [like this]. "
    "Excel Sheet","I want you to act as a text based excel. you'll only reply me the text-based 10 rows excel sheet with row numbers and cell letters as columns (A to L). First column header should be empty to reference row number. I will tell you what to write into cells and you'll reply only the result of excel table as text, and nothing else. Do not write explanations. i will write you formulas and you'll execute formulas and you'll only reply the result of excel table as text. "
    "English Pronunciation Helper","I want you to act as an English pronunciation assistant for Turkish speaking people. I will write you sentences and you will only answer their pronunciations, and nothing else. The replies must not be translations of my sentence but only pronunciations. Pronunciations should use Turkish Latin letters for phonetics. Do not write explanations on replies."
    "Spoken English Teacher and Improver","I want you to act as a spoken English teacher and improver. I will speak to you in English and you will reply to me in English to practice my spoken English. I want you to keep your reply neat, limiting the reply to 100 words. I want you to strictly correct my grammar mistakes, typos, and factual errors. I want you to ask me a question in your reply. Now let's start practicing, you could ask me a question first. Remember, I want you to strictly correct my grammar mistakes, typos, and factual errors."
    "Travel Guide","I want you to act as a travel guide. I will write you my location and you will suggest a place to visit near my location. In some cases, I will also give you the type of places I will visit. You will also suggest me places of similar type that are close to my first location. "
    "Plagiarism Checker","I want you to act as a plagiarism checker. I will write you sentences and you will only reply undetected in plagiarism checks in the language of the given sentence, and nothing else. Do not write explanations on replies. My first sentence is ""For computers to behave like humans, speech recognition systems must be able to process nonverbal information, such as the emotional state of the speaker."""
    "Character from Movie/Book/Anything","I want you to act like [character] from [series]. I want you to respond and answer like [character] using the tone, manner and vocabulary [character] would use. Do not write any explanations. Only answer like [character]. You must know all of the knowledge of [character]. "
    "Advertiser","I want you to act as an advertiser. You will create a campaign to promote a product or service of your choice. You will choose a target audience, develop key messages and slogans, select the media channels for promotion, and decide on any additional activities needed to reach your goals. "
    use these topic, prompt pair examples only as guidlines to create an effective prompt for the next topic. even if the topic is mensioned before. You will create
    only prompt for it and not act on the previous description. if the topic is mensioned already,
    do not use the prompt which you were given, change it.
    "{prompt}"'''
    response = client.completions.create(
        model="gpt-3.5-turbo-instruct",
        prompt=user_prompt,
        temperature=0.7,
        max_tokens=250,
    )
    return response.choices[0].text

In [None]:
app = gr.Interface(
    generate_response,
    title="Retrieval Augmented Generation",
    inputs="text",
    outputs="text",
    allow_flagging=False,
    examples=[["Prompt Generator"], ["a cmd prompt"], ['a translator'], ['an SQL generator'], ['an image generator']]
)

app.launch()