# A RAG application using ChatGPT API

Before you begin, please go to https://platform.openai.com/api-keys and create an new secret key if you haven't before. Copy the secret key and paste it below.

In [None]:
%env OPENAI_API_KEY = 
MODEL = "gpt-3.5-turbo"


env: OPENAI_API_KEY=


Install the required dependencies

In [26]:
!pip install langchain langchain-openai langchain_pinecone "langchain[docarray]" docarray pydantic==1.10.8 pytube python-dotenv tiktoken pinecone-client scikit-learn ruff pypdf

Collecting pypdf
  Downloading pypdf-4.2.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pypdf
Successfully installed pypdf-4.2.0


Let's import the libraries and test if the API works

In [None]:
from langchain_openai.chat_models import ChatOpenAI
from langchain_community.embeddings import OllamaEmbeddings
from langchain_openai.embeddings import OpenAIEmbeddings

if MODEL.startswith("gpt"):
    model = ChatOpenAI(openai_api_key=os.getenv('OPENAI_API_KEY'), model=MODEL)
    embeddings = OpenAIEmbeddings()

model.invoke("Tell me a joke")

AIMessage(content="Why don't scientists trust atoms?\n\nBecause they make up everything!", response_metadata={'token_usage': {'completion_tokens': 13, 'prompt_tokens': 11, 'total_tokens': 24}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-a361395e-8a16-4f16-91cb-82b349f7fc66-0')

The previous output included a lot of unnecessary information like completion tokens, model name, logprobs and etc. Let's use langchain's Output Parser within a Chain and ask ChatGPT to tell another joke.

In [None]:
from langchain_core.output_parsers import StrOutputParser

parser = StrOutputParser()

chain = model | parser
chain.invoke("Tell me a joke")

"Why couldn't the bicycle stand up by itself?\n\nBecause it was two-tired!"

Much better! Ignore the \n or \t, those are code syntax for creating a new line. Now we're about to provide a template that outlines how prompts should be be structured based on given "context" and "question". We then fill this template with actual content, essentially guiding the language model to generate answers that are good, precise and relevant. The key takeaway is that this process enables a more controlled and meaningful interaction with language models by provisioning them with clear and structured input.

In [None]:
from langchain.prompts import PromptTemplate

template = """
Answer the question based on the context below. If you can't
answer the question, reply "I don't know".

Context: {context}

Question: {question}
"""

prompt = PromptTemplate.from_template(template)
prompt.format(context="Here is some context", question="Here is a question")

'\nAnswer the question based on the context below. If you can\'t\nanswer the question, reply "I don\'t know".\n\nContext: Here is some context\n\nQuestion: Here is a question\n'

Now we connect it into the Chain we created earlier adding the prompt to the front of it. Next we use invoke() to provide some context by setting it and adding a question following it.

In [22]:
chain = prompt | model | parser

chain.invoke({"context": "My parents named me Santiago", "question": "What's my name'?"})

'Your name is Santiago.'

Click on the Files icon on the left, then upload the Pandora Papers.pdf from the repository. You can download it and upload it here. Now we shall load the document and split it into pages to be set as context later

In [27]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("/content/Pandora Papers.pdf")
pages = loader.load_and_split()
pages

[Document(page_content="Symbol used to represent the leak\nby the International Consortium of\nInvestigative Journalists\nPandora Papers\nThe Pandora Papers  are 11.9 million leaked documents  with 2.9\nterabytes  of data that the International Consortium of\nInvestigative Jour nalists  (ICIJ) published beginning on 3 October\n2021.[1][2][3] The leak exposed  the secret offshore accounts of 35\nworld leaders, including current and former presidents, prime\nministers, and heads of state as well as more than 100 business\nleaders, billionaires, and celebrities. The news organizations of the\nICIJ described the document leak as their most expansive exposé\nof financial secrecy yet, containing documents, images, emails and\nspreadsheets from 14 financial service companies, in nations\nincluding Panam a, Switzerland and the United Arab\nEmirates.[4][5] The size of the leak surpassed their previous release\nof the Panama Papers  in 2016, which had 11.5 million confidential\ndocuments and 2.6

This code below is part of a process that involves setting up an in-memory search system using vectors. This is done by provisioning a collection of documents (pages) along with their associated vector representations (embeddings). The purpose of this setup is to enable efficient searching and retrieval of documents based on vector similarity, which can be particularly useful in applications like semantic search or recommendation systems.

In [28]:
from langchain_community.vectorstores import DocArrayInMemorySearch

vectorstore = DocArrayInMemorySearch.from_documents(pages, embedding=embeddings)

Let's demonstrate how to use the previously set up in-memory vector store for searching documents related to a query. First, it concerts the vectorstore into a retriever object using the as_retriever() method. This retriever object is then used to perform a search with the query "money laundering".

In [29]:
retriever = vectorstore.as_retriever()
retriever.invoke("money laundering")

[Document(page_content='Uruguay : The National Secretariat for the Fight against Money Laundering and Terrorism\nFinancing of the Presidential Of fice of Uruguay started ex officio  to research on the case to find\nout about the roles of law firms  based in Montevideo aimed to provide of fshore entities to clients\nworldwide as intermediaries of Alcogal.[91]\nList of people named in the Pandora Papers\nAzerbaijani Laundromat\nBahamas Leaks\nBanca Privada d\'Andorra\nCyprus Papers\nDubai Uncovered\nFinCEN Files\nThe Laundromat  (2019 film)\nLuxLeaks\nMauritius Leaks\nOffshore Leaks\nPanama Papers\nParadise Papers\nRussian Laundromat\nSwiss Leaks\nSuisse secrets\n1. Miller , Greg; Cenziper , Debbie; Whoriskey , Peter (3 October 2021). "Pandora Papers – A Global\nInvestigation – Billions Hidden Beyond Reach – Trove of secret files details opaque financial\nuniverse where global elite shield riches from taxes, probes and accountability"  (https://www .washi\nngtonpost.com/business/interact

This pipeline effectively chains together a sequence of operations: extracting information, retrieving related context, generating prompts, processing that prompts with a model, and parsing the output. This approx is highly modular, allowing for each component of the pipeline to be swapped out or modified independently.

In [30]:
from operator import itemgetter

chain = (
    {
        "context": itemgetter("question") | retriever,
        "question": itemgetter("question"),
    }
    | prompt
    | model
    | parser
)

Let's ask some relevant questions to the pandora papers below

In [31]:
questions = [
    "What are the Pandora Papers and how were they obtained?",
    "Who are some of the most notable individuals mentioned in the Pandora Papers and what accusations are made against them?",
    "How do the Pandora Papers differ from previous leaks like the Panama Papers and Paradise Papers?",
    "What methods do wealthy individuals and corporations use to hide their wealth as revealed by the Pandora Papers?",
    "What are the legal and ethical implications of the findings in the Pandora Papers?",
    "How have governments around the world reacted to the revelations in the Pandora Papers?",
    "What role do offshore financial centers play in the global economy, according to insights from the Pandora Papers?",
    "How have the Pandora Papers influenced public opinion on wealth inequality and tax justice?",
    "What measures are being proposed or implemented to prevent the kind of secretive financial activities revealed by the Pandora Papers?",
    "What challenges do journalists face when investigating and reporting on leaks like the Pandora Papers?",

]

for question in questions:
    print(f"Question: {question}")
    print(f"Answer: {chain.invoke({'question': question})}")
    print()

Question: What are the Pandora Papers and how were they obtained?
Answer: The Pandora Papers are 11.9 million leaked documents with 2.9 terabytes of data that were obtained by the International Consortium of Investigative Journalists (ICIJ) and published beginning on October 3, 2021.

Question: Who are some of the most notable individuals mentioned in the Pandora Papers and what accusations are made against them?
Answer: Some of the most notable individuals mentioned in the Pandora Papers include former British Prime Minister Tony Blair, Chilean President Sebastián Piñera, former Kenyan President Uhuru Kenyatta, Montenegrin President Milo Đukanović, Ukrainian President Volodymyr Zelenskyy, Qatari Emir Tamim bin Hamad Al Thani, and many others. The accusations made against them include hiding assets in tax havens, involvement in financial fraud, and using offshore accounts for financial secrecy.

Question: How do the Pandora Papers differ from previous leaks like the Panama Papers and P