<a href="https://colab.research.google.com/github/Nickguild1993/Working_with_PDFs/blob/main/PDF_RAG_ChatBot_openAI_pynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Loading in a pdf, extracting the text (can do tables and images as well) and querying the RAG system for information within the PDF using a chat bot**

RAG: Retrieval-Augmented Generation
RAG stands for Retrieval-Augmented Generation. It's an AI technique that combines the power of large language models (LLMs) with defined knowledge sources/documents.

There are a few PDF libraries that you can use: PyPDF, PdfPlumber, and PyMuPDF.  

In [44]:
# This function removes the annonyingly long output that occurs when !pip installing a package that's already loaded
# Maybe I'm the only one bothered by this, haha.

import importlib
import subprocess

def install_package(package_name):
  try:
    importlib.import_module(package_name)
    print(f"{package_name} is already installed.")
  except ImportError:
    subprocess.check_call(["pip", "install", package_name])
    print(f"{package_name} installed successfully.")

In [45]:
# import the normies

import pandas as pd
import numpy as np
from google.colab import drive
from PIL import Image
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [48]:
install_package("langchain")
install_package("openai")
install_package("tiktoken")
install_package("faiss-gpu")
install_package("langchain_experimental")
install_package('pypdf')

langchain is already installed.
openai is already installed.
tiktoken is already installed.
faiss-gpu installed successfully.
langchain_experimental is already installed.
pypdf is already installed.


In [49]:
# pip instlaling these instead of using the function b/c they are different formats
!pip install "langchain[docarray]"
!pip install -U langchain-openai




In [50]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from langchain_openai import ChatOpenAI

In [51]:
# this version uses pypdf instead of PyMuPDF, but it's still a good install if you're doing further text/image/table analysis

install_package('PyMuPDF')
import fitz
install_package('pdf2image')
import pdf2image

PyMuPDF installed successfully.
pdf2image is already installed.


In [52]:
# get PDF path

pdf_path = '/content/drive/MyDrive/Colab Notebooks/PDF_distilling_and_summarization/monetary_and_social_markets_Ariely.pdf'

# load the path and define the pages

loader = PyPDFLoader(pdf_path)
pages = loader.load_and_split()

In [9]:
# pages[0]

# Check to make sure, should return the first page

Document(metadata={'source': '/content/drive/MyDrive/Colab Notebooks/PDF_distilling_and_summarization/monetary_and_social_markets_Ariely.pdf', 'page': 0}, page_content='Research Article\nEffort for Payment\nA Tale of Two Markets\nJames Heyman1 and Dan Ariely2\n1University of California, Berkeley, and2Massachusetts Institute of Technology\nABSTRACT— The standard model of labor is one in which\nindividuals trade their time and energy in return for\nmonetary rewards. Building on Fiske’s relational theory\n(1992), we propose that there are two types of markets\nthat determine relationships between effort and payment:\nmonetary and social. We hypothesize that monetary\nmarkets are highly sensitive to the magnitude of compen-\nsation, whereas social markets are not. This perspective\ncan shed light on the well-established observation that\npeople sometimes expend more effort in exchange for no\npayment (a social market) than they expend when they\nreceive low payment (a monetary market). Thr

BECAUSE OF A SHIT UPDATE, YOU HAVE TO REVERT THE OPENAI VERSION (AS OF 12/16)

https://community.openai.com/t/error-with-openai-1-56-0-client-init-got-an-unexpected-keyword-argument-proxies/1040332/11

In [53]:
%%capture
!pip install openai==1.55.3 httpx==0.27.2 --force-reinstall --quiet

In [54]:
from openai import OpenAI


from google.colab import userdata

# Retrieve API KEY
api_key = userdata.get('nick_open_ai') # instantating client

Create the embeddings and store them into the vector database

In [55]:
embeddings = OpenAIEmbeddings(openai_api_key = api_key)

vectorstore = FAISS.from_documents(pages, embedding = embeddings)

Need to load in the LLM model

In [56]:
llm = ChatOpenAI(temperature = 0.7, model_name = 'gpt-4o-mini', openai_api_key = api_key)

Reserve memory buffer to store convos

In [57]:
memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)

Connect those 3 elements into a new conversion chain

In [58]:
conversation_chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    chain_type="stuff", # lol
    retriever=vectorstore.as_retriever(),
    memory=memory
)

create target_folder variable for the export path we want to use

In [None]:
# folder for the exports
target_folder = "/content/drive/MyDrive/Colab Notebooks/PDF_distilling_and_summarization/processed_text"

LET'S ASK THE MODEL SOME QUESTIONS, BOYS

In [60]:
query = "Please give a detailed explanation of the two types of markets and how they differ."
result = conversation_chain({"question" : query}) # plugging in the above query in the question field
answer = result["answer"]

response_text = answer
print(response_text)

The two types of markets described in the research article are monetary markets and social markets. Here’s a detailed explanation of each type and their differences:

1. **Monetary Markets**:
   - In monetary markets, individuals exchange their time and effort for monetary compensation. This type of market is characterized by a direct relationship between the amount of payment and the effort exerted. The more money offered, the more effort individuals are likely to put in. This relationship is often described as monotonic, meaning that effort increases with higher payment levels.
   - In these markets, norms and expectations associated with monetary compensation come into play. Participants are likely to view tasks as work and may feel obligated to reciprocate the payment with the corresponding level of effort. The presence of money creates a transactional mindset where the relationship is defined by economic exchange.

2. **Social Markets**:
   - Social markets, on the other hand, are

In [37]:
# # change response_text from nontype into string

# response_text_string = str(response_text)

Set up the export

In [61]:


file_path = f'{target_folder}/ariely_two_markets.txt'

with open(file_path, 'w') as f:
  f.write(response_text)

SUMMARIZE THE PDF!

In [40]:
query = "Please provide an in depth summary of the paper"
result = conversation_chain({"question": query})
answer = result["answer"]

summary_text = answer
print(summary_text)

The paper titled "Effort for Payment: A Tale of Two Markets" by James Heyman and Dan Ariely explores the relationship between effort and compensation, distinguishing between two types of markets: monetary and social. The authors argue that these markets influence how individuals respond to different forms of payment and levels of compensation.

**Key Concepts:**
1. **Monetary Markets:** These are sensitive to the magnitude of compensation. In these markets, people are more likely to exert effort in exchange for higher monetary rewards.
2. **Social Markets:** These are less sensitive to compensation levels. People may exert more effort when no payment is offered than when low monetary payments are made, as social motivations can play a significant role.

**Research Focus:**
The paper examines the conditions under which individuals are motivated to help others, highlighting the complexities of compensation. It poses questions about the effectiveness of different types of rewards (cash vs

In [43]:
# export it as a .txt file


file_path = f'{target_folder}/ariely_summarization_of_two_markets.txt'
with open(file_path, 'w') as f: # w = write. we're writing it to f, which is file
  f.write(summary_text)

SCRAP BELOW, M'BOY