# Welcome to Lab-1
## In this Notebook we will learn how we can use different documents and get them analysed by GPT

## First of all we will install the required packages that will help us along the way
1. openai: This package will help us to call the chat completion method of openai to generate results using GPT.
3. PyMuPDF: This package is used for easy PDF manipulation.
4. tiktoken: This package is used to calculate the tokens in a text

In [None]:
%%capture
!pip install openai==1.3.9 PyMuPDF==1.24.2 PyMuPDFb==1.24.1 tqdm tiktoken

In [None]:
import openai
import fitz
from tqdm import tqdm
import os
import tiktoken

# As described in prerequiste step 1. Replace your OpenAPI key here.
os.environ["OPENAI_API_KEY"] = "sk-proj-mzypCstrTOuG6zzxNyEcT3BlbkFJKwABYtAKmNv6gQfHQCrps"
model_name = "gpt-3.5-turbo"

token_encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

Here we have created a function that is called for generating response from OpenAI

This loads the keys from the environment and uses it to call the `openai.chat.completions.create` method

This method takes the system message and the user message as input for generating response.

### The temperature parameter defines the randomness of the output,

Higher the temperature the more creative the LLM will become with its answers, thus higher temperatures are used for poem generation, jokes etc.
Lower temperature gives deterministic results and return the most probable next token.

### Top P


Imagine you have a bag of words the model can choose from. Nucleus sampling is like picking a smaller bag of the most likely words. A small bag (low value) gives fewer choices, making answers more exact but less creative. A bigger bag (high value) gives more choices, making answers more surprising but less reliable.

Top P is similar, but instead of bag size, it controls how many words you consider "good enough"  A low value only picks the absolute best words, leading to safe but boring answers.  A high value includes some less likely words, making the answers more interesting but potentially less accurate.

In [None]:
def CallOpenAI(user,system):
  response = openai.chat.completions.create(
              model= model_name, # model = "deployment_name".
              temperature= 0,
              top_p= 0,
              messages=[
                  {"role": "system", "content": system},
                  {"role": "user", "content": user}
              ]
          )
  return response

## Lets take a contract and try to analyse it without much instruction

First we load the PDF and extract the texts from it and generate the token count of the text

In [None]:
def extract_text(pdf_path):
  pdf = fitz.open(pdf_path)
  text = ''

  for page in pdf:
    text += page.get_text()

  num_tokens = len(token_encoding.encode(text))
  print("Number of tokens in the entire Document: ", num_tokens)
  return text

Out here we can see the token count of the document is 11590 which is well withing the 16000 context limit of the GPT-3.5 model

Also upload the .PDF in the Colab runtime.

Follow the Instructions given in Lab-0 to where we have shown how to upload the files to the Colab runtime.

Here's the link:

https://github.com/initmahesh/MLAI-community-labs/blob/main/Class-Labs/Lab-0(Pre-requisites)/README.md

In [None]:
short_document = extract_text("AWS1.pdf")

Number of tokens in the entire Document:  11590


## We concatenate the text from the PDF and the question that the user wants to ask to the GPT about the PDF and form a prompt that we will use to generate the response using `openai.chat.completion.create` method

In [None]:
Question = "What is the governing law for Amazon Web Services South Africa ProprietaryLimited"

full_prompt_SD = "<Context>"+short_document+"</Context>" +"\n\n" +"<Question>"+Question+"</Question>"

In [None]:
response = CallOpenAI(full_prompt_SD,"You are a Professional lawyer who can analyse documents thorougly")

AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: sk-proj-*********************************************Vht). You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

### We can see that the GPT was able to generate the answer by refering to the prompt and give the correct result

In [None]:
print(response.choices[0].message.content)

NameError: name 'response' is not defined

## Now lets load up a document that has more than 16000 tokens, which is the limit of GPT-3.5-Turbo

Upload the .PDF in the Colab runtime.

Follow the Instructions given in Lab-0 to where we have shown how to upload the files to the Colab runtime.

Here's the link:

https://github.com/initmahesh/MLAI-community-labs/blob/main/Class-Labs/Lab-0(Pre-requisites)/README.md

In [None]:
long_document = extract_text("PROFRAC HOLDINGS, LLC credit agreement.pdf")

Number of tokens in the entire Document:  163227


In [None]:
Question = "What is the Acknowledgement Regarding Any Supported QFCs?"

full_prompt_LD = "<Context>"+long_document+"</Context>" +"\n\n" +"<Question>"+Question+"</Question>"

## Here what you see is, when the message length exceeded the limit of GPT, it throws an error.
### This problem will be fixed in the next lab where you see how Retrieval Augmented Generation(RAG) will fix this problem and enable us to analyse documents of any length.

In [None]:
try:
  response = CallOpenAI(full_prompt_LD,"You are a Professional lawyer who can analyse documents thorougly")
except Exception as e:
  print("Context length is more than the input capacity of 16000")

Context length is more than the input capacity of 16000


## Lets discuss this in next lab as how we solve this problem with RAG(Retrival augmented generation)