<a href="https://colab.research.google.com/github/RCarteri/openAi_api/blob/main/Speak_with_any_PDF_file_PDF_AI_Clone.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 4: Talk with any Document - Integrating ChatCompletion API, Embeddings, and Pinecone

In this advanced section of our course, we're going to build a highly interactive and intelligent system that lets users 'talk' with any document. Leveraging the capabilities of OpenAI's ChatCompletion API, the semantic understanding of embeddings, we'll create an application that can understand and retrieve information from documents in a conversational manner.

## What You Will Learn

- **Integration of OpenAI Services**: Understand how to seamlessly integrate various OpenAI services such as ChatCompletion API and Embeddings to create a powerful AI system.
- **Pinecone for Vector Searching**: Get acquainted with Pinecone, a vector database perfect for handling complex queries over embeddings, to efficiently index and retrieve document information.
- **Natural Language Understanding**: Enhance the system's ability to comprehend and process human language within documents for more natural interactions.
- **User Interface for Document Interaction**: Build a user-friendly interface that allows users to upload documents and engage in conversations with the content.
- **Conversational Context Management**: Develop strategies to maintain the context of the conversation, ensuring relevant and accurate responses.

## Project Objectives

By the end of this project, you will have developed a system that can:

1. **Interpret Documents**: Analyze and understand the content of various documents through the power of embeddings.
2. **Conversational Interface**: Provide users with the ability to ask questions and receive answers as if they were talking to a human expert on the document's content.
3. **Contextual Awareness**: Maintain the thread of conversation, taking into account previous interactions and the document's subject matter.
4. **Scalable Document Handling**: Efficiently manage and query a large number of documents using Pinecone's vector database capabilities.

## Preparation Checklist

Before we dive in, make sure you have:

- A Google Colab account.
- A foundational understanding of Python, APIs, and natural language processing concepts.
- An OpenAI API key with access to the ChatCompletion and Embeddings features ([OpenAI](https://platform.openai.com/account/api-keys)).
- Familiarity with LangChain and Pinecone services.

## Ready to Talk with Documents?

We are about to transform how you interact with text-based information. Prepare to build a conversational bridge between users and the vast world of documents!

NOTE:

Retrieval-augmented generation (RAG) for large language models (LLMs) aims to improve prediction quality by using an external datastore at inference time to build a richer prompt that includes some combination of context, history, and recent/relevant knowledge.


# 2. Libraries import

In [1]:
!pip install openai



In [2]:
!pip install PyPDF2
!pip install pinecone-client

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
   ---------------------------------------- 0.0/232.6 kB ? eta -:--:--
   ------------------- -------------------- 112.6/232.6 kB 2.2 MB/s eta 0:00:01
   ---------------------------------------- 232.6/232.6 kB 2.9 MB/s eta 0:00:00
Installing collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [24]:
import os
import openai
import PyPDF2
import random

from pinecone import Pinecone
from openai import OpenAI
from dotenv import load_dotenv

# 3. Working with PDF files

![](https://miro.medium.com/v2/resize:fit:1400/1*FWwgOvUE660a04zoQplS7A.png)

Source: https://betterprogramming.pub/building-a-multi-document-reader-and-chatbot-with-langchain-and-chatgpt-d1864d47e339


### 3.1 Setting up API Key

In [4]:
load_dotenv()
os.getenv('OPENAI_API_KEY')
client = OpenAI()

### 3.2 Loading a PDF file




In [17]:
# Function to load a random PDF from a given directory
def load_pdf(file_name):
  pdf_file = open(file_name, 'rb')
  pdf_reader = PyPDF2.PdfReader(pdf_file)

  text_from_pdf = ""

  for page in range(len(pdf_reader.pages)):
      text_from_pdf += pdf_reader.pages[page].extract_text()

  return text_from_pdf

In [6]:
# Function to chunk text by number of words or characters with a given size and overlap
def chunk_text(text, chunk_size=1500, chunk_overlap=100, by='word'):
    if by not in ['word', 'char']:
        raise ValueError("Invalid value for 'by'. Use 'word' or 'char'.")

    chunks = []

    if by == 'word':
        text = text.split()
    elif by == 'char':
        text = text

    current_chunk_start = 0
    while current_chunk_start < len(text):
        current_chunk_end = current_chunk_start + chunk_size

        if by == 'word':
            chunk = " ".joint(text[current_chunk_start:current_chunk_end])
        elif by == 'char':
            chunk = text[current_chunk_start:current_chunk_end]

        chunks.append(chunk)
        current_chunk_start += chunk_size - chunk_overlap
    return chunks

In [20]:
pdf_loaded = load_pdf("files/state_of_ai_docs.pdf")
pdf_loaded[:100]

'As organizations rapidly deploy generative AI tools, survey respondents \nexpect significant effects '

In [22]:
chunks = chunk_text(pdf_loaded, by='char')
chunks[:2]

['As organizations rapidly deploy generative AI tools, survey respondents \nexpect significant effects on their industries and workforces.The state of AI in \n2023: Generative AI’s \nbreakout year\nAugust 2023The state of AI in 2023: Generative AI’s breakout yearThe latest annual McKinsey Global Survey  on the current  \nstate of AI confirms the explosive growth of generative AI  \n(gen AI) tools. Less than a year after many of these tools debuted, \none-third of our survey respondents say their organizations are \nusing gen AI regularly in at least one business function. Amid \nrecent advances, AI has risen from a topic relegated to tech \nemployees to a focus of company leaders: nearly one-quarter  \nof surveyed C-suite executives say they are personally using  \ngen AI tools for work, and more than one-quarter of respondents \nfrom companies using AI say gen AI is already on their boards’ \nagendas. What’s more, 40 percent of respondents say their \norganizations will increase their

## 4. Building RAG system (Retrieval Augmented System)

In [25]:
# Pinecone init
pc = Pinecone(api_key=os.getenv('PINECODE_API_KEY'))

# getting the index by name in pinecode
index = pc.Index("rag-test")

In [26]:
for i in range(len(chunks)):
    vector = client.embeddings.create(
        model = "text-embedding-ada-002",
        input = chunks[i]
    )

    insert_stats = index.upsert(
        vectors = [
            (
                str(i),
                vector.data[0].embedding,
                {
                    "org_text": chunks[i]
                }
            )
        ]
    )

### 5. Building an interface to get proper answer based on the documentation


In [29]:
user_input = "The top AI trends in 2023/2024"

user_vector = client.embeddings.create(
    model = "text-embedding-ada-002",
    input = user_input
)

user_vector = user_vector.data[0].embedding

matches = index.query(
    vector = user_vector,
    top_k = 1,
    include_metadata = True
)

print(matches['matches'][0]['metadata']['org_text'])

As organizations rapidly deploy generative AI tools, survey respondents 
expect significant effects on their industries and workforces.The state of AI in 
2023: Generative AI’s 
breakout year
August 2023The state of AI in 2023: Generative AI’s breakout yearThe latest annual McKinsey Global Survey  on the current  
state of AI confirms the explosive growth of generative AI  
(gen AI) tools. Less than a year after many of these tools debuted, 
one-third of our survey respondents say their organizations are 
using gen AI regularly in at least one business function. Amid 
recent advances, AI has risen from a topic relegated to tech 
employees to a focus of company leaders: nearly one-quarter  
of surveyed C-suite executives say they are personally using  
gen AI tools for work, and more than one-quarter of respondents 
from companies using AI say gen AI is already on their boards’ 
agendas. What’s more, 40 percent of respondents say their 
organizations will increase their investment in AI

In [41]:
messages = [{"role": "system", "content": """I want you to act as a support agent. Your name is "My Super Assistant". You will provide me with answers from the given info. If the answer is not included, say exactly "Ooops! I don't know that." and stop after that. Refuse to answer any question not about the info. Never break character."""}]
messages.append({"role": "user", "content": matches['matches'][0]['metadata']['org_text']})
messages.append({"role": "user", "content": "trends in 20203 and"})

chat_messages = client.chat.completions.create(
    model = "gpt-3.5-turbo",
    messages = messages,
    temperature=0,
    max_tokens=400
)

print(chat_messages.choices[0].message.content)

I'm sorry, but I can only provide information based on the text you provided about the state of AI in 2023. If you have any questions related to that, feel free to ask!
