# Apresentação ✒️

Notebook referente ao curso de LangChain disponibilizado na plataforma DeepLearningAI. Essa parte é uma finalização do curso como um todo, o qual passou pela etapa de carregamento de dados, divisão da informação (pois se trata de um processo de otimização e redução de custos de processamento para dados de elevada quantidade) e criação de banco de dados vetoriais, nos quais armazena-se o conteúdo.

O objetivo agora é conseguir desenvolver um chat de perguntas e respostas e outro que consiga desempenhar a mesma relação, mas apresentando memória, como forma de otimizar o processo de interação com o usuário. Assim, a partir de uma pergunta busca-se prover ao usuário um retorno relacionado a essa, bem como uma vez fornecida ficar na "memória" do modelo de LLM de uso, para que ele não precise realizar a consulta novamente.

## Bibliotecas 📚

In [1]:
!pip install --upgrade langchain -q

!pip install --upgrade langchain_community -q

!pip install google-generativeai -q

!pip install langchain_google_genai -q

!pip install docarray -q

!pip install pypdf -q

!pip install chromadb -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m975.5/975.5 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m337.4/337.4 kB[0m [31m32.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.5/127.5 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.1/141.1 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.2/49.2 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.9/163.9 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m718.3/718.3 kB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━

In [2]:
import warnings
warnings.filterwarnings('ignore')

import os
import sys
sys.path.append('../..')
import google.generativeai as genai

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.embeddings import GooglePalmEmbeddings
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

## Definindo as variáveis de ambiente

In [3]:
# Definindo as variáveis de ambiente.

os.environ["GOOGLE_API_KEY"] = "sua-api-key"

genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

## Instanciando os modelos de uso

In [4]:
# Instanciando o modelo de embedding e de LLM de uso.

embedding = GooglePalmEmbeddings()

llm = ChatGoogleGenerativeAI(
    model = "gemini-1.5-pro-latest",
    temperature = 0.5
)

## Carregando os dados

In [5]:
# Carregando os dados que serão utilizados.

loaders = [
    PyPDFLoader("https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture02.pdf"),
    PyPDFLoader("https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture03.pdf")
]

In [6]:
# Criando uma lista de documentos PDFs

# Método extend adiciona cada documento separadamente na lista pdf_docs.

pdf_docs = []

for loader in loaders:
  pdf_docs.extend(loader.load())

In [7]:
# Quantidade de páginas dos arquivos carregados.

len(pdf_docs)

56

## Realizando a divisão do conteúdo em menores porções

In [8]:
# Dividindo os textos em pequenas porções, como forma de otimizar a manipulação
# da informação e, principalmente, a recuperação dessa.

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 2000,
    chunk_overlap = 200,
    length_function = len,
    separators = ["\n\n", "\n", "."]
)

In [11]:
pdf_docs_split = text_splitter.split_documents(pdf_docs)

In [13]:
# Observa-se que a quantidade de itens alterou após a divisão, duplicando
# em relação ao valor original.

print(f'Quantidade de itens antes da divisão : {len(pdf_docs)}')
print(f'Quantidade de itens após a divisão : {len(pdf_docs_split)}')

Quantidade de itens antes da divisão : 56
Quantidade de itens após a divisão : 108


## Criando o banco de dados vetorial com Chroma

In [9]:
# Criando o vector databse :

persist_directory = 'docs/chroma/'

In [10]:
!rm -rf ./docs/chroma  # Remove datavase antigo se existe.

In [14]:
# Instanciando o vector databse com Chroma, um framework open source
# que cria banco de dados vetoriais, bastando informar principalmente
# a informação que deseja ser armazenada e o modelo de embedding utilizado
# para a sua transformação vetorial.

docs_vectordb = Chroma.from_documents(
    documents=pdf_docs_split,
    embedding=embedding,
    persist_directory=persist_directory
)

In [16]:
# verificando se o conteúdo foi armazenado no banco de dados.

docs_vectordb._collection.count()

108

## Criando a RetrievalQAChain

In [17]:
from langchain.chains import RetrievalQA

In [18]:
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="refine",
    retriever=docs_vectordb.as_retriever()
)

In [22]:
# Consultando a qa_chain :

question = "What are the major topics for this class?"

result = qa_chain({"query": question})



In [23]:
result['result']

'The new context provides additional clues about the course content, allowing us to refine the initial answer. \n\n**Refined Major Topics:**\n\n* **Machine Learning:** Confirmed as the core focus.\n* **Linear Regression:**  Specifically mentioned as a major topic.\n* **Gradient Descent:**  A key optimization algorithm used in machine learning, covered in detail.\n* **Normal Equations:**  An alternative approach to solving linear regression problems.\n* **Linear Algebra:**  Fundamental to the course, with a dedicated review session. \n* **Convex Optimization:** Likely to be important given the focus on gradient descent (a convex optimization algorithm). \n* **Programming:**  While not explicitly mentioned, practical application is implied through assignments.\n\n**Removed/Uncertain Topics:**\n\n* **Hidden Markov Models:**  No longer mentioned, so their inclusion is uncertain. \n\n**Overall:**\n\nThe new context confirms and expands upon our initial understanding. The course heavily emph

In [24]:
question_2 = "In this class talking about neural networks ?"

result_2 = qa_chain({"query": question_2})

In [26]:
result_2

{'query': 'In this class talking about neural networks ?',
 'result': 'The new context, while providing information about the course structure and MATLAB, **doesn\'t offer any further clues about whether neural networks will be specifically covered.** \n\nThe excerpt focuses on:\n\n* **Student feedback:**  The anecdote about MATLAB highlights its usefulness but doesn\'t reveal anything about the course content itself.\n* **Discussion section purpose:** The description emphasizes reviewing prerequisites (probability, statistics, algebra) and covering potential extensions to the main lecture material. \n\nSince neural networks aren\'t mentioned in either context, we can\'t definitively say if they are part of the curriculum.  The original answer still stands: it\'s possible neural networks are included, especially given the emphasis on "state-of-the-art" algorithms, but we don\'t have enough information to be certain. \n'}

## Combinando RetrievalQ&A com PromptTemplate

Essa técnica serve como ajuste à LLM fornecer respostas mais apuradas e estruturadas, como forma de melhorar a experiência do usuário.

In [27]:
from langchain import PromptTemplate

In [30]:
template = """Use the following pieces of {context} to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

Question: {question}

Helpful Answer:
"""

qa_prompt_template = PromptTemplate.from_template(template)

In [34]:
qa_chain_template = RetrievalQA.from_chain_type(
    llm = llm,
    chain_type = "stuff",
    retriever = docs_vectordb.as_retriever(),
    return_source_documents = True, # Retorna a fonte da resposta.
    chain_type_kwargs = {"prompt": qa_prompt_template}
)

In [35]:
question_3 = "In this class talking about machine learning ?"

result_3 = qa_chain_template({"query": question_3})

In [38]:
result_3

{'query': 'In this class talking about machine learning ?',
 'result': 'The transcript provided is from a machine learning class, and the instructor explicitly states he will be talking about machine learning. So the answer is yes. \n',
 'source_documents': [Document(metadata={'page': 0, 'source': 'https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf'}, page_content="MachineLearning-Lecture01  \nInstructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine \nlearning class. So what I wanna do today is ju st spend a little time going over the logistics \nof the class, and then we'll start to  talk a bit about machine learning.  \nBy way of introduction, my name's  Andrew Ng and I'll be instru ctor for this class. And so \nI personally work in machine learning, and I' ve worked on it for about 15 years now, and \nI actually think that machine learning is th e most exciting field of all the computer \nsciences. So I'm actually always excited 

In [39]:
# Para verificar a saída gerada sem o retorno da fonte da resposta.

qa_chain_template = RetrievalQA.from_chain_type(
    llm = llm,
    chain_type = "stuff",
    retriever = docs_vectordb.as_retriever(),
    return_source_documents = False,
    chain_type_kwargs = {"prompt": qa_prompt_template}
)

In [40]:
question_3 = "In this class talking about machine learning ?"

result_3 = qa_chain_template({"query": question_3})

In [41]:
result_3

{'query': 'In this class talking about machine learning ?',
 'result': 'Helpful Answer: Yes, the class is about machine learning. \n'}

## Criando um chat

In [42]:
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

In [43]:
from langchain.chains import ConversationalRetrievalChain

retriever=docs_vectordb.as_retriever()

qa = ConversationalRetrievalChain.from_llm(
    llm,
    retriever=retriever,
    memory=memory
)

In [44]:
question = "Is probability a class topic?"
result = qa({"question": question})

In [45]:
result

{'question': 'Is probability a class topic?',
 'chat_history': [HumanMessage(content='Is probability a class topic?'),
  AIMessage(content="Yes, probability is a class topic. The instructor mentions that the discussion sections will cover prerequisites like probability, especially for students who haven't encountered it recently or need a refresher. \n")],
 'answer': "Yes, probability is a class topic. The instructor mentions that the discussion sections will cover prerequisites like probability, especially for students who haven't encountered it recently or need a refresher. \n"}

In [46]:
question = "why are those prerequesites needed?"
result = qa({"question": question})

In [47]:
result

{'question': 'why are those prerequesites needed?',
 'chat_history': [HumanMessage(content='Is probability a class topic?'),
  AIMessage(content="Yes, probability is a class topic. The instructor mentions that the discussion sections will cover prerequisites like probability, especially for students who haven't encountered it recently or need a refresher. \n"),
  HumanMessage(content='why are those prerequesites needed?'),
  AIMessage(content='The instructor states that a basic understanding of probability and statistics is needed because students will need to know things like "what random variables are", "what expectation is", and "what a variance of a random variable is". \n')],
 'answer': 'The instructor states that a basic understanding of probability and statistics is needed because students will need to know things like "what random variables are", "what expectation is", and "what a variance of a random variable is". \n'}