# ***Sistema RAG com um Self-Querying Retriever em LangChain para busca de Filmes***

Esse Colab foi criado com base em um artigo publicado por Ed Izaguirre no Medium.

[Clique aqui para ler o artigo completo](https://towardsdatascience.com/how-to-build-a-rag-system-with-a-self-querying-retriever-in-langchain-16b4fa23e9ad)


## Introdução

Nesta apresentação, vamos explorar como criar um sistema de busca de filmes utilizando a técnica RAG (Retrieval-Augmented Generation) com um Self-Querying Retriever, implementado na biblioteca LangChain. A ideia surgiu a partir de uma frustração comum: a dificuldade de encontrar filmes que correspondam exatamente ao que estamos procurando, usando apenas o título ou o nome de atores. O objetivo é permitir que os usuários façam buscas mais naturais e específicas, como “recomende filmes de comédia com zumbis” ou “encontre dramas curtos em inglês com animais de estimação”, sem precisar depender de recomendações baseadas em histórico de visualização.

O sistema que vamos construir combina a simplicidade e a privacidade de uma busca baseada em consultas de linguagem natural com a eficiência de um filtro por metadados. Primeiro, ele filtra os filmes com base nos critérios especificados pelo usuário, como gênero ou data de lançamento, e em seguida, realiza uma busca de similaridade para encontrar títulos que correspondam ao tom ou conteúdo desejado. Ao longo deste Colab, iremos passo a passo construir esse sistema, com todo o código necessário para você implementar e personalizar essa solução em seus próprios projetos.

## Uma visão técnica do projeto

Utilizamos o LangChain, uma poderosa biblioteca para a construção de sistemas de Recuperação Aumentada por Geração (RAG), integrando-a com o Pinecone, um banco de dados vetorial que permite buscas rápidas e eficientes por similaridade. A técnica central é o uso de um Self-Querying Retriever, que permite filtrar os filmes por metadados antes de realizar a busca por similaridade. Além disso, utilizamos os modelos GPT-4o e GPT-4o-mini da OpenAI para construir um sistema de recomendação baseado em consultas em linguagem natural. A API do The Movie Database (TMDB) foi a fonte dos dados, fornecendo uma base de filmes com diversas informações que foram essenciais para a personalização das buscas. Todos esses componentes foram integrados em um pipeline que permite uma experiência de busca de filmes sem a necessidade de histórico de usuário.

# **Instalação de dependências**

## Escrita do arquivo requirements.txt

In [1]:
%%writefile requirements.txt
aiohttp==3.9.3
aiosignal==1.3.1
altair==5.2.0
annotated-types==0.6.0
anyio==4.3.0
appnope==0.1.4
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asgiref==3.8.0
asttokens==2.4.1
async-lru==2.0.4
attrs==23.2.0
Babel==2.14.0
backoff==2.2.1
bcrypt==4.1.2
beautifulsoup4==4.12.3
bleach==6.1.0
blinker==1.7.0
build==1.1.1
cachetools==5.3.3
certifi==2024.2.2
cffi==1.16.0
charset-normalizer==3.3.2
chroma-hnswlib==0.7.3
click==8.1.7
coloredlogs==15.0.1
comm==0.2.1
dataclasses-json==0.6.4
debugpy==1.8.1
decorator==5.1.1
defusedxml==0.7.1
Deprecated==1.2.14
distro==1.9.0
executing==2.0.1
fastapi==0.110.0
fastjsonschema==2.19.1
filelock==3.13.1
flatbuffers==24.3.7
fqdn==1.5.1
frozenlist==1.4.1
fsspec==2024.2.0
gitdb==4.0.11
GitPython==3.1.42
google-auth==2.29.0
googleapis-common-protos==1.63.0
grpcio==1.62.1
h11==0.14.0
httpcore==1.0.4
httptools==0.6.1
httpx==0.27.0
huggingface-hub==0.21.4
humanfriendly==10.0
idna==3.6
importlib-metadata==6.11.0
importlib_resources==6.4.0
ipykernel==6.29.3
ipython==8.22.2
ipywidgets==8.1.2
iso-639==0.4.5
isoduration==20.11.0
jedi==0.19.1
Jinja2==3.1.3
json5==0.9.22
jsonpatch==1.33
jsonpointer==2.4
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-events==0.9.0
jupyter-lsp==2.2.4
jupyter_client==8.6.0
jupyter_core==5.7.1
jupyter_server==2.13.0
jupyter_server_terminals==0.5.2
jupyterlab==4.1.4
jupyterlab_pygments==0.3.0
jupyterlab_server==2.25.3
jupyterlab_widgets==3.0.10
kubernetes==29.0.0
langchain==0.1.13
langchain-community==0.0.29
langchain-core==0.1.33
langchain-experimental==0.0.54
langchain-openai==0.0.8
langchain-pinecone==0.0.3
langchain-text-splitters==0.0.1
langsmith==0.1.23
lark==1.1.9
markdown-it-py==3.0.0
MarkupSafe==2.1.5
marshmallow==3.21.1
matplotlib-inline==0.1.6
mdurl==0.1.2
mistune==3.0.2
mmh3==4.1.0
monotonic==1.6
mpmath==1.3.0
multidict==6.0.5
mypy-extensions==1.0.0
nbclient==0.9.0
nbconvert==7.16.2
nbformat==5.9.2
neo4j==5.18.0
nest-asyncio==1.6.0
notebook==7.1.1
notebook_shim==0.2.4
numpy==1.26.4
oauthlib==3.2.2
onnxruntime==1.17.1
openai==1.13.3
opentelemetry-api==1.23.0
opentelemetry-exporter-otlp-proto-common==1.23.0
opentelemetry-exporter-otlp-proto-grpc==1.23.0
opentelemetry-instrumentation==0.44b0
opentelemetry-instrumentation-asgi==0.44b0
opentelemetry-instrumentation-fastapi==0.44b0
opentelemetry-proto==1.23.0
opentelemetry-sdk==1.23.0
opentelemetry-semantic-conventions==0.44b0
opentelemetry-util-http==0.44b0
orjson==3.9.15
overrides==7.7.0
packaging==23.2
pandas==2.2.1
pandocfilters==1.5.1
parso==0.8.3
pexpect==4.9.0
pillow==10.2.0
pinecone-client==3.1.0
platformdirs==4.2.0
posthog==3.5.0
prometheus_client==0.20.0
prompt-toolkit==3.0.43
protobuf==4.25.3
psutil==5.9.8
ptyprocess==0.7.0
pulsar-client==3.4.0
pure-eval==0.2.2
pyarrow==15.0.2
pyasn1==0.5.1
pyasn1-modules==0.3.0
pycparser==2.21
pydantic==2.6.3
pydantic_core==2.16.3
pydeck==0.8.1b0
Pygments==2.17.2
PyPika==0.48.9
pyproject_hooks==1.0.0
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-json-logger==2.0.7
pytz==2024.1
PyYAML==6.0.1
pyzmq==25.1.2
qtconsole==5.5.1
QtPy==2.4.1
referencing==0.33.0
regex==2023.12.25
requests==2.31.0
requests-oauthlib==1.4.0
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.7.1
rpds-py==0.18.0
rsa==4.9
safetensors==0.4.2
Send2Trash==1.8.2
setuptools==68.2.2
six==1.16.0
smmap==5.0.1
sniffio==1.3.1
soupsieve==2.5
SQLAlchemy==2.0.28
stack-data==0.6.3
starlette==0.36.3
streamlit==1.32.2
sympy==1.12
tabulate==0.9.0
tenacity==8.2.3
terminado==0.18.0
tiktoken==0.6.0
tinycss2==1.2.1
tokenizers==0.15.2
toml==0.10.2
toolz==0.12.1
tornado==6.4
tqdm==4.66.2
traitlets==5.14.1
transformers==4.38.2
typer==0.9.0
types-python-dateutil==2.8.19.20240106
typing-inspect==0.9.0
typing_extensions==4.10.0
tzdata==2024.1
uri-template==1.3.0
urllib3==2.2.1
uvicorn==0.29.0
uvloop==0.19.0
watchfiles==0.21.0
wcwidth==0.2.13
webcolors==1.13
webencodings==0.5.1
websocket-client==1.7.0
websockets==12.0
wheel==0.41.2
widgetsnbextension==4.0.10
wrapt==1.16.0
yarl==1.9.4
zipp==3.18.1

Writing requirements.txt


## Instalação

In [2]:
!pip install -r requirements.txt

Collecting aiohttp==3.9.3 (from -r requirements.txt (line 1))
  Downloading aiohttp-3.9.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.4 kB)
Collecting altair==5.2.0 (from -r requirements.txt (line 3))
  Downloading altair-5.2.0-py3-none-any.whl.metadata (8.7 kB)
Collecting annotated-types==0.6.0 (from -r requirements.txt (line 4))
  Downloading annotated_types-0.6.0-py3-none-any.whl.metadata (12 kB)
Collecting anyio==4.3.0 (from -r requirements.txt (line 5))
  Downloading anyio-4.3.0-py3-none-any.whl.metadata (4.6 kB)
Collecting appnope==0.1.4 (from -r requirements.txt (line 6))
  Downloading appnope-0.1.4-py2.py3-none-any.whl.metadata (908 bytes)
Collecting arrow==1.3.0 (from -r requirements.txt (line 9))
  Downloading arrow-1.3.0-py3-none-any.whl.metadata (7.5 kB)
Collecting asgiref==3.8.0 (from -r requirements.txt (line 10))
  Downloading asgiref-3.8.0-py3-none-any.whl.metadata (9.3 kB)
Collecting asttokens==2.4.1 (from -r requirements.txt (line 11))
  Dow

# **Solicitação de chaves das APIs**

In [17]:
from getpass import getpass

TMBD_API_KEY = getpass("Digite a chave do TMDB: ")
PINECONE_KEY = getpass("Digite a chave do Pinecone: ")
OPENAI_API_KEY = getpass("Digite a chave da OpenAI: ")
PINECONE_INDEX_NAME = "seminario"

Digite a chave do TMDB: ··········
Digite a chave do Pinecone: ··········
Digite a chave da OpenAI: ··········


# **utils.py**

O arquivo utils.py contém algumas funções auxiliares que têm como objetivo consumir a API do TMDB, buscando os IDs dos filmes e os dados sobre eles e para escrever os arquivos CSV.

As seguintes informações foram extraídas sobre cada filme:

- Title
- Runtime (minutes)
- Language
- Overview
- Release Year
- Genre
- Keywords describing the film
- Actors
- Directors
- Places to stream
- Places to buy
- Places to rent
- List of Production Companies

In [4]:
import requests
import os
import csv
import time
from iso639 import languages

def get_id_list(api_key, year, max_retries=5):
    """
    Function to get list of IDs for all films made in {year}.

    parameters:
    api_key (str): API key for TMDB
    year (int): Year of interest

    returns:
    list of str: List of all movie ids in {year}
    """
    url = f'https://api.themoviedb.org/3/discover/movie?api_key={api_key}&primary_release_year={year}&include_video=false&language=en-US&sort_by=popularity.desc'

    movie_ids = []

    total_pages = 5  # 5 pages of ids = 100 movies
    for page in range(1, total_pages + 1):
        for i in range(max_retries):
            response = requests.get(url + f'&page={page}')
            if response.status_code == 429:
                # If the response was a 429, wait and then try again
                print(
                    f"Request limit reached. Waiting and retrying ({i+1}/{max_retries})")
                time.sleep(2 ** i)  # Exponential backoff

            else:
                # If the response was not a 429, continue
                response_dict = response.json()
                for film in response_dict['results']:
                    movie_ids.append(str(film['id']))
                break

    return movie_ids


def get_data(API_key, Movie_ID, max_retries=5):
    """
    Function to pull details of your film of interest in JSON format.

    parameters:
    API_key (str): Your API key for TMBD
    Movie_ID (str): TMDB id for film of interest

    returns:
    dict: JSON formatted dictionary containing all details of your film of
    interest
    """

    query = 'https://api.themoviedb.org/3/movie/' + Movie_ID + \
        '?api_key='+API_key + '&append_to_response=keywords,' + \
            'watch/providers,credits&language=en-US'
    for i in range(max_retries):
        response = requests.get(query)
        if response.status_code == 429:
            # If the response was a 429, wait and then try again
            print(
                f"Request limit reached. Waiting and retrying ({i+1}/{max_retries})")
            time.sleep(2 ** i)  # Exponential backoff
        else:
            response_dict = response.json()
            return response_dict


def write_file(filename, data):
    """
    Appends a row to a csv file titled 'filename', if the
    movie belongs to a collection. The row contains the name of the
    movie in the first column and the name of the collection in the
    second column. Adds nothing if the film is not part of the collection.

    parameters:
    filename (str): Name of file you desire for the csv
    dict (dict): Python dictionary with JSON formatted details of film

    returns:
    None
    """
    csvFile = open(filename, 'a')
    csvwriter = csv.writer(csvFile)
    # unpack the result to access the "collection name" element
    title = data['title']
    runtime = data['runtime']
    language_code = data['original_language']
    release_date = data['release_date']
    overview = data['overview']
    all_genres = data['genres']
    prod_companies = data['production_companies']

    # Parsing release date
    release_year = release_date.split('-')[0]

    # Converting language
    try:
        language = languages.get(alpha2=language_code).name
    except KeyError:
        language = 'Unknown'

    # Parsing genres
    genre_str = ""
    for genre in all_genres:
        genre_str += genre['name'] + ", "
    genre_str = genre_str[:-2]

    # Parsing keywords (remove non-English words)
    all_keywords = data['keywords']['keywords']
    keyword_str = ""
    for keyword in all_keywords:
        if is_english(keyword['name']):
            keyword_str += keyword['name'] + ", "
    if keyword_str == "":
        keyword_str = "None"
    else:
        keyword_str = keyword_str[:-2]

    # Parsing watch providers
    watch_providers = data['watch/providers']['results']
    stream_str, buy_str, rent_str = "", "", ""
    if 'US' in watch_providers:
        watch_providers = watch_providers['US']
        provider_strings = ['flatrate', 'buy', 'rent']
        for string in provider_strings:
            if string not in watch_providers:
                continue

            _str = ""

            for element in watch_providers[string]:
                _str += element['provider_name'] + ", "
            _str = _str[:-2] + " "

            if string == 'flatrate':
                stream_str += _str
            elif string == 'buy':
                buy_str += _str
            else:
                rent_str += _str

    credits = data['credits']
    actor_list, director_list = [], []

    # Parsing cast
    cast = credits['cast']
    NUM_ACTORS = 5
    for member in cast[:NUM_ACTORS]:
        actor_list.append(member["name"])

    # Parsing crew
    crew = credits['crew']
    for member in crew:
        if member['job'] == 'Director':
            director_list.append(member["name"])

    actor_str = ', '.join(list(set(actor_list)))
    director_str = ', '.join(list(set(director_list)))

    # Parsing production companies
    prod_str = ""
    for company in prod_companies:
        prod_str += company['name'] + ", "
    prod_str = prod_str[:-2]

    # # Adding Wikipedia summaries if available
    # wiki_wiki = wikipediaapi.Wikipedia(
    #     user_agent='FilmBot (ed.izaguirre@pm.me)',
    #     language='en',
    #     extract_format=wikipediaapi.ExtractFormat.WIKI
    # )

    # p_wiki = wiki_wiki.page(title)

    # if p_wiki.exists():
    #     # If wiki exists, append text
    #     wiki_summary = p_wiki.text
    # else:
    #     # Otherwise, append a blank string
    #     wiki_summary = ""

    result = [title, runtime, language, overview,
              release_year, genre_str, keyword_str,
              actor_str, director_str, stream_str,
              buy_str, rent_str, prod_str]

    # write data
    csvwriter.writerow(result)
    csvFile.close()


def is_english(s):
    try:
        s.encode(encoding='utf-8').decode('ascii')
    except UnicodeDecodeError:
        return False
    else:
        return True


# **pull_data.ipynb**

## Obtenção de todos os IDs de filmes

No artigo, o autor pega o top 100 filmes de cada ano entre 1920 e 2023. Porém, observamos uma certa lentidão no download de informações da API do TMDB e para fins práticos, reduzimos o range de anos para 2014-2024.

In [5]:
years = [2014, 2024]

YEARS = range(years[0], years[-1]+1)
CSV_HEADER = ['Title', 'Runtime (minutes)', 'Language', 'Overview',
              'Release Year', 'Genre', 'Keywords',
              'Actors', 'Directors', 'Stream', 'Buy', 'Rent',
              'Production Companies']

## Escrita dos IDs em arquivos CSV

Aqui são chamadas as funções do arquivo utils.py

In [6]:
import os

for year in YEARS:
    # Grab list of ids for all films made in {YEAR}
    movie_list = list(set(get_id_list(TMBD_API_KEY, year)))

    FILE_PATH = './data/'
    FILE_NAME = f'{FILE_PATH}{year}_movie_collection_data.csv'

    if not os.path.exists(FILE_PATH):
      os.makedirs(FILE_PATH)

    # Creating file
    with open(FILE_NAME, 'w') as f:
        writer = csv.writer(f)
        writer.writerow(CSV_HEADER)

    # Iterate through list of ids to get data
    for id in movie_list:
        data_dict = get_data(TMBD_API_KEY, id)
        write_file(FILE_NAME, data_dict)

{'page': 1, 'results': [{'adult': False, 'backdrop_path': '/xJHokMbljvjADYdit5fK5VQsXEG.jpg', 'genre_ids': [12, 18, 878], 'id': 157336, 'original_language': 'en', 'original_title': 'Interstellar', 'overview': 'The adventures of a group of explorers who make use of a newly discovered wormhole to surpass the limitations on human space travel and conquer the vast distances involved in an interstellar voyage.', 'popularity': 184.317, 'poster_path': '/gEU2QniE6E77NI6lCU6MxlNBvIx.jpg', 'release_date': '2014-11-05', 'title': 'Interstellar', 'video': False, 'vote_average': 8.44, 'vote_count': 35038}, {'adult': False, 'backdrop_path': '/9vAoubhoZE8aSkUZoSfxs3UWZhO.jpg', 'genre_ids': [53, 28, 80], 'id': 156022, 'original_language': 'en', 'original_title': 'The Equalizer', 'overview': 'McCall believes he has put his mysterious past behind him and dedicated himself to beginning a new, quiet life. But when he meets Teri, a young girl under the control of ultra-violent Russian gangsters, he can’t st

# **rag_self_query.ipynb**

## Instalação do LangChain

In [7]:
!pip install langchain_community langchain_openai langchain_pinecone

Collecting langchain_community
  Downloading langchain_community-0.2.14-py3-none-any.whl.metadata (2.7 kB)
Collecting langchain_openai
  Downloading langchain_openai-0.1.23-py3-none-any.whl.metadata (2.6 kB)
Collecting langchain_pinecone
  Downloading langchain_pinecone-0.1.3-py3-none-any.whl.metadata (1.7 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting langchain<0.3.0,>=0.2.15 (from langchain_community)
  Downloading langchain-0.2.15-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core==0.2.36 (from langchain_community)
  Downloading langchain_core-0.2.36-py3-none-any.whl.metadata (6.2 kB)
Collecting langsmith<0.2.0,>=0.1.0 (from langchain_community)
  Downloading langsmith-0.1.106-py3-none-any.whl.metadata (13 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain-core==0.2.36->langchain_community)
  Using cached jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collectin

## Imports do LangChain

In [8]:
# Pinecone
from pinecone import Pinecone, PodSpec

# Langchain
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain.chains.query_constructor.base import (
    StructuredQueryOutputParser,
    get_query_constructor_prompt,
    AttributeInfo
)
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.retrievers.self_query.pinecone import PineconeTranslator
from langchain_openai import (
    ChatOpenAI,
    OpenAIEmbeddings
)
from langchain_pinecone import PineconeVectorStore
from langchain.indexes import SQLRecordManager, index
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# General
import os
from dotenv import load_dotenv
load_dotenv()

False

## Conversão de CSVs para Documentos

O autor explica que fazer "chunking" geralmente é um processo importante em sistemas RAG, mas nesse caso cada documento será apenas 1 linha de cada arquivo CSV, tornando esse processo desnecessário.

O *DirectoryLoader* do LangChain carrega os arquivos CSV em documentos, e é necessário definir o que será o conteúdo principal (*page_content*) e o que será metadado (*metadata*). O *page_content* será usado na busca por similaridade, enquanto os metadados serão usados para filtragem antes da busca. Neste projeto, os atributos *overview* e *keywords* foram definidos como *page_content*, enquanto os demais atributos ficaram como metadados.

In [9]:
# Loading in data from all csv files
loader = DirectoryLoader(
    path="./data",
    glob="*.csv",
    loader_cls=CSVLoader,
    show_progress=True)

docs = loader.load()

metadata_field_info = [
    AttributeInfo(
        name="Title", description="The title of the movie", type="string"),
    AttributeInfo(name="Runtime (minutes)",
                  description="The runtime of the movie in minutes", type="integer"),
    AttributeInfo(name="Language",
                  description="The language of the movie", type="string"),
    AttributeInfo(name="Release Year",
                  description="The release year of the movie as an integer", type="integer"),
    AttributeInfo(name="Genre", description="The genre of the movie",
                  type="string or list[string]"),
    AttributeInfo(name="Actors", description="The actors in the movie",
                  type="string or list[string]"),
    AttributeInfo(name="Directors", description="The directors of the movie",
                  type="string or list[string]"),
    AttributeInfo(name="Stream", description="The streaming platforms for the movie",
                  type="string or list[string]"),
    AttributeInfo(name="Buy", description="The platforms where the movie can be bought",
                  type="string or list[string]"),
    AttributeInfo(name="Rent", description="The platforms where the movie can be rented",
                  type="string or list[string]"),
    AttributeInfo(name="Production Companies",
                  description="The production companies of the movie", type="string or list[string]"),
]

def convert_to_list(doc, field):
    if field in doc.metadata and doc.metadata[field] is not None:
        doc.metadata[field] = [item.strip()
                               for item in doc.metadata[field].split(',')]

def convert_to_int(doc, field):
    if field in doc.metadata and doc.metadata[field] is not None:
        doc.metadata[field] = int(
            doc.metadata[field])

fields_to_convert_list = ['Genre', 'Actors', 'Directors',
                          'Production Companies', 'Stream', 'Buy', 'Rent']
fields_to_convert_int = ['Runtime (minutes)', 'Release Year']

# Set 'overview' and 'keywords' as 'page_content' and other fields as 'metadata'
for doc in docs:
    # Parse the page_content string into a dictionary
    page_content_dict = {}

    for line in doc.page_content.split("\n"):
        if ": " in line:
            key, value = line.split(": ", 1)
            page_content_dict[key] = value


    doc.page_content = 'Overview: ' + page_content_dict.get(
        'Overview') + '. Keywords: ' + page_content_dict.get('Keywords')
    doc.metadata = {field.name: page_content_dict.get(
        field.name) for field in metadata_field_info}

    # Convert fields from string to list of strings
    for field in fields_to_convert_list:
        convert_to_list(doc, field)

    # Convert fields from string to integers
    for field in fields_to_convert_int:
        convert_to_int(doc, field)

100%|██████████| 11/11 [00:00<00:00, 80.60it/s]


In [None]:
print(docs[5])

page_content='Overview: Up-and-coming sports reporter rescues a homeless man ("Champ") only to discover that he is, in fact, a boxing legend believed to have passed away. What begins as an opportunity to resurrect Champ's story and escape the shadow of his father's success becomes a personal journey as the ambitious reporter reexamines his own life and his relationship with his family.. Keywords: sports' metadata={'Title': 'Resurrecting the Champ', 'Runtime (minutes)': 112, 'Language': 'English', 'Release Year': 2007, 'Genre': ['Drama'], 'Actors': ['Teri Hatcher', 'Samuel L. Jackson', 'Kathryn Morris', 'Alan Alda', 'Josh Hartnett'], 'Directors': ['Rod Lurie'], 'Stream': ['Amazon Prime Video', 'Peacock Premium', 'Amazon Prime Video with Ads', 'Peacock Premium Plus'], 'Buy': ['Apple TV', 'Amazon Video', 'FlixFling'], 'Rent': ['Apple TV', 'Amazon Video', 'FlixFling'], 'Production Companies': ['Battleplan Productions', 'Yari Film Group', 'Phoenix Pictures', 'Alberta Film Entertainment'], '

## Criação do Index no Pinecone e upload de Documentos

O Pinecone permite o armazenamento dos documentos na nuvem.

In [18]:
# Create empty index
pc = Pinecone(api_key=PINECONE_KEY)

# Uncomment if index is not created already
# pc.create_index(
#     name=PINECONE_INDEX_NAME,
#     dimension=1536,
#     metric="cosine",
#     spec=PodSpec(
#         environment="gcp-starter"
#     )
# )

# Target index and check status
pc_index = pc.Index(PINECONE_INDEX_NAME)
print(pc_index.describe_index_stats())

embeddings = OpenAIEmbeddings(model='text-embedding-ada-002', openai_api_key=OPENAI_API_KEY)

vectorstore = PineconeVectorStore(
    pc_index, embeddings
)

# Create record manager
namespace = f"pinecone/{PINECONE_INDEX_NAME}"
record_manager = SQLRecordManager(
    namespace, db_url="sqlite:///record_manager_cache.sql"
)

record_manager.create_schema()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 3459}},
 'total_vector_count': 3459}


A função index faz o upload dos documentos

In [19]:
def _clear():
    """
    Hacky helper method to clear content.
    """
    index([], record_manager, vectorstore,
          cleanup="full", source_id_key="Title")

# Uncomment this line if you want to clear the Pinecone vectorstore
_clear()

# Upload documents to pinecome
index(docs, record_manager, vectorstore, cleanup="full", source_id_key="Title")

{'num_added': 1100, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

## Criação de Self-Querying Retriever

Este é o processo anterior à busca por similaridade. Aqui  os filmes são filtrados com dados mais "objetivos", como ano de lançamento.

São definidios comparadores, exemplos de queries e filtros correspondentes (essa técnica é conhecida como few-shot learning).

In [20]:
document_content_description = "Brief overview of a movie, along with keywords"

# Define allowed comparators list
allowed_comparators = [
    "$eq",  # Equal to (number, string, boolean)
    "$ne",  # Not equal to (number, string, boolean)
    "$gt",  # Greater than (number)
    "$gte",  # Greater than or equal to (number)
    "$lt",  # Less than (number)
    "$lte",  # Less than or equal to (number)
    "$in",  # In array (string or number)
    "$nin",  # Not in array (string or number)
]

examples = [
    (
        "I'm looking for a sci-fi comedy released after 2021.",
        {
            "query": "sci-fi comedy",
            "filter": "and(eq('Genre', 'Science'), eq('Genre', 'Comedy'), gt('Release Year', 2021))",
        },
    ),
    (
        "Show me critically acclaimed dramas without Tom Hanks.",
        {
            "query": "critically acclaimed drama",
            "filter": "and(eq('Genre', 'Drama'), nin('Actors', ['Tom Hanks']))",
        },
    ),
    (
        "Recommend some films by Yorgos Lanthimos.",
        {
            "query": "Yorgos Lanthimos",
            "filter": 'in("Directors", ["Yorgos Lanthimos]")',
        },
    ),
    (
        "Films similar to Yorgos Lanthmios movies.",
        {
            "query": "Dark comedy, absurd, Greek Weird Wave",
            "filter": 'NO_FILTER',
        },
    ),
    (
        "Find me thrillers with a strong female lead released between 2015 and 2020.",
        {
            "query": "thriller strong female lead",
            "filter": "and(eq('Genre', 'Thriller'), gt('Release Year', 2015), lt('Release Year', 2021))",
        },
    ),
    (
        "Find me highly rated drama movies in English that are less than 2 hours long",
        {
            "query": "Highly rated drama English under 2 hours",
            "filter": 'and(eq("Genre", "Drama"), eq("Language", "English"), lt("Runtime (minutes)", 120))',
        },
    ),
]

constructor_prompt = get_query_constructor_prompt(
    document_content_description,
    metadata_field_info,
    allowed_comparators=allowed_comparators,
    examples=examples,
)

query_model = ChatOpenAI(
    # model='gpt-3.5-turbo-0125',
    model='gpt-4o',
    temperature=0,
    streaming=True,
    openai_api_key=OPENAI_API_KEY
)

output_parser = StructuredQueryOutputParser.from_components()
query_constructor = constructor_prompt | query_model | output_parser

In [21]:
question = "Comedy films"
# question = "Find me thrillers with a strong female lead released between 2015 and 2020."
# print(constructor_prompt.format(query=question))
# print(type(constructor_prompt))

In [22]:
query_constructor.invoke(
    {
        "query": question
    }
)

StructuredQuery(query='comedy', filter=None, limit=None)

In [23]:
retriever = SelfQueryRetriever(
    query_constructor=query_constructor,
    vectorstore=vectorstore,
    structured_query_translator=PineconeTranslator(),
    search_kwargs={'k': 10}
)

retriever.invoke(question)

[Document(metadata={'Actors': ['Kevin Hart', 'Ed Helms', 'Thomas Middleditch', 'Nick Kroll', 'Jordan Peele'], 'Buy': ['Apple TV', 'Amazon Video', 'Google Play Movies', 'YouTube', 'Fandango At Home', 'Microsoft Store', 'AMC on Demand'], 'Directors': ['David Soren'], 'Genre': ['Action', 'Animation', 'Comedy', 'Family'], 'Language': 'English', 'Production Companies': ['DreamWorks Animation', 'Scholastic Entertainment'], 'Release Year': 2017.0, 'Rent': ['Apple TV', 'Amazon Video', 'Google Play Movies', 'YouTube', 'Fandango At Home', 'Microsoft Store'], 'Runtime (minutes)': 89.0, 'Stream': ['Netflix', 'Netflix basic with Ads'], 'Title': 'Captain Underpants: The First Epic Movie'}, page_content="Overview: Based on the bestselling book series, this outrageous comedy tells the story of George and Harold,  two overly imaginative pranksters who hypnotize their principal into thinking he’s an enthusiastic, yet dimwitted, superhero named Captain Underpants.. Keywords: friendship, based on novel or

## Criação de RAG Chain

Depois de construir o *self-querying retriever*, o próximo passo é desenvolver o modelo RAG padrão. O processo começa definindo um modelo de chat, que recebe um contexto (filmes recuperados + mensagem do sistema) e responde com um resumo de cada recomendação. Uma parte essencial dessa configuração é a mensagem do sistema, que define o objetivo do bot e impõe regras, como a restrição de não recomendar filmes que não estejam no contexto fornecido pelo *self-querying retriever*. Essa abordagem evita que o modelo recomende filmes inexistentes ou fora do escopo da busca, garantindo respostas precisas e confiáveis.

A mensagem do sistema é detalhada através de um template de prompt, instruindo o modelo a recomendar de três a cinco filmes com base no contexto, sem exceder esse limite e sem sugerir filmes não encontrados pelo retriever. A função `format_docs` é usada para organizar e apresentar as informações dos filmes ao modelo, combinando o *page_content* e os metadados. O *rag_chain_from_docs* é uma cadeia que formata os documentos recuperados e os passa ao modelo para gerar respostas. Em seguida, `rag_chain_with_source` é criado como um *RunnableParallel*, que simultaneamente recupera documentos relevantes e passa a consulta ao modelo, combinando os resultados para gerar a resposta final. O trecho final do código garante que a resposta seja transmitida ao usuário em tempo real, simulando a experiência de interação contínua, como a que vemos no ChatGPT.

In [24]:
def format_docs(docs):
    return "\n\n".join(f"{doc.page_content}\n\nMetadata: {doc.metadata}" for doc in docs)

chat_model = ChatOpenAI(
    model='gpt-4o-mini',
    # model='gpt-4-0125-preview',
    temperature=0,
    streaming=True,
    openai_api_key=OPENAI_API_KEY
)

prompt = ChatPromptTemplate.from_messages(
    [
        (
            'system',
            """
            Your goal is to recommend films to users based on their
            query and the retrieved context. If a retrieved film doesn't seem
            relevant, omit it from your response. Never refer to films that
            are not in your context. If you cannot recommend any
            films, suggest better queries to the user. You cannot
            recommend more than five films. Your recommendation should
            be relevant, original, and at least two to three sentences
            long.

            YOU CANNOT RECOMMEND A FILM IF IT DOES NOT APPEAR IN YOUR
            CONTEXT.

            # TEMPLATE FOR OUTPUT
            - [Title of Film](source link):
                - Runtime:
                - Release Year:
                - (Your reasoning for recommending this film)

            Question: {question}
            Context: {context}
            """
        ),
    ]
)

# Create a chatbot Question & Answer chain from the retriever
rag_chain_from_docs = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
    | prompt
    | chat_model
    | StrOutputParser()
)

rag_chain_with_source = RunnableParallel(
    {"context": retriever, "question": RunnablePassthrough()}
).assign(answer=rag_chain_from_docs)


query_constructor.invoke(
    {
        "query": question
    }
)
# Only prints final answer
# for chunk in rag_chain_with_source.stream(question):
#     for key in chunk:
#         if key == 'answer':
#             print(chunk[key], end="", flush=True)

# Prints everything
output = {}
curr_key = None
for chunk in rag_chain_with_source.stream(question):
    for key in chunk:
        if key not in output:
            output[key] = chunk[key]
        else:
            output[key] += chunk[key]
        if key != curr_key:
            print(f"\n\n{key}: {chunk[key]}", end="", flush=True)
        else:
            print(chunk[key], end="", flush=True)
        curr_key = key
output



question: Comedy films

context: [Document(metadata={'Actors': ['Kevin Hart', 'Ed Helms', 'Thomas Middleditch', 'Nick Kroll', 'Jordan Peele'], 'Buy': ['Apple TV', 'Amazon Video', 'Google Play Movies', 'YouTube', 'Fandango At Home', 'Microsoft Store', 'AMC on Demand'], 'Directors': ['David Soren'], 'Genre': ['Action', 'Animation', 'Comedy', 'Family'], 'Language': 'English', 'Production Companies': ['DreamWorks Animation', 'Scholastic Entertainment'], 'Release Year': 2017.0, 'Rent': ['Apple TV', 'Amazon Video', 'Google Play Movies', 'YouTube', 'Fandango At Home', 'Microsoft Store'], 'Runtime (minutes)': 89.0, 'Stream': ['Netflix', 'Netflix basic with Ads'], 'Title': 'Captain Underpants: The First Epic Movie'}, page_content="Overview: Based on the bestselling book series, this outrageous comedy tells the story of George and Harold,  two overly imaginative pranksters who hypnotize their principal into thinking he’s an enthusiastic, yet dimwitted, superhero named Captain Underpants.. Keyw

{'question': 'Comedy films',
 'context': [Document(metadata={'Actors': ['Kevin Hart', 'Ed Helms', 'Thomas Middleditch', 'Nick Kroll', 'Jordan Peele'], 'Buy': ['Apple TV', 'Amazon Video', 'Google Play Movies', 'YouTube', 'Fandango At Home', 'Microsoft Store', 'AMC on Demand'], 'Directors': ['David Soren'], 'Genre': ['Action', 'Animation', 'Comedy', 'Family'], 'Language': 'English', 'Production Companies': ['DreamWorks Animation', 'Scholastic Entertainment'], 'Release Year': 2017.0, 'Rent': ['Apple TV', 'Amazon Video', 'Google Play Movies', 'YouTube', 'Fandango At Home', 'Microsoft Store'], 'Runtime (minutes)': 89.0, 'Stream': ['Netflix', 'Netflix basic with Ads'], 'Title': 'Captain Underpants: The First Epic Movie'}, page_content="Overview: Based on the bestselling book series, this outrageous comedy tells the story of George and Harold,  two overly imaginative pranksters who hypnotize their principal into thinking he’s an enthusiastic, yet dimwitted, superhero named Captain Underpants.

# **chat_app.py**

Nessa etapa utiliza-se o que foi feito anteriormente, juntando todas as funções da RAG em uma classe: "FilmSearch", juntando com funções para gerar as respostas.

In [25]:
# Langchain
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.pinecone import PineconeTranslator
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_pinecone import PineconeVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain.chains.query_constructor.base import (
    StructuredQueryOutputParser,
    get_query_constructor_prompt,
)

# Pinecone
from pinecone import Pinecone

# General
import json
from dotenv import load_dotenv
import os


class FilmSearch:
    RETRIEVER_MODEL_NAME = "gpt-4o"
    SUMMARY_MODEL_NAME = "gpt-4o-mini"
    constructor_prompt = None
    vectorstore = None
    retriever = None
    rag_chain_with_source = None

    def __init__(self, openai_api_key, pinecone_api_key, pinecone_index_name):
        load_dotenv()
        self.initialize_query_constructor()
        self.initialize_vector_store(
            openai_api_key, pinecone_api_key, pinecone_index_name)
        self.initialize_retriever(openai_api_key)
        self.initialize_chat_model(openai_api_key)

    def initialize_query_constructor(self):
        document_content_description = "Brief overview of a movie, along with keywords"

        # Define allowed comparators list
        allowed_comparators = [
            "$eq",  # Equal to (number, string, boolean)
            "$ne",  # Not equal to (number, string, boolean)
            "$gt",  # Greater than (number)
            "$gte",  # Greater than or equal to (number)
            "$lt",  # Less than (number)
            "$lte",  # Less than or equal to (number)
            "$in",  # In array (string or number)
            "$nin",  # Not in array (string or number)
            "$exists",  # Has the specified metadata field (boolean)
        ]

        examples = [
            (
                "I'm looking for a sci-fi comedy released after 2021.",
                {
                    "query": "sci-fi comedy",
                    "filter": "and(eq('Genre', 'Science Fiction'), eq('Genre', 'Comedy'), gt('Release Year', 2021))",
                },
            ),
            (
                "Show me critically acclaimed dramas without Tom Hanks.",
                {
                    "query": "critically acclaimed drama",
                    "filter": "and(eq('Genre', 'Drama'), nin('Actors', ['Tom Hanks']))",
                },
            ),
            (
                "Recommend some films by Yorgos Lanthimos.",
                {
                    "query": "Yorgos Lanthimos",
                    "filter": 'in("Directors", ["Yorgos Lanthimos]")',
                },
            ),
            (
                "Films similar to Yorgos Lanthmios movies.",
                {
                    "query": "Dark comedy, absurd, Greek Weird Wave",
                    "filter": 'NO_FILTER',
                },
            ),
            (
                "Find me thrillers with a strong female lead released between 2015 and 2020.",
                {
                    "query": "thriller strong female lead",
                    "filter": "and(eq('Genre', 'Thriller'), gt('Release Year', 2015), lt('Release Year', 2021))",
                },
            ),
            (
                "Find me highly rated drama movies in English that are less than 2 hours long",
                {
                    "query": "Highly rated drama English under 2 hours",
                    "filter": 'and(eq("Genre", "Drama"), eq("Language", "English"), lt("Runtime (minutes)", 120))',
                },
            ),
        ]

        metadata_field_info = [
            AttributeInfo(
                name="Title", description="The title of the movie", type="string"),
            AttributeInfo(name="Runtime (minutes)",
                          description="The runtime of the movie in minutes", type="integer"),
            AttributeInfo(name="Language",
                          description="The language of the movie", type="string"),
            AttributeInfo(name="Release Year",
                          description="The release year of the movie", type="integer"),
            AttributeInfo(name="Genre", description="The genre of the movie",
                          type="string or list[string]"),
            AttributeInfo(name="Actors", description="The actors in the movie",
                          type="string or list[string]"),
            AttributeInfo(name="Directors", description="The directors of the movie",
                          type="string or list[string]"),
            AttributeInfo(name="Stream", description="The streaming platforms for the movie",
                          type="string or list[string]"),
            AttributeInfo(name="Buy", description="The platforms where the movie can be bought",
                          type="string or list[string]"),
            AttributeInfo(name="Rent", description="The platforms where the movie can be rented",
                          type="string or list[string]"),
            AttributeInfo(name="Production Companies",
                          description="The production companies of the movie", type="string or list[string]"),
        ]

        self.constructor_prompt = get_query_constructor_prompt(
            document_content_description,
            metadata_field_info,
            allowed_comparators=allowed_comparators,
            examples=examples,
        )

    def initialize_vector_store(self, open_ai_key, pinecone_api_key, pinecone_index_name):
        pc = Pinecone(api_key=pinecone_api_key)

        # Target index and check status
        pc_index = pc.Index(pinecone_index_name)

        embeddings = OpenAIEmbeddings(model='text-embedding-ada-002',
                                      api_key=open_ai_key)

        self.vectorstore = PineconeVectorStore(
            pc_index, embeddings
        )

    def initialize_retriever(self, open_ai_key):
        query_model = ChatOpenAI(
            model=self.RETRIEVER_MODEL_NAME,
            temperature=0,
            streaming=True,
            api_key=open_ai_key
        )

        output_parser = StructuredQueryOutputParser.from_components()
        query_constructor = self.constructor_prompt | query_model | output_parser

        self.retriever = SelfQueryRetriever(
            query_constructor=query_constructor,
            vectorstore=self.vectorstore,
            structured_query_translator=PineconeTranslator(),
            search_kwargs={'k': 10}
        )

    def initialize_chat_model(self, open_ai_key):
        def format_docs(docs):
            return "\n\n".join(f"{doc.page_content}\n\nMetadata: {doc.metadata}" for doc in docs)

        chat_model = ChatOpenAI(
            model=self.SUMMARY_MODEL_NAME,
            temperature=0,
            streaming=True,
            api_key=open_ai_key
        )

        prompt = ChatPromptTemplate.from_messages(
            [
                (
                    'system',
                    """
                    Your goal is to recommend films to users based on their
                    query and the retrieved context. If a retrieved film doesn't seem
                    relevant, omit it from your response. If your context is empty
                    or none of the retrieved films are relevant, do not recommend films, but instead
                    tell the user you couldn't find any films that match their query.
                    Aim for three to five film recommendations, as long as the films are relevant. You cannot
                    recommend more than five films. Your recommendation should
                    be relevant, original, and at least two to three sentences
                    long.

                    YOU CANNOT RECOMMEND A FILM IF IT DOES NOT APPEAR IN YOUR
                    CONTEXT.

                    # TEMPLATE FOR OUTPUT
                    - **Title of Film**:
                        - Runtime:
                        - Release Year:
                        - Streaming:
                        - (Your reasoning for recommending this film)

                    Question: {question}
                    Context: {context}
                    """
                ),
            ]
        )

        # Create a chatbot Question & Answer chain from the retriever
        rag_chain_from_docs = (
            RunnablePassthrough.assign(
                context=(lambda x: format_docs(x["context"])))
            | prompt
            | chat_model
            | StrOutputParser()
        )

        self.rag_chain_with_source = RunnableParallel(
            {"context": self.retriever, "question": RunnablePassthrough()}
        ).assign(answer=rag_chain_from_docs)

    def ask(self, query: str):
        try:
            for chunk in self.rag_chain_with_source.stream(query):
                for key in chunk:
                    if key == 'answer':
                        yield chunk[key]
        except Exception as e:
            print(f"An error occurred: {e}")


# **Client**

Aqui é um exemplo de como a consulta poderia ser feita por um usuário.

Exemplo de input: A movie like Interstellar

In [26]:
def generate_response(input_text, openai_api_key):
    chat = FilmSearch(OPENAI_API_KEY, PINECONE_KEY, PINECONE_INDEX_NAME)

    # Inicialize uma string vazia para acumular os chunks
    full_answer = ""

    # Itere sobre cada chunk e acumule-os na string
    for chunk in chat.ask(input_text):
        full_answer += chunk  # Acumula o conteúdo do chunk na string

    # Imprima a resposta completa ao final
    print("Full Answer:", full_answer)

text = input("Wich kind of movie are you looking for? ")

generate_response(text, openai_api_key=OPENAI_API_KEY)

Wich kind of movie are you looking for? A movie like Interstellar
Full Answer: - **Title of Film**: Interstellar
    - Runtime: 169 minutes
    - Release Year: 2014
    - Streaming: Amazon Prime Video, Epix Amazon Channel, Paramount Plus, and more
    - Interstellar is a must-watch for fans of science fiction and space exploration. Directed by Christopher Nolan, it delves into themes of time manipulation, family relationships, and the vastness of the universe. The film's stunning visuals and thought-provoking narrative make it a perfect companion to your interest in space-themed films.

- **Title of Film**: Project 'Gemini'
    - Runtime: 98 minutes
    - Release Year: 2022
    - Streaming: Amazon Prime Video
    - This film explores the challenges faced by a crew stranded on an alien planet, echoing the themes of survival and exploration found in Interstellar. With its focus on the unknown and the dangers of space travel, Project 'Gemini' offers a thrilling experience for those intrig