# Cohere Document Search with LlamaIndex

This example shows how to use the Python [LlamaIndex](https://docs.llamaindex.ai/en/stable/) library to run a text-generation request against [Cohere's](https://cohere.com/) API, then augment that request using the text stored in a collection of local PDF documents.

**Requirements:**
- You will need an access key to Cohere's API key, which you can sign up for at (https://dashboard.cohere.com/welcome/login). A free trial account will suffice, but will be limited to a small number of requests.
- After obtaining this key, store it in plain text in your home in directory in the `~/.cohere.key` file.
- (Optional) Upload some pdf files into the `source_documents` subfolder under this notebook. We have already provided some sample pdfs, but feel free to replace these with your own.

## Install Dependencies

In [1]:
%pip install --quiet pymupdf

Note: you may need to restart the kernel to use updated packages.


Fresh installation of llama-index-core + integration packages -- New version of LlamaIndex introcudes breaking changes from the version on Vector Cluster.

In [2]:
%pip uninstall --quiet llama-index llama-index-core llama-index-llms-cohere llama_index.llms.litellm llama-index-readers-file llama-index-embeddings-cohere -y
%pip install --quiet llama-index llama-index-core llama-index-llms-cohere llama_index.llms.litellm llama-index-readers-file llama-index-embeddings-cohere

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Preprocessing

In [3]:
# load source document
import os
import fitz # imports the pymupdf library

source_doc_path = './S1_PDFs/Facebook S-1.pdf'
source_doc = fitz.open(source_doc_path)

In [4]:
# Trunctate it to the first 10 pages -- This greatly improves speed & accuracy of the model retrieving the table of contents

truncated_dir = './S1_PDFs/Truncated'                                        # dir to save truncated files to
os.makedirs(truncated_dir) if not os.path.exists(truncated_dir) else None    # create dir if not exists

truncated_doc = fitz.open()
truncated_doc.insert_pdf(source_doc, from_page=0, to_page=9)

In [5]:
filename = os.path.basename(source_doc_path)
filename, ext = os.path.splitext(filename)

truncated_filename = f"{filename} - Truncated - TOC{ext}"
truncated_out_path = os.path.join(truncated_dir, truncated_filename)  # path to save the truncated file

truncated_doc.save(truncated_out_path)

print(f"{len(source_doc)} pages - Original PDF doc")
print(f"{len(truncated_doc)} pages - Truncated PDF doc")
print(f"Saved truncated doc at: {truncated_out_path}")

198 pages - Original PDF doc
10 pages - Truncated PDF doc
Saved truncated doc at: ./S1_PDFs/Truncated/Facebook S-1 - Truncated - TOC.pdf


### Extract the Table of Contents from Truncated Doc using Cohere Command-R Model

In [6]:
# Load Cohere API Key
import os
from pathlib import Path
try:
    os.environ["COHERE_API_KEY"] = open(Path.home() / ".cohere.key", "r").read().strip()
    os.environ["CO_API_KEY"] = open(Path.home() / ".cohere.key", "r").read().strip()
except Exception:
    print(f"ERROR: You must have a Cohere API key available in your home directory at ~/.cohere.key")

In [7]:
# llama_index.llms.cohere does not support command-r model -- Use LiteLLM instead
from llama_index.llms.litellm import LiteLLM
llm = LiteLLM("command-r")

In [10]:
from llama_index.core import SimpleDirectoryReader
reader = SimpleDirectoryReader(input_files=[truncated_out_path])
documents = reader.load_data()  # get truncated Facebook S-1 document

In [12]:
from llama_index.embeddings.cohere import CohereEmbedding
embed_model = CohereEmbedding(
    model_name="embed-english-v3.0",
    input_type="search_query"
)

In [13]:
from llama_index.core import ServiceContext
service_context = ServiceContext.from_defaults(
    embed_model=embed_model,
    llm=llm,
    chunk_size=200
)

  service_context = ServiceContext.from_defaults(


In [15]:
from llama_index.core import VectorStoreIndex
truncated_index = VectorStoreIndex.from_documents(documents, service_context=service_context, show_progress=True)

Parsing nodes:   0%|          | 0/10 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/123 [00:00<?, ?it/s]

In [None]:
# Get the section after 'Risk Factors'

## Set up the RAG workflow environment

In [None]:
from getpass import getpass
import os
from pathlib import Path

from llama_index import ServiceContext, SimpleDirectoryReader, VectorStoreIndex
from llama_index.embeddings.cohereai import CohereEmbedding
from llama_index.llms import Cohere
from llama_index.postprocessor.cohere_rerank import CohereRerank

Set up some helper functions:

In [None]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

Make sure other necessary items are in place:

In [None]:
try:
    os.environ["COHERE_API_KEY"] = open(Path.home() / ".cohere.key", "r").read().strip()
    os.environ["CO_API_KEY"] = open(Path.home() / ".cohere.key", "r").read().strip()
except Exception:
    print(f"ERROR: You must have a Cohere API key available in your home directory at ~/.cohere.key")

# Look for the source-materials folder and make sure there is at least 1 pdf file here
contains_pdf = False
directory_path = "./source_documents"
if not os.path.exists(directory_path):
    print(f"ERROR: The {directory_path} subfolder must exist under this notebook")
for filename in os.listdir(directory_path):
    contains_pdf = True if ".pdf" in filename else contains_pdf
if not contains_pdf:
    print(f"ERROR: The {directory_path} subfolder must contain at least one .pdf file")

## Start with a basic generation request without RAG augmentation

Let's start by asking the Cohere LLM a difficult, domain-specific question we don't expect it to have an answer to. A simple question like "*What is the capital of France?*" is not a good question here, because that's basic knowledge that we expect the LLM to know.

Instead, we want to ask it a question that is very domain-specific that it won't know the answer to. A good example would an obscure detail buried deep within a company's annual report. For example:

"*How many Vector scholarships in AI were awarded in 2022?*"

In [None]:
query = "How many Vector scholarships in AI were awarded in 2022?"

## Now send the query to Cohere

In [None]:
llm = Cohere(api_key=os.environ["COHERE_API_KEY"])
result = llm.complete(query)
print(f"Result: \n\n{result}")

Without additional information, Cohere is unable to answer the question correctly. **Vector in fact awarded 109 AI scholarships in 2022.** Fortunately, we do have that information available in Vector's 2021-22 Annual Report, which is available in the `source_documents` folder. Let's see how we can use RAG to augment our question with a document search and get the correct answer.

## Ingestion: Load and store the documents from source-materials

Start by reading in all the PDF files from `source_documents`.

In [None]:
# Load the pdfs
pdf_folder_path = "./source_documents"
documents = SimpleDirectoryReader(pdf_folder_path).load_data()
print(f"Number of source materials: {len(documents)}\n")

## Define an embeddings model

This embeddings model will convert the textual data from our PDF files into vector embeddings. These vector embeddings will later enable us to quickly find the chunk of text that most closely corresponds to our original query.

In [None]:
embed_model = CohereEmbedding(
    model_name="embed-english-v3.0",
    input_type="search_query"
)
service_context = ServiceContext.from_defaults(
    embed_model=embed_model,
    llm=llm,
    chunk_size=200
)

## Storage: Store the documents in a vector database

In [None]:
index = VectorStoreIndex.from_documents(documents, service_context=service_context, show_progress=True)

## Retrieval: Now do a search to retrieve the chunk of document text that most closely matches our original query

In [None]:
search_query_retriever = index.as_retriever(service_context=service_context)
search_query_retrieved_nodes = search_query_retriever.retrieve(query)
print(f"Search query retriever found {len(search_query_retrieved_nodes)} results")
print(f"First result example:\n{search_query_retrieved_nodes[0]}\n")

That first result doesn't look right, but it's close? Could it be that we got the result that we wanted from that retrieval, but the results came back out of order? Let's try using a reranker to check which of our results is a closest match.

## Reranking: Improve the ordering of the document chunks

In [None]:
reranker = CohereRerank()
query_engine = index.as_query_engine(
    node_postprocessors = [reranker]
)

## Final RAG-augmented query

In [None]:
result = query_engine.query(query)
print(f"Result: {result}\n\n")