# Section Retrieval On Single Sample Document

This notebook builds off of the *Cohere Document Search with LlamaIndex* notebook (<code>document_search_cohere_llamaindex.ipynb</code>) to implement section extraction on a single sample document.

**High-level workflow:**
1. Use RAG to retrieve the document's Table Of Contents and extract a list of the document's sections.
2. Use the <code>PyMuPDF</code> library to extract the body of a section by providing its title & the title of the section after it.
3. Chunk the section body into paragraphs, create embeddings.
4. Perform RAG (e.g, summarization) using the section body embedings.

**Requirements:**
- A Cohere API key stored in plain text in your home in directory in the `~/.cohere.key` file.
- Upload a pdf file into the `S1_PDFs` subfolder under this notebook, and provide the path to the document in the **source_doc_path** variable.

## Install Dependencies

In [1]:
%pip install --quiet pymupdf

Note: you may need to restart the kernel to use updated packages.


Fresh installation of llama-index-core + integration packages -- New version of LlamaIndex introcudes breaking changes from the version on Vector Cluster.

In [2]:
%pip uninstall --quiet llama-index llama-index-core llama-index-llms-cohere llama-index-llms-litellm llama-index-readers-file llama-index-embeddings-cohere llama-index-postprocessor-cohere-rerank -y
%pip install --quiet llama-index llama-index-core llama-index-llms-cohere llama-index-llms-litellm llama-index-readers-file llama-index-embeddings-cohere llama-index-postprocessor-cohere-rerank

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Preprocessing

In [3]:
# load source document
import os
import fitz # imports the pymupdf library

source_doc_path = './S1_PDFs/Facebook S-1.pdf'
source_doc = fitz.open(source_doc_path)

In [4]:
# Trunctate it to the first 10 pages -- This greatly improves speed & accuracy of the model retrieving the table of contents

truncated_dir = './S1_PDFs/Truncated'                                        # dir to save truncated files to
os.makedirs(truncated_dir) if not os.path.exists(truncated_dir) else None    # create dir if not exists

truncated_doc = fitz.open()
truncated_doc.insert_pdf(source_doc, from_page=0, to_page=9)

In [5]:
filename = os.path.basename(source_doc_path)
filename, ext = os.path.splitext(filename)

truncated_filename = f"{filename} - Truncated - TOC{ext}"
truncated_out_path = os.path.join(truncated_dir, truncated_filename)  # path to save the truncated file

truncated_doc.save(truncated_out_path)

print(f"{len(source_doc)} pages - Original PDF doc")
print(f"{len(truncated_doc)} pages - Truncated PDF doc")
print(f"Saved truncated doc at: {truncated_out_path}")

198 pages - Original PDF doc
10 pages - Truncated PDF doc
Saved truncated doc at: ./S1_PDFs/Truncated/Facebook S-1 - Truncated - TOC.pdf


### Extract the Table of Contents from Truncated Doc using Cohere Command-R Model

In [6]:
# Load Cohere API Key
import os
from pathlib import Path
try:
    os.environ["COHERE_API_KEY"] = open(Path.home() / ".cohere.key", "r").read().strip()
    os.environ["CO_API_KEY"] = open(Path.home() / ".cohere.key", "r").read().strip()
except Exception:
    print(f"ERROR: You must have a Cohere API key available in your home directory at ~/.cohere.key")

In [7]:
# llama_index.llms.cohere does not support command-r model -- Use LiteLLM instead
from llama_index.llms.litellm import LiteLLM
llm = LiteLLM(
    model="command-r",
    temperature=0
)

In [8]:
from llama_index.core import SimpleDirectoryReader
reader = SimpleDirectoryReader(input_files=[truncated_out_path])
documents = reader.load_data()  # get truncated Facebook S-1 document

In [9]:
from llama_index.embeddings.cohere import CohereEmbedding
embed_model = CohereEmbedding(
    model_name="embed-english-v3.0",
    input_type="search_query"
)

In [10]:
from llama_index.core import ServiceContext
service_context = ServiceContext.from_defaults(
    embed_model=embed_model,
    llm=llm,
    chunk_size=500
)

  service_context = ServiceContext.from_defaults(


In [11]:
from llama_index.core import VectorStoreIndex
truncated_index = VectorStoreIndex.from_documents(documents, service_context=service_context, show_progress=True)

Parsing nodes:   0%|          | 0/10 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/20 [00:00<?, ?it/s]

In [12]:
from llama_index.postprocessor.cohere_rerank import CohereRerank
cohere_rerank = CohereRerank()

In [13]:
query_engine = truncated_index.as_query_engine(
    node_postprocessors=[cohere_rerank],
)

In [14]:
response = query_engine.query(
    """
    What are the sections that constitute the 'table of contents' of the document?
    Create a Python list, where each element of the list is a section from the document's 'table of contents',
    in the order they occur in within the table of contents.
    
    Your output should just be the Python list, nothing else.
    """
)

In [15]:
response.response

"['Prospectus Summary', 'Risk Factors', 'Special Note Regarding Forward-Looking Statements', 'Industry Data and User Metrics', 'Use of Proceeds', 'Dividend Policy', 'Capitalization', 'Dilution', 'Selected Consolidated Financial Data', 'Management’ s Discussion and Analysis of Financial Condition and Results of Operations', 'Letter from Mark Zuckerber g', 'Business', 'Management', 'Executive Compensation', 'Related Party Transactions', 'Principal and Selling Stockholders', 'Description of Capital Stock', 'Shares Eligible for Future Sale', 'Material U.S. Federal Tax Considerations for Non-U.S. Holders of Class A Common Stock', 'Underwriting', 'Legal Matters', 'Experts', 'Where You Can Find Additional Information', 'Index to Consolidated Financial Statements']"

In [16]:
import ast    # Built-in Abstract Syntax Trees module - Can convert a string contining a list to a list object
document_sections = ast.literal_eval(response.response)

In [17]:
document_sections

['Prospectus Summary',
 'Risk Factors',
 'Special Note Regarding Forward-Looking Statements',
 'Industry Data and User Metrics',
 'Use of Proceeds',
 'Dividend Policy',
 'Capitalization',
 'Dilution',
 'Selected Consolidated Financial Data',
 'Management’ s Discussion and Analysis of Financial Condition and Results of Operations',
 'Letter from Mark Zuckerber g',
 'Business',
 'Management',
 'Executive Compensation',
 'Related Party Transactions',
 'Principal and Selling Stockholders',
 'Description of Capital Stock',
 'Shares Eligible for Future Sale',
 'Material U.S. Federal Tax Considerations for Non-U.S. Holders of Class A Common Stock',
 'Underwriting',
 'Legal Matters',
 'Experts',
 'Where You Can Find Additional Information',
 'Index to Consolidated Financial Statements']

### Get the Section Body

In [18]:
from utils import get_item_after    # helper functions defined in utils.py

target_section = 'Risk Factors'
next_section = get_item_after(document_sections, target_section)

print(f"Target section:\t{target_section}\nSection after:\t{next_section}") if next_section != None else None

Target section:	Risk Factors
Section after:	Special Note Regarding Forward-Looking Statements


In [19]:
# Get words from PDF doc so that they can be parsed by the PyMuPDF package
# Set start_page=X to begin retrieving words at page x in the doc -- Higher start page -> Less words to retrieve & parse
# Get words in lowercase format to help with searching for substrings

from utils import get_words_from_PDF
lowercase_words = get_words_from_PDF(doc=source_doc, start_page=10, lowercase=True)

print(f"{len(lowercase_words)} total words retrieved from document")

101027 total words retrieved from document


In [20]:
# example of some of the retrieved words -- notice all lowercase
lowercase_words[14:60]

['according',
 'to',
 'an',
 'industry',
 'source,',
 'total',
 'worldwide',
 'advertising',
 'spending',
 'in',
 '2010',
 'was',
 '$588',
 'billion.',
 'our',
 'addressable',
 'market',
 'opportunity',
 'includes',
 'portions',
 'of',
 'many',
 'existing',
 'advertising',
 'markets,',
 'including',
 'the',
 'traditional',
 'offline',
 'branded',
 'advertising,',
 'online',
 'display',
 'advertising,',
 'online',
 'performance-based',
 'advertising,',
 'and',
 'mobile',
 'advertising',
 'markets.',
 'advertising',
 'on',
 'the',
 'social',
 'web']

In [21]:
from utils import get_section

# search for section titles in lowercase format -- helps handle formatting differences
start_str = target_section.lower()
end_str = next_section.lower()

body = get_section(
    start_str=start_str,
    end_str=end_str,
    words=lowercase_words,
)

if body:    # only runs if start_str & end_string found
    print(f"Section: {target_section}")
    print(f"Body: ({len(body.split(' '))} words)\n{body}")

Section: Risk Factors
Body: (16822 words)
our business is subject to numerous risks described in the section entitled “risk factors” and elsewhere in this prospectus. you should carefully consider these risks before making an investment. some of these risks include: • if we fail to retain existing users or add new users, or if our users decrease their level of engagement with facebook, our revenue, financial results, and business may be significantly harmed; • we generate a substantial majority of our revenue from advertising. the loss of advertisers, or reduction in spending by advertisers with facebook, could seriously harm our business; • growth in use of facebook through our mobile products, where we do not currently display ads, as a substitute for use on personal computers may negatively affect our revenue and financial results; • facebook user growth and engagement on mobile devices depend upon effective operation with mobile operating systems, networks, and standards that we do

### Chunk Section Body into Paragraphs and Create Embeddings

### Perform RAG using the Sentence Body Embeddings