# Financial Document Parsers
Financial documents are often in the form of `.pdf` files. Using this notebook to explore how to best parse those documents.

Some methodologies to try:
- simple python pdf parsers
- Unstructured.io
- Llamaindex (using unstructured.io)
- LLM (sending pdf into llm)

## Simple Python PDF Parser
https://pypdf.readthedocs.io/en/stable/index.html

**Warning**: this didn't work on financial statements pdf. Maybe too big. As an alternative, just use y-finance for numerical information.

In [78]:
from pypdf import PdfReader
dir_path = "../../data/NFLX/equity-research-report"
# file_name = "2024-11-20-NFLX.OQ-Pivotal Research Gro-NFLX Raising YE 25 Target $175 To Street High $1,100-111692279.pdf"
file_name = "2022-11-15-NFLX.OQ-BofA Global Research-Netflix, Inc. Full stream ahead - reinstate Netflix at Buy ...-99226234.pdf"
# file_name = "2022-11-17-NFLX.OQ-EVERCORE ISI-Better With Ads – First Look At BWA-99248969.pdf"
# file_name = "2022-02-08-Credit Suisse-Global TMT Sector Metaverse A guide to the next gen intern...-95432537.PDF"
pdf_path = f"{dir_path}/{file_name}"
reader = PdfReader(pdf_path)

In [79]:
# metadata
meta = reader.metadata

# All of the following could be None!
print("Title:", meta.title)
print("Author:", meta.author)
print("Subject:", meta.subject)
print("Creator:", meta.creator)
print("Producer:", meta.producer)
print("Creation date:", meta.creation_date)
print("Modification date:", meta.modification_date)

Title: Netflix, Inc.
Author: Jessica Reif Ehrlich
Subject: Reinstatement of Coverage
Creator: Aspose Ltd.
Producer: macOS 版本12.5（版号21G72） Quartz PDFContext, AppendMode 1.1
Creation date: 2022-11-15 10:54:50+00:00
Modification date: 2023-01-10 01:44:13+00:00


In [81]:
# extract text

# NOTE: using .extract_text() and not "layout" mode because layout mode extracts line by line
# (literally), and sometimes in financial doc, paragraphs are stacked next to each other.
pdf_pages = []
for page in reader.pages:
    extracted_text = page.extract_text()
    pdf_pages.append(extracted_text)
    # print("====NEW PAGE====")
    # print(extracted_text)
    # print(page.extract_text(extraction_mode="layout"))
    # print(page.extract_text(extraction_mode="layout", layout_mode_space_vertically=False)) # excludes inferred blank line
    # print(page.extract_text(extraction_mode="layout", layout_mode_scale_weight=1.0))
    # print(page.extract_text(extraction_mode="layout", layout_mode_strip_rotated=False))

### Post processing
Post processing either just using `strip` or using llms

#### Using `strip`

In [83]:
for page in pdf_pages:
    processed_lines = []
    for line in page.split("\n"):
        new_line = line.strip()
        if new_line == "": 
            continue
        processed_lines.append(new_line)
    print("Pages: ", " ".join(processed_lines))


Pages:  BofA Securities does and seeks to do business with issuers covered in its research reports. As a result, investors should be aware that the firm may have a conflict of interest that could affect the objectivity of this report. Investors should consider this report as only a single factor in making their investment decision. Refer to important disclosures on page 35 to 37. Analyst Certification on page 33. Price Objective Basis/Risk on page 33.  12487356 Netflix, Inc. Full stream ahead - reinstate Netflix at Buy with a $370 PO Reinstating Coverage: BUY | PO: 370.00 USD | Price: 299.27 USD Reinstating on Netflix at Buy with a $3 70 PO We are reinstating coverage of Netflix (NFLX) with a Buy rating and a 12-month price objective (PO) of $370 (+24% upside potential). We value the shares using 20.5x EV/EBITDA multiple to our CY24E EBITDA estimate, which is supported by our DCF. Our valuation accounts for Netflix’s leading position within the still burgeoning shift towards non-linear

#### Using LLM

In [None]:
# ...

## Unstructured.io / LlamaIndex

In [None]:
# ...