# USA Company Valuation with ChatGPT - Part 1

## Goal
We want to be able to evaluate USA companies for investing purposes.

The goal is to create a tool that will help us retrieve all necessary measures to be used in the Valuation model and give us a summary of the most important information that an investor should be aware of.

Our Valuation model is built upon the principles teached by Prof. Damodaran in his Valuation Course, available for free on YouTube.
https://www.youtube.com/watch?v=LYGYvN5LUbA&list=PLUkh9m2BorqnhWfkEP2rRdhgpYKLS-NOJ

We extrapolate data from the U.S. SECURITIES AND EXCHANGE COMMISSION (SEC) website. The SEC's Electronic Data Gathering, Analysis and Retrieval (EDGAR) database provides free public access to USA corporate information, allowing us to quickly research a company's financial and operations information.
https://www.sec.gov/edgar/search-and-access

We also gather data from Prof. Damodaran website https://pages.stern.nyu.edu/~adamodar/.

## <a class="anchor" id="toc">Table of Contents:</a>
This project is structured in the following way:
1. [Data Collection](#1-bullet)
2. [Qualitative Analysis](#2-bullet) (Leverage OpenAI models to make sense of reports information)
3. [Quantitative Analysis](#3-bullet) (Valuation model based on financial data)
4. [Visualization](#4-bullet) (Company valuation Visualizations in Tableau and PowerBI)
5. [Conclusions and Next steps](#5-bullet)

## <a class="anchor" id="1-bullet" href="#toc">1. Data Collection</a>

### MongoDB
We are gonna use MongoDB to store annual reports, financial data, and our processed data.

We have the following MongoDB collections:
- **cik_ticker**: contains a single document with a mapping of CIK (Central Index KEY, id of company on EDGAR) and TICKER on the exchange.
- **submissions**: contains multiple documents, 1 for each company with the list of all submissions the company had done.
- **documents**: contains multiple documents, 1 for each SEC filing. The document contains the raw html of the report.
- **financial_data**: contains multiple documents, 1 for each company. The document contains the whole history of financial data of a single company.
- **parsed_documents**: contains multiple documents, 1 for each filing. A document contains a parsed version of the documents, where text is split in sections related to SEC filings items.
- **items_summary**: contains multiple documents, 1 for each filing. A document contains a summary for the most important sections of a SEC filing.

### PostgreSQL
In PostgreSQL we are going to store data from Damodaran website and Yahoo Finance.

Here is a brief list of the files we use from Damodaran:
- Damodaran
    - country_stats:
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/countrystats.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/ctrypremJan23.xlsx
        - https://www.stern.nyu.edu/~adamodar/pc/datasets/countrytaxrates.xls
        - https://pages.stern.nyu.edu/~adamodar/New_Home_Page/datafile/ctryprem.html
    - erp:
        - https://pages.stern.nyu.edu/~adamodar/pc/implprem/ERPbymonth.xlsx
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/histimpl.xls
    - bond_spread:
        - https://pages.stern.nyu.edu/~adamodar/New_Home_Page/datafile/ratings.html
    - industry:
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/pedata.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/peEurope.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/peJapan.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/peemerg.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/peChina.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/peIndia.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/peGlobal.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/peRest.xls

        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/pbvdata.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/pbvEurope.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/pbvJapan.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/pbvemerg.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/pbvChina.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/pbvIndia.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/pbvGlobal.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/pbvRest.xls

        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/psdata.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/psEurope.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/psJapan.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/psemerg.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/psChina.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/psIndia.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/psGlobal.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/psRest.xls

        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/vebitda.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/vebitdaEurope.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/vebitdaJapan.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/vebitdaemerg.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/vebitdaChina.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/vebitdaIndia.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/vebitdaGlobal.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/vebitdaRest.xls

        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/betas.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/betaEurope.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/betaJapan.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/betaemerg.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/betaChina.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/betaIndia.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/betaGlobal.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/betaRest.xls

        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/capex.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/capexEurope.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/capexJapan.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/capexemerg.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/capexChina.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/capexIndia.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/capexGlobal.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/capexRest.xls

        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/divfcfe.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/divfcfeEurope.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/divfcfeJapan.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/divfcfeemerg.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/divfcfeChina.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/divfcfeIndia.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/divfcfeGlobal.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/divfcfeRest.xls

        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/margin.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/marginEurope.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/marginJapan.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/marginemerg.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/marginChina.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/marginIndia.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/marginGlobal.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/marginRest.xls

Since this process of retrieving, transforming and storing data to PostgreSQL is out of the scope of this project we are not going to describe it in this notebook.

### Project Setup
We are going to use various dependencies to collect data from EDGAR APIs (https://www.sec.gov/edgar/sec-api-documentation).
Here we are going to import everything we use in this notebook.

In [1]:
# move to root to simplify imports
%cd ..

C:\Users\matte\repo\tests\company_valuation


In [2]:
import requests
import pandas as pd
import datetime
import time
from dateutil.relativedelta import relativedelta
from pymongo.errors import DocumentTooLarge
import mongodb # a utility script containing interface methods to a MongoDB instance

Then we define various utility methods used in our project.

In [3]:
def make_edgar_request(url):
    """
    Make a request to EDGAR (Electronic Data Gathering, Analysis and Retrieval)
    :param url: request URL
    :return: response
    """
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36",
        "Accept-Encoding": "gzip, deflate, br",
    }
    return requests.get(url, headers=headers)

def download_cik_ticker_map():
    """
    Get a mapping of cik (Central Index Key, id of company on edgar) and ticker on the exchange.
    It upsert the mapping in MongoDB collection cik_ticker.
    """
    CIK_TICKER_URL = "https://www.sec.gov/files/company_tickers_exchange.json"
    response = make_edgar_request(CIK_TICKER_URL)
    r = response.json()
    r["_id"] = "cik_ticker"
    mongodb.upsert_document("cik_ticker", r)
    
def get_df_cik_ticker_map():
    """
    Create a DataFrame from cik ticker document on MongoDB.
    :return: DataFrame
    """
    try:
        cik_ticker = mongodb.get_collection_documents("cik_ticker").next()
    except StopIteration:
        print("cik ticker document not found")
        return
    df = pd.DataFrame(cik_ticker["data"], columns=cik_ticker["fields"])
    
    # add leading 0s to cik (always 10 digits)
    df["cik"] = df.apply(lambda x: add_trailing_to_cik(x["cik"]), axis=1)
    
    return df

def company_from_cik(cik):
    """
    Get company info from cik
    :param cik: company id on EDGAR
    :return: DataFrame row with company information (name, ticker, exchange)
    """
    df = get_df_cik_ticker_map()
    try:
        return df[df["cik"] == cik].iloc[0]
    except IndexError:
        return None
    
def cik_from_ticker(ticker):
    """
    Get company cik from ticker
    :param ticker: company ticker
    :return: cik (company id on EDGAR)
    """
    df = get_df_cik_ticker_map()
    try:
        cik = df[df["ticker"] == ticker]["cik"].iloc[0]
    except:
        cik = -1
    return cik

def download_all_cik_submissions(cik):
    """
    Get list of submissions for a single company.
    Upsert this list on MongoDB (each download contains all the submissions).
    :param cik: cik of the company
    :return:
    """
    url = f"https://data.sec.gov/submissions/CIK{cik}.json"
    response = make_edgar_request(url)
    r = response.json()
    r["_id"] = cik
    mongodb.upsert_document("submissions", r)
    
def download_submissions_documents(cik, forms_to_download=("10-Q", "10-K", "8-K"), years=5):
    """
    Download all documents for submissions forms 'forms_to_download' for the past 'years'.
    Insert them on mongodb.
    :param cik: company cik
    :param forms_to_download: a tuple containing the form types to download
    :param years: the max number of years to download
    :return:
    """
    try:
        submissions = mongodb.get_document("submissions", cik)
    except StopIteration:
        print(f"submissions file not found in mongodb for {cik}")
        return
    
    cik_no_trailing = submissions["cik"]
    filings = submissions["filings"]["recent"]
    
    for i in range(len(filings["filingDate"])):
        filing_date = filings['filingDate'][i]
        difference_in_years = relativedelta(datetime.date.today(),
                                            datetime.datetime.strptime(filing_date, "%Y-%m-%d")).years
        
        # as the document are ordered cronologically when we reach the max history we can return
        if difference_in_years > years:
            return
        
        form_type = filings['form'][i]
        if form_type not in forms_to_download:
            continue
            
        accession_no_symbols = filings["accessionNumber"][i].replace("-","")
        primary_document = filings["primaryDocument"][i]
        url = f"https://www.sec.gov/Archives/edgar/data/{cik_no_trailing}/{accession_no_symbols}/{primary_document}"
        
        # if we already have the document, we don't download it again
        if mongodb.check_document_exists("documents", url):
            continue
        
        print(f"{filing_date} ({form_type}): {url}")
        download_document(url, cik, form_type, filing_date)
        
        # insert a quick sleep to avoid reaching edgar rate limit
        time.sleep(0.2)

def download_document(url, cik, form_type, filing_date, updated_at=None):
    """
    Download and insert submission document
    :param url: document URL
    :param cik: company cik
    :param form_type: document form type
    :param filing_date: document filing date
    :return:
    """
    response = make_edgar_request(url)
    r = response.text
    
    doc = {"html": r, "cik": cik, "form_type": form_type, "filing_date": filing_date, "updated_at": updated_at, "_id": url}
    
    try:
        mongodb.insert_document("documents", doc)
    except DocumentTooLarge:
        # DocumenTooLarge is raised by mongodb when uploading files larger than 16MB
        # To avoid this it is better to save this kind of files in a separate storate like S3 and retriving them when needed.
        # Another option could be using mongofiles: https://www.mongodb.com/docs/database-tools/mongofiles/#mongodb-binary-bin.mongofiles
        # for management of large files saved in mongo db.
        print("Document too Large (over 16MB)", url)

def add_trailing_to_cik(cik_no_trailing):
    return "{:010d}".format(cik_no_trailing)

These methods allow us to download data for all companies managed by SEC, that are those present in the *cik_ticker* collection.

With this in place, we can download all SEC filings for all these companies and save them on mongoDB in *submissions* and *documents* collections.

### Example: download Alphabet Inc. (Google) data
As an example we are going to collect data of Alphabet Inc.

In [4]:
# First we download the cik_ticker map
download_cik_ticker_map()

In [5]:
mongodb.get_collection('cik_ticker').find({}).next()

{'_id': 'cik_ticker',
 'fields': ['cik', 'name', 'ticker', 'exchange'],
 'data': [[320193, 'Apple Inc.', 'AAPL', 'Nasdaq'],
  [789019, 'MICROSOFT CORP', 'MSFT', 'Nasdaq'],
  [1652044, 'Alphabet Inc.', 'GOOGL', 'Nasdaq'],
  [1018724, 'AMAZON COM INC', 'AMZN', 'Nasdaq'],
  [1045810, 'NVIDIA CORP', 'NVDA', 'Nasdaq'],
  [1318605, 'Tesla, Inc.', 'TSLA', 'Nasdaq'],
  [1067983, 'BERKSHIRE HATHAWAY INC', 'BRK-B', 'NYSE'],
  [1326801, 'Meta Platforms, Inc.', 'META', 'Nasdaq'],
  [1046179, 'TAIWAN SEMICONDUCTOR MANUFACTURING CO LTD', 'TSM', 'NYSE'],
  [1403161, 'VISA INC.', 'V', 'NYSE'],
  [824046, 'LVMH MOET HENNESSY LOUIS VUITTON', 'LVMUY', 'OTC'],
  [731766, 'UNITEDHEALTH GROUP INC', 'UNH', 'NYSE'],
  [59478, 'ELI LILLY & Co', 'LLY', 'NYSE'],
  [34088, 'EXXON MOBIL CORP', 'XOM', 'NYSE'],
  [19617, 'JPMORGAN CHASE & CO', 'JPM', 'NYSE'],
  [104169, 'Walmart Inc.', 'WMT', 'NYSE'],
  [200406, 'JOHNSON & JOHNSON', 'JNJ', 'NYSE'],
  [884394, 'SPDR S&P 500 ETF TRUST', 'SPY', 'NYSE'],
  [1141391, 'Ma

In [6]:
# Retrieve CIK with the Alphabet Inc. TICKER.
apple_tiker = "GOOGL"
cik = cik_from_ticker(apple_tiker)
cik

'0001652044'

In [7]:
# Get list of submissions for Alphabet Inc.
download_all_cik_submissions(cik)

In [8]:
# Here we can see all submissions collected in mongoDB 'submission' collection.
mongodb.get_collection('submissions').find({"_id":cik}).next()

{'_id': '0001652044',
 'cik': '1652044',
 'entityType': 'operating',
 'sic': '7370',
 'sicDescription': 'Services-Computer Programming, Data Processing, Etc.',
 'insiderTransactionForOwnerExists': 1,
 'insiderTransactionForIssuerExists': 1,
 'name': 'Alphabet Inc.',
 'tickers': ['GOOGL', 'GOOG'],
 'exchanges': ['Nasdaq', 'Nasdaq'],
 'ein': '611767919',
 'description': '',
 'website': '',
 'investorWebsite': '',
 'category': 'Large accelerated filer',
 'fiscalYearEnd': '1231',
 'stateOfIncorporation': 'DE',
 'stateOfIncorporationDescription': 'DE',
 'addresses': {'mailing': {'street1': '1600 AMPHITHEATRE PARKWAY',
   'street2': None,
   'city': 'MOUNTAIN VIEW',
   'stateOrCountry': 'CA',
   'zipCode': '94043',
   'stateOrCountryDescription': 'CA'},
  'business': {'street1': '1600 AMPHITHEATRE PARKWAY',
   'street2': None,
   'city': 'MOUNTAIN VIEW',
   'stateOrCountry': 'CA',
   'zipCode': '94043',
   'stateOrCountryDescription': 'CA'}},
 'phone': '650-253-0000',
 'flags': '',
 'formerN

In [9]:
# This download all documents for submissions forms 10-k for the past 5 years.
download_submissions_documents(cik, ("10-K"), 5)

In [10]:
# Here we can see a document collected in mongoDB 'documents' collection.
mongodb.get_collection('documents').find({"cik":cik}).next()

{'_id': 'https://www.sec.gov/Archives/edgar/data/1652044/000165204423000016/goog-20221231.htm',
 'html': '<?xml version="1.0" ?><!--XBRL Document Created with Wdesk from Workiva--><!--Copyright 2023 Workiva--><!--r:94db13ab-d0fb-433a-a7d1-96ca74a2a87d,g:b8c6572a-40d9-4f2f-b3df-2328dc788b5b,d:a96e4fb0476549c99dc3a2b2368f643f--><html xmlns:country="http://xbrl.sec.gov/country/2022" xmlns:iso4217="http://www.xbrl.org/2003/iso4217" xmlns="http://www.w3.org/1999/xhtml" xmlns:ixt="http://www.xbrl.org/inlineXBRL/transformation/2020-02-12" xmlns:goog="http://www.google.com/20221231" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dei="http://xbrl.sec.gov/dei/2022" xmlns:ix="http://www.xbrl.org/2013/inlineXBRL" xmlns:srt="http://fasb.org/srt/2022" xmlns:xbrli="http://www.xbrl.org/2003/instance" xmlns:ixt-sec="http://www.sec.gov/inlineXBRL/transformation/2015-08-31" xmlns:us-gaap="http://fasb.org/us-gaap/2022" xmlns:link="http://www.xbrl.org

## <a class="anchor" id="2-bullet" href="#toc">2. Qualitative Analysis</a>

In this step we are going to leverage OpenAI **gpt-turbo-3.5** (the model that powers ChatGPT) to create useful summaries to help us evaluate a company (starting from the SEC Filings reports)

Qualitative analysis for financial and investing purposes involves evaluating non-numerical factors such as the company's management team, competitive positioning, brand reputation, industry trends, regulatory environment, corporate governance, and strategic initiatives. This analysis aims to gain insights into the company's qualitative strengths, weaknesses, risks, and opportunities to make informed investment decisions and assess the company's long-term potential.

In particular, we will read a specific filing, the 2022 10-K Annual report for Google (https://www.sec.gov/Archives/edgar/data/1652044/000165204423000016/goog-20221231.htm).

Let's now extrapolate sections and identify critical informations.

### Parse Documents
First we are gonna parse raw htmls collected in "documents" collection to extrapolate the text contained in the various sections (business, risk, MD&A, and so on).

Then we will pass these sections to **gpt-turbo-3.5** asking to provide a summary.

We do this pre-processing step of dividing the document in sections because **gpt-turbo-3.5** has a limitation in the number of tokens (words/syllables) it can process.

To do this step we can use BeautifulSoup and some euristics.

### Problem:
Annual reports do not have a standard structure. Every file structure could differ even if they are from the same company. 

Luckily, we identified common patterns in the markdown structure of a filing.

Most of the reports have a table of content at the beginning of document and we can use this to make sense of the structure of the document.

When we have a table of content we can look at the &lt;a&gt; tags (links) to retrieve hrefs that will be used to identify tag elements inside the document.

With this, we can split the document in multiple sections mantaining a common context.

However, not all documents have table of contents with hrefs, some documents don't even have a table of contents. Thus, we wrote an algorithm that consider also these cases.

In [11]:
import copy
from datetime import datetime
import Levenshtein as Levenshtein
from bs4 import BeautifulSoup, NavigableString
from unidecode import unidecode
import mongodb
import string
import re

In [12]:
# This is a list of strings that we use to look for the table of contents in a 10-K filing
list_10k_items = [
    "business",
    "risk factors",
    "unresolved staff comments",
    "properties",
    "legal proceedings",
    "mine safety disclosures",
    "market for registrant’s common equity, related stockholder matters and issuer purchases of equity securities",
    "reserved",
    "management’s discussion and analysis of financial condition and results of operations",
    "quantitative and qualitative disclosures about market risk",
    "financial statements and supplementary data",
    "changes in and disagreements with accountants on accounting and financial disclosure",
    "controls and procedures",
    "other information",
    "disclosure regarding foreign jurisdictions that prevent inspection",
    "directors, executive officers, and corporate governance",
    "executive compensation",
    "security ownership of certain beneficial owners and management and related stockholder matters",
    "certain relationships and related transactions, and director independence",
    "principal accountant fees and services",
    "exhibits and financial statement schedules",
]

# This is a dictionary of default sections that a 10-K filing, annual report, could contain
default_10k_sections = {
     1: {'item': 'item 1', 'title': ['business']},
     2: {'item': 'item 1a', 'title': ['risk factor']},
     3: {'item': 'item 1b', 'title': ['unresolved staff']},
     4: {'item': 'item 2', 'title': ['propert']},
     5: {'item': 'item 3', 'title': ['legal proceeding']},
     6: {'item': 'item 4', 'title': ['mine safety disclosure', 'submission of matters to a vote of security holders']},
     7: {'item': 'item 5', 'title': ["market for registrant's common equity, related stockholder matters and issuer purchases of equity securities"]},
     8: {'item': 'item 6', 'title': ['reserved', 'selected financial data']},
     9: {'item': 'item 7', 'title': ["management's discussion and analysis of financial condition and results of operations"]},
     10: {'item': 'item 7a', 'title': ['quantitative and qualitative disclosures about market risk']},
     11: {'item': 'item 8', 'title': ['financial statements and supplementary data']},
     12: {'item': 'item 9', 'title': ['changes in and disagreements with accountants on accounting and financial disclosure']},
     13: {'item': 'item 9a', 'title': ['controls and procedures']},
     14: {'item': 'item 9b', 'title': ['other information']},
     15: {'item': 'item 9c', 'title': ['Disclosure Regarding Foreign Jurisdictions that Prevent Inspections']},
     16: {'item': 'item 10', 'title': ['directors, executive officers and corporate governance','directors and executive officers of the registrant']},
     17: {'item': 'item 11', 'title': ['executive compensation']},
     18: {'item': 'item 12', 'title': ['security ownership of certain beneficial owners and management and related stockholder matters']},
     19: {'item': 'item 13', 'title': ['certain relationships and related transactions']},
     20: {'item': 'item 14', 'title': ['principal accountant fees and services']},
     21: {'item': 'item 15', 'title': ['exhibits, financial statement schedules', 'exhibits and financial statement schedules']},
}

# This is a list of strings that we use to look for the table of contents in a 10-Q filing
list_10q_items = [
    "financial statement",
    "risk factor",
    "legal proceeding",
    "mine safety disclosure",
    "management’s discussion and analysis of financial condition and results of operations",
    "quantitative and qualitative disclosures about market risk",
    "controls and procedures",
    "other information",
    "unregistered sales of equity securities and use of proceeds",
    "defaults upon senior securities",
    "exhibits"
]

# This is a dictionary of default sections that a 10-Q filing, quarterly report, could contain
default_10q_sections = {
    1: {'item': 'item 1', 'title': ['financial statement']},
    2: {'item': 'item 2', 'title': ["management's discussion and analysis of financial condition and results of operations"]},
    3: {'item': 'item 3', 'title': ['quantitative and qualitative disclosures about market risk']},
    4: {'item': 'item 4', 'title': ['controls and procedures']},
    5: {'item': 'item 1', 'title': ['legal proceeding']},
    6: {'item': 'item 1a', 'title': ['risk factor']},
    7: {'item': 'item 2', 'title': ["unregistered sales of equity securities and use of proceeds"]},
    8: {'item': 'item 3', 'title': ["defaults upon senior securities"]},
    9: {'item': 'item 4', 'title': ["mine safety disclosure"]},
    10: {'item': 'item 5', 'title': ["other information"]},
    11: {'item': 'item 6', 'title': ["exhibits"]},
}

# This is a dictionary of default sections that a 8-K filing, current report, could contain
default_8k_sections = {
    1: {'item': 'item 1.01', 'title': ["entry into a material definitive agreement"]},
    2: {'item': 'item 1.02', 'title': ["termination of a material definitive agreement"]},
    3: {'item': 'item 1.03', 'title': ["bankruptcy or receivership"]},
    4: {'item': 'item 1.04', 'title': ["mine safety"]},
    5: {'item': 'item 2.01', 'title': ["completion of acquisition or disposition of asset"]},
    6: {'item': 'item 2.02', 'title': ['results of operations and financial condition']},
    7: {'item': 'item 2.03', 'title': ["creation of a direct financial obligation"]},
    8: {'item': 'item 2.04', 'title': ["triggering events that accelerate or increase a direct financial obligation"]},
    9: {'item': 'item 2.05', 'title': ["costs associated with exit or disposal activities"]},
    10: {'item': 'item 2.06', 'title': ["material impairments"]},
    11: {'item': 'item 3.01', 'title': ["notice of delisting or failure to satisfy a continued listing"]},
    12: {'item': 'item 3.02', 'title': ["unregistered sales of equity securities"]},
    13: {'item': 'item 3.03', 'title': ["material modification to rights of security holders"]},
    14: {'item': 'item 4.01', 'title': ["changes in registrant's certifying accountant"]},
    15: {'item': 'item 4.02', 'title': ["non-reliance on previously issued financial statements"]},
    16: {'item': 'item 5.01', 'title': ["changes in control of registrant"]},
    17: {'item': 'item 5.02', 'title': ['departure of directors or certain officers']},
    18: {'item': 'item 5.03', 'title': ['amendments to articles of incorporation or bylaws']},
    19: {'item': 'item 5.04', 'title': ["temporary suspension of trading under registrant"]},
    20: {'item': 'item 5.05', 'title': ["amendment to registrant's code of ethics"]},
    21: {'item': 'item 5.06', 'title': ["change in shell company status"]},
    22: {'item': 'item 5.07', 'title': ['submission of matters to a vote of security holders']},
    23: {'item': 'item 5.08', 'title': ["shareholder director nominations"]},
    24: {'item': 'item 6.01', 'title': ["abs informational and computational material"]},
    25: {'item': 'item 6.02', 'title': ['change of servicer or trustee']},
    26: {'item': 'item 6.03', 'title': ['change in credit enhancement or other external support']},
    27: {'item': 'item 6.04', 'title': ["failure to make a required distribution"]},
    28: {'item': 'item 6.05', 'title': ["securities act updating disclosure"]},
    29: {'item': 'item 7.01', 'title': ["regulation fd disclosure"]},
    30: {'item': 'item 8.01', 'title': ['other events']},
    31: {'item': 'item 9.01', 'title': ["financial statements and exhibits"]},
}

def identify_table_of_contents(soup, list_items):
    """
    Given a soup object and a list of item, this method looks for a table of contents.
    :param soup: soup object of the document
    :param list_items: an array of strings related to sections titles.
    :return: the table of contents PageElement object or None if not found.
    """
    if list_items is None:
        return None
    max_table = 0
    chosen_table = None
    tables = soup.body.findAll("table")
    
    # for each table in the document
    for t in tables:
        
        # count how many elements of list_items are present in the table
        count = 0
        for s in list_items:
            r = t.find(string=re.compile(f'{s}', re.IGNORECASE))
            if r is not None:
                count += 1

        # choose the table that has the maximum number of elements
        if count > max_table:
            chosen_table = t
            max_table = count
                   
    # we return the chosen table only if it has at least 3 elements
    if max_table > 3:
        return chosen_table
    
    return None

def get_sections_text_with_hrefs(soup, sections):
    """
    This method tries to retrieve text from soup object related to a document and its sections
    :param soup: a soup object
    :param sections: a dictionary containing data about sections
    :return: 
    """
    next_section = 1
    current_section = None
    text = ""
    last_was_new_line = False
    
    # for each element in body
    for el in soup.body.descendants:
        
        # if we find the start element of a section
        if next_section in sections and el == sections[next_section]['start_el']:
            
            # set current_section = text and reset text to empty string
            if current_section is not None:
                sections[current_section]["text"] = text
                text = ""
                last_was_new_line = False

            # change section
            current_section = next_section
            next_section += 1

        # if we are currently in a section
        if current_section is not None and isinstance(el, NavigableString):
            
            if last_was_new_line and el.text == "\n":
                continue
            elif el.text == "\n":
                last_was_new_line = True
            else:
                last_was_new_line = False
            found_text = unidecode(el.get_text(separator=" "))
            
            # append to text
            if len(text) > 0 and text[-1] != " " and len(found_text) > 0 and found_text[0] != " ":
                text += "\n"
            text += found_text.replace('\n', ' ')

    # we reached the end of the document, set current_section = text
    if current_section is not None:
        sections[current_section]["text"] = text

    return sections

def clean_section_title(title):
    """
    Clean the title string removing special words and punctuation that makes harder to recognize it.
    :param title: a string
    :return: a cleaned string, lowercase
    """
    
    # lower case
    title = title.lower()
    
    # remove special html characters
    title = unidecode(title)
    
    # remove item
    title = title.replace("item ", "")
    
    # remove '1.' etc
    for idx in range(20, 0, -1):
        for let in ['', 'a', 'b', 'c']:
            title = title.replace(f"{idx}{let}.", "")
    for idx in range(10, 0, -1):
        title = title.replace(f"f-{idx}", "")
    
    # remove parentesis and strip
    title = re.sub(r'\([^)]*\)', '', title).strip(string.punctuation + string.whitespace)
    
    return title

def get_sections_using_hrefs(soup, table_of_contents):
    """
    Scan the table_of_contents and identify all hrefs, if present.
    The method create a dictionary of sections by finding tag elements referenced inside soup with the specific hrefs.
    :param soup: soup object of the document.
    :param table_of_contents:
    :return: a dictionary with the following structure:
        {1:
            {
                'start_el': tag element where the section starts,
                'idx': an integer index of start element inside soup, used for ordering
                'title': a string representing the section title,
                'title_candidates': a list of title candidates. If there is a single candidate that becomes the title
                'end_el': tag element where the section ends,
                'text': the text of the section
            },
        ...
        }
        Section are ordered based on chid['idx'] value
    :param soup:
    :return: section dictionary
    """
    
    # get all html elements
    all_elements = soup.find_all()
    hrefs = {}
    sections = {}
    
    # for each row in table of contents
    for tr in table_of_contents.findAll("tr"):
        
        # get all <a> tags and their links
        try:
            aa = tr.find_all("a")
            tr_hrefs = [a['href'][1:] for a in aa]
            
        except Exception as e:
            continue

        # for each element in the table row
        for el in tr.children:
            
            text = el.text
            text = clean_section_title(text)
            
            # check if there is a title
            if is_title_valid(text):
                
                
                for tr_href in tr_hrefs:
                    if tr_href not in hrefs:
                        
                        # find a document related to that title
                        h_tag = soup.find(id=tr_href)
                        if h_tag is None:
                            h_tag = soup.find(attrs={"name": tr_href})
                            
                        # if we find one, we store the information in our hrefs dictionary
                        if h_tag:
                            hrefs[tr_href] = {
                                'start_el': h_tag,
                                'idx': all_elements.index(h_tag),
                                'title': None,
                                'title_candidates': set([text])}
                    else:
                        hrefs[tr_href]['title_candidates'].add(text)
            else:
                continue

    # for each element in our hrefs dictionary (title information)
    for h in hrefs:
        hrefs[h]['title_candidates'] = list(hrefs[h]['title_candidates'])
        if len(hrefs[h]['title_candidates']) == 1:
            hrefs[h]['title'] = hrefs[h]['title_candidates'][0]
        else:
            hrefs[h]['title'] = "+++".join(hrefs[h]['title_candidates'])

    # let's sort the titles based on where we found the corresponding element in the document.
    # It can happen (seldom) that an element that comes before in the table of contents, 
    # actually comes after in the document
    temp_s = sorted(hrefs.items(), key=lambda x: x[1]["idx"])
    for i, s in enumerate(temp_s):
        sections[i + 1] = s[1]
        if i > 0:
            sections[i]["end_el"] = sections[i + 1]["start_el"]

    # retrieve sections text
    sections = get_sections_text_with_hrefs(soup, sections)
    return sections

def string_similarity_percentage(string1, string2):
    """
    Compute the leveshtein distance between the two strings and return the percentage similarity.
    :param string1: 
    :param string2: 
    :return: a float representing the percentage of similarity
    """
    distance = Levenshtein.distance(string1.replace(" ", ""), string2.replace(" ", ""))
    max_length = max(len(string1), len(string2))
    similarity_percentage = (1 - (distance / max_length)) * 100
    return similarity_percentage

def is_title_valid(text):
    """
    Check if title is valid, meaning;
    it does not starts with key words like: item, part, signature, page or is digit and has less than 2 chars
    :param text: a string representing the title
    :return: True if all conditions are satisfied else False
    """
    valid = not (
            text.startswith("item") or
            text.startswith("part") or
            text.startswith("signature") or
            text.startswith("page") or
            text.isdigit() or
            len(text) <= 2)
    return valid

def select_best_match(string_to_match, matches, start_index):
    """
    Identifies the best match, in terms of similarity distance between a string_to_match and a list of matches.
    start_index is used to avoid cases where the string_to_match is matched with the first occurence in matches.
    :param string_to_match: a string
    :param matches: a list of regular expresion matches
    :param start_index: a integer representing the index to start from
    :return: a regualr expression match with highest simialrity
    """
    match = None

    if start_index == 0:
        del matches[0]

    # if there is only one possibility
    if len(matches) == 1:
        match = matches[0]
        if matches[0].start() > start_index:
            match = matches[0]
            
    # else search for the most similar option
    elif len(matches) > 1:
        max_similarity = -1
        for i, m in enumerate(matches):
            if m.start() > start_index:
                sim = string_similarity_percentage(string_to_match, m.group().lower().replace("\n", " "))
                if sim > max_similarity:
                    max_similarity = sim
                    match = m
    return match

def get_sections_using_strings(soup, table_of_contents, default_sections):
    """
        Scan the table_of_contents and identify possible section text using strings that match default_sections.
        Retrieve sections strings in soup.body.text.
        :param soup: the soup object
        :param table_of_contents: a PageElement from soup that represent the table of contents
        :param default_sections: a dictionary that contains prefilled data about default sections that could be found in the document
        :return: a dictionary with the following structure, representing the sections:
            {1:
                {
                    'start_index': the start index of the section inside soup.body.text
                    'end_index': the start index of the section inside soup.body.text,
                    'title': a string representing the section title,
                    'end_el': tag element where the section ends
                },
            ...
            }
            Section are ordered based on chid['idx'] value
        """

    # Clean soup.body.text removing consecutive \n and spaces
    body_text = unidecode(soup.body.get_text(separator=" "))
    body_text = re.sub('\n', ' ', body_text)
    body_text = re.sub(' +', ' ', body_text)

    # If there is a table_of_contents look for items strings a check for their validity
    sections = {}
    if table_of_contents:
        num_section = 1
        for tr in table_of_contents.findAll("tr"):
            section = {}
            for el in tr.children:
                text = el.text
                
                # remove special html characters
                item = unidecode(text.lower()).replace("\n", " ").strip(string.punctuation + string.whitespace)

                if 'item' in item:
                    section["item"] = item

                text = clean_section_title(text)
                if 'item' in section and is_title_valid(text):
                    section['title'] = text
                    sections[num_section] = section
                    num_section += 1
    
    # Different behaviour if there is a table_of_contents and sections is already populated.
    if len(sections) == 0:
        # no usable table_of_contents sections, we use a prefilled default_sections dictionary
        sections = copy.deepcopy(default_sections)
        start_index = 1
    else:
        # skip first occurrence in text since it also present in table_of_contents
        start_index = 0
    
    # Loop through all sections to identify a possible item and title for a section.
    # If multiple values are found we select best match based on string similarity.
    for si in sections:
        s = sections[si]
        if 'item' in s:
            match = None
            if isinstance(s['title'], list):
                for t in s['title']:
                    matches = list(re.finditer(fr"{s['item']}. *{t}", body_text, re.IGNORECASE + re.DOTALL))
                    if matches:
                        match = select_best_match(f"{s['item']} {t}", matches, start_index)
                        break
            else:
                matches = list(re.finditer(fr"{s['item']}. *{s['title']}", body_text, re.IGNORECASE + re.DOTALL))
                if matches:
                    match = select_best_match(f"{s['item']} {s['title']}", matches, start_index)

            if match is None:
                matches = list(re.finditer(fr"{s['item']}", body_text, re.IGNORECASE + re.DOTALL))
                if matches:
                    match = select_best_match(f"{s['item']}", matches, start_index)

            if match:
                s['title'] = match.group()
                s["start_index"] = match.start()
                start_index = match.start()
            else:
                s['remove'] = True

    sections_temp = {}
    for si in sections:
        if "remove" not in sections[si]:
            sections_temp[si] = sections[si]

    # Eventually we populate each section in the dictionary with its text taken from body_text
    temp_s = sorted(sections_temp.items(), key=lambda x: x[1]["start_index"])
    sections = {}
    last_section = 0
    for i, s in enumerate(temp_s):
        sections[i + 1] = s[1]
        if i > 0:
            sections[i]["end_index"] = sections[i + 1]["start_index"]
            sections[i]["text"] = body_text[sections[i]["start_index"]:sections[i]["end_index"]]
        last_section = i + 1
    if last_section > 0:
        sections[last_section]["end_index"] = -1
        sections[last_section]["text"] = body_text[sections[last_section]["start_index"]:sections[last_section]["end_index"]]

    return sections


We defined a lot of methods that will be used to parse a document.
Below we define the parse_document method that takes a document and extract parsed text removing html tags to obtain a plain text. Also it splits the document in distint sections.

In [13]:
def parse_document(doc):
    """
    Take a document, SEC filing, parse the content and retrieve the sections.
    Save the result in MongoDB under parsed_documents collection.
    :param doc: document from "documents" collection of mongoDB
    :return:
    """

    url = doc["_id"]
    form_type = doc["form_type"]
    filing_date = doc["filing_date"]
    sections = {}
    cik = doc["cik"]
    html = doc["html"]

    # Supported form type are 10-K, 10-K/A, 10-Q, 10-Q/A, 8-K
    if form_type in ["10-K", "10-K/A"]:
        include_forms = ["10-K", "10-K/A"]
        list_items = list_10k_items
        default_sections = default_10k_sections
    elif form_type == "10-Q":
        include_forms = ["10-Q", "10-Q/A"]
        list_items = list_10q_items
        default_sections = default_10q_sections
    elif form_type == "8-K":
        include_forms = ["8-K"]
        list_items = None
        default_sections = default_8k_sections
    else:
        print(f"return because form_type {form_type} is not valid")
        return

    if form_type not in include_forms:
        print(f"return because form_type != {form_type}")
        return

    company_info = company_from_cik(cik)

    # no cik in cik_map
    if company_info is None:
        print("return because company info None")
        return

    print(f"form type: \t\t{form_type}")
    print(company_info)

    soup = BeautifulSoup(html, features="html.parser")

    if soup.body is None:
        print("return because soup.body None")
        return

    table_of_contents = identify_table_of_contents(soup, list_items)

    if table_of_contents:
        sections = get_sections_using_hrefs(soup, table_of_contents)

    if len(sections) == 0:
        sections = get_sections_using_strings(soup, table_of_contents, default_sections)

    result = {"_id": url, "cik": cik, "form_type":form_type, "filing_date": filing_date, "sections":{}}

    for s in sections:
        section = sections[s]
        if 'text' in section:
            text = section['text']
            text = re.sub('\n', ' ', text)
            text = re.sub(' +', ' ', text)

            result["sections"][section["title"]] = {"text":text, "link":section["link"] if "link" in section else None}

    try:
        mongodb.upsert_document("parsed_documents", result)
    except:
        traceback.print_exc()
        print(result.keys())
        print(result["sections"].keys())

Let's parse our Google 10-K document!

In [14]:
filing_url = 'https://www.sec.gov/Archives/edgar/data/1652044/000165204423000016/goog-20221231.htm'

doc = mongodb.get_collection("documents").find({"_id":filing_url}).next()
parse_document(doc)

form type: 		10-K
cik            0001652044
name        Alphabet Inc.
ticker              GOOGL
exchange           Nasdaq
Name: 2, dtype: object


In [15]:
parsed_doc = mongodb.get_collection("parsed_documents").find({"_id":filing_url}).next()
parsed_doc

{'_id': 'https://www.sec.gov/Archives/edgar/data/1652044/000165204423000016/goog-20221231.htm',
 'cik': '0001652044',
 'form_type': '10-K',
 'filing_date': '2023-02-03',
 'sections': {'note about forward-looking statements': {'text': 'Table of Contents Alphabet Inc. Note About Forward-Looking Statements This Annual Report on Form 10-K contains forward-looking statements within the meaning of the Private Securities Litigation Reform Act of 1995. These include, among other things, statements regarding: * the growth of our business and revenues and our expectations about the factors that influence our success and trends in our business; * fluctuations in our revenues and margins and various factors contributing to such fluctuations; * our expectation that the continuing shift from an offline to online world will continue to benefit our business; * our expectation that the portion of our revenues that we derive from non-advertising revenues will continue to increase and may affect our marg

### Summarize the parsed document
Now that we had split the document in shorter sections we can apply a summarization algorithm to extrapolate valuable insights from the document.

To do so we are going to leverage OpenAI API boosted with langchain.

In [16]:
from typing import Any, List
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders.unstructured import UnstructuredBaseLoader
from langchain.callbacks import get_openai_callback
from langchain.chains.summarize import load_summarize_chain
from langchain.chat_models import ChatOpenAI
from configparser import ConfigParser
import os

parser = ConfigParser()
_ = parser.read(os.path.join("credentials.cfg"))

class UnstructuredStringLoader(UnstructuredBaseLoader):
    """
    Uses unstructured to load a string
    Source of the string, for metadata purposes, can be passed in by the caller
    """

    def __init__(
        self, content: str, source: str = None, mode: str = "single",
        **unstructured_kwargs: Any
    ):
        self.content = content
        self.source = source
        super().__init__(mode=mode, **unstructured_kwargs)

    def _get_elements(self) -> List:
        from unstructured.partition.text import partition_text

        return partition_text(text=self.content, **self.unstructured_kwargs)

    def _get_metadata(self) -> dict:
        return {"source": self.source} if self.source else {}


def split_doc_in_chunks(doc, chunk_size=20000):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=100)
    chunks = text_splitter.split_documents(doc)
    return chunks

def compute_cost(tokens, model="gpt-3.5-turbo"):
    """
    Compute API cost from number of tokens
    :param tokens: the number of token
    :param model: the model name
    :return: cost in USD
    """
    if model == "gpt-3.5-turbo":
        return round(tokens / 1000 * 0.002, 4)
    if model == "gpt-3.5-turbo-16k":
        return round(tokens / 1000 * 0.004, 4)
    
def create_summary(section_text, model, chain_type="map_reduce", verbose=False):
    """
    Call OpenAI model with langchain library using ChatOpenAI.
    then call langchain.load_summarize_chain with the selected model and the chain_type
    :param section_text: text to be summarized
    :param model: language model
    :param chain_type: chain type for langchain.load_summarize_chain
    :param verbose: print langchain process
    :return: the model response, and the number of total tokens it took.
    """
    # load langchain language model
    llm = ChatOpenAI(model_name=model, openai_api_key=parser.get("open_ai", "api_key"))

    # prepare section_text string with a custom string loader to be ready for load_summarize_chain
    string_loader = UnstructuredStringLoader(section_text)

    # split the string in multiple chunks
    docs = split_doc_in_chunks(string_loader.load())

    # call model with the chain_type specified
    chain = load_summarize_chain(llm, chain_type=chain_type, verbose=verbose)

    # retrieve model response
    with get_openai_callback() as cb:
        res = chain.run(docs)

    return res, cb.total_tokens

def summarize_section(section_text, model="gpt-3.5-turbo", chain_type="map_reduce", verbose=False):
    """
    Create a summary for a document section.
    Output is a json {"data":["info1", "info2", ..., "infoN"]}
    :param section_text: text input to be summarized
    :param model:the OpenAI model to use, default is gpt-3.5-turbo
    :param chain_type: the type of chain to use for summarization, default is "map_reduce",
     possible other values are "stuff" and "refine"
    :param verbose: passed to langchain to print details about the chain process
    :return: bullet points of the summary as an array of strings and the cost of the request
    """
    # call model to create the summary
    summary, tokens = create_summary(section_text, model, chain_type, verbose)

    # split summary in bullet points using "." as separator
    bullets = [x.strip() for x in re.split(r'(?<!inc)(?<!Inc)\. ', summary)]

    # compute cost based on tokens of the response and the used model
    cost = compute_cost(tokens, model=model)

    return bullets, cost

Then we need to prepare the sections identifying sections and selecting the most important ones in a filing.

In [17]:
def restructure_parsed_10k(doc):
    """
    Look for and select only the sections specified in result dictionary.
    :param doc: mongo document from "documents" collection
    :return: a dictionary containing the parsed document sections titles and their text.
    """
    result = {
        "business": {"text":"", "links":[]},
        "risk": {"text":"", "links":[]},
        "unresolved": {"text":"", "links":[]},
        "property": {"text":"", "links":[]},
        "legal": {"text":"", "links":[]},
        "foreign": {"text":"", "links":[]},
        "other": {"text":"", "links":[]},
        
        # we are not going to summarize MD&A and financial notes sections of the document, while both extremely important,
        # because we didn't manage to obtain useful results from OpenAI models, without further pre-processing.
        
        # "MD&A": {"text":"", "links":[]},
        # "notes": {"text":"", "links":[]},
        
    }

    for s in doc["sections"]:

        found = None
        if ("business" in s.lower() or "overview" in s.lower() or "company" in s.lower() or "general" in s.lower() or "outlook" in s.lower())\
                and not "combination" in s.lower():
            found = "business"
        elif "propert" in s.lower() and not "plant" in s.lower() and not "business" in s.lower():
            found = "property"
        elif "foreign" in s.lower() and "jurisdiction" in s.lower():
            found = "foreign"
        elif "legal" in s.lower() and "proceeding" in s.lower():
            found = "legal"
        elif "information" in s.lower() and "other" in s.lower():
            found = "other"
        elif "unresolved" in s.lower():
            found = "unresolved"
        elif "risk" in s.lower():
            found = "risk"
        
        # we are not going to summarize MD&A and financial notes sections of the document, while both extremely important,
        # because we didn't manage to obtain useful results from OpenAI models, without further pre-processing.
        
        # elif "management" in s.lower() and "discussion" in s.lower():
        #     found = "MD&A"
        # elif "supplementa" in s.lower() or ("note" in s.lower() and "statement" not in s.lower()):
        #     found = "notes"

        if found is not None:
            result[found]["text"] += doc["sections"][s]["text"]
            result[found]["links"].append({
                "title": s,
                "link": doc["sections"][s]["link"] if "link" in doc["sections"][s] else None
            })

    return result

def restructure_parsed_10q(doc):
    result = {
        "risk": {"text":"", "links":[]},
        "MD&A": {"text":"", "links":[]},
        "legal": {"text":"", "links":[]},
        "other": {"text":"", "links":[]},
        "equity": {"text":"", "links":[]},
        "defaults": {"text":"", "links":[]},
    }

    for s in doc["sections"]:

        found = None
        if "legal" in s.lower() and "proceeding" in s.lower():
            found = "legal"
        elif "management" in s.lower() and "discussion" in s.lower():
            found = "MD&A"
        elif "information" in s.lower() and "other" in s.lower():
            found = "other"
        elif "risk" in s.lower():
            found = "risk"
        elif "sales" in s.lower() and "equity" in s.lower():
            found = "equity"
        elif "default" in s.lower():
            found = "defaults"

        if found is not None:
            result[found]["text"] += doc["sections"][s]["text"]
            result[found]["links"].append({
                "title": s,
                "link": doc["sections"][s]["link"] if "link" in doc["sections"][s] else None
            })

    return result

def restructure_parsed_8k(doc):

    result = {}

    for s in doc["sections"]:
        if "financial statements and exhibits" in s.lower():
            continue
        result[s] = doc["sections"][s]

    return result

def sections_summary(doc, verbose=False):
    """
    Summarize all sections of a document using openAI API.
    Upsert summary on MongoDB (overwrite previous one, in case we make changes to openai_interface)

    This method is configured to use gpt-3.5-turbo. At the moment this model has two different version,
    a version with 4k token and a version with 16k tokens. The one we use is based on the length of a section.

    :param doc: a parsed_document from MongoDB
    :param verbose: passed to langchain verbose
    :return:
    """

    company = company_from_cik(doc["cik"])
    result = {"_id": doc["_id"],
              "name": company["name"],
              "ticker": company["ticker"],
              "form_type": doc["form_type"],
              "filing_date": doc["filing_date"]}

    # keep track of duration and costs
    total_cost = 0
    total_start_time = time.time()

    if "10-K" in doc["form_type"]:
        new_doc = restructure_parsed_10k(doc)
    elif "10-Q" in doc["form_type"]:
        new_doc = restructure_parsed_10q(doc)
    elif doc["form_type"] == "8-K":
        new_doc = restructure_parsed_8k(doc)
    else:
        print(f"form_type {doc['form_type']} is not yet implemented")
        return

    # for each section
    for section_title, section in new_doc.items():

        section_links = section["links"] if "links" in section else None
        section_text = section["text"]

        start_time = time.time()
        
        # if the section text is too small we skip it, it's probably not material
        if len(section_text) < 250:
            continue

        # select chain_type and model (4k or 16k) based on the section and its length
        if section_title in ["business", "risk", "MD&A"]:
            chain_type = "refine"

            if len(section_text) > 25000:
                model = "gpt-3.5-turbo-16k"
            else:
                model = "gpt-3.5-turbo"
        else:
            if len(section_text) < 25000:
                chain_type = "refine"
                model = "gpt-3.5-turbo"
            elif len(section_text) < 50000:
                chain_type = "map_reduce"
                model = "gpt-3.5-turbo"
            else:
                chain_type = "map_reduce"
                model = "gpt-3.5-turbo-16k"

        original_len = len(section_text)

        # get summary from openAI model
        print(f"{section_title} original_len: {original_len} use {model} w/ chain {chain_type}")
        summary, cost = summarize_section(section_text, model, chain_type, verbose)

        result[section_title] = {"summary":summary, "links": section_links}

        summary_len = len(''.join(summary))
        reduction = 100 - round(summary_len / original_len * 100, 2)

        total_cost += cost
        duration = round(time.time() - start_time, 1)

        print(f"{section_title} original_len: {original_len} summary_len: {summary_len} reduction: {reduction}% "
              f"cost: {cost}$ duration:{duration}s used {model} w/ chain {chain_type}")

    mongodb.upsert_document("items_summary", result)

    total_duration = round(time.time() - total_start_time, 1)

    print(f"\nTotal Cost: {total_cost}$, Total duration: {total_duration}s")

### Langchain digression
LangChain is a framework for developing applications powered by language models.
Using an LLM in isolation is fine for simple applications, but more complex applications require chaining LLMs - either with each other or with other components.

LangChain provides the **Chain** interface for such "chained" applications. They define a Chain very generically as a sequence of calls to components, which can include other chains.

A summarization chain can be used to summarize multiple documents. One way is to input multiple smaller documents, after they have been divided into chunks, and operate over them with a MapReduceDocumentsChain. You can also choose instead for the chain that does summarization to be a StuffDocumentsChain, or a RefineDocumentsChain.

If you want to go deeper in this discussion feel free to read about langchain summarization chain here: https://python.langchain.com/docs/modules/chains/popular/summarize

In brief, in the code above we use the langchain.load_summarize_chain method. This method allow us to summarize a text using a LLM and a chain type. There are three chain type that could be used in different situations:
- **stuff**: takes the entire text and perform a summarization request to the LLM without splitting in chunks, this is useful for maintaining the context of text but cannot be used for text that exceed the model token capacity.
- **map_reduce**: takes n splitted documents and perform the summary of each split in parallel, than takes the resulting summaries and perform a final summary combining them alltogether. This is useful for large documents. it is fast but could lose context since each chunk is independent from others.
- **refine**: takes n splitted documents, then start summarizing the first split, then take this summary and use it as input for computing the next summary. It is a cumulative way to compute the final summary. It is useful for summarize large text and mantain the context between splits.

### Example: summarize a section
As an example to demostrate how this code works, let's select a section to summarize and print its summary after the model response.


In [18]:
restructured_doc = restructure_parsed_10k(parsed_doc)
restructured_doc

{'business': {'text': 'ITEM 1. BUSINESS Overview As our founders Larry and Sergey wrote in the original founders\' letter, "Google is not a conventional company. We do not intend to become one." That unconventional spirit has been a driving force throughout our history, inspiring us to tackle big problems and invest in moonshots, such as our long-term opportunities in artificial intelligence (AI). We continue this work under the leadership of Alphabet and Google CEO Sundar Pichai. Alphabet is a collection of businesses -- the largest of which is Google. We report Google in two segments, Google Services and Google Cloud; we also report all non-Google businesses collectively as Other Bets. Alphabet\'s structure is about helping each of our businesses prosper through strong leaders and independence. Access and technology for everyone The Internet is one of the world\'s most powerful equalizers; it propels ideas, people and businesses large and small. Our mission to organize the world\'s i

Then we want to summarize the business sections. This section contains the company description as well as other useful information to understand the company business.

In [19]:
section_text = restructured_doc["business"]["text"]
section_text

'ITEM 1. BUSINESS Overview As our founders Larry and Sergey wrote in the original founders\' letter, "Google is not a conventional company. We do not intend to become one." That unconventional spirit has been a driving force throughout our history, inspiring us to tackle big problems and invest in moonshots, such as our long-term opportunities in artificial intelligence (AI). We continue this work under the leadership of Alphabet and Google CEO Sundar Pichai. Alphabet is a collection of businesses -- the largest of which is Google. We report Google in two segments, Google Services and Google Cloud; we also report all non-Google businesses collectively as Other Bets. Alphabet\'s structure is about helping each of our businesses prosper through strong leaders and independence. Access and technology for everyone The Internet is one of the world\'s most powerful equalizers; it propels ideas, people and businesses large and small. Our mission to organize the world\'s information and make it

In [20]:
len(section_text)

25020

Since the section_text length is short enough we can use the default **gpt-3.5-turbo** model with the **refine** chain type.

In [21]:
chain_type = "refine"
model = "gpt-3.5-turbo"
verbose = True

# get summary from openAI model
print(f"business original_len: {len(section_text)} use {model} w/ chain {chain_type}")
summary, cost = summarize_section(section_text, model, chain_type, verbose)

business original_len: 25020 use gpt-3.5-turbo w/ chain refine


[1m> Entering new  chain...[0m


[1m> Entering new  chain...[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"ITEM 1. BUSINESS Overview As our founders Larry and Sergey wrote in the original founders' letter, "Google is not a conventional company. We do not intend to become one." That unconventional spirit has been a driving force throughout our history, inspiring us to tackle big problems and invest in moonshots, such as our long-term opportunities in artificial intelligence (AI). We continue this work under the leadership of Alphabet and Google CEO Sundar Pichai. Alphabet is a collection of businesses -- the largest of which is Google. We report Google in two segments, Google Services and Google Cloud; we also report all non-Google businesses collectively as Other Bets. Alphabet's structure is about helping each of our businesses prosper through strong leaders and independence. A


[1m> Finished chain.[0m


[1m> Entering new  chain...[0m
Prompt after formatting:
[32;1m[1;3mYour job is to produce a final summary
We have provided an existing summary up to a certain point: Google, a subsidiary of Alphabet Inc., is a technology company that aims to organize the world's information and make it universally accessible and useful. They offer a range of products and services, including Google Search, YouTube, Google Assistant, and Google Cloud. Google generates revenue primarily through advertising and also invests in other areas such as hardware and cloud computing. They are committed to sustainability and have set goals to achieve net-zero emissions and run on carbon-free energy. Google values its workforce and strives to create an inclusive and supportive environment for its employees.
We have the opportunity to refine the existing summary(only if needed) with some more context below.
------------
on Form 10-K or in any other report or document we file with the 

In [22]:
print(f"BULLET POINTS")
for el in summary:
    print(el)
print(f"cost: {cost} in USD")

BULLET POINTS
Google, a subsidiary of Alphabet Inc., is a technology company that aims to organize the world's information and make it universally accessible and useful
They offer a range of products and services, including Google Search, YouTube, Google Assistant, and Google Cloud
Google generates revenue primarily through advertising and also invests in other areas such as hardware and cloud computing
They are committed to sustainability and have set goals to achieve net-zero emissions and run on carbon-free energy
Google values its workforce and strives to create an inclusive and supportive environment for its employees
They have work councils and statutory employee representation obligations in certain countries, and they are committed to supporting protected labor rights and maintaining an open culture
Google also communicates information about the company through multiple internal channels to their employees
They work with external partners and staffing agencies to provide specia

### Alphabet Inc. items summary
Now that we have seen how to summarize a section we can run the algorithm to create the summary for all the important sections of the last filing for Alphabet Inc.

We can do his by calling the sections_summary method passing the parsed_doc. The result will be saved in the items_summary collection.

In [23]:
sections_summary(parsed_doc)

business original_len: 25020 use gpt-3.5-turbo-16k w/ chain refine
business original_len: 25020 summary_len: 1498 reduction: 94.01% cost: 0.0213$ duration:12.2s used gpt-3.5-turbo-16k w/ chain refine
risk original_len: 82337 use gpt-3.5-turbo-16k w/ chain refine
risk original_len: 82337 summary_len: 3478 reduction: 95.78% cost: 0.0721$ duration:43.4s used gpt-3.5-turbo-16k w/ chain refine
property original_len: 328 use gpt-3.5-turbo w/ chain refine
property original_len: 328 summary_len: 213 reduction: 35.06% cost: 0.0002$ duration:1.2s used gpt-3.5-turbo w/ chain refine
legal original_len: 272 use gpt-3.5-turbo w/ chain refine
legal original_len: 272 summary_len: 178 reduction: 34.56% cost: 0.0002$ duration:2.3s used gpt-3.5-turbo w/ chain refine
other original_len: 493 use gpt-3.5-turbo w/ chain refine
other original_len: 493 summary_len: 357 reduction: 27.590000000000003% cost: 0.0004$ duration:2.3s used gpt-3.5-turbo w/ chain refine

Total Cost: 0.0942$, Total duration: 61.3s


In [24]:
import datetime

# Get the summarized document
summary_doc = mongodb.get_document("items_summary", parsed_doc["_id"])

for k, v in summary_doc.items():
    if isinstance(v, dict):
        print(f"=== {k} ===")

        for info in v["summary"]:
            print(info)

        print()

=== business ===
Alphabet Inc., the parent company of Google, is led by CEO Sundar Pichai and focuses on providing access and technology to everyone
Google offers services such as Google Search, YouTube, and Google Assistant, and has a strong presence in the cloud computing industry with Google Cloud
Alphabet's structure allows its businesses to thrive independently
The company is committed to investing in moonshot projects and advancing AI technologies while prioritizing privacy and security for its users and customers
Alphabet also has a portfolio of Other Bets aimed at solving industry problems
They face competition in multiple areas and prioritize developing innovative products and technologies
Sustainability is a core value, with ambitious goals to transition to a carbon-free and circular economy
Alphabet values its culture and workforce, providing a supportive environment and diverse benefits and programs
They are committed to diversity, equity, and inclusion in their workforce
W