# USA Company Valuation with ChatGPT

## Goal
We want to be able to evaluate USA companies for investing purposes.

The goal is to create a tool that will help us retrieve all necessary measures to be used in the Valuation model and give us a summary of the most important information that an investor should be aware of.

Our Valuation model is built upon the principles teached by Prof. Damodaran in his Valuation Course, available for free on YouTube.
https://www.youtube.com/watch?v=LYGYvN5LUbA&list=PLUkh9m2BorqnhWfkEP2rRdhgpYKLS-NOJ

We extrapolate data from the U.S. SECURITIES AND EXCHANGE COMMISSION (SEC) website. The SEC's Electronic Data Gathering, Analysis and Retrieval (EDGAR) database provides free public access to USA corporate information, allowing us to quickly research a company's financial and operations information.
https://www.sec.gov/edgar/search-and-access

We also gather data from Prof. Damodaran website https://pages.stern.nyu.edu/~adamodar/.

## <a class="anchor" id="toc">Table of Contents:</a>
This project is structured in the following way:
1. [Data Collection](#1-bullet)
2. [Qualitative Analysis](#2-bullet) (Leverage OpenAI models to make sense of reports information)
3. [Quantitative Analysis](#3-bullet) (Valuation model based on financial data)
4. [Visualization](#4-bullet) (Company valuation Visualizations in Tableau and PowerBI)
5. [Conclusions and Next steps](#5-bullet)

## <a class="anchor" id="1-bullet" href="#toc">1. Data Collection</a>

### MongoDB
We are gonna use MongoDB to store annual reports, financial data, and our processed data.

We have the following MongoDB collections:
- **cik_ticker**: contains a single document with a mapping of CIK (Central Index KEY, id of company on EDGAR) and TICKER on the exchange.
- **submissions**: contains multiple documents, 1 for each company with the list of all submissions the company had done.
- **documents**: contains multiple documents, 1 for each SEC filing. The document contains the raw html of the report.
- **financial_data**: contains multiple documents, 1 for each company. The document contains the whole history of financial data of a single company.
- **parsed_documents**: contains multiple documents, 1 for each filing. A document contains a parsed version of the documents, where text is split in sections related to SEC filings items.
- **items_summary**: contains multiple documents, 1 for each filing. A document contains a summary for the most important sections of a SEC filing.

### PostgreSQL
In PostgreSQL we are going to store data from Damodaran website and Yahoo Finance.

Here is a brief list of the files we use from Damodaran:
- Damodaran
    - country_stats:
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/countrystats.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/ctrypremJan23.xlsx
        - https://www.stern.nyu.edu/~adamodar/pc/datasets/countrytaxrates.xls
        - https://pages.stern.nyu.edu/~adamodar/New_Home_Page/datafile/ctryprem.html
    - erp:
        - https://pages.stern.nyu.edu/~adamodar/pc/implprem/ERPbymonth.xlsx
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/histimpl.xls
    - bond_spread:
        - https://pages.stern.nyu.edu/~adamodar/New_Home_Page/datafile/ratings.html
    - industry:
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/pedata.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/peEurope.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/peJapan.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/peemerg.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/peChina.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/peIndia.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/peGlobal.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/peRest.xls

        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/pbvdata.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/pbvEurope.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/pbvJapan.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/pbvemerg.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/pbvChina.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/pbvIndia.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/pbvGlobal.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/pbvRest.xls

        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/psdata.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/psEurope.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/psJapan.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/psemerg.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/psChina.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/psIndia.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/psGlobal.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/psRest.xls

        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/vebitda.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/vebitdaEurope.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/vebitdaJapan.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/vebitdaemerg.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/vebitdaChina.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/vebitdaIndia.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/vebitdaGlobal.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/vebitdaRest.xls

        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/betas.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/betaEurope.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/betaJapan.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/betaemerg.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/betaChina.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/betaIndia.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/betaGlobal.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/betaRest.xls

        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/capex.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/capexEurope.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/capexJapan.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/capexemerg.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/capexChina.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/capexIndia.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/capexGlobal.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/capexRest.xls

        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/divfcfe.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/divfcfeEurope.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/divfcfeJapan.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/divfcfeemerg.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/divfcfeChina.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/divfcfeIndia.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/divfcfeGlobal.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/divfcfeRest.xls

        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/margin.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/marginEurope.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/marginJapan.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/marginemerg.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/marginChina.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/marginIndia.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/marginGlobal.xls
        - https://pages.stern.nyu.edu/~adamodar/pc/datasets/marginRest.xls

Since this process of retrieving, transforming and storing data to PostgreSQL is out of the scope of this project we are not going to describe it in this notebook.

### Project Setup
We are going to use various dependencies to collect data from EDGAR APIs (https://www.sec.gov/edgar/sec-api-documentation).
Here we are going to import everything we use in this notebook.

In [1]:
# move to root to simplify imports
%cd ..

C:\Users\matte\repo\tests\company_valuation


In [2]:
import requests
import pandas as pd
import datetime
import time
from dateutil.relativedelta import relativedelta
from pymongo.errors import DocumentTooLarge
import mongodb # a utility script containing interface methods to a MongoDB instance

Then we define various utility methods used in our project.

In [3]:
def make_edgar_request(url):
    """
    Make a request to EDGAR (Electronic Data Gathering, Analysis and Retrieval)
    :param url: request URL
    :return: response
    """
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36",
        "Accept-Encoding": "gzip, deflate, br",
    }
    return requests.get(url, headers=headers)

def download_cik_ticker_map():
    """
    Get a mapping of cik (Central Index Key, id of company on edgar) and ticker on the exchange.
    It upsert the mapping in MongoDB collection cik_ticker.
    """
    CIK_TICKER_URL = "https://www.sec.gov/files/company_tickers_exchange.json"
    response = make_edgar_request(CIK_TICKER_URL)
    r = response.json()
    r["_id"] = "cik_ticker"
    mongodb.upsert_document("cik_ticker", r)
    
def get_df_cik_ticker_map():
    """
    Create a DataFrame from cik ticker document on MongoDB.
    :return: DataFrame
    """
    try:
        cik_ticker = mongodb.get_collection_documents("cik_ticker").next()
    except StopIteration:
        print("cik ticker document not found")
        return
    df = pd.DataFrame(cik_ticker["data"], columns=cik_ticker["fields"])
    
    # add leading 0s to cik (always 10 digits)
    df["cik"] = df.apply(lambda x: add_trailing_to_cik(x["cik"]), axis=1)
    
    return df

def company_from_cik(cik):
    """
    Get company info from cik
    :param cik: company id on EDGAR
    :return: DataFrame row with company information (name, ticker, exchange)
    """
    df = get_df_cik_ticker_map()
    try:
        return df[df["cik"] == cik].iloc[0]
    except IndexError:
        return None
    
def cik_from_ticker(ticker):
    """
    Get company cik from ticker
    :param ticker: company ticker
    :return: cik (company id on EDGAR)
    """
    df = get_df_cik_ticker_map()
    try:
        cik = df[df["ticker"] == ticker]["cik"].iloc[0]
    except:
        cik = -1
    return cik

def download_all_cik_submissions(cik):
    """
    Get list of submissions for a single company.
    Upsert this list on MongoDB (each download contains all the submissions).
    :param cik: cik of the company
    :return:
    """
    url = f"https://data.sec.gov/submissions/CIK{cik}.json"
    response = make_edgar_request(url)
    r = response.json()
    r["_id"] = cik
    mongodb.upsert_document("submissions", r)
    
def download_submissions_documents(cik, forms_to_download=("10-Q", "10-K", "8-K"), years=5):
    """
    Download all documents for submissions forms 'forms_to_download' for the past 'years'.
    Insert them on mongodb.
    :param cik: company cik
    :param forms_to_download: a tuple containing the form types to download
    :param years: the max number of years to download
    :return:
    """
    try:
        submissions = mongodb.get_document("submissions", cik)
    except StopIteration:
        print(f"submissions file not found in mongodb for {cik}")
        return
    
    cik_no_trailing = submissions["cik"]
    filings = submissions["filings"]["recent"]
    
    for i in range(len(filings["filingDate"])):
        filing_date = filings['filingDate'][i]
        difference_in_years = relativedelta(datetime.date.today(),
                                            datetime.datetime.strptime(filing_date, "%Y-%m-%d")).years
        
        # as the document are ordered cronologically when we reach the max history we can return
        if difference_in_years > years:
            return
        
        form_type = filings['form'][i]
        if form_type not in forms_to_download:
            continue
            
        accession_no_symbols = filings["accessionNumber"][i].replace("-","")
        primary_document = filings["primaryDocument"][i]
        url = f"https://www.sec.gov/Archives/edgar/data/{cik_no_trailing}/{accession_no_symbols}/{primary_document}"
        
        # if we already have the document, we don't download it again
        if mongodb.check_document_exists("documents", url):
            continue
        
        print(f"{filing_date} ({form_type}): {url}")
        download_document(url, cik, form_type, filing_date)
        
        # insert a quick sleep to avoid reaching edgar rate limit
        time.sleep(0.2)

def download_document(url, cik, form_type, filing_date, updated_at=None):
    """
    Download and insert submission document
    :param url: document URL
    :param cik: company cik
    :param form_type: document form type
    :param filing_date: document filing date
    :return:
    """
    response = make_edgar_request(url)
    r = response.text
    
    doc = {"html": r, "cik": cik, "form_type": form_type, "filing_date": filing_date, "updated_at": updated_at, "_id": url}
    
    try:
        mongodb.insert_document("documents", doc)
    except DocumentTooLarge:
        # DocumenTooLarge is raised by mongodb when uploading files larger than 16MB
        # To avoid this it is better to save this kind of files in a separate storate like S3 and retriving them when needed.
        # Another option could be using mongofiles: https://www.mongodb.com/docs/database-tools/mongofiles/#mongodb-binary-bin.mongofiles
        # for management of large files saved in mongo db.
        print("Document too Large (over 16MB)", url)

def add_trailing_to_cik(cik_no_trailing):
    return "{:010d}".format(cik_no_trailing)

These methods allow us to download data for all companies managed by SEC, that are those present in the *cik_ticker* collection.

With this in place, we can download all SEC filings for all these companies and save them on mongoDB in *submissions* and *documents* collections.

### Example: download Alphabet Inc. (Google) data
As an example we are going to collect data of Alphabet Inc.

In [4]:
# First we download the cik_ticker map
download_cik_ticker_map()

In [5]:
# Retrieve CIK with the Alphabet Inc. TICKER.
apple_tiker = "GOOGL"
cik = cik_from_ticker(apple_tiker)
cik

'0001652044'

In [6]:
# Get list of submissions for Alphabet Inc.
download_all_cik_submissions(cik)

In [7]:
# This download all documents for submissions forms 10-k for the past 5 years.
download_submissions_documents(cik, ("10-K"), 5)

## <a class="anchor" id="2-bullet" href="#toc">2. Qualitative Analysis</a>

In this step we are going to leverage OpenAI **gpt-turbo-3.5** (the model that powers ChatGPT) to create useful summaries to help us evaluate a company (starting from the SEC Filings reports)

Qualitative analysis for financial and investing purposes involves evaluating non-numerical factors such as the company's management team, competitive positioning, brand reputation, industry trends, regulatory environment, corporate governance, and strategic initiatives. This analysis aims to gain insights into the company's qualitative strengths, weaknesses, risks, and opportunities to make informed investment decisions and assess the company's long-term potential.

In particular, we will read a specific filing, the 2022 10-K Annual report for Google (https://www.sec.gov/Archives/edgar/data/1652044/000165204423000016/goog-20221231.htm).

Let's now extrapolate sections and identify critical informations.

### Parse Documents
First we are gonna parse raw htmls collected in "documents" collection to extrapolate the text contained in the various sections (business, risk, MD&A, and so on).

Then we will pass these sections to **gpt-turbo-3.5** asking to provide a summary.

We do this pre-processing step of dividing the document in sections because **gpt-turbo-3.5** has a limitation in the number of tokens (words/syllables) it can process.

To do this step we can use BeautifulSoup and some euristics.

### Problem:
Annual reports do not have a standard structure. Every file structure could differ even if they are from the same company. 

Luckily, we identified common patterns in the markdown structure of a filing.

Most of the reports have a table of content at the beginning of document and we can use this to make sense of the structure of the document.

When we have a table of content we can look at the &lt;a&gt; tags (links) to retrieve hrefs that will be used to identify tag elements inside the document.

With this, we can split the document in multiple sections mantaining a common context.

However, not all documents have table of contents with hrefs, some documents don't even have a table of contents. Thus, we wrote an algorithm that consider also these cases.

In [8]:
import copy
from datetime import datetime
import Levenshtein as Levenshtein
from bs4 import BeautifulSoup, NavigableString
from unidecode import unidecode
import mongodb
import string
import re

In [9]:
# This is a list of strings that we use to look for the table of contents in a 10-K filing
list_10k_items = [
    "business",
    "risk factors",
    "unresolved staff comments",
    "properties",
    "legal proceedings",
    "mine safety disclosures",
    "market for registrant’s common equity, related stockholder matters and issuer purchases of equity securities",
    "reserved",
    "management’s discussion and analysis of financial condition and results of operations",
    "quantitative and qualitative disclosures about market risk",
    "financial statements and supplementary data",
    "changes in and disagreements with accountants on accounting and financial disclosure",
    "controls and procedures",
    "other information",
    "disclosure regarding foreign jurisdictions that prevent inspection",
    "directors, executive officers, and corporate governance",
    "executive compensation",
    "security ownership of certain beneficial owners and management and related stockholder matters",
    "certain relationships and related transactions, and director independence",
    "principal accountant fees and services",
    "exhibits and financial statement schedules",
]

# This is a dictionary of default sections that a 10-K filing, annual report, could contain
default_10k_sections = {
     1: {'item': 'item 1', 'title': ['business']},
     2: {'item': 'item 1a', 'title': ['risk factor']},
     3: {'item': 'item 1b', 'title': ['unresolved staff']},
     4: {'item': 'item 2', 'title': ['propert']},
     5: {'item': 'item 3', 'title': ['legal proceeding']},
     6: {'item': 'item 4', 'title': ['mine safety disclosure', 'submission of matters to a vote of security holders']},
     7: {'item': 'item 5', 'title': ["market for registrant's common equity, related stockholder matters and issuer purchases of equity securities"]},
     8: {'item': 'item 6', 'title': ['reserved', 'selected financial data']},
     9: {'item': 'item 7', 'title': ["management's discussion and analysis of financial condition and results of operations"]},
     10: {'item': 'item 7a', 'title': ['quantitative and qualitative disclosures about market risk']},
     11: {'item': 'item 8', 'title': ['financial statements and supplementary data']},
     12: {'item': 'item 9', 'title': ['changes in and disagreements with accountants on accounting and financial disclosure']},
     13: {'item': 'item 9a', 'title': ['controls and procedures']},
     14: {'item': 'item 9b', 'title': ['other information']},
     15: {'item': 'item 9c', 'title': ['Disclosure Regarding Foreign Jurisdictions that Prevent Inspections']},
     16: {'item': 'item 10', 'title': ['directors, executive officers and corporate governance','directors and executive officers of the registrant']},
     17: {'item': 'item 11', 'title': ['executive compensation']},
     18: {'item': 'item 12', 'title': ['security ownership of certain beneficial owners and management and related stockholder matters']},
     19: {'item': 'item 13', 'title': ['certain relationships and related transactions']},
     20: {'item': 'item 14', 'title': ['principal accountant fees and services']},
     21: {'item': 'item 15', 'title': ['exhibits, financial statement schedules', 'exhibits and financial statement schedules']},
}

# This is a list of strings that we use to look for the table of contents in a 10-Q filing
list_10q_items = [
    "financial statement",
    "risk factor",
    "legal proceeding",
    "mine safety disclosure",
    "management’s discussion and analysis of financial condition and results of operations",
    "quantitative and qualitative disclosures about market risk",
    "controls and procedures",
    "other information",
    "unregistered sales of equity securities and use of proceeds",
    "defaults upon senior securities",
    "exhibits"
]

# This is a dictionary of default sections that a 10-Q filing, quarterly report, could contain
default_10q_sections = {
    1: {'item': 'item 1', 'title': ['financial statement']},
    2: {'item': 'item 2', 'title': ["management's discussion and analysis of financial condition and results of operations"]},
    3: {'item': 'item 3', 'title': ['quantitative and qualitative disclosures about market risk']},
    4: {'item': 'item 4', 'title': ['controls and procedures']},
    5: {'item': 'item 1', 'title': ['legal proceeding']},
    6: {'item': 'item 1a', 'title': ['risk factor']},
    7: {'item': 'item 2', 'title': ["unregistered sales of equity securities and use of proceeds"]},
    8: {'item': 'item 3', 'title': ["defaults upon senior securities"]},
    9: {'item': 'item 4', 'title': ["mine safety disclosure"]},
    10: {'item': 'item 5', 'title': ["other information"]},
    11: {'item': 'item 6', 'title': ["exhibits"]},
}

# This is a dictionary of default sections that a 8-K filing, current report, could contain
default_8k_sections = {
    1: {'item': 'item 1.01', 'title': ["entry into a material definitive agreement"]},
    2: {'item': 'item 1.02', 'title': ["termination of a material definitive agreement"]},
    3: {'item': 'item 1.03', 'title': ["bankruptcy or receivership"]},
    4: {'item': 'item 1.04', 'title': ["mine safety"]},
    5: {'item': 'item 2.01', 'title': ["completion of acquisition or disposition of asset"]},
    6: {'item': 'item 2.02', 'title': ['results of operations and financial condition']},
    7: {'item': 'item 2.03', 'title': ["creation of a direct financial obligation"]},
    8: {'item': 'item 2.04', 'title': ["triggering events that accelerate or increase a direct financial obligation"]},
    9: {'item': 'item 2.05', 'title': ["costs associated with exit or disposal activities"]},
    10: {'item': 'item 2.06', 'title': ["material impairments"]},
    11: {'item': 'item 3.01', 'title': ["notice of delisting or failure to satisfy a continued listing"]},
    12: {'item': 'item 3.02', 'title': ["unregistered sales of equity securities"]},
    13: {'item': 'item 3.03', 'title': ["material modification to rights of security holders"]},
    14: {'item': 'item 4.01', 'title': ["changes in registrant's certifying accountant"]},
    15: {'item': 'item 4.02', 'title': ["non-reliance on previously issued financial statements"]},
    16: {'item': 'item 5.01', 'title': ["changes in control of registrant"]},
    17: {'item': 'item 5.02', 'title': ['departure of directors or certain officers']},
    18: {'item': 'item 5.03', 'title': ['amendments to articles of incorporation or bylaws']},
    19: {'item': 'item 5.04', 'title': ["temporary suspension of trading under registrant"]},
    20: {'item': 'item 5.05', 'title': ["amendment to registrant's code of ethics"]},
    21: {'item': 'item 5.06', 'title': ["change in shell company status"]},
    22: {'item': 'item 5.07', 'title': ['submission of matters to a vote of security holders']},
    23: {'item': 'item 5.08', 'title': ["shareholder director nominations"]},
    24: {'item': 'item 6.01', 'title': ["abs informational and computational material"]},
    25: {'item': 'item 6.02', 'title': ['change of servicer or trustee']},
    26: {'item': 'item 6.03', 'title': ['change in credit enhancement or other external support']},
    27: {'item': 'item 6.04', 'title': ["failure to make a required distribution"]},
    28: {'item': 'item 6.05', 'title': ["securities act updating disclosure"]},
    29: {'item': 'item 7.01', 'title': ["regulation fd disclosure"]},
    30: {'item': 'item 8.01', 'title': ['other events']},
    31: {'item': 'item 9.01', 'title': ["financial statements and exhibits"]},
}

def identify_table_of_contents(soup, list_items):
    """
    Given a soup object and a list of item, this method looks for a table of contents.
    :param soup: soup object of the document
    :param list_items: an array of strings related to sections titles.
    :return: the table of contents PageElement object or None if not found.
    """
    if list_items is None:
        return None
    max_table = 0
    chosen_table = None
    tables = soup.body.findAll("table")
    
    # for each table in the document
    for t in tables:
        
        # count how many elements of list_items are present in the table
        count = 0
        for s in list_items:
            r = t.find(string=re.compile(f'{s}', re.IGNORECASE))
            if r is not None:
                count += 1

        # choose the table that has the maximum number of elements
        if count > max_table:
            chosen_table = t
            max_table = count
                   
    # we return the chosen table only if it has at least 3 elements
    if max_table > 3:
        return chosen_table
    
    return None

def get_sections_text_with_hrefs(soup, sections):
    """
    This method tries to retrieve text from soup object related to a document and its sections
    :param soup: a soup object
    :param sections: a dictionary containing data about sections
    :return: 
    """
    next_section = 1
    current_section = None
    text = ""
    last_was_new_line = False
    
    # for each element in body
    for el in soup.body.descendants:
        
        # if we find the start element of a section
        if next_section in sections and el == sections[next_section]['start_el']:
            
            # set current_section = text and reset text to empty string
            if current_section is not None:
                sections[current_section]["text"] = text
                text = ""
                last_was_new_line = False

            # change section
            current_section = next_section
            next_section += 1

        # if we are currently in a section
        if current_section is not None and isinstance(el, NavigableString):
            
            if last_was_new_line and el.text == "\n":
                continue
            elif el.text == "\n":
                last_was_new_line = True
            else:
                last_was_new_line = False
            found_text = unidecode(el.get_text(separator=" "))
            
            # append to text
            if len(text) > 0 and text[-1] != " " and len(found_text) > 0 and found_text[0] != " ":
                text += "\n"
            text += found_text.replace('\n', ' ')

    # we reached the end of the document, set current_section = text
    if current_section is not None:
        sections[current_section]["text"] = text

    return sections

def clean_section_title(title):
    """
    Clean the title string removing special words and punctuation that makes harder to recognize it.
    :param title: a string
    :return: a cleaned string, lowercase
    """
    
    # lower case
    title = title.lower()
    
    # remove special html characters
    title = unidecode(title)
    
    # remove item
    title = title.replace("item ", "")
    
    # remove '1.' etc
    for idx in range(20, 0, -1):
        for let in ['', 'a', 'b', 'c']:
            title = title.replace(f"{idx}{let}.", "")
    for idx in range(10, 0, -1):
        title = title.replace(f"f-{idx}", "")
    
    # remove parentesis and strip
    title = re.sub(r'\([^)]*\)', '', title).strip(string.punctuation + string.whitespace)
    
    return title

def get_sections_using_hrefs(soup, table_of_contents):
    """
    Scan the table_of_contents and identify all hrefs, if present.
    The method create a dictionary of sections by finding tag elements referenced inside soup with the specific hrefs.
    :param soup: soup object of the document.
    :param table_of_contents:
    :return: a dictionary with the following structure:
        {1:
            {
                'start_el': tag element where the section starts,
                'idx': an integer index of start element inside soup, used for ordering
                'title': a string representing the section title,
                'title_candidates': a list of title candidates. If there is a single candidate that becomes the title
                'end_el': tag element where the section ends,
                'text': the text of the section
            },
        ...
        }
        Section are ordered based on chid['idx'] value
    :param soup:
    :return: section dictionary
    """
    
    # get all html elements
    all_elements = soup.find_all()
    hrefs = {}
    sections = {}
    
    # for each row in table of contents
    for tr in table_of_contents.findAll("tr"):
        
        # get all <a> tags and their links
        try:
            aa = tr.find_all("a")
            tr_hrefs = [a['href'][1:] for a in aa]
            
        except Exception as e:
            continue

        # for each element in the table row
        for el in tr.children:
            
            text = el.text
            text = clean_section_title(text)
            
            # check if there is a title
            if is_title_valid(text):
                
                
                for tr_href in tr_hrefs:
                    if tr_href not in hrefs:
                        
                        # find a document related to that title
                        h_tag = soup.find(id=tr_href)
                        if h_tag is None:
                            h_tag = soup.find(attrs={"name": tr_href})
                            
                        # if we find one, we store the information in our hrefs dictionary
                        if h_tag:
                            hrefs[tr_href] = {
                                'start_el': h_tag,
                                'idx': all_elements.index(h_tag),
                                'title': None,
                                'title_candidates': set([text])}
                    else:
                        hrefs[tr_href]['title_candidates'].add(text)
            else:
                continue

    # for each element in our hrefs dictionary (title information)
    for h in hrefs:
        hrefs[h]['title_candidates'] = list(hrefs[h]['title_candidates'])
        if len(hrefs[h]['title_candidates']) == 1:
            hrefs[h]['title'] = hrefs[h]['title_candidates'][0]
        else:
            hrefs[h]['title'] = "+++".join(hrefs[h]['title_candidates'])

    # let's sort the titles based on where we found the corresponding element in the document.
    # It can happen (seldom) that an element that comes before in the table of contents, 
    # actually comes after in the document
    temp_s = sorted(hrefs.items(), key=lambda x: x[1]["idx"])
    for i, s in enumerate(temp_s):
        sections[i + 1] = s[1]
        if i > 0:
            sections[i]["end_el"] = sections[i + 1]["start_el"]

    # retrieve sections text
    sections = get_sections_text_with_hrefs(soup, sections)
    return sections

def string_similarity_percentage(string1, string2):
    """
    Compute the leveshtein distance between the two strings and return the percentage similarity.
    :param string1: 
    :param string2: 
    :return: a float representing the percentage of similarity
    """
    distance = Levenshtein.distance(string1.replace(" ", ""), string2.replace(" ", ""))
    max_length = max(len(string1), len(string2))
    similarity_percentage = (1 - (distance / max_length)) * 100
    return similarity_percentage

def is_title_valid(text):
    """
    Check if title is valid, meaning;
    it does not starts with key words like: item, part, signature, page or is digit and has less than 2 chars
    :param text: a string representing the title
    :return: True if all conditions are satisfied else False
    """
    valid = not (
            text.startswith("item") or
            text.startswith("part") or
            text.startswith("signature") or
            text.startswith("page") or
            text.isdigit() or
            len(text) <= 2)
    return valid

def select_best_match(string_to_match, matches, start_index):
    """
    Identifies the best match, in terms of similarity distance between a string_to_match and a list of matches.
    start_index is used to avoid cases where the string_to_match is matched with the first occurence in matches.
    :param string_to_match: a string
    :param matches: a list of regular expresion matches
    :param start_index: a integer representing the index to start from
    :return: a regualr expression match with highest simialrity
    """
    match = None

    if start_index == 0:
        del matches[0]

    # if there is only one possibility
    if len(matches) == 1:
        match = matches[0]
        if matches[0].start() > start_index:
            match = matches[0]
            
    # else search for the most similar option
    elif len(matches) > 1:
        max_similarity = -1
        for i, m in enumerate(matches):
            if m.start() > start_index:
                sim = string_similarity_percentage(string_to_match, m.group().lower().replace("\n", " "))
                if sim > max_similarity:
                    max_similarity = sim
                    match = m
    return match

def get_sections_using_strings(soup, table_of_contents, default_sections):
    """
        Scan the table_of_contents and identify possible section text using strings that match default_sections.
        Retrieve sections strings in soup.body.text.
        :param soup: the soup object
        :param table_of_contents: a PageElement from soup that represent the table of contents
        :param default_sections: a dictionary that contains prefilled data about default sections that could be found in the document
        :return: a dictionary with the following structure, representing the sections:
            {1:
                {
                    'start_index': the start index of the section inside soup.body.text
                    'end_index': the start index of the section inside soup.body.text,
                    'title': a string representing the section title,
                    'end_el': tag element where the section ends
                },
            ...
            }
            Section are ordered based on chid['idx'] value
        """

    # Clean soup.body.text removing consecutive \n and spaces
    body_text = unidecode(soup.body.get_text(separator=" "))
    body_text = re.sub('\n', ' ', body_text)
    body_text = re.sub(' +', ' ', body_text)

    # If there is a table_of_contents look for items strings a check for their validity
    sections = {}
    if table_of_contents:
        num_section = 1
        for tr in table_of_contents.findAll("tr"):
            section = {}
            for el in tr.children:
                text = el.text
                
                # remove special html characters
                item = unidecode(text.lower()).replace("\n", " ").strip(string.punctuation + string.whitespace)

                if 'item' in item:
                    section["item"] = item

                text = clean_section_title(text)
                if 'item' in section and is_title_valid(text):
                    section['title'] = text
                    sections[num_section] = section
                    num_section += 1
    
    # Different behaviour if there is a table_of_contents and sections is already populated.
    if len(sections) == 0:
        # no usable table_of_contents sections, we use a prefilled default_sections dictionary
        sections = copy.deepcopy(default_sections)
        start_index = 1
    else:
        # skip first occurrence in text since it also present in table_of_contents
        start_index = 0
    
    # Loop through all sections to identify a possible item and title for a section.
    # If multiple values are found we select best match based on string similarity.
    for si in sections:
        s = sections[si]
        if 'item' in s:
            match = None
            if isinstance(s['title'], list):
                for t in s['title']:
                    matches = list(re.finditer(fr"{s['item']}. *{t}", body_text, re.IGNORECASE + re.DOTALL))
                    if matches:
                        match = select_best_match(f"{s['item']} {t}", matches, start_index)
                        break
            else:
                matches = list(re.finditer(fr"{s['item']}. *{s['title']}", body_text, re.IGNORECASE + re.DOTALL))
                if matches:
                    match = select_best_match(f"{s['item']} {s['title']}", matches, start_index)

            if match is None:
                matches = list(re.finditer(fr"{s['item']}", body_text, re.IGNORECASE + re.DOTALL))
                if matches:
                    match = select_best_match(f"{s['item']}", matches, start_index)

            if match:
                s['title'] = match.group()
                s["start_index"] = match.start()
                start_index = match.start()
            else:
                s['remove'] = True

    sections_temp = {}
    for si in sections:
        if "remove" not in sections[si]:
            sections_temp[si] = sections[si]

    # Eventually we populate each section in the dictionary with its text taken from body_text
    temp_s = sorted(sections_temp.items(), key=lambda x: x[1]["start_index"])
    sections = {}
    last_section = 0
    for i, s in enumerate(temp_s):
        sections[i + 1] = s[1]
        if i > 0:
            sections[i]["end_index"] = sections[i + 1]["start_index"]
            sections[i]["text"] = body_text[sections[i]["start_index"]:sections[i]["end_index"]]
        last_section = i + 1
    if last_section > 0:
        sections[last_section]["end_index"] = -1
        sections[last_section]["text"] = body_text[sections[last_section]["start_index"]:sections[last_section]["end_index"]]

    return sections


We defined a lot of methods that will be used to parse a document.
Below we define the parse_document method that takes a document and extract parsed text removing html tags to obtain a plain text. Also it splits the document in distint sections.

In [10]:
def parse_document(doc):
    """
    Take a document, SEC filing, parse the content and retrieve the sections.
    Save the result in MongoDB under parsed_documents collection.
    :param doc: document from "documents" collection of mongoDB
    :return:
    """

    url = doc["_id"]
    form_type = doc["form_type"]
    filing_date = doc["filing_date"]
    sections = {}
    cik = doc["cik"]
    html = doc["html"]

    # Supported form type are 10-K, 10-K/A, 10-Q, 10-Q/A, 8-K
    if form_type in ["10-K", "10-K/A"]:
        include_forms = ["10-K", "10-K/A"]
        list_items = list_10k_items
        default_sections = default_10k_sections
    elif form_type == "10-Q":
        include_forms = ["10-Q", "10-Q/A"]
        list_items = list_10q_items
        default_sections = default_10q_sections
    elif form_type == "8-K":
        include_forms = ["8-K"]
        list_items = None
        default_sections = default_8k_sections
    else:
        print(f"return because form_type {form_type} is not valid")
        return

    if form_type not in include_forms:
        print(f"return because form_type != {form_type}")
        return

    company_info = company_from_cik(cik)

    # no cik in cik_map
    if company_info is None:
        print("return because company info None")
        return

    print(f"form type: \t\t{form_type}")
    print(company_info)

    soup = BeautifulSoup(html, features="html.parser")

    if soup.body is None:
        print("return because soup.body None")
        return

    table_of_contents = identify_table_of_contents(soup, list_items)

    if table_of_contents:
        sections = get_sections_using_hrefs(soup, table_of_contents)

    if len(sections) == 0:
        sections = get_sections_using_strings(soup, table_of_contents, default_sections)

    result = {"_id": url, "cik": cik, "form_type":form_type, "filing_date": filing_date, "sections":{}}

    for s in sections:
        section = sections[s]
        if 'text' in section:
            text = section['text']
            text = re.sub('\n', ' ', text)
            text = re.sub(' +', ' ', text)

            result["sections"][section["title"]] = {"text":text, "link":section["link"] if "link" in section else None}

    try:
        mongodb.upsert_document("parsed_documents", result)
    except:
        traceback.print_exc()
        print(result.keys())
        print(result["sections"].keys())

Let's parse our Google 10-K document!

In [11]:
filing_url = 'https://www.sec.gov/Archives/edgar/data/1652044/000165204423000016/goog-20221231.htm'

doc = mongodb.get_collection("documents").find({"_id":filing_url}).next()
parse_document(doc)

form type: 		10-K
cik            0001652044
name        Alphabet Inc.
ticker              GOOGL
exchange           Nasdaq
Name: 2, dtype: object


In [12]:
parsed_doc = mongodb.get_collection("parsed_documents").find({"_id":filing_url}).next()

### Summarize the parsed document
Now that we had split the document in shorter sections we can apply a summarization algorithm to extrapolate valuable insights from the document.

To do so we are going to leverage OpenAI API boosted with langchain.

In [13]:
from typing import Any, List
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders.unstructured import UnstructuredBaseLoader
from langchain.callbacks import get_openai_callback
from langchain.chains.summarize import load_summarize_chain
from langchain.chat_models import ChatOpenAI
from configparser import ConfigParser
import os

parser = ConfigParser()
_ = parser.read(os.path.join("credentials.cfg"))

class UnstructuredStringLoader(UnstructuredBaseLoader):
    """
    Uses unstructured to load a string
    Source of the string, for metadata purposes, can be passed in by the caller
    """

    def __init__(
        self, content: str, source: str = None, mode: str = "single",
        **unstructured_kwargs: Any
    ):
        self.content = content
        self.source = source
        super().__init__(mode=mode, **unstructured_kwargs)

    def _get_elements(self) -> List:
        from unstructured.partition.text import partition_text

        return partition_text(text=self.content, **self.unstructured_kwargs)

    def _get_metadata(self) -> dict:
        return {"source": self.source} if self.source else {}


def split_doc_in_chunks(doc, chunk_size=20000):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=100)
    chunks = text_splitter.split_documents(doc)
    return chunks

def compute_cost(tokens, model="gpt-3.5-turbo"):
    """
    Compute API cost from number of tokens
    :param tokens: the number of token
    :param model: the model name
    :return: cost in USD
    """
    if model == "gpt-3.5-turbo":
        return round(tokens / 1000 * 0.002, 4)
    if model == "gpt-3.5-turbo-16k":
        return round(tokens / 1000 * 0.004, 4)
    
def create_summary(section_text, model, chain_type="map_reduce", verbose=False):
    """
    Call OpenAI model with langchain library using ChatOpenAI.
    then call langchain.load_summarize_chain with the selected model and the chain_type
    :param section_text: text to be summarized
    :param model: language model
    :param chain_type: chain type for langchain.load_summarize_chain
    :param verbose: print langchain process
    :return: the model response, and the number of total tokens it took.
    """
    # load langchain language model
    llm = ChatOpenAI(model_name=model, openai_api_key=parser.get("open_ai", "api_key"))

    # prepare section_text string with a custom string loader to be ready for load_summarize_chain
    string_loader = UnstructuredStringLoader(section_text)

    # split the string in multiple chunks
    docs = split_doc_in_chunks(string_loader.load())

    # call model with the chain_type specified
    chain = load_summarize_chain(llm, chain_type=chain_type, verbose=verbose)

    # retrieve model response
    with get_openai_callback() as cb:
        res = chain.run(docs)

    return res, cb.total_tokens

def summarize_section(section_text, model="gpt-3.5-turbo", chain_type="map_reduce", verbose=False):
    """
    Create a summary for a document section.
    Output is a json {"data":["info1", "info2", ..., "infoN"]}
    :param section_text: text input to be summarized
    :param model:the OpenAI model to use, default is gpt-3.5-turbo
    :param chain_type: the type of chain to use for summarization, default is "map_reduce",
     possible other values are "stuff" and "refine"
    :param verbose: passed to langchain to print details about the chain process
    :return: bullet points of the summary as an array of strings and the cost of the request
    """
    # call model to create the summary
    summary, tokens = create_summary(section_text, model, chain_type, verbose)

    # split summary in bullet points using "." as separator
    bullets = [x.strip() for x in re.split(r'(?<!inc)(?<!Inc)\. ', summary)]

    # compute cost based on tokens of the response and the used model
    cost = compute_cost(tokens, model=model)

    return bullets, cost

Then we need to prepare the sections identifying sections and selecting the most important ones in a filing.

In [14]:
def restructure_parsed_10k(doc):
    """
    Look for and select only the sections specified in result dictionary.
    :param doc: mongo document from "documents" collection
    :return: a dictionary containing the parsed document sections titles and their text.
    """
    result = {
        "business": {"text":"", "links":[]},
        "risk": {"text":"", "links":[]},
        "unresolved": {"text":"", "links":[]},
        "property": {"text":"", "links":[]},
        "legal": {"text":"", "links":[]},
        "foreign": {"text":"", "links":[]},
        "other": {"text":"", "links":[]},
        
        # we are not going to summarize MD&A and financial notes sections of the document, while both extremely important,
        # because we didn't manage to obtain useful results from OpenAI models, without further pre-processing.
        
        # "MD&A": {"text":"", "links":[]},
        # "notes": {"text":"", "links":[]},
        
    }

    for s in doc["sections"]:

        found = None
        if ("business" in s.lower() or "overview" in s.lower() or "company" in s.lower() or "general" in s.lower() or "outlook" in s.lower())\
                and not "combination" in s.lower():
            found = "business"
        elif "propert" in s.lower() and not "plant" in s.lower() and not "business" in s.lower():
            found = "property"
        elif "foreign" in s.lower() and "jurisdiction" in s.lower():
            found = "foreign"
        elif "legal" in s.lower() and "proceeding" in s.lower():
            found = "legal"
        elif "information" in s.lower() and "other" in s.lower():
            found = "other"
        elif "unresolved" in s.lower():
            found = "unresolved"
        elif "risk" in s.lower():
            found = "risk"
        
        # we are not going to summarize MD&A and financial notes sections of the document, while both extremely important,
        # because we didn't manage to obtain useful results from OpenAI models, without further pre-processing.
        
        # elif "management" in s.lower() and "discussion" in s.lower():
        #     found = "MD&A"
        # elif "supplementa" in s.lower() or ("note" in s.lower() and "statement" not in s.lower()):
        #     found = "notes"

        if found is not None:
            result[found]["text"] += doc["sections"][s]["text"]
            result[found]["links"].append({
                "title": s,
                "link": doc["sections"][s]["link"] if "link" in doc["sections"][s] else None
            })

    return result

def restructure_parsed_10q(doc):
    result = {
        "risk": {"text":"", "links":[]},
        "MD&A": {"text":"", "links":[]},
        "legal": {"text":"", "links":[]},
        "other": {"text":"", "links":[]},
        "equity": {"text":"", "links":[]},
        "defaults": {"text":"", "links":[]},
    }

    for s in doc["sections"]:

        found = None
        if "legal" in s.lower() and "proceeding" in s.lower():
            found = "legal"
        elif "management" in s.lower() and "discussion" in s.lower():
            found = "MD&A"
        elif "information" in s.lower() and "other" in s.lower():
            found = "other"
        elif "risk" in s.lower():
            found = "risk"
        elif "sales" in s.lower() and "equity" in s.lower():
            found = "equity"
        elif "default" in s.lower():
            found = "defaults"

        if found is not None:
            result[found]["text"] += doc["sections"][s]["text"]
            result[found]["links"].append({
                "title": s,
                "link": doc["sections"][s]["link"] if "link" in doc["sections"][s] else None
            })

    return result

def restructure_parsed_8k(doc):

    result = {}

    for s in doc["sections"]:
        if "financial statements and exhibits" in s.lower():
            continue
        result[s] = doc["sections"][s]

    return result

def sections_summary(doc, verbose=False):
    """
    Summarize all sections of a document using openAI API.
    Upsert summary on MongoDB (overwrite previous one, in case we make changes to openai_interface)

    This method is configured to use gpt-3.5-turbo. At the moment this model has two different version,
    a version with 4k token and a version with 16k tokens. The one we use is based on the length of a section.

    :param doc: a parsed_document from MongoDB
    :param verbose: passed to langchain verbose
    :return:
    """

    company = company_from_cik(doc["cik"])
    result = {"_id": doc["_id"],
              "name": company["name"],
              "ticker": company["ticker"],
              "form_type": doc["form_type"],
              "filing_date": doc["filing_date"]}

    # keep track of duration and costs
    total_cost = 0
    total_start_time = time.time()

    if "10-K" in doc["form_type"]:
        new_doc = restructure_parsed_10k(doc)
    elif "10-Q" in doc["form_type"]:
        new_doc = restructure_parsed_10q(doc)
    elif doc["form_type"] == "8-K":
        new_doc = restructure_parsed_8k(doc)
    else:
        print(f"form_type {doc['form_type']} is not yet implemented")
        return

    # for each section
    for section_title, section in new_doc.items():

        section_links = section["links"] if "links" in section else None
        section_text = section["text"]

        start_time = time.time()
        
        # if the section text is too small we skip it, it's probably not material
        if len(section_text) < 250:
            continue

        # select chain_type and model (4k or 16k) based on the section and its length
        if section_title in ["business", "risk", "MD&A"]:
            chain_type = "refine"

            if len(section_text) > 25000:
                model = "gpt-3.5-turbo-16k"
            else:
                model = "gpt-3.5-turbo"
        else:
            if len(section_text) < 25000:
                chain_type = "refine"
                model = "gpt-3.5-turbo"
            elif len(section_text) < 50000:
                chain_type = "map_reduce"
                model = "gpt-3.5-turbo"
            else:
                chain_type = "map_reduce"
                model = "gpt-3.5-turbo-16k"

        original_len = len(section_text)

        # get summary from openAI model
        print(f"{section_title} original_len: {original_len} use {model} w/ chain {chain_type}")
        summary, cost = summarize_section(section_text, model, chain_type, verbose)

        result[section_title] = {"summary":summary, "links": section_links}

        summary_len = len(''.join(summary))
        reduction = 100 - round(summary_len / original_len * 100, 2)

        total_cost += cost
        duration = round(time.time() - start_time, 1)

        print(f"{section_title} original_len: {original_len} summary_len: {summary_len} reduction: {reduction}% "
              f"cost: {cost}$ duration:{duration}s used {model} w/ chain {chain_type}")

    mongodb.upsert_document("items_summary", result)

    total_duration = round(time.time() - total_start_time, 1)

    print(f"\nTotal Cost: {total_cost}$, Total duration: {total_duration}s")

### Langchain digression
LangChain is a framework for developing applications powered by language models.
Using an LLM in isolation is fine for simple applications, but more complex applications require chaining LLMs - either with each other or with other components.

LangChain provides the **Chain** interface for such "chained" applications. They define a Chain very generically as a sequence of calls to components, which can include other chains.

A summarization chain can be used to summarize multiple documents. One way is to input multiple smaller documents, after they have been divided into chunks, and operate over them with a MapReduceDocumentsChain. You can also choose instead for the chain that does summarization to be a StuffDocumentsChain, or a RefineDocumentsChain.

If you want to go deeper in this discussion feel free to read about langchain summarization chain here: https://python.langchain.com/docs/modules/chains/popular/summarize

In brief, in the code above we use the langchain.load_summarize_chain method. This method allow us to summarize a text using a LLM and a chain type. There are three chain type that could be used in different situations:
- **stuff**: takes the entire text and perform a summarization request to the LLM without splitting in chunks, this is useful for maintaining the context of text but cannot be used for text that exceed the model token capacity.
- **map_reduce**: takes n splitted documents and perform the summary of each split in parallel, than takes the resulting summaries and perform a final summary combining them alltogether. This is useful for large documents. it is fast but could lose context since each chunk is independent from others.
- **refine**: takes n splitted documents, then start summarizing the first split, then take this summary and use it as input for computing the next summary. It is a cumulative way to compute the final summary. It is useful for summarize large text and mantain the context between splits.

### Example: summarize a section
As an example to demostrate how this code works, let's select a section to summarize and print its summary after the model response.


In [15]:
restructured_doc = restructure_parsed_10k(parsed_doc)

Then we want to summarize the business sections. This section contains the company description as well as other useful information to understand the company business.

In [16]:
section_text = restructured_doc["business"]["text"]
section_text

'ITEM 1. BUSINESS Overview As our founders Larry and Sergey wrote in the original founders\' letter, "Google is not a conventional company. We do not intend to become one." That unconventional spirit has been a driving force throughout our history, inspiring us to tackle big problems and invest in moonshots, such as our long-term opportunities in artificial intelligence (AI). We continue this work under the leadership of Alphabet and Google CEO Sundar Pichai. Alphabet is a collection of businesses -- the largest of which is Google. We report Google in two segments, Google Services and Google Cloud; we also report all non-Google businesses collectively as Other Bets. Alphabet\'s structure is about helping each of our businesses prosper through strong leaders and independence. Access and technology for everyone The Internet is one of the world\'s most powerful equalizers; it propels ideas, people and businesses large and small. Our mission to organize the world\'s information and make it

In [17]:
len(section_text)

25020

Since the section_text length is short enough we can use the default **gpt-3.5-turbo** model with the **refine** chain type.

In [18]:
chain_type = "refine"
model = "gpt-3.5-turbo"
verbose = True

# get summary from openAI model
print(f"business original_len: {len(section_text)} use {model} w/ chain {chain_type}")
summary, cost = summarize_section(section_text, model, chain_type, verbose)

business original_len: 25020 use gpt-3.5-turbo w/ chain refine


[1m> Entering new  chain...[0m


[1m> Entering new  chain...[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"ITEM 1. BUSINESS Overview As our founders Larry and Sergey wrote in the original founders' letter, "Google is not a conventional company. We do not intend to become one." That unconventional spirit has been a driving force throughout our history, inspiring us to tackle big problems and invest in moonshots, such as our long-term opportunities in artificial intelligence (AI). We continue this work under the leadership of Alphabet and Google CEO Sundar Pichai. Alphabet is a collection of businesses -- the largest of which is Google. We report Google in two segments, Google Services and Google Cloud; we also report all non-Google businesses collectively as Other Bets. Alphabet's structure is about helping each of our businesses prosper through strong leaders and independence. A


[1m> Finished chain.[0m


[1m> Entering new  chain...[0m
Prompt after formatting:
[32;1m[1;3mYour job is to produce a final summary
We have provided an existing summary up to a certain point: Alphabet Inc., the parent company of Google, operates under the leadership of CEO Sundar Pichai. Google is divided into two segments: Google Services and Google Cloud, while other businesses are collectively referred to as Other Bets. The company's mission is to make information universally accessible and useful, and it continues to invest in artificial intelligence (AI) and innovative products. Google generates revenue primarily through advertising, but also through Google Play, hardware sales, and YouTube subscriptions. The company faces competition in various aspects of its business and is committed to sustainability and promoting diversity and inclusion.
We have the opportunity to refine the existing summary(only if needed) with some more context below.
------------
on Form 10-K or in a

In [19]:
print(f"BULLET POINTS")
for el in summary:
    print(el)
print(f"cost: {cost} in USD")

BULLET POINTS
Alphabet Inc., the parent company of Google, operates under the leadership of CEO Sundar Pichai
Google is divided into two segments: Google Services and Google Cloud, while other businesses are collectively referred to as Other Bets
The company's mission is to make information universally accessible and useful, and it continues to invest in artificial intelligence (AI) and innovative products
Google generates revenue primarily through advertising, but also through Google Play, hardware sales, and YouTube subscriptions
The company faces competition in various aspects of its business and is committed to sustainability and promoting diversity and inclusion
Alphabet had 190,234 employees as of December 31, 2022, and is committed to supporting protected labor rights and maintaining an open culture
The company contracts with businesses around the world for specialized services, and reviews their compliance with Google's Supplier Code of Conduct
Alphabet is subject to numerous l

### Alphabet Inc. items summary
Now that we have seen how to summarize a section we can run the algorithm to create the summary for all the important sections of the last filing for Alphabet Inc.

We can do his by calling the sections_summary method passing the parsed_doc. The result will be saved in the items_summary collection.

In [20]:
sections_summary(parsed_doc)

business original_len: 25020 use gpt-3.5-turbo-16k w/ chain refine
business original_len: 25020 summary_len: 1452 reduction: 94.2% cost: 0.0212$ duration:11.7s used gpt-3.5-turbo-16k w/ chain refine
risk original_len: 82337 use gpt-3.5-turbo-16k w/ chain refine
risk original_len: 82337 summary_len: 2662 reduction: 96.77% cost: 0.0715$ duration:50.1s used gpt-3.5-turbo-16k w/ chain refine
property original_len: 328 use gpt-3.5-turbo w/ chain refine
property original_len: 328 summary_len: 226 reduction: 31.099999999999994% cost: 0.0002$ duration:1.5s used gpt-3.5-turbo w/ chain refine
legal original_len: 272 use gpt-3.5-turbo w/ chain refine
legal original_len: 272 summary_len: 147 reduction: 45.96% cost: 0.0002$ duration:1.5s used gpt-3.5-turbo w/ chain refine
other original_len: 493 use gpt-3.5-turbo w/ chain refine
other original_len: 493 summary_len: 311 reduction: 36.92% cost: 0.0003$ duration:2.4s used gpt-3.5-turbo w/ chain refine

Total Cost: 0.0934$, Total duration: 67.3s


In [21]:
import datetime

# Get the summarized document
summary_doc = mongodb.get_document("items_summary", parsed_doc["_id"])

for k, v in summary_doc.items():
    if isinstance(v, dict):
        print(f"=== {k} ===")

        for info in v["summary"]:
            print(info)

        print()

=== business ===
Alphabet Inc., the parent company of Google, is committed to innovation and solving big problems
They strive to make information universally accessible and provide tools for knowledge, health, happiness, and success
Google Services, including search, YouTube, and Google Assistant, offer intuitive experiences
Google Cloud helps businesses overcome challenges and drive growth
Alphabet also invests in Other Bets to address various industry problems
They focus on AI to assist and inspire people in different fields
Privacy and security are top priorities, with a commitment to building secure products and giving users control over their data
The company aims for sustainability and has ambitious goals for net-zero emissions and a circular economy
Alphabet values a diverse and inclusive workforce, offering competitive compensation, benefits, and career development opportunities
They have work councils and statutory employee representation obligations in certain countries, and 

## <a class="anchor" id="3-bullet" href="#toc">3. Quantitative Analysis</a>

In this step we are going to collect and process financial data.

Let's see step by step the algorithm for our valuation model (quantitative).

Valuation is done following principles teached by Prof. Damodaran in his Valuation Course.

We build 4 different scenarios for both FCFF and Dividends Valuation:
1. Earnings TTM & Historical Growth
2. Earnings Normalized & Historical Growth
3. Earnings TTM & Growth TTM
4. Earnings Normalized & Growth Normalized

Each scenario is also run with a recession hypothesis.
We compute a median value for FCFF, Recession FCFF, Dividends, Recession Dividends and then compute 2 Expected Values based on the recession_probability.

These 2 values are then used to compute the final valuation (value/share) skewing the result towards the lowest value (to be conservative).

But let's start from the beginning.

First we are gonna download financial data for our company and save them in MongoDB.

In [22]:
cik = '0001652044'

In [None]:
def download_financial_data(cik):
    """
    Download financial data for a company.
    Upsert document on mongodb (each requests returns the entire history)
    :param cik: company cik
    :return:
    """
    url = f"https://data.sec.gov/api/xbrl/companyfacts/CIK{cik}.json"
    response = make_edgar_request(url)
    
    try:
        r = response.json()
        r["_id"] = cik
        r["url"] = url
        mongodb.upsert_document("financial_data", r)
        
    # ETFs, funds, trusts do not have financial information
    except:
        print(f"ERROR {cik} - {response} - {url}")
        print(company_from_cik(cik))
        
print(cik)
download_financial_data(cik)

doc = mongodb.get_document("financial_data", cik)
doc

0001652044


For the following steps we are going to import methods from our quantitative_analysis module.

We will provide a description of what the function does, but we are not going to delve into much detail here because it pretty menial work of parsing and elaborating data, which would distract you from the main high-level goal of valuating the company and estimate its risks.

if you are interested in the details you are welcome to take a look at the code of the project.

In [None]:
from quantitative_analysis import *

### General Information

The goals of this first phase is extracting the financial data and the general information for our company.

Let's extract financial data required for valuation from company financial document. We have 3 different kind of measures in the result dictionary:

- mr_ measures: these are the most recent values in the financial data (they come from the most recent 10-K or 10-Q
- ttm_ measures: these are the Trailing 12 months measures, meaning the value from the 4 most recent quarters. If the most recent filing is a 10-K that value is equal to the ttm value. Otherwise it needs to be calculated.
- yearly measures: Simply the yearly values coming from the 10-Ks

The format for mr and ttm measures is {"date": date, "value": value}

The format for yearly measures is {"dates": [], "values": []}

In [None]:
data = extract_company_financial_information(cik)

Then we look for company revenues to check if there is financial data.

We want to consider the last 5 years in the valuation.

In [None]:
# how many financial years to consider in the valuation 
years = 5
final_year = data["revenue"]["dates"][-1]
initial_year = final_year - years + 1
print(f"Initial Year: {initial_year} - Final Year: {final_year}")

We are gonna retrieve the Equity Risk Premium from damodaran data, that we saved in our postgreSQL DB.

Equity Risk Premium is the expected return an investor can expect to achieve by investing in the stock market compared to a riskfree government bond investment.

In [None]:
erp = get_df_from_table("damodaran_erp")
erp = erp[erp["date"] == erp["date"].max()]["value"].iloc[0]
print(f"ERP {erp}")

Let's retrieve the company info and real time price per share (via Yahoo Finance API).

In [28]:
company_info = company_from_cik(cik)
print(company_info)
ticker = company_info["ticker"]
price_per_share = get_current_price_from_yahoo(ticker)
print(f"Price per share {price_per_share}")

Price per share 120.97


Let's retrieve some additional info using the company ticker from data we have stored in PostgreSQL (which we collected from Yahoo Finance). 

In [29]:
company_name, country, industry, region = get_generic_info(ticker)
print(f"Company Name: {company_name}")
print(f"Country: {country}")
print(f"Industry: {industry}")
print(f"Region: {region}")

yahoo_equity_ticker = get_df_from_table("yahoo_equity_tickers", f"where symbol = '{ticker}'", most_recent=True).iloc[0]
db_curr = yahoo_equity_ticker["currency"]
db_financial_curr = yahoo_equity_ticker["financial_currency"]

Company Name: Alphabet Inc.
Country: United States
Industry: Information Services
Region: US


This is not the case but if db_curr and db_financial_curr are different it means the quote currency (the currency in which shares are priced) and financial currency (the currency in which financial statements are presented) are different.

In this case we need the forex rate to be able to convert between the two.

We also need the forex rate between the financial currency and USD (in case they are different) to be able to compute the market cap in USD (which we use to estimate the company size).

In [30]:
fx_rate = None

# they are different
if db_curr != db_financial_curr:
    fx_rate = convert_currencies(db_curr, db_financial_curr)

fx_rate_financial_USD = 1

if db_financial_curr != "USD":
    fx_rate_financial_USD = convert_currencies("USD", db_financial_curr)

print(f"FX rate: {fx_rate} - {fx_rate_financial_USD}")

FX rate: None - 1


Retrieve bond_spread DataFrame from damodaran data that we uploaded in our postgreSQL DB.

This is used to estimate the company spread based on its interest coverage ratio = EBIT / Interest Expenses

In [31]:
damodaran_bond_spread = get_df_from_table("damodaran_bond_spread", most_recent=True)
damodaran_bond_spread["greater_than"] = pd.to_numeric(damodaran_bond_spread["greater_than"])
damodaran_bond_spread["less_than"] = pd.to_numeric(damodaran_bond_spread["less_than"])
damodaran_bond_spread

Unnamed: 0,greater_than,less_than,rating,spread,created_at
0,-100000.0,0.199999,D2/D,0.2,2023-06-23
1,0.2,0.649999,C2/C,0.175,2023-06-23
2,0.65,0.799999,Ca2/CC,0.1578,2023-06-23
3,0.8,1.249999,Caa/CCC,0.1157,2023-06-23
4,1.25,1.499999,B3/B-,0.0737,2023-06-23
5,1.5,1.749999,B2/B,0.0526,2023-06-23
6,1.75,1.999999,B1/B+,0.0455,2023-06-23
7,2.0,2.25,Ba2/BB,0.0313,2023-06-23
8,2.25,2.49999,Ba1/BB+,0.0242,2023-06-23
9,2.5,2.999999,Baa2/BBB,0.02,2023-06-23


Make sure to retrieve last annual report (10-K on SEC)

In [32]:
doc = get_last_document(cik, "10-K")

Extract business segments and compute geographic distributions.

From the most recent 10-K we are going to extract information about where the company is doing business.
We want to consider the business risk not just based on the country in which the company is incorporated, but more importantly based on the countries where it's conducting business.

In [33]:
segments = extract_segments(doc)
geo_segments_df = geography_distribution(segments, ticker)
geo_segments_df

Unnamed: 0,value,country,country_area,region
0,0.479977,UnitedStates,UnitedStates,US
1,0.204515,Germany,Western Europe,Europe
2,0.043825,Saudi Arabia,Middle East,emerg
3,0.043825,SouthAfrica,Africa,emerg
4,0.167419,China,Asia,emerg
5,0.060439,Mexico,Central and South America,emerg


Retrieve country statistics from damodaran data in our postgreSQL DB.
 
Here we have average metrics of the companies in the country (like PE, PEG, EV/Sales, ...) but also country specific metrics like Tax Rate, Country Risk Premium, Moody's rating for the country, ...

For this last set of metrics we have also the same metrics for region (North America, Western Europe, Asia, ...)

In [34]:
country_stats = get_df_from_table("damodaran_country_stats", most_recent=True)
country_stats

Unnamed: 0,country,pe,peg,pbv,ps,ev_ebitda,ev_sales,moody_rating,adjusted_default_spread,country_risk_premium,created_at,alpha_2_code,alpha_3_code,currency,power,tax_rate
0,Liechtenstein,15.03076923076923,,0.744286439817166,3.615542763157895,,,Aaa,0.0,0.0,2023-06-23,LI,LIE,CHF,,0.125
1,Argentina,11.5045871559633,0.5843443469282394,1.26813880126183,0.9422632794457275,9.06578947368421,1.167701863354037,Ca,0.14682729357798166,0.20711819215360386,2023-06-23,AR,ARG,ARS,0.6091,0.35
2,Australia,19.30132450331126,2.386433103564577,2.664233576642336,6.227410144440713,13.43498452012384,6.753153153153153,Aaa,0.0,0.0,2023-06-23,AU,AUS,AUD,0.2377,0.3
3,Austria,20.12588792423047,1.273790374951295,1.115204495785201,1.293035437144335,10.44125326370757,1.606434604078204,Aa1,0.004889449541284403,0.0068971777994344076,2023-06-23,AT,AUT,EUR,0.8924,0.24
4,Azerbaijan,7.51293103448276,,1.407915993537965,1.707149853085211,3.147747747747748,1.368854064642508,Ba1,0.030630963302752296,0.04320879033175085,2023-06-23,AZ,AZE,AZN,1.0251,0.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
118,EasternEurope&Russia,,,,,,,,0.05523863838321001,0.07792098212912112,2023-06-23,,-,,,0.18347474471087855
119,MiddleEast,,,,,,,,0.017766934990427034,0.025062475549707256,2023-06-23,,-,,,0.151813371674751
120,NorthAmerica,,,,,,,,0.0,0.0,2023-06-23,,-,,,0.25
121,WesternEurope,,,,,,,,0.010682352271974054,0.015068788892060588,2023-06-23,,-,,,0.24828912576026474


Compute tax rate, country default spread and country risk premium based on the countries where the company is making business in.

In [35]:
tax_rate = 0
country_default_spread = 0
country_risk_premium = 0

if geo_segments_df is None or geo_segments_df.empty:
    try:
        filter_df = country_stats[country_stats["country"] == country.replace(" ", "")].iloc[0]
    except:
        filter_df = country_stats[country_stats["country"] == "Global"].iloc[0]
    tax_rate = float(filter_df["tax_rate"])
    country_default_spread = float(filter_df["adjusted_default_spread"])
    country_risk_premium = float(filter_df["country_risk_premium"])
else:
    for _, row in geo_segments_df.iterrows():
        percent = row["value"]
        search_key = row["country_area"]
        try:
            filter_df = country_stats[country_stats["country"] == search_key.replace(" ", "")].iloc[0]
        except:
            filter_df = country_stats[country_stats["country"] == "Global"].iloc[0]
        t = float(filter_df["tax_rate"])
        cds = float(filter_df["adjusted_default_spread"])
        crp = float(filter_df["country_risk_premium"])

        tax_rate += t * percent
        country_default_spread += cds * percent
        country_risk_premium += crp * percent
        
print(f"Tax rate: {tax_rate}")
print(f"Country default spread: {country_default_spread}")
print(f"Country risk premium: {country_risk_premium}")

Tax rate: 0.24944222192426488
Country default spread: 0.0110584953817104
Country risk premium: 0.015599385615470526


Adding country risk premium to our base ERP, we get the final Equity Risk Premium (this means that to invest in a company that is doing business in a risky country, the investor will expect a higher return).

In [36]:
final_erp = float(erp) + country_risk_premium
print(f"Final ERP: {final_erp}")

Final ERP: 0.06809938561547052


Select alpha_3_code from company country. These are three-letter country codes defined in ISO 3166-1. 

This will be used for computing the riskfree rate.

In [37]:
alpha_3_code = country_stats[country_stats["country"] == country.replace(" ", "")].iloc[0]["alpha_3_code"]
print(f"Alpha 3 code: {alpha_3_code}")

Alpha 3 code: USA


To get the riskfree rate, our first attempt is to get the 10y bond yield for the financial currency.

We obtain it by scraping it from investing.com.

As the 10y bond yield is influenced by the country risk, to get the riskfree rate we need to subtract from it the country default spread, as taken from Damodaran.

In case we cannot find the currency on investing.com (this is the case for particularly exotic currencies):
- we compute the riskfree for USA
- we take the inflation rate for USA and for the company country of incorporation
- riskfree = riskfree USA * inflation of the country / inflation in USA

In [38]:
riskfree = currency_bond_yield(db_financial_curr, alpha_3_code, country_stats)
print(f"Risk Free: {riskfree}")

Risk Free: 0.03944


To recap we collected the following general information.

In [39]:
print("===== GENERAL INFORMATION =====\n")
print("ticker", ticker)
print("cik", cik)
print("company_name", company_name)
print("country", country)
print("region", region)
print("industry", industry)
print("financial currency", db_financial_curr)
print("riskfree", riskfree)
print("erp", erp)
print("\n\n")

===== GENERAL INFORMATION =====

ticker GOOGL
cik 0001652044
company_name Alphabet Inc.
country United States
region US
industry Information Services
financial currency USD
riskfree 0.03944
erp 0.0525





### Last Available and Historical Financial Data

The goal of this phase is extracting financial data from our data dictionary in variables that will make subsequential computations easier.

Now let's retrieve the shares number for the last years.

get_selected_years extracts the measure we want for the years we specify.

If one of the years in the period in not available in our data dictionary it will insert a 0 in the list.

In [40]:
mr_shares = data["mr_shares"]["value"] / 1000
shares = get_selected_years(data, "shares", initial_year, final_year)
print(f"MR shares: {mr_shares}")   
print(f"Shares: {shares}")    

MR shares: 12722000.0
Shares: [695556.0, 688335.0, 675222.0, 13242000.0, 12849000.0]


Now we are gonna retrieve some more financial data from our data dictionary.

In [41]:
ttm_revenue = data["ttm_revenue"]["value"] / 1000
ttm_ebit = data["ttm_ebit"]["value"] / 1000
ttm_net_income = data["ttm_net_income"]["value"] / 1000
ttm_dividends = data["ttm_dividends"]["value"] / 1000
ttm_interest_expense = data["ttm_interest_expenses"]["value"] / 1000
mr_cash = data["mr_cash"]["value"] / 1000
mr_securities = data["mr_securities"]["value"] / 1000
mr_debt = data["mr_debt"]["value"] / 1000
mr_equity = data["mr_equity"]["value"] / 1000
ebit = get_selected_years(data, "ebit", initial_year, final_year)
net_income = get_selected_years(data, "net_income", initial_year, final_year)
dividends = get_selected_years(data, "dividends", initial_year, final_year)
capex = get_selected_years(data, "capex", initial_year, final_year)
depreciation = get_selected_years(data, "depreciation", initial_year, final_year)
equity_bv = get_selected_years(data, "equity", initial_year, final_year)
cash = get_selected_years(data, "cash", initial_year, final_year)
securities = get_selected_years(data, "securities", initial_year, final_year)
debt_bv = get_selected_years(data, "debt", initial_year, final_year)
revenue = get_selected_years(data, "revenue", initial_year-1, final_year)

mr_cash_and_securities = mr_cash + mr_securities
cash_and_securities = [sum(x) for x in zip(cash, securities)]

Let's compute the revenue growth and revenue delta, which we will use later on.

In [42]:
# Compute revenue growth
revenue_growth = []
revenue_delta = []
for i in range(len(revenue) - 1):
    revenue_delta.append(revenue[i + 1] - revenue[i])
    try:
        revenue_growth.append(revenue[i + 1] / revenue[i] - 1)
    except:
        revenue_growth.append(0)

# drop 1st element we don't need
revenue = revenue[1:]
revenue_growth = revenue_growth[1:]
print(f"Revenue: {revenue}")
print(f"Revenue Growth: {revenue_growth}")

Revenue: [136819000.0, 161857000.0, 182527000.0, 257637000.0, 282836000.0]
Revenue Growth: [0.18300089899794614, 0.1277053201282614, 0.41150076427049154, 0.09780815643715779]


To get adjusted values from our valuation we are going to perform 2 different tasks as explained by Prof. Damodaran:
1. Capitalize Research and Development
2. Account for Operating Leases

Let's start by capitalizing R&D.

We need to capitalize R&D because accounting standards treat it as an operating expense, while it really is a capital expense because benefits of R&D are to be reaped in a number of years not in the year where the expense is taking place.

By capitalizing R&D:
- we choose a number of amortization years based on the industry the company is operating in
- we calculate the amount of unamortized R&D (which will be added to our Equity book value)
- we compute the current year R&D amortization
- we adjust EBIT and Net income by adding back the current year R&D expenses and subtracting the current year R&D amortization

This usually will increase our EBIT and Net income values, but also increase our Equity book value, thus reducing the ROE (return on equity) and ROC (return on capital)

In [43]:
try:
    r_and_d_amortization_years = r_and_d_amortization[industry]
except:
    print(f"\n#######\nCould not find industry: {industry} mapping. "
          f"Check r_and_d_amortization dictionary.\n#######\n")
    r_and_d_amortization_years = 5

r_and_d = get_selected_years(data, "rd", final_year - r_and_d_amortization_years, final_year)
while len(r_and_d) < years:
    r_and_d.insert(0, 0)

ebit_r_and_d_adj, tax_benefit, r_and_d_unamortized, r_and_d_amortization_cy = capitalize_rd(r_and_d, r_and_d_amortization_years, tax_rate, years)


Let's compute R&D adjusted values

In [44]:
ttm_ebit_adj = ttm_ebit + ebit_r_and_d_adj[-1]
ebit_adj = [sum(x) for x in zip(ebit, ebit_r_and_d_adj)]
ttm_net_income_adj = ttm_net_income + ebit_r_and_d_adj[-1]
net_income_adj = [sum(x) for x in zip(net_income, ebit_r_and_d_adj)]
mr_equity_adj = mr_equity + r_and_d_unamortized[-1]
equity_bv_adj = [sum(x) for x in zip(equity_bv, r_and_d_unamortized)]
capex_adj = [sum(x) for x in zip(capex, r_and_d[-years:])]
depreciation_adj = [sum(x) for x in zip(depreciation, r_and_d_amortization_cy)]
ebit_after_tax = [sum(x) for x in zip([x * (1 - tax_rate) for x in ebit_adj], tax_benefit)]
ttm_eps_adj = ttm_net_income_adj / mr_shares

And now the second step, accounting for operating leases.

We need to account for operating leases because accounting standards treat it as an operating expense (like rent), while it really is debt!

This is because operating leases are a long-term commitment and require a minimum payment for many subsequent years. It's not like a rent where the company can cancel if they desire so. It's more like a long-term loan.

By accounting for Operating leases:
- we add following years commitment to debt (discounting them by the cost of debt as the payments will happen in the future)
- we adjust EBIT based on current year payment compared to a equal amortization payment
- we add the interest part of the current year payment to interest expenses
- we recalculate interest coverage ratio and cost of debt based on these new values

This usually will increase our Debt book value, thus reducing the ROC (return on capital), and potentially increasing the company default spread and the cost of debt.

In [45]:
leases = [
    data["mr_op_leases_expense"]["value"] / 1000,
    data["mr_op_leases_next_year"]["value"] / 1000,
    data["mr_op_leases_next_2year"]["value"] / 1000,
    data["mr_op_leases_next_3year"]["value"] / 1000,
    data["mr_op_leases_next_4year"]["value"] / 1000,
    data["mr_op_leases_next_5year"]["value"] / 1000,
    data["mr_op_leases_after_5year"]["value"] / 1000,
]
last_year_leases = max([i for i, x in enumerate(leases) if x != 0], default=-1)
if last_year_leases != -1:
    ebit_op_adj, int_exp_op_adj, debt_adj, tax_benefit_op, company_default_spread = \
        debtize_op_leases(ttm_interest_expense, ttm_ebit_adj, damodaran_bond_spread, riskfree, country_default_spread,
                      leases, last_year_leases, tax_rate, revenue_growth)
    ttm_ebit_adj += ebit_op_adj[-1]
    ttm_interest_expense_adj = ttm_interest_expense + int_exp_op_adj
    mr_debt_adj = mr_debt + debt_adj[-1]
    ebit_adj = [sum(x) for x in zip(ebit_adj, ebit_op_adj)]
    debt_bv_adj = [sum(x) for x in zip(debt_bv, debt_adj)]
    ebit_after_tax = [sum(x) for x in zip(ebit_after_tax, tax_benefit_op)]

    ttm_ebit_after_tax = ttm_ebit_adj * (1 - tax_rate) + tax_benefit[-1] + tax_benefit_op[-1]
# no leases
else:
    ttm_interest_expense_adj = ttm_interest_expense
    mr_debt_adj = mr_debt
    debt_bv_adj = debt_bv
    company_default_spread = get_spread_from_dscr(12.5, damodaran_bond_spread)
    ttm_ebit_after_tax = ttm_ebit_adj * (1 - tax_rate) + tax_benefit[-1]

Let's compute cost of debt.

In [46]:
cost_of_debt = riskfree + country_default_spread + company_default_spread
print(f"Cost of debt: {cost_of_debt}")

Cost of debt: 0.05739849538171041


Let's compute cash and securities.

These are the most liquid assets the company owns, and we are going to consider them both as Cash (usually this is referred to as Cash and Cash equivalents).

In [47]:
mr_cash_and_securities = mr_cash + mr_securities
cash_and_securities = [sum(x) for x in zip(cash, securities)]
print(f"MR Cash and securities {mr_cash_and_securities}")
print(f"Cash and securities {cash_and_securities}")

MR Cash and securities 110880000.0
Cash and securities [107918000.0, 116379000.0, 131265000.0, 133301000.0, 109410000.0]


We compute both earnings per share and dividends per share using the most recent number of shares (to account for splits, dilution, and buybacks).

In [48]:
eps = [x / mr_shares for x in net_income]
eps_adj = [x/mr_shares for x in net_income_adj]
dividends = [x/mr_shares for x in dividends]
print(f"EPS {eps}")
print(f"EPS Adjusted {eps_adj}")
print(f"Dividends {dividends}")

EPS [2.415972331394435, 2.69949693444427, 3.1653041974532306, 5.976497406068228, 4.7140386731645965]
EPS Adjusted [2.991487715555697, 3.275012318605532, 3.775216015712839, 6.6746454787447504, 5.587774458942515]
Dividends [0.0, 0.0, 0.0, 0.0, 0.0]


Now we are going to compute Working Capital as inventory + receivables + other assets - payables - due to affiliates - due to related.

As explained by Damodaran we want to compute WC = non-cash current assets - non-debt current liabilities

In [49]:
wc = {}
for i in ["inventory", "receivables", "other_assets", "account_payable", "due_to_affiliates", "due_to_related_parties"]:
    val = get_selected_years(data, i, initial_year-1, final_year)
    wc[i] = val
    
df = pd.DataFrame(wc)
df["wc"] = df["inventory"] + df["receivables"] + df["other_assets"] - df["account_payable"] \
           - df["due_to_affiliates"] - df["due_to_related_parties"]

# this compute the difference from the previous row
df["delta_wc"] = df["wc"].diff(1)
df = df.dropna()

working_capital = df["wc"].to_list()
delta_wc = df["delta_wc"].to_list()

print(f"Working capital {working_capital}")
print(f"Delta WC {delta_wc}")

Working capital [21803000.0, 25176000.0, 31559000.0, 42457000.0, 45905000.0]
Delta WC [2872000.0, 3373000.0, 6383000.0, 10898000.0, 3448000.0]


Compute reinvestment as CAPEX + delta Working capital - Depreciation

In [50]:
reinvestment = []
for i in range(len(capex)):
    reinvestment.append(capex_adj[i] + delta_wc[i] - depreciation_adj[i])
print(f"Reinvestment {reinvestment}")

Reinvestment [9314706.717299584, 34242706.71729958, 36423298.15189874, 44419839.78059073, 46048666.66666667]


Compute equity market value (which is the company Market Cap)

In [51]:
equity_mkt = mr_shares * price_per_share
if fx_rate is not None:
    equity_mkt /= fx_rate
print(f"Equity market {equity_mkt}")

Equity market 1538980340.0


Compute debt market value.

In [52]:
debt_mkt = ttm_interest_expense_adj * (1 - (1 + cost_of_debt) ** -6) / cost_of_debt + mr_debt_adj / (1 + cost_of_debt) ** 6
print(f"Debt Market {debt_mkt}")

Debt Market 23519939.485163145


Get company/industry data for sales to capital (reinvestment needs), dividends payout, unlevered beta, operating marging and debt to equity.

These values are computed taking into consideration company values and industry values (for sales to capital, operating margin and debt to equity).

PBV, dividends payout and unlevered beta only take into consideration industry values.

These industry values are taken as usual from data provided by Prof. Damodaran (they represents the average values for the companies operating in a specific industry).

In [53]:
target_sales_capital, industry_payout, pbv, unlevered_beta, target_operating_margin, target_debt_equity = \
get_industry_data(industry, region, geo_segments_df, revenue, ebit_adj, revenue_delta, reinvestment,
                  equity_mkt, debt_mkt, equity_bv_adj, debt_bv_adj, mr_equity_adj, mr_debt_adj)

value not found for  Information Services Europe cash_return
searching now in region  US


Retrieve minority interest and compute the market value of Minority interest by multiplying the book value for the PBV (price to book value) derived from the industry data.

In [54]:
mr_original_min_interest = data["mr_minority_interest"]["value"] / 1000
mr_minority_interest = mr_original_min_interest * pbv
print(f"Minority interest {mr_minority_interest}")

Minority interest 0.0


Retrieve tax benefits.

In [55]:
mr_tax_benefits = data["mr_tax_benefits"]["value"] / 1000
mr_sbc = data["mr_sbc"]["value"] / 1000
print(f"MR Tax benefits {mr_tax_benefits}")
print(f"MR SBC {mr_sbc}")

MR Tax benefits 7500000.0
MR SBC 2138000.0


We have now gathered pretty much all the data we are going to need in our valuation model!

But we still need to compute many more measures starting from these, so stay with us.

In [56]:
print("===== Last Available Data =====\n")
print("Outstanding Shares", mr_shares)
print("Price/Share (price currency)", price_per_share)
print("FX Rate:", 1 if fx_rate is None else fx_rate)
print("FX Rate USD:", fx_rate_financial_USD)
print("ttm_revenue", ttm_revenue)
print("ttm_ebit", ttm_ebit, "=>", ttm_ebit_adj)
print("ttm_net_income", ttm_net_income, "=>", ttm_net_income_adj)
print("ttm_dividends", ttm_dividends)
print("ttm_interest_expense", ttm_interest_expense, "=>", ttm_interest_expense_adj)
print("tax_credit", mr_tax_benefits)
print("\n\n")
print("===== Historical Data =====\n")
print("initial_year", initial_year)
print("revenue", revenue)
print("revenue_delta", revenue_delta)
print("ebit", ebit, "=>", ebit_adj)
print("net_income", net_income, "=>", net_income_adj)
print("dividends", dividends)
print("working_capital", working_capital)
print("delta_WC", delta_wc)
print("capex", capex, "=>", capex_adj)
print("depreciation", depreciation, "=>", depreciation_adj)
print("shares_outstanding", shares)
print("equity_bv", equity_bv, "=>", equity_bv_adj)
print("cash&securities", cash_and_securities)
print("debt_bv", debt_bv, "=>", debt_bv_adj)
print("\n\n")
print("===== R&D =====")
print("r_and_d", r_and_d)
print("amortization_years", r_and_d_amortization_years)
print("\n===== Operating Leases =====")
print("leases", leases)
print("\n===== Segments =====\n")
if geo_segments_df is None:
    print("10-K not found. Check annual report on company website.")
else:
    print(geo_segments_df.to_markdown())
print("\n===== Options =====")
print("mr_sbc", mr_sbc)
print("\n\n")

===== Last Available Data =====

Outstanding Shares 12722000.0
Price/Share (price currency) 120.97
FX Rate: 1
FX Rate USD: 1
ttm_revenue 284612000.0
ttm_ebit 72163000.0 => 83584648.3032728
ttm_net_income 58587000.0 => 69702666.66666667
ttm_dividends 0.0
ttm_interest_expense 354000.0 => 501757.63830892835
tax_credit 7500000.0



===== Historical Data =====

initial_year 2018
revenue [136819000.0, 161857000.0, 182527000.0, 257637000.0, 282836000.0]
revenue_delta [25964000.0, 25038000.0, 20670000.0, 75110000.0, 25199000.0]
ebit [27524000.0, 34231000.0, 41224000.0, 78714000.0, 74842000.0] => [34993722.180461325, 41727809.14328552, 49180762.08925044, 87874560.27907851, 86263648.3032728]
net_income [30736000.0, 34343000.0, 40269000.0, 76033000.0, 59972000.0] => [38057706.71729958, 41664706.71729958, 48028298.151898734, 84914839.78059071, 71087666.66666667]
dividends [0.0, 0.0, 0.0, 0.0, 0.0]
working_capital [21803000.0, 25176000.0, 31559000.0, 42457000.0, 45905000.0]
delta_WC [2872000.0, 337

### Computations

The goal of this phase is crunching the financial data we extracted before to compute all the additional derived metrics that we are going to need in our valuation.

Now we can start to calculate the company expected growth.

We estimate growth in 3 different way:
- bottom up growth estimate with TTM values
- bottom up growth estimate with 5 years normalized values
- historical growth trend

Here we compute the first one (bottom up growth estimate with TTM values).

The formula is growth = return on capital * reinvestment rate ((CAPEX + delta WC) / EBIT after tax)

As we are computing growth_TTM, we will use TTM ROC and TTM reinvestment rate.

The same 3 growth estimates are also calculated for growth in EPS (for our dividends valuation), the difference is that growth in eps is computed as return on equity * reinvestment rate (1 - payout ratio)

In [57]:
roc_last, reinvestment_last, growth_last, roe_last, reinvestment_eps_last, growth_eps_last = \
get_growth_ttm(ttm_ebit_after_tax, ttm_net_income_adj, mr_equity_adj, mr_debt_adj, mr_cash_and_securities,
               reinvestment, ttm_dividends, industry_payout)

Compute ROE and ROC to be used in following methods.

ROC = EBIT after tax / (debt + equity - cash)

ROE = Net Income / equity

In [58]:
roe, roc = get_roe_roc(equity_bv_adj, debt_bv_adj, cash_and_securities, ebit_after_tax, net_income_adj)

Now we compute another estimate of growth (historical growth trend), plus other industries values we'll need later on.

Here we use some heuristics on the historical CAGR obtained by the company to compute a conservative historical growth estimate.

In [59]:
cagr, target_levered_beta, target_cost_of_equity, target_cost_of_debt, target_cost_of_capital = \
get_target_info(revenue, ttm_revenue, country_default_spread, tax_rate, final_erp, riskfree,
                unlevered_beta, damodaran_bond_spread, company_default_spread, target_debt_equity)

Let's now compute normalized values for revenue, EBIT, margin and so on, while also computing the last estimate of growth (bottom up with normalized values)

To normalize values we use a weighted average of the values in the 5 years we are considering. More recent years get a higher weight, while older years get a lower weight.

The formulas we use to estimate growth normalized are the same we saw for the TTM one, just using normalized values instead of TTM ones for return on capital, reinvestment rate, return on equity and payout ratio.

In [60]:
revenue_5y, ebit_5y, operating_margin_5y, sales_capital_5y, roc_5y, reinvestment_5y, growth_5y, \
net_income_5y, roe_5y, reinvestment_eps_5y, growth_eps_5y = \
get_normalized_info(revenue, ebit_adj, revenue_delta, reinvestment, target_sales_capital,
                ebit_after_tax, industry_payout, cagr, net_income_adj, roe, dividends, eps_adj, roc)

Compute normalized EPS and payout ratio.

In [61]:
eps_5y, payout_5y = get_dividends_info(eps_adj, dividends)

Here we compute cost of capital based on debt to equity, cost of equity and cost of debt.

In this method we also compute the survival probability which estimates the probability that the firm will still be up and running in 10 years.

This is estimated as 1 - company default spread ^ 10

In [62]:
survival_prob, debt_equity, levered_beta, cost_of_equity, equity_weight, debt_weight, cost_of_capital = \
get_final_info(riskfree, cost_of_debt, equity_mkt, debt_mkt, unlevered_beta,
           tax_rate, final_erp, company_default_spread)

Let's now compute the liquidation value, meaning the value of the firm in case of liquidation.

This is estimated by taking:
- cash, securities and real estate properties at book value
- inventory, account receivables and PP&E at 75% of book value
- equity investments at 50% of book value
- all liabilities at book value

In [63]:
mr_receivables = data["mr_receivables"]["value"] / 1000
mr_inventory = data["mr_inventory"]["value"] / 1000
mr_other_current_assets = data["mr_other_assets"]["value"] / 1000
mr_ppe = data["mr_ppe"]["value"] / 1000
mr_property = data["mr_investment_property"]["value"] / 1000
mr_equity_investments = data["mr_equity_investments"]["value"] / 1000
mr_total_liabilities = data["mr_liabilities"]["value"] / 1000

debug = True
try:
    liquidation_value = calculate_liquidation_value(mr_cash, mr_receivables, mr_inventory, mr_securities,
                                                    mr_other_current_assets, mr_property,
                                                    mr_ppe, mr_equity_investments, mr_total_liabilities, equity_mkt,
                                                    mr_debt, mr_equity, mr_original_min_interest,
                                                    mr_minority_interest, debug=debug)
except:
    print(traceback.format_exc())
    liquidation_value = 0

===== Liquidation Value =====

cash 25924000.0
securities 84956000.0
receivables 36036000.0
inventory 2315000.0
other_current_assets_ms 8532000.0
property 0.0
ppe 117560000.0
equity_investments 1600000.0

total_liabilities 108597000.0
debt_bv 15208000.0
equity_bv 260894000.0
minority_interest 0.0 => 0.0
damodaran_liquidation 126415250.0
net_net_wc_liquidation 49166000.0
liquidation_value 74915750.0





Now we finally have all the data we need to start building our valuation model scenarios.

In [64]:
print("===== Growth =====\n")
print("cagr", round(cagr,4))
print("riskfree", round(riskfree,4))
print("\n\n")
print("===== Model Helper Calculation =====\n")
print("roc_last", round(roc_last,4))
print("reinvestment_last", round(reinvestment_last,4))
print("growth_last", round(growth_last,4))
print("ROC history", roc)
print("roc_5y", round(roc_5y,4))
print("Reinvestment history", reinvestment)
print("reinvestment_5y", round(reinvestment_5y,4))
print("growth_5y", round(growth_5y,4))
print("revenue_5y", revenue_5y)
print("ebit_5y", ebit_5y)
print("roe_last", round(roe_last,4))
print("reinvestment_eps_last", round(reinvestment_eps_last,4))
print("growth_eps_last", round(growth_eps_last,4))
print("sales_capital_5y", round(sales_capital_5y,4))
print("roe_5y", round(roe_5y,4))
print("reinvestment_eps_5y", round(reinvestment_eps_5y,4))
print("growth_eps_5y", round(growth_eps_5y,4))
print("eps_5y", round(eps_5y,4))
print("payout_5y", round(payout_5y,4))
print("industry_payout", round(industry_payout,4))
print("target_sales_capital", round(target_sales_capital,4))
print("\n\n")
print("===== Recap Info =====\n")
print("country_default_spread", round(country_default_spread,4))
print("country_risk_premium", round(country_risk_premium,4))
print("riskfree", round(riskfree,4))
print("final_erp", round(final_erp,4))
print("unlevered_beta", round(unlevered_beta,4))
print("tax_rate", round(tax_rate,4))
print("levered_beta", round(levered_beta,4))
print("cost_of_equity", round(cost_of_equity,4))
print("cost_of_debt", round(cost_of_debt,4))
print("equity_weight", round(equity_weight,4))
print("debt_weight", round(debt_weight,4))
print("cost_of_capital", round(cost_of_capital,4))
print("equity_mkt", round(equity_mkt,2))
print("debt_mkt", round(debt_mkt,2))
print("debt_equity", round(debt_equity,4))
print("equity_bv_adj", round(mr_equity_adj,2))
print("debt_bv_adj", round(mr_debt_adj,2))
print("ebit_adj", round(ttm_ebit_adj,2))
print("company_default_spread", round(company_default_spread,4))
print("survival_prob", round(survival_prob,4))
print("liquidation value", round(liquidation_value, 2))
print("\n\n")
print("===== Other Model inputs =====\n")
print("operating_margin_5y", round(operating_margin_5y,4))
print("target_operating_margin", round(target_operating_margin,4))
print("target_debt_equity", round(target_debt_equity,4))
print("target_levered_beta", round(target_levered_beta,4))
print("target_cost_of_equity", round(target_cost_of_equity,4))
print("target_cost_of_debt", round(target_cost_of_debt,4))
print("target_cost_of_capital", round(target_cost_of_capital,4))
print("\n\n")

===== Growth =====

cagr 0.0979
riskfree 0.0394



===== Model Helper Calculation =====

roc_last 0.2632
reinvestment_last 0.7021
growth_last 0.1848
ROC history [0.22135934300621093, 0.22984520957423168, 0.23582586817601423, 0.33667822475994597, 0.2740669576134261]
roc_5y 0.2807
Reinvestment history [9314706.717299584, 34242706.71729958, 36423298.15189874, 44419839.78059073, 46048666.66666667]
reinvestment_5y 0.7246
growth_5y 0.2034
revenue_5y 250874612.9032258
ebit_5y 77367330.4486783
roe_last 0.2108
reinvestment_eps_last 1.0
growth_eps_last 0.2108
sales_capital_5y 1.009
roe_5y 0.2231
reinvestment_eps_5y 1.0
growth_eps_5y 0.2231
eps_5y 5.4014
payout_5y 0.0
industry_payout 0.9536
target_sales_capital 1.5995



===== Recap Info =====

country_default_spread 0.0111
country_risk_premium 0.0156
riskfree 0.0394
final_erp 0.0681
unlevered_beta 1.3355
tax_rate 0.2494
levered_beta 1.3509
cost_of_equity 0.1314
cost_of_debt 0.0574
equity_weight 0.9849
debt_weight 0.0151
cost_of_capital 0.1301
eq

### Valuation

And now, having everything we need, we can perform our valuations for each of the 8 scenarios we presented at the beginning (recession and no recession):
- Earnings TTM & Historical Growth
- Earnings Normalized & Historical Growth
- Earnings TTM & Growth TTM
- Earnings Normalized & Growth Normalized

**Dividends valuation**:
1. Compute EPS for the next 10 years based on growth estimates
2. Compute payout ratio and Dividends/Share for the next 10 years
3. Compute terminal value as 11-year EPS / (cost of equity - growth)
4. Discount everything to today using the cost of equity

**Free Cash Flow valuation**:
1. Compute revenue for the next 10 years based on growth estimates
2. Compute operating margin, EBIT, taxes, reinvestment and FCFF for the next 10 years
3. Compute terminal value as 11-year FCFF / (cost of capital - growth)
4. Discount everything to today using the cost of capital

Compute liquidation per share

In [65]:
liquidation_per_share = liquidation_value / mr_shares
if fx_rate is not None:
    fcff_value *= fx_rate
    div_value *= fx_rate
    liquidation_per_share *= fx_rate

In [66]:
dict_values_for_bi = {}

stock_value_div_ttm_fixed = dividends_valuation(EARNINGS_TTM, GROWTH_FIXED, cagr, growth_eps_5y, growth_5y,
                                                riskfree, industry_payout, cost_of_equity,
                                                target_cost_of_equity, growth_eps_last, eps_5y, payout_5y, ttm_eps_adj,
                                                reinvestment_eps_last, fx_rate, survival_prob, liquidation_per_share, debug=debug, dict_values_for_bi=dict_values_for_bi)
stock_value_div_norm_fixed = dividends_valuation(EARNINGS_NORM, GROWTH_FIXED, cagr, growth_eps_5y, growth_5y,
                                                 riskfree, industry_payout, cost_of_equity,
                                                 target_cost_of_equity, growth_eps_last, eps_5y, payout_5y, ttm_eps_adj,
                                                reinvestment_eps_last, fx_rate, survival_prob, liquidation_per_share, debug=debug, dict_values_for_bi=dict_values_for_bi)
stock_value_div_ttm_ttm = dividends_valuation(EARNINGS_TTM, GROWTH_TTM, cagr, growth_eps_5y, growth_5y, riskfree,
                                              industry_payout, cost_of_equity, target_cost_of_equity,
                                              growth_eps_last, eps_5y, payout_5y, ttm_eps_adj,
                                                reinvestment_eps_last, fx_rate, survival_prob, liquidation_per_share, debug=debug, dict_values_for_bi=dict_values_for_bi)
stock_value_div_norm_norm = dividends_valuation(EARNINGS_NORM, GROWTH_NORM, cagr, growth_eps_5y, growth_5y, riskfree,
                                                industry_payout, cost_of_equity,
                                                target_cost_of_equity, growth_eps_last, eps_5y, payout_5y, ttm_eps_adj,
                                                reinvestment_eps_last, fx_rate, survival_prob, liquidation_per_share, debug=debug, dict_values_for_bi=dict_values_for_bi)
stock_value_div_ttm_fixed_recession = dividends_valuation(EARNINGS_TTM, GROWTH_FIXED, cagr, growth_eps_5y, growth_5y,
                                                riskfree, industry_payout, cost_of_equity,
                                                target_cost_of_equity, growth_eps_last, eps_5y, payout_5y, ttm_eps_adj,
                                                reinvestment_eps_last, fx_rate, survival_prob, liquidation_per_share, debug=debug, recession=True, dict_values_for_bi=dict_values_for_bi)
stock_value_div_norm_fixed_recession = dividends_valuation(EARNINGS_NORM, GROWTH_FIXED, cagr, growth_eps_5y, growth_5y,
                                                 riskfree, industry_payout, cost_of_equity,
                                                 target_cost_of_equity, growth_eps_last, eps_5y, payout_5y, ttm_eps_adj,
                                                reinvestment_eps_last, fx_rate, survival_prob, liquidation_per_share, debug=debug, recession=True, dict_values_for_bi=dict_values_for_bi)
stock_value_div_ttm_ttm_recession = dividends_valuation(EARNINGS_TTM, GROWTH_TTM, cagr, growth_eps_5y, growth_5y, riskfree,
                                              industry_payout, cost_of_equity, target_cost_of_equity,
                                              growth_eps_last, eps_5y, payout_5y, ttm_eps_adj,
                                                reinvestment_eps_last, fx_rate, survival_prob, liquidation_per_share, debug=debug, recession=True, dict_values_for_bi=dict_values_for_bi)
stock_value_div_norm_norm_recession = dividends_valuation(EARNINGS_NORM, GROWTH_NORM, cagr, growth_eps_5y, growth_5y, riskfree,
                                                industry_payout, cost_of_equity,
                                                target_cost_of_equity, growth_eps_last, eps_5y, payout_5y, ttm_eps_adj,
                                                reinvestment_eps_last, fx_rate, survival_prob, liquidation_per_share, debug=debug, recession=True, dict_values_for_bi=dict_values_for_bi)

stock_value_fcff_ttm_fixed = fcff_valuation(EARNINGS_TTM, GROWTH_FIXED, cagr, riskfree, ttm_revenue, ttm_ebit_adj,
                                            target_operating_margin, mr_tax_benefits, tax_rate, sales_capital_5y, target_sales_capital,
                                            debt_equity, target_debt_equity, unlevered_beta, final_erp, cost_of_debt,
                                            target_cost_of_debt, mr_cash, mr_securities, debt_mkt, mr_minority_interest, survival_prob, mr_shares,
                                            liquidation_value, growth_last, growth_5y, revenue_5y, ebit_5y, fx_rate, mr_property, mr_sbc, debug=debug, dict_values_for_bi=dict_values_for_bi)
stock_value_fcff_norm_fixed = fcff_valuation(EARNINGS_NORM, GROWTH_FIXED, cagr, riskfree, ttm_revenue, ttm_ebit_adj,
                                             target_operating_margin, mr_tax_benefits, tax_rate, sales_capital_5y, target_sales_capital,
                                             debt_equity, target_debt_equity, unlevered_beta, final_erp, cost_of_debt,
                                             target_cost_of_debt, mr_cash, mr_securities, debt_mkt, mr_minority_interest, survival_prob, mr_shares,
                                             liquidation_value, growth_last, growth_5y, revenue_5y, ebit_5y, fx_rate, mr_property, mr_sbc, debug=debug, dict_values_for_bi=dict_values_for_bi)
stock_value_fcff_ttm_ttm = fcff_valuation(EARNINGS_TTM, GROWTH_TTM, cagr, riskfree, ttm_revenue, ttm_ebit_adj,
                                          target_operating_margin, mr_tax_benefits, tax_rate, sales_capital_5y, target_sales_capital,
                                          debt_equity, target_debt_equity, unlevered_beta, final_erp, cost_of_debt,
                                          target_cost_of_debt, mr_cash, mr_securities, debt_mkt, mr_minority_interest, survival_prob, mr_shares,
                                          liquidation_value, growth_last, growth_5y, revenue_5y, ebit_5y, fx_rate, mr_property, mr_sbc, debug=debug, dict_values_for_bi=dict_values_for_bi)
stock_value_fcff_norm_norm = fcff_valuation(EARNINGS_NORM, GROWTH_NORM, cagr, riskfree, ttm_revenue, ttm_ebit_adj,
                                            target_operating_margin, mr_tax_benefits, tax_rate, sales_capital_5y, target_sales_capital,
                                            debt_equity, target_debt_equity, unlevered_beta, final_erp, cost_of_debt,
                                            target_cost_of_debt, mr_cash, mr_securities, debt_mkt, mr_minority_interest, survival_prob, mr_shares,
                                            liquidation_value, growth_last, growth_5y, revenue_5y, ebit_5y, fx_rate, mr_property, mr_sbc, debug=debug, dict_values_for_bi=dict_values_for_bi)
stock_value_fcff_ttm_fixed_recession = fcff_valuation(EARNINGS_TTM, GROWTH_FIXED, cagr, riskfree, ttm_revenue, ttm_ebit_adj,
                                                      target_operating_margin, mr_tax_benefits, tax_rate, sales_capital_5y, target_sales_capital,
                                                      debt_equity, target_debt_equity, unlevered_beta, final_erp, cost_of_debt,
                                                      target_cost_of_debt, mr_cash, mr_securities, debt_mkt, mr_minority_interest, survival_prob, mr_shares,
                                                      liquidation_value, growth_last, growth_5y, revenue_5y, ebit_5y, fx_rate, mr_property, mr_sbc, debug=debug, recession=True, dict_values_for_bi=dict_values_for_bi)
stock_value_fcff_norm_fixed_recession = fcff_valuation(EARNINGS_NORM, GROWTH_FIXED, cagr, riskfree, ttm_revenue, ttm_ebit_adj,
                                                       target_operating_margin, mr_tax_benefits, tax_rate, sales_capital_5y, target_sales_capital,
                                                       debt_equity, target_debt_equity, unlevered_beta, final_erp, cost_of_debt,
                                                       target_cost_of_debt, mr_cash, mr_securities, debt_mkt, mr_minority_interest, survival_prob, mr_shares,
                                                       liquidation_value, growth_last, growth_5y, revenue_5y, ebit_5y, fx_rate, mr_property, mr_sbc, debug=debug, recession=True, dict_values_for_bi=dict_values_for_bi)
stock_value_fcff_ttm_ttm_recession = fcff_valuation(EARNINGS_TTM, GROWTH_TTM, cagr, riskfree, ttm_revenue, ttm_ebit_adj,
                                                    target_operating_margin, mr_tax_benefits, tax_rate, sales_capital_5y, target_sales_capital,
                                                    debt_equity, target_debt_equity, unlevered_beta, final_erp, cost_of_debt,
                                                    target_cost_of_debt, mr_cash, mr_securities, debt_mkt, mr_minority_interest, survival_prob, mr_shares,
                                                    liquidation_value, growth_last, growth_5y, revenue_5y, ebit_5y, fx_rate, mr_property, mr_sbc, debug=debug, recession=True, dict_values_for_bi=dict_values_for_bi)
stock_value_fcff_norm_norm_recession = fcff_valuation(EARNINGS_NORM, GROWTH_NORM, cagr, riskfree, ttm_revenue, ttm_ebit_adj,
                                                      target_operating_margin, mr_tax_benefits, tax_rate, sales_capital_5y, target_sales_capital,
                                                      debt_equity, target_debt_equity, unlevered_beta, final_erp, cost_of_debt,
                                                      target_cost_of_debt, mr_cash, mr_securities, debt_mkt, mr_minority_interest, survival_prob, mr_shares,
                                                      liquidation_value, growth_last, growth_5y, revenue_5y, ebit_5y, fx_rate, mr_property, mr_sbc, debug=debug, recession=True, dict_values_for_bi=dict_values_for_bi)

===== Dividends Valuation - EARNINGS_TTM + GROWTH_FIXED + recession:False =====

expected_growth [0.1074 0.101  0.0946 0.0881 0.0817 0.0753 0.0689 0.0625 0.0561 0.0497
 0.0433]
earnings_per_share [6.0673, 6.6799, 7.3116, 7.9561, 8.6064, 9.2546, 9.8924, 10.5106, 11.1, 11.6513, 12.1552]
payout_ratio [0.0867 0.1734 0.2601 0.3468 0.4335 0.5202 0.6069 0.6935 0.7802 0.8669
 0.9536]
dividends_per_share [0.526, 1.1582, 1.9016, 2.759, 3.7306, 4.8139, 6.0032, 7.2895, 8.6606, 10.1008, 11.5915]
cost_of_equity [0.1315 0.1317 0.1318 0.1319 0.132  0.1321 0.1322 0.1323 0.1324 0.1325
 0.1327]
cumulative_cost_equity [1.1315, 1.2805, 1.4492, 1.6404, 1.8569, 2.1022, 2.3801, 2.695, 3.0519, 3.4564, 3.915]
present_value [0.4648, 0.9045, 1.3121, 1.6819, 2.0091, 2.29, 2.5222, 2.7048, 2.8377, 2.9223]
terminal_value 129.66
PV of terminal_value 33.12
stock value (price curr) 52.77
stock value (fin curr) 49.63



===== Dividends Valuation - EARNINGS_NORM + GROWTH_FIXED + recession:False =====

expected_growth [0.1

Now let's compute the Expected values for our valuation.

We take the median value for the 4 FCFF no recession scenarios, and the median value for the 4 FCFF recession scenarios.

We then take the weighted average of these 2 values based on the recession probability.

And of course we do the same for the Dividends scenarios.

In [67]:
fcff_values_list = [stock_value_fcff_ttm_fixed, stock_value_fcff_norm_fixed, stock_value_fcff_ttm_ttm,
                       stock_value_fcff_norm_norm]
fcff_recession_values_list = [stock_value_fcff_ttm_fixed_recession, stock_value_fcff_norm_fixed_recession,
                                          stock_value_fcff_ttm_ttm_recession, stock_value_fcff_norm_norm_recession]
div_values_list = [stock_value_div_ttm_fixed, stock_value_div_norm_fixed, stock_value_div_ttm_ttm,
                   stock_value_div_norm_norm]
div_recession_values_list = [stock_value_div_ttm_fixed_recession, stock_value_div_norm_fixed_recession,
                                         stock_value_div_ttm_ttm_recession, stock_value_div_norm_norm_recession]

recession_probability = 0.5
fcff_value = summary_valuation(fcff_values_list)
fcff_recession_value = summary_valuation(fcff_recession_values_list)
ev_fcff = fcff_value * (1 - recession_probability) + fcff_recession_value * recession_probability
div_value = summary_valuation(div_values_list)
div_recession_value = summary_valuation(div_recession_values_list)
ev_dividends = div_value * (1 - recession_probability) + div_recession_value * recession_probability

The delta variables are the % difference between the value our model assigned to 1 share of the company and the current stock price.

A negative value means the stock is undervalued according to our model.

A positive value means the stock is overvalued according to our model.

In [68]:
fcff_delta = price_per_share / ev_fcff - 1 if fcff_value > 0 else 10
div_delta = price_per_share / ev_dividends - 1 if div_value > 0 else 10
liquidation_delta = price_per_share / liquidation_per_share - 1 if liquidation_per_share > 0 else 10

In [69]:
print("FCFF values")
print([round(x, 2) for x in fcff_values_list])
print("\nFCFF values w/ Recession")
print([round(x, 2) for x in fcff_recession_values_list])
print("\n\nDiv values")
print([round(x, 2) for x in div_values_list])
print("\nDiv values w/ Recession")
print([round(x, 2) for x in div_recession_values_list])

print("\n\n\n")

print("Price per Share", price_per_share)
print("FCFF Result", ev_fcff)
print("FCFF Deviation", fcff_delta)
print("Dividends Result", ev_dividends)
print("Dividends Deviation", div_delta)


FCFF values
[59.31, 54.0, 75.09, 71.83]

FCFF values w/ Recession
[42.92, 39.34, 46.93, 43.88]


Div values
[49.63, 48.94, 79.39, 82.64]

Div values w/ Recession
[35.41, 34.92, 48.82, 49.95]




Price per Share 120.97
FCFF Result 54.48574881278017
FCFF Deviation 1.2202135904503764
Dividends Result 53.31465660538761
Dividends Deviation 1.2689820717662768


### Company Risk Assessment

This last phase is all about estimating risks in investing in the company.

We are going to consider:
- Company size
- Company complexity
- Share dilution in the last 5 years
- Changes in auditor
- Company type
- Consistent growth of inventory and receivables compared to revenue
- Qualitative information we extracted previously using ChatGPT model

In [70]:
market_cap_USD = equity_mkt * fx_rate_financial_USD
if market_cap_USD < 50 * 10 ** 3:
    company_size = "Nano"
elif market_cap_USD < 300 * 10 ** 3:
    company_size = "Micro"
elif market_cap_USD < 2 * 10 ** 6:
    company_size = "Small"
elif market_cap_USD < 10 * 10 ** 6:
    company_size = "Medium"
elif market_cap_USD < 200 * 10 ** 6:
    company_size = "Large"
else:
    company_size = "Mega"
print(f"Company Size {company_size}")

Company Size Mega


For company type we have 7 different possibilities, where a single company can match one or more of them.

The categories are build around the concepts explained by Peter Lynch in his book "One Up on Wall Street".

- fast grower (high growth)
- stalward (moderate growth)
- slow grower (low growth)
- declining (negative growth)
- turn around (money losing company or debt afflicted company with a clear way to turn around)
- asset play (liquidation value higher than market cap)
- cyclical (results affected by business cycle)

In [71]:
complexity = company_complexity(doc, industry, company_size)
dilution = company_share_diluition(shares)
inventory = get_selected_years(data, "inventory", initial_year-1, final_year)
receivables = get_selected_years(data, "receivables", initial_year-1, final_year)
company_type = get_company_type(revenue_growth, mr_debt_adj, equity_mkt, liquidation_value, operating_margin_5y, industry)

In [72]:
auditor = find_auditor(doc)
print(f"Auditor {auditor}")

Auditor Ernst & Young LLP We have served as the Company's auditor since 1999


Here the result of the risk assessment.

In [73]:
print("===== Risk Assessment =====\n")
print("MKT CAP USD: ", market_cap_USD)
print("company_size", company_size)
print("company complexity", complexity)
print("share dilution", round(dilution, 4))
print("revenue", revenue)
print("inventory", inventory)
print("receivables", receivables)
print("company_type", company_type)
print("Auditor", auditor)
print()

===== Risk Assessment =====

MKT CAP USD:  1538980340.0
company_size Mega
company complexity 4
share dilution 1.0732
revenue [136819000.0, 161857000.0, 182527000.0, 257637000.0, 282836000.0]
inventory [749000.0, 1107000.0, 999000.0, 728000.0, 1170000.0, 2670000.0]
receivables [18336000.0, 20838000.0, 25326000.0, 30930000.0, 39304000.0, 40258000.0]
company_type {'fast_grower': True, 'stalward': False, 'slow_grower': False, 'declining': False, 'turn_around': False, 'asset_play': False, 'cyclical': True}
Auditor Ernst & Young LLP We have served as the Company's auditor since 1999



To conclude we can show the qualitative information from the most recent company filings (as shown in the qualitative analysis sections).

This can inform the investor about the company business, the company specific risks, the management view on the future of the company and the industry, and much more.

Having the summary of the most recent 10Qs and 8Ks we can stay up to date with the most recent developments of the company, which is extremely useful when deciding whether to invest.

In [74]:
recent_docs = get_recent_docs(cik, doc["filing_date"])
for d in recent_docs:

    print("##############")
    print(d["form_type"], d["filing_date"], d["_id"])
    print("##############\n")

    if not mongodb.check_document_exists("parsed_documents", d["_id"]):
        parse_document(d)

    parsed_doc = mongodb.get_document("parsed_documents", d["_id"])

    if not mongodb.check_document_exists("items_summary", d["_id"]):
        sections_summary(parsed_doc)

    summary_doc = mongodb.get_document("items_summary", d["_id"])

    for k, v in summary_doc.items():
        if isinstance(v, dict):

            print(f"=== {k} ===")

            for info in v["summary"]:
                print(info)
                
            print()

    print("\n")

##############
10-K 2023-02-03 https://www.sec.gov/Archives/edgar/data/1652044/000165204423000016/goog-20221231.htm
##############

=== business ===
Alphabet Inc., the parent company of Google, is committed to innovation and solving big problems
They strive to make information universally accessible and provide tools for knowledge, health, happiness, and success
Google Services, including search, YouTube, and Google Assistant, offer intuitive experiences
Google Cloud helps businesses overcome challenges and drive growth
Alphabet also invests in Other Bets to address various industry problems
They focus on AI to assist and inspire people in different fields
Privacy and security are top priorities, with a commitment to building secure products and giving users control over their data
The company aims for sustainability and has ambitious goals for net-zero emissions and a circular economy
Alphabet values a diverse and inclusive workforce, offering competitive compensation, benefits, and c

=== Item 5.02. Departure of Directors or Certain Officers ===
Alphabet Inc.'s stockholders approved the amendment and restatement of the company's 2021 Stock Plan at the 2023 Annual Meeting
The amendment increases the share reserve by 170,000,000 shares of Class C capital stock
More details can be found in the 2023 Proxy Statement and the full text of the amended stock plan is included in the filing.

=== Item 5.07. Submission of Matters to a Vote of Security Holders ===
At Alphabet's 2023 Annual Meeting, stockholders voted on nineteen proposals
The individuals listed were elected as directors
The appointment of Ernst & Young LLP as the independent registered public accounting firm was ratified
The amendment and restatement of the 2021 Stock Plan was approved
The compensation awarded to named executive officers was approved
Stockholders voted for a frequency of advisory votes on executive compensation once every three years
Various stockholder proposals regarding lobbying, congruency, 

## <a class="anchor" id="4-bullet" href="#toc">4. Visualization</a>

We built some dashboards using the 2 main Business Intelligence tools, Tableau and PowerBI.

Using these dashboards we can understand every company we value at a glance using a visual format.

### Tableau Overview

A ticker filter in the top left, let us choose the company we want to visualize. We can see the current price/share and general information at the top.

Then we have a geography segmentation where we can see the distribution of country/region where the company operates.

And finally at the bottom we have two tables were we can explore the summaries of the sections of the most recent filings of the company. (The ones we built before using OpenAI API)

<img src="../images/TB_overview1.png">

Clicking the desidered section on the left we can instantly see the summary of that section.

This is pretty awesome, and can save hours and hours of research time.

(We know because we do that for the companies we invest in)

<img src="../images/TB_overview2.PNG">

### PowerBI Overview

We built the same dashboard also in PowerBI, just for fun.

The functionality is pretty much the same.

<img src="../images/PBI_overview1.PNG">

### Tableau Valuation

In this dashboard we can quickly visualize the different valuations we got from our 8 scenarios, both for FCFF valuations and dividends valuations.

Then we have a chart with the estimated revenue, EBIT, FCFF and dividends in the next 10 years.

We can change the scenario we are visualizing by changing the parameters on the right.

At the bottom we can see the components that make up for the company value we estimate, and the estimated value/share.

<img src="../images/TB_valuation1.PNG">

### PowerBI Valuation

We also built this dashboard in PowerBI.

We have some different visual elements here to better leverage the features of the tool, like a waterfall chart instead of a simple bar chart (we could also build that in Tableau, but it would have required a little data crunching to get it to the correct format).

<img src="../images/PBI_valuation2.PNG">

## <a class="anchor" id="5-bullet" href="#toc">5. Conclusions</a>

We have completed this project from data collection to data visualization with the goal of building a tool to help us valuating USA companies.

### Qualitative analysis results
So far we have seen the **qualitative analysis** in which we extrapolated text from company reports structured it in sections and then summarized the most important ones in key insights to better understand the company and help us valuate the company.

As we have seen for *Alphabet Inc.*, we got various insights that can give use a better understanding of the company business, its mission and how its revenue is structured.

We can further increase our knowledge of what happened in recent periods by reading the quarterly reports (10-Q) and the major-event announcements (8-K).

### Quantitative analysis results
Then we extrapolated data from financial data of the company, and computed all measures and values to perform the **quantitative analysis**.

Regarding *Alphabet Inc.* financial data, we can see that the model correctly classifies it as a *fast grower* *Mega* company. 

However, based on the model valuation, the company is currently overpriced, meaning we have to wait for a better time to invest in it.


### Next Steps

At the moment our project has still various limitations that needs to addressed in order to reach a production-ready product, here's a non-comprehensive list:
- Manage files larger than 16MB. 
    - Due to mongoDB size limitation we excluded files that are larger than 16MB. This could be solved using a storage system like S3 to save raw files instead that directly on mongoDB.
- Industry segmentation.
    - Understanding the segmentation of company revenue by industry is not trivial. Every company has the freedom to report operating segments as it sees fit. Meaning no standardization of industries. In our project we just assumed that 100% of the revenue is generated in the main industry in which the company is categorized. This is usually not the case.
- Better report sections identification.
    - There are some limitation in finding sections inside the filings' documents. At the moment we exploit the table of contents or a default items list. This could bring some problems in unidentified cases.
- Improve qualitative analysis including sentiment analysis of the company products.
    - Sentiment analysis is a known practice when valuating companies. A possible way to implement this could be using Twitter APIs or similar to integrate what people say about a company or examine google trends to find out the level of interest in the company and its products.
- Industry and industry CAGR to compare with company growth.
    - Retrieve the industry market size and industry growth could be useful to compare the market growth with the company growth, to understand if it's sustainable in the long-term.
- Aggregate similar risks and define a peer voting system to define risk impact for a specific kind of risk.
    - A possible way to manage subjectivity of assessing risks in a company could be a peer voting system, where people votes on defined risks based on their knowledge and sharing their opinion.
- Include shareholders composition.
    - Another useful measure we didn't include is the shareholder composition, to see if there is a majority shareholder that could add an additional layer of risk.
- Valuation for financial companies, that requires different metrics.
    - Our valuation does not work for financial companies at the moment, as they report different measures. Also for financial companies we would probably just compute the Dividends valuation, as explained by Prof. Damodaran.
- Include MD&A and financial notes sections analysis of 10-K and 10-Q, and other long sections.
    - Due to long text in MD&A and other sections we excluded their summaries. This problem need further text processing before using a LLM model to take it as input. While possible for the 16k context model to manage, the results it gave us are pretty much meaningless.