Reproducible steps for getting 10-k filings from [EDGAR](https://www.sec.gov/edgar).

You will need the latest copy of the modules.

1. [sec-downloader](https://github.com/Elijas/sec-downloader)
2. [sec-parser](https://github.com/alphanome-ai/sec-parser)


In [None]:
pip install --upgrade pip

In [None]:
!pip install sec-downloader
!pip install sec-parser

# Example

1. Navigate to [here](https://www.sec.gov/edgar/searchedgar/companysearch)
2. Enter "ENV" into the search box.
3. Right hand side, expand "10-K (annual reports) and 10-Q (quarterly reports)"
4. Get whatever is on top.
5. Scroll to "Item 7"

As of 2024/01/03

"ENV" is:

* https://www.sec.gov/edgar/browse/?CIK=1337619
* https://www.sec.gov/ix?doc=/Archives/edgar/data/1337619/000133761923000012/env-20221231.htm
* pages 38-65

"MSFT" is:

* https://www.sec.gov/edgar/browse/?CIK=789019
* https://www.sec.gov/ix?doc=/Archives/edgar/data/789019/000095017023035122/msft-20230630.htm
* pages 40-57

# Python

1. Get all the `symbols` in _symbols.csv_
2. For each `symbol` in `symbols`
   1. If _corpus/raw_ is missing the xhtml associated with the `symbol`, download it
3. For each xhtml  _corpus/raw_
   1. Get the element IDs associated with Item 7 and 7A
   2. TODO: Collect all the text inside [Item 7, Item 7A), save it to _corpus/text_


In [None]:
import csv
from sec_downloader import Downloader
from pathlib import Path
from lxml import etree

In [None]:
symbol_file = Path('./symbols.csv')
raw_files = Path('./corpus/raw')
company_name = 'SokpheanalHuynh'
email_address = 'sokpheanal.huynh@gmail.com'

if not raw_files.exists():
    raw_files.mkdir(parents = True)

In [None]:
with open(symbol_file, mode = 'r', encoding = 'utf-8') as fp:
    reader = csv.reader(fp)
    next(reader)
    symbols = [row[0] for row in reader]

dl = Downloader(company_name, email_address)
for symbol in symbols:
    if symbol.startswith('#'):
        continue
    xhtml_file = raw_files.joinpath(f'{symbol}.xhtml')
    if xhtml_file.exists():
        continue
    print(f'Retrieving {symbol}...')
    xhtml = dl.get_filing_html(ticker = symbol, form = "10-K")
    with open(xhtml_file, mode = 'wb') as fp:
        fp.write(xhtml)

In [None]:
def get_ref_ids(node: etree.Element) -> tuple[str, str]:
    # you need to include the NS in every part of the `xpath`
    # https://stackoverflow.com/questions/38936185/etree-xpath-return-entire-html-instead-of-text
    ns_map = {'x':'http://www.w3.org/1999/xhtml'}
    xpaths = [
        ".//x:table/x:tr/x:td//*[contains(text(),'Financial Condition and Results of Operations')]",
        "ancestor-or-self::x:a[contains(@href, '#')]",
        "ancestor::x:tr/following-sibling::x:tr/x:td//x:a[contains(@href, '#')]"]
    t1: etree.Element = node.xpath(xpaths[0], namespaces = ns_map)[0]
    t1_href: str = t1.xpath(xpaths[1], namespaces = ns_map)[0].attrib['href']
    t2_href: str = t1.xpath(xpaths[2], namespaces = ns_map)[0].attrib['href']
    return t1_href, t2_href

In [None]:
parser = etree.XMLParser(encoding = 'utf-8', recover = True, ns_clean = True)
for xhtml_file in raw_files.iterdir():
    print(f'Parsing {xhtml_file.name}...')
    with open(xhtml_file, mode = 'rb') as fp:
        xhtml = fp.read()
    root: etree.Element = etree.fromstring(xhtml, parser)
    ids = get_ref_ids(root)
    print(ids[0])
    print(ids[1])