Reproducible steps for getting 10-k filings from [EDGAR](https://www.sec.gov/edgar).


# Example

1. Navigate to [here](https://www.sec.gov/edgar/searchedgar/companysearch)
2. Enter "ENV" into the search box.
3. Right hand side, expand "10-K (annual reports) and 10-Q (quarterly reports)"
4. Get whatever is on top.

As of 2024/01/03

"ENV" is:

* https://www.sec.gov/edgar/browse/?CIK=1337619
* https://www.sec.gov/ix?doc=/Archives/edgar/data/1337619/000133761923000012/env-20221231.htm

"MSFT" is:

* https://www.sec.gov/edgar/browse/?CIK=789019
* https://www.sec.gov/ix?doc=/Archives/edgar/data/789019/000095017023035122/msft-20230630.htm


In [None]:
import csv
import requests
import sec_downloader.types as sec_t #type: ignore
import typing as t
from sec_downloader import Downloader #type: ignore
from pathlib import Path
from tqdm.notebook import tqdm
from datetime import datetime

In [None]:
data_folder = Path('./data/10-k/raw')
tickers_file = Path('./data/tickers.csv')
bad_cik_file = Path('./data/bad_cik.csv')
user_agent = 'TextCorpusLabs/EDGAR'
limit = 20
form_type = '10-K'

_tickers.csv_ is a required file.
If you don't have it, please run _Get Tickers.ipynb_ before proceding

In [None]:
if not tickers_file.exists():
    print('Run  to generate the tickers file')
    exit(1)

We need to pull the ticker from _tickers.csv_ to process.
However, the SEC has some effective dupes (AACI vs AACIU) in their data pull.
They will both have the same CIK number (1844817) which is how the SEC understands the company.
We need to pull the distinct CIKs and use those, not the English symbol.
This means we need to load the CIKs, not the ticker from _tickers.csv_. 

In [None]:
with open(tickers_file, mode = 'r', encoding = 'utf-8') as fp:
    reader = csv.reader(fp)
    next(reader)
    ciks = list(set(row[0] for row in reader))

We need to filter out CIKs that have proven hard to pull in the past (`bad_cik_file`).
As we process, any CIK that has an issue will be added to this list.

In [None]:
if bad_cik_file.exists():
    with open(bad_cik_file, mode = 'r', encoding = 'utf-8') as fp:
        reader = csv.reader(fp)
        next(reader)
        bad_ciks = [row[0] for row in reader]
else:
    bad_ciks = []

ciks = [cik for cik in sorted(ciks) if cik not in bad_ciks]

In [None]:
def get_filing_metadata(cik: str, form_type: str, limit: int) -> t.Union[None, t.List[sec_t.FilingMetadata]]:
    downloader = Downloader(user_agent, '')
    try:
        return downloader.get_filing_metadatas(sec_t.RequestedFilings(ticker_or_cik  = cik, form_type = form_type, limit = limit))
    except ValueError:
        return None

In [None]:
def get_xhtml_file_path(data_folder: Path, metadata: sec_t.FilingMetadata) -> Path:
    filing_date = datetime.strptime(metadata.filing_date, '%Y-%m-%d')
    filing_date = datetime.strftime(filing_date, '%Y%m%d')
    report_year = datetime.strptime(metadata.report_date, '%Y-%m-%d')
    report_year = datetime.strftime(report_year, '%Y')
    return data_folder.joinpath(f'{report_year}/{metadata.cik}.{filing_date}.xhtml')

In [None]:
def get_filing_xhtml(session: requests.Session, metadata: sec_t.FilingMetadata) -> t.Union[None, str]:
    doc_url = metadata.primary_doc_url
    with session.get(doc_url) as response:
        if response.status_code == 200:
            return response.text
    return None

Get the `cik`'s filing's (`form_type`) metadata for the past `limit` years.
It is possible for this to return less than the full amount.
If no metadata is returned, add the `cik` to the list of `bad_ciks` so we don't keep re-trying them.

In [None]:
with requests.Session() as session:
    session.headers['User-Agent'] = user_agent
    for cik in tqdm(ciks):
        print(f'Working on cik ({cik}): ', end = '')
        metadata = get_filing_metadata(cik, form_type, limit)
        if metadata is None:
            bad_ciks.append(cik)
            print(f'no {form_type} filings')
        else:
            for meta in metadata:
                print(f'{meta.report_date} ', end = '')
                xhtml_file = get_xhtml_file_path(data_folder, meta)
                if not xhtml_file.parent.exists():
                    xhtml_file.parent.mkdir(parents = True)
                if xhtml_file.exists():
                    continue
                xhtml = get_filing_xhtml(session, meta)
                if xhtml is None:
                    print(f' (failed) ', end = '')
                else:
                    with open(xhtml_file, mode = 'w') as fp:
                        fp.write(xhtml)
            print('')

In [None]:
with open(bad_cik_file, mode = 'w', encoding = 'utf-8', newline = '') as fp:
    writer = csv.writer(fp, delimiter = ',', quotechar = '"', quoting = csv.QUOTE_ALL)
    writer.writerow(['CIK'])
    for cik in bad_ciks:
        writer.writerow([cik])