ABANDONED

sec_scraping

The aim of this repository is to be able to easily parse SEC-filings and create a model that can be used with a database. Created for dilutionscout.com.

how to setup dilution_db:

Fork/clone repository
Create an empty database in postgresql
make a copy of the public.env file or add it to gitignore
Fill the .env file with the database info
Add paths for files to the .env
Add the polygon API key to the .env. Get a free polygon api key if you havnt got one. polygon pricing page
specify what companies and what sec forms to track in main/configs.py (class AppConfig)
do (in a file above the main directory):

if you want to parse the filings

    from dilution_db import DilutionDB
    from main.configs import FactoryConfig, GlobalConfig
    from boot import bootstrap_dilution_db
    
    config = FactoryConfig(GlobalConfig(ENV_STATE="prod").ENV_STATE)()

    db = bootstrap_dilution_db(
        start_orm=True,
        config=config
    )
    db.inital_setup()
    # this will download at least 20GB of bulk data, the base info for the
    # company and all filings of the specified forms,
    # and will try to parse the filings and write to the database

if you only want the base and xbrl companyfacts data (cashburn, outstanding common shares)

    from dilution_db import DilutionDB
    from main.configs import FactoryConfig, GlobalConfig
    from boot import bootstrap_dilution_db
    
    config = FactoryConfig(GlobalConfig(ENV_STATE="prod").ENV_STATE)()

    db = bootstrap_dilution_db(
        start_orm=True,
        config=config
    )

    db.util.inital_company_setup(
            config.DOWNLOADER_ROOT_PATH,
            config.POLYGON_OVERVIEW_FILES_PATH,
            config.POLYGON_API_KEY,
            config.APP_CONFIG.TRACKED_FORMS,
            config.APP_CONFIG.TRACKED_TICKERS,
            after="2010-1-1",
            before=None
        )

accessing company data through the model

lets assume we have some data relating to "AAPL" already in the database. Retrieving a company object containing the data can be done like so:

    from dilution_db import DilutionDB
    from main.configs import FactoryConfig, GlobalConfig
    from boot import bootstrap_dilution_db
    
    config = FactoryConfig(GlobalConfig(ENV_STATE="prod").ENV_STATE)()

    db = bootstrap_dilution_db(
        start_orm=True,
        config=config
    )
    
    with db.uow as uow:
        company = uow.company.get(symbol="AAPL")

Parsers

using parser directly:

from main.parser import parsers

filing_path = r"absolut_path/to/filing.htm"
parser = parsers.Parser8K()
filing_sections = parser.parse(filing_path)

using parser through FilingFactory

from main.parser.parsers import Parsers8K, ParserEFFECT, BaseHTMFiling, BaseFiling, FilingFactory, ParserFactory

# register parser for form type and file extension
parser_factory_default = [
    (".htm", "8-K", Parser8K),
    (".xml", "EFFECT", ParserEFFECT)

]
parser_factory = ParserFactory(defaults=parser_factory_default)

# register object which handles parsing
filing_factory_defaults = [
    ("8-K", ".htm", BaseHTMFiling),
    ("EFFECT", ".xml", BaseFiling)
]
filing_factory = FilingFactory(defaults=filing_factory_defaults)

filing_path_8k = r"absolut_path/to/filing_8k.htm"

parsed_filing = filing_factory.create_filing("8-K", ".htm", path=filing_path_8k)

Extractors

To use an extractor independent of a database:

from main.domain.model import Company
from main.services.messagebus import MessageBus
from main.services.unit_of_work import FakeCompanyUnitOfWork
from main.parser import extractors
from main.parser import parsers

from pathlib import Path

# create fake messagebus
mbus = MessageBus(FakeCompanyUnitOfWork, dict())

# create empty company to hold extracted values
company = Company(
    name="some_company_name",
    cik="0000000000",
    sic=9999,
    symbol="ANY"
    )

filing_path = r"path/cik/form_type/accession_number/filing_s3.htm"
filing = parsers.filing_factory(
        path= path,
        filing_date= datetime.date(2018, 9, 28),
        accession_number= Path(path).parents[0].name,
        cik= Path(path).parents[2].name,
        file_number= None,
        form_type= "S-3",
        extension= ".htm"
)
extractor = extractors.HTMS3Extractor()
extractor.extract_form_values(filing, company, mbus)
# this way of accessing what happened during extraction 
# will likely change in the future and be made available 
# as some kind of result object 
commands_issued_during_extraction = mbus.collect_command_history()

filing_nlp

Used to get the financial instruments and their related attributes in unstructured text with spaCy. can be used independently:

from main.parser.filing_nlp import SpacyFilingTextSearch

search = SpacyFilingTextSearch

some_str = "This is a text about 500000 shares of common stock (The 'Shares')."

doc = search.nlp(some_str)

# this will mark:
#   'common stock' as SECU entity
#   '500000" as SECUQUANTITY enitity
#   Will determine 'Shares' as an alias and 
#   therefor a SECU entity aswell, with a relation 
#   to the 'common stock' SECU entity. 
#   This relation will be saved in doc._.single_secu_alias_tuples

further examples WIP

XBRL Fact search

if you want to extract XBRL facts from sec companyfacts files check the data_aggregation folder 1. use pysec-downloader package to retrieve companyfacts or run bulk_files.py to retrieve companyfacts and submissions files (around 25GB of files) 2. use the _get_fact_data function from fact_extractor.py to get a specific fact/ or facts matching a regex pattern

Name		Name	Last commit message	Last commit date
Latest commit History 374 Commits
.vscode		.vscode
__pycache__		__pycache__
main		main
resources		resources
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
TrainingBERT.ipynb		TrainingBERT.ipynb
_constants.py		_constants.py
boot.py		boot.py
dilution_db.py		dilution_db.py
dilution_db_examples.py		dilution_db_examples.py
example_sentence_alias_chart.uxf		example_sentence_alias_chart.uxf
fact_extractor_playground.py		fact_extractor_playground.py
filing_nlp.uxf		filing_nlp.uxf
notebook.ipynb		notebook.ipynb
overview_diagram.uxf		overview_diagram.uxf
pdf_playground.py		pdf_playground.py
playground.py		playground.py
sdp_utils.py		sdp_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ABANDONED

sec_scraping

how to setup dilution_db:

accessing company data through the model

Parsers

Extractors

filing_nlp

XBRL Fact search

About

Releases

Packages

Languages

License

Camelket/sec_scraping

Folders and files

Latest commit

History

Repository files navigation

ABANDONED

sec_scraping

how to setup dilution_db:

accessing company data through the model

Parsers

Extractors

filing_nlp

XBRL Fact search

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages