10-K-scraper

This script is to download 10-k filing textual data (.htm) through Sec Edgar API, and to scrape specific sections, then save them into .txt file. You are welcomed to do modifications on this scripts.

Platform & Dependency:

Python 3.6 and standard libraries
BeautifulSoup

Introduction:

TenKDownloader(CIK, start_time, end_time) will return a TenKDownloader object. CIK can be one string or a list of string. Check https://www.sec.gov/Archives/edgar/cik-lookup-data.txt to see the CIKs for companies that you are looking for. Sometimes this argument can be symbol. start_time and end_time are in format %Y%m%d.

Attributes:

Method:
- download(path='./data') will download coresponding 10-k filing in ./data/<CIK>/date.htm. Implementation of this function is to use BeautifulSoup to scrape the web page and retrieve the .htm file;
Data
- all_url is a Python dictionary (key: CIK; value: list of tuple (date, filing url)).
TenKScraper(section, next_section) will return a TenKScraper object. section is something like 'item 1', and next_section is where you stop. For example, if you want to scrape section 'item 2', you can create TenKScraper('item 2', 'item 3').

Attributes:

Method:
- scrape(htm_file, txt_file) will scrape and write textual data into txt_file, and will also return the text as a string. Implementation of this function is based on the work of http://community.mis.temple.edu/zuyinzheng/pythonworkshop/, using regular expression to recognize bond tag. You can customize the pattern which is p1-p13 in my code. Note that output path must exist, but the txt file is not necessary to be existed.

Example

from TenK import TenKDownloader, TenKScraper

company_CIK = ['6281', '6769']
downloader = TenKDownloader(company_name, '20150101','20181101')
downloader.download()

scraper = TenKScraper('Item 1A', 'Item 1B')  # scrape text start from Item 1A, and stop by Item 1B
scraper2 = TenKScraper('Item 7', 'Item 8')
scraper.scrape('./data/6281/20171122.htm', './data/txt/test.txt') # make sure ./data/txt exists
scraper2.scrape('./data/6769/20180223.htm', './data/txt/test2.txt')

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE		LICENSE
README.md		README.md
TenK.py		TenK.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

TenK.py

TenK.py

Repository files navigation

10-K-scraper

Platform & Dependency:

Introduction:

Attributes:

Method:

Data

Attributes:

Method:

Example

About

Releases

Packages

Languages

License

Theling/10-K-scraper

Folders and files

Latest commit

History

Repository files navigation

10-K-scraper

Platform & Dependency:

Introduction:

Attributes:

Method:

Data

Attributes:

Method:

Example

About

Resources

License

Stars

Watchers

Forks

Languages