<a href="https://colab.research.google.com/github/GusAI40/datawisegus/blob/main/Extracting_data_from_SEC_EDGAR_RESTful_APIs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:

# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES
# TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

import os
import sys
from tempfile import NamedTemporaryFile
from urllib.request import urlopen
from urllib.parse import unquote, urlparse
from urllib.error import HTTPError
from zipfile import ZipFile
import tarfile
import shutil

CHUNK_SIZE = 40960
DATA_SOURCE_MAPPING = 'sec-edgar-cik-ticker-exchange:https%3A%2F%2Fstorage.googleapis.com%2Fkaggle-data-sets%2F2471230%2F7005648%2Fbundle%2Farchive.zip%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com%252F20240702%252Fauto%252Fstorage%252Fgoog4_request%26X-Goog-Date%3D20240702T011448Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3D04c52c559beb7aac2e949b0f66cf231b5fcb6cc9c71812fe10c3f4b4acdc9e98d402062b94d9f7a2096fa8060c7d1e3adb1327c098ff4e775f0a52b5f6d3d627061d8021b254b65f3619c6ad75dfcd8197d87a05ca555a8af8008d5eedd52710706355734fe0441cb15a6f65535c978fe349ccc1d9760e2f4eba52def96a09588a70a47bf48b98f994c0e6f9e2ee08a7d5748207f5ece2f3271c2a36bb3ef7a0e9770dbd113b5bfcc90842001868e0457f69550a10feca3bda8f18aa50923c9933ecfa8a666f3fe7fe52bf3a481533f24762401d7033c88c2bb369d5fb2dab7b4b71e8aeb2e608cd13179e586554c01841484bf41f70e0450bbec9215f42f640'

KAGGLE_INPUT_PATH='/kaggle/input'
KAGGLE_WORKING_PATH='/kaggle/working'
KAGGLE_SYMLINK='kaggle'

!umount /kaggle/input/ 2> /dev/null
shutil.rmtree('/kaggle/input', ignore_errors=True)
os.makedirs(KAGGLE_INPUT_PATH, 0o777, exist_ok=True)
os.makedirs(KAGGLE_WORKING_PATH, 0o777, exist_ok=True)

try:
  os.symlink(KAGGLE_INPUT_PATH, os.path.join("..", 'input'), target_is_directory=True)
except FileExistsError:
  pass
try:
  os.symlink(KAGGLE_WORKING_PATH, os.path.join("..", 'working'), target_is_directory=True)
except FileExistsError:
  pass

for data_source_mapping in DATA_SOURCE_MAPPING.split(','):
    directory, download_url_encoded = data_source_mapping.split(':')
    download_url = unquote(download_url_encoded)
    filename = urlparse(download_url).path
    destination_path = os.path.join(KAGGLE_INPUT_PATH, directory)
    try:
        with urlopen(download_url) as fileres, NamedTemporaryFile() as tfile:
            total_length = fileres.headers['content-length']
            print(f'Downloading {directory}, {total_length} bytes compressed')
            dl = 0
            data = fileres.read(CHUNK_SIZE)
            while len(data) > 0:
                dl += len(data)
                tfile.write(data)
                done = int(50 * dl / int(total_length))
                sys.stdout.write(f"\r[{'=' * done}{' ' * (50-done)}] {dl} bytes downloaded")
                sys.stdout.flush()
                data = fileres.read(CHUNK_SIZE)
            if filename.endswith('.zip'):
              with ZipFile(tfile) as zfile:
                zfile.extractall(destination_path)
            else:
              with tarfile.open(tfile.name) as tarfile:
                tarfile.extractall(destination_path)
            print(f'\nDownloaded and uncompressed: {directory}')
    except HTTPError as e:
        print(f'Failed to load (likely expired) {download_url} to path {destination_path}')
        continue
    except OSError as e:
        print(f'Failed to load {download_url} to path {destination_path}')
        continue

print('Data source import complete.')


Downloading sec-edgar-cik-ticker-exchange, 189367 bytes compressed
Downloaded and uncompressed: sec-edgar-cik-ticker-exchange
Data source import complete.


# Using SEC EDGAR RESTful data APIs

This notebook shows how to retrieve information reported by regulated entities to U.S. Securities and Exchange Commision (SEC).

SEC is maintainig EDGAR system with information about all regulated enties (companies, funds, individuals). Accessing the data is free and there is number of [various ways how to access the data](https://www.sec.gov/os/accessing-edgar-data).

"data.sec.gov" was created to host RESTful data Application Programming Interfaces (APIs) delivering JSON-formatted data to external customers and to web pages on SEC.gov. These APIs do not require any authentication or API keys to access.

Currently included in the APIs are the submissions history by filer and the XBRL data from financial statements (forms 10-Q, 10-K,8-K, 20-F, 40-F, 6-K, and their variants).

The JSON structures are updated throughout the day, in real time, as submissions are disseminated.

In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/sec-edgar-cik-ticker-exchange/company_tickers_exchange.json


In [None]:
# data are published in JSON format so we will need json library
import json


# Finding CIK of company

EDGAR assigns to filers a unique numerical identifier, known as a Central Index Key (CIK), when they sign up to make filings to the SEC. CIK numbers remain unique to the filer; they are not recycled.

List of all CIKs matched with entity name is available for download [(13 MB, text file)](https://www.sec.gov/Archives/edgar/cik-lookup-data.txt). Note that this list includes funds and individuals and is historically cumulative for company names. Thus a given CIK may be associated with multiple names in the case of company or fund name changes, and the list contains some entities that no longer file with the SEC.

We will be using smaller (611 kB) JSON [kaggle dataset](https://www.kaggle.com/datasets/svendaj/sec-edgar-cik-ticker-exchange), which is sourcing data directly at EDGAR and is input for this notebook. This dataset contains only companies names, CIK, ticker and associated stock exchange.

In [None]:
# Let's convert CIK JSON to pandas DataFrame
# First load the data into python dictionary

with open("../input/sec-edgar-cik-ticker-exchange/company_tickers_exchange.json", "r") as f:
    CIK_dict = json.load(f)

In [None]:
# dataset contains two sections
CIK_dict.keys()

In [None]:
# fields is specifying meaning and and order of company data
# we will use it as columns names
CIK_dict["fields"]

In [None]:
# data section is list of records/lists for each company
# we will use it as DataFrame rows
print("Number of company records:", len(CIK_dict["data"]))
CIK_dict["data"][:5]    # first 5 records

In [None]:
# convert CIK_dict to pandas
CIK_df = pd.DataFrame(CIK_dict["data"], columns=CIK_dict["fields"])
CIK_df

## Select the ticker of company used in this example

Subsequent information retrieval will be using selected `ticker` and associated CIK

In [None]:
# finding company row with given ticker
ticker = "AMMX"
CIK_df[CIK_df["ticker"] == ticker]

In [None]:
CIK = CIK_df[CIK_df["ticker"] == ticker].cik.values[0]

In [None]:
# finding companies containing substring in company name
substring = "oil"
CIK_df[CIK_df["name"].str.contains(substring, case=False)]

# Entity’s current filing history

Each entity’s current filing history is available at the following URL:

* https://data.sec.gov/submissions/CIK##########.json

Where the ########## is the entity’s 10-digit Central Index Key (CIK), including leading zeros.

This JSON data structure contains metadata such as current name, former name, and stock exchanges and ticker symbols of publicly-traded companies. The object’s property path contains at least one year’s of filing or to 1,000 (whichever is more) of the most recent filings in a compact columnar data array. If the entity has additional filings, files will contain an array of additional JSON files and the date range for the filings each one contains.

In [None]:
# preparation of input data, using ticker and CIK set earlier
url = f"https://data.sec.gov/submissions/CIK{str(CIK).zfill(10)}.json"
url

# Reading from RESTful API

EDGAR requires that HTTP requests will be identified with proper [UserAgent in header and comply with fair use policy (currently max. 10 requests per second)](https://www.sec.gov/os/accessing-edgar-data). At minimum you need to supply your own e-mail adress in User-Agent field (otherwise you will get 403/Forbiden error). If you will provide Host field, please be sure use data.sec.gov server and not www.sec.gov as mentioned in example (this would result in 404/Not Found error).

In [None]:
# read response from REST API with `requests` library and format it as python dict

import requests
header = {
  "User-Agent": "your.email@email.com"#, # remaining fields are optional
#    "Accept-Encoding": "gzip, deflate",
#    "Host": "data.sec.gov"
}

company_filings = requests.get(url, headers=header).json()
company_filings.keys()

In [None]:
company_filings["addresses"]

In [None]:
company_filings["filings"]["recent"].keys()

# Creating DataFrame with submitted filings

`company_filings["filings"]["recent"]` contains up to 1000 last submitted filings sorted from latest to oldest.

In [None]:
company_filings_df = pd.DataFrame(company_filings["filings"]["recent"])
company_filings_df

In [None]:
# filter only Annual reports
company_filings_df[company_filings_df.form == "10-K"]

# Accessing specific filing document

Let's download latest Annual Report (10-K). Files are stored in browsable directory structure for CIK and accession-number:
* https://www.sec.gov/Archives/edgar/data/{CIK}/{accession-number}/

In [None]:
access_number = company_filings_df[company_filings_df.form == "10-K"].accessionNumber.values[0].replace("-", "")

file_name = company_filings_df[company_filings_df.form == "10-K"].primaryDocument.values[0]

url = f"https://www.sec.gov/Archives/edgar/data/{CIK}/{access_number}/{file_name}"
url

In [None]:
# dowloading and saving requested document to working directory
req_content = requests.get(url, headers=header).content.decode("utf-8")

with open(file_name, "w") as f:
    f.write(req_content)

## and saving it as PDF


In [None]:
pip install weasyprint

In [None]:
from weasyprint import HTML

HTML(string=req_content, base_url="").write_pdf(file_name + ".pdf")

In [None]:
!ls -al .

# XBRL data APIs

Extensible Business Markup Language (XBRL) is an XML-based format for reporting financial statements used by the SEC and financial regulatory agencies across the world. XBRL, in a separate XML file or more recently embedded in quarterly and annual HTML reports as inline XBRL, was first required by the SEC in 2009. XBRL facts must be associated for a standard US-GAAP or IFRS taxonomy. Companies can also extend standard taxonomies with their own custom taxonomies.

The following XBRL APIs aggregate facts from across submissions that
1. Use a non-custom taxonomy (e.g. us-gaap, ifrs-full, dei, or srt)
1. Apply to the entire filing entity

This ensures that facts have a consistent context and meaning across companies and between filings and are comparable between companies and across time.

## All company concepts data
## data.sec.gov/api/xbrl/companyfacts/

This API returns all the company concepts data for a company into a single API call:

* https://data.sec.gov/api/xbrl/companyfacts/CIK##########.json

In [None]:
url = f"https://data.sec.gov/api/xbrl/companyfacts/CIK{str(CIK).zfill(10)}.json"
url

In [None]:
company_facts = requests.get(url, headers=header).json()

# get the current assets values as reported over time and make it pandas DataFrame
curr_assets_df = pd.DataFrame(company_facts["facts"]["us-gaap"]["AssetsCurrent"]["units"]["USD"])
curr_assets_df

In [None]:
# get just values reported in valid frame and plot them
curr_assets_df[curr_assets_df.frame.notna()]


In [None]:
import plotly.express as px
pd.options.plotting.backend = "plotly"

curr_assets_df.plot(x="end", y="val",
                    title=f"{company_filings['name']}, {ticker}: Current Assets",
                   labels= {
                       "val": "Value ($)",
                       "end": "Quarter End"
                   })

## Getting datapoints of single concept
## data.sec.gov/api/xbrl/companyconcept/

The company-concept API returns all the XBRL disclosures from a single company (CIK) and concept (a taxonomy and tag) into a single JSON file, with a separate array of facts for each units on measure that the company has chosen to disclose (e.g. net profits reported in U.S. dollars and in Canadian dollars).

* https://data.sec.gov/api/xbrl/companyconcept/CIK##########/us-gaap/AccountsPayableCurrent.json


In [None]:
# let's retrieve current assets for comparision with company facts API
url = f"https://data.sec.gov/api/xbrl/companyconcept/CIK{str(CIK).zfill(10)}/us-gaap/AssetsCurrent.json"
url

In [None]:
curr_assets_dict = requests.get(url, headers=header).json()
curr_assets_dict.keys()

In [None]:
curr_assets_dict["tag"]

In [None]:
# first 5 datapoints
curr_assets_dict["units"]["USD"][:5]

In [None]:
# this should be resulting in same DataFrame as retrieved through companyfacts API and selected through taxonomy us-gaap, AssetsCurrent concept/tag and units USD
curr_assets_df = pd.DataFrame(curr_assets_dict["units"]["USD"])
curr_assets_df

## Getting one fact from requested period/frame
## data.sec.gov/api/xbrl/frames/

The xbrl/frames API aggregates one fact for **each** reporting entity that is last filed that most closely fits the calendrical period requested. This API supports for annual, quarterly and instantaneous data:

* https://data.sec.gov/api/xbrl/frames/us-gaap/AccountsPayableCurrent/USD/CY2019Q1I.json

Where the units of measure specified in the XBRL contains a numerator and a denominator, these are separated by “-per-” such as “USD-per-shares”. Note that the default unit in XBRL is “pure”.

The period format is CY#### for annual data (duration 365 days +/- 30 days), CY####Q# for quarterly data (duration 91 days +/- 30 days), and CY####Q#I for instantaneous data. Because company financial calendars can start and end on any month or day and even change in length from quarter to quarter to according to the day of the week, the frame data is assembled by the dates that best align with a calendar quarter or year. Data users should be mindful different reporting start and end dates for facts contained in a frame.

In [None]:
# Let's retrieve all data about current assets in Q4 of 2021
fact = "AssetsCurrent"
year = 2021
quarter = "Q1I"

url = f"https://data.sec.gov/api/xbrl/frames/us-gaap/{fact}/USD/CY{year}{quarter}.json"
url

In [None]:
curr_assets_dict = requests.get(url, headers=header).json()
curr_assets_dict.keys()

In [None]:
# let's convert all data of requested period to pandas dataframe
curr_assets_df = pd.DataFrame(curr_assets_dict["data"])
curr_assets_df.sort_values("val", ascending=False)

In [None]:
company_facts["facts"].keys()

In [None]:
company_facts["entityName"]

In [None]:
CIK = 320193
url = f"https://data.sec.gov/api/xbrl/companyconcept/CIK{str(CIK).zfill(10)}/dei/EntityRegistrantName.json"
url

![SEC logo](https://www.pngkit.com/png/detail/177-1773725_seal-of-the-united-states-securities-and-exchange.png)