# SEC Filing Data Extraction

This notebook demonstrates how to download and process SEC filings, extract specific sections, and save them in JSON format. This process is crucial for building a GraphRAG (Graph-based Retrieval-Augmented Generation) system for financial documents. By extracting structured data from these filings, we can enhance the knowledge graph with detailed financial information, which can be used for various analytical and generative tasks.

## Import Necessary Libraries

We begin by importing the necessary libraries. `os` and `json` are used for file operations and data handling, while `BeautifulSoup` from the `bs4` library is used for parsing HTML/XBRL content. Parsing is crucial for extracting structured information from the filings.

In [1]:
import os
import json
from sec_edgar_downloader import Downloader
from bs4 import BeautifulSoup

## Set Up Data Directory

Before processing the data, we need to ensure that the directory for saving the collected data exists. This step creates a directory named `data` in the current working directory, which will be used to store the JSON files generated from the SEC filings.

In [2]:
# Set up the data directory
data_directory = 'data'
if not os.path.exists(data_directory):
    os.makedirs(data_directory)
print(f"Data directory set up at: {os.path.abspath(data_directory)}")

Data directory set up at: /workspace/data


## Initialize the Downloader

The `Downloader` instance is initialized with company information and a download directory. This setup is essential for accessing SEC filings programmatically. Ensure you have the necessary permissions and API keys, which can be obtained from [NVIDIA's AI portal](https://build.nvidia.com/).

In [3]:
# Initialize the downloader instance
dl = Downloader("NVIDIA", "bhall@nvidia.com", "data/sec_data")

## Define Functions for Data Extraction

### Extract Sections
This function extracts specific sections from the filing text. For a GraphRAG, it's important to have well-defined sections as they represent different nodes or entities in the graph. Each section corresponds to a key aspect of the financial document, such as business overview, risk factors, etc.

In [4]:
def extract_sections(text, form_type):
    soup = BeautifulSoup(text, 'html.parser')
    text = soup.get_text(separator=' ')
    
    if form_type == "10-K":
        sections = {
            "Business Overview": extract_section(text, "Item 1. Business", "Item 1A. Risk Factors"),
            "Risk Factors": extract_section(text, "Item 1A. Risk Factors", "Item 1B. Unresolved Staff Comments"),
            "Management's Discussion and Analysis (MD&A)": extract_section(text, "Item 7. Management's Discussion and Analysis", "Item 7A. Quantitative and Qualitative Disclosures About Market Risk"),
            "Financial Statements and Supplementary Data": extract_section(text, "Item 8. Financial Statements and Supplementary Data", "Item 9. Changes in and Disagreements with Accountants on Accounting and Financial Disclosure")
        }
    elif form_type == "10-Q":
        sections = {
            "Management's Discussion and Analysis (MD&A)": extract_section(text, "Item 2. Management's Discussion and Analysis of Financial Condition and Results of Operations", "Item 3. Quantitative and Qualitative Disclosures About Market Risk"),
            "Financial Statements": extract_section(text, "Item 1. Financial Statements", "Item 2. Management's Discussion and Analysis of Financial Condition and Results of Operations"),
            "Risk Factors": extract_section(text, "Item 1A. Risk Factors", "Item 2. Unregistered Sales of Equity Securities and Use of Proceeds"),
            "Controls and Procedures": extract_section(text, "Item 4. Controls and Procedures", "Item 5. Other Information")
        }
    else:
        sections = {}
    
    return sections

### Extract Section
This helper function extracts a section of text between two headings. It's essential for isolating the content of interest, which can then be used to populate nodes in the knowledge graph.

In [5]:
def extract_section(text, start_heading, end_heading):
    start_index = text.find(start_heading)
    end_index = text.find(end_heading, start_index)
    if start_index != -1 and end_index != -1:
        return text[start_index:end_index].strip()
    elif start_index != -1:
        return text[start_index:].strip()
    else:
        return ""

## Download and Save SEC Filings

This function downloads and saves SEC filings. It reads the filings, extracts relevant sections, and saves the data in JSON format. This structured data is crucial for building a knowledge graph, as it allows for easy integration and retrieval of financial information.

In [6]:
def download_and_save_filings(ticker, year, output_dir):
    try:
        # Download 10-K and 10-Q filings
        dl.get("10-K", ticker,  download_details=True, after=f"{year}-01-01", before=f"{year}-12-31")
        dl.get("10-Q", ticker, after=f"{year}-01-01", before=f"{year}-12-31")
        
        # Define the directory where the filings are saved
        filings_dir = os.path.join(dl.download_folder, ticker)
        
        # Read and save the filings as JSON
        filings_data = {}
        for form_type in ["10-K", "10-Q"]:
            form_dir = os.path.join(filings_dir, form_type)
            if os.path.exists(form_dir):
                filings_data[form_type] = []
                for filename in os.listdir(form_dir):
                    if filename.endswith(".txt"):
                        with open(os.path.join(form_dir, filename), 'r') as file:
                            filing_text = file.read()
                            sections = extract_sections(filing_text, form_type)
                            filings_data[form_type].append(sections)
        
        # Save the filings data to a JSON file
        json_filename = os.path.join(output_dir, f"{ticker}_{year}.json")
        with open(json_filename, 'w') as json_file:
            json.dump(filings_data, json_file, indent=4)
        
        print(f"Processed and saved data for ticker: {ticker}")
    
    except Exception as e:
        print(f"Failed to process ticker {ticker}: {e}")

## Process Tickers
This function processes each ticker, downloads the filings, and saves the extracted data. The structured JSON output is suitable for ingestion into a knowledge graph, where each section can be linked to other related data points.

In [7]:
def process_tickers(ticker_file, output_dir, year):
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    
    with open(ticker_file, 'r') as file:
        tickers = json.load(file)
#         print(tickers['tickers'])
    
    for ticker in tickers['tickers']:
        ticker = ticker.strip()
        download_and_save_filings(ticker, year, output_dir)

## Example Usage
Set up the paths for the ticker file and output directory, and then call the `process_tickers` function to start processing. This step is crucial for preparing the data for integration into a GraphRAG system.

In [8]:
# Example usage
ticker_file = 'stock_tickers.json'  # Path to the JSON file containing the list of tickers
output_dir = 'data/sec_data'  # Directory to save the JSON files
year = '2024'  # Year for the filings

process_tickers(ticker_file, output_dir, year)

Processed and saved data for ticker: A
Processed and saved data for ticker: AA
Processed and saved data for ticker: AAAU
Processed and saved data for ticker: AAL
Processed and saved data for ticker: AAOI
Processed and saved data for ticker: AAP
Processed and saved data for ticker: AAPL
Processed and saved data for ticker: ABBV
Processed and saved data for ticker: ABCL
Processed and saved data for ticker: ABEV
Processed and saved data for ticker: ABNB
Processed and saved data for ticker: ABR
Processed and saved data for ticker: ABSI
Processed and saved data for ticker: ABT
Processed and saved data for ticker: ACAD
Processed and saved data for ticker: ACB
Processed and saved data for ticker: ACGL
Processed and saved data for ticker: ACHR
Processed and saved data for ticker: ACI
Processed and saved data for ticker: ACMR
Processed and saved data for ticker: ACN
Processed and saved data for ticker: ADAP
Processed and saved data for ticker: ADBE
Processed and saved data for ticker: ADC
Proce