# SEC Filing Data Extraction for GraphRAG

This notebook processes XBRL files, extracts specific sections, and saves them in JSON format. This is particularly useful for creating a GraphRAG (Graph-based Retrieval-Augmented Generation) system for financial documents. By extracting structured data from these filings, we can enhance the knowledge graph with detailed financial information, which can be used for various analytical and generative tasks.

## Import Necessary Libraries

We begin by importing the necessary libraries. `os` and `json` are used for file operations and data handling, while `BeautifulSoup` from the `bs4` library is used for parsing HTML/XBRL content. Parsing is crucial for extracting structured information from the filings.

In [1]:
import os
import json
from bs4 import BeautifulSoup

## Define Functions for Data Extraction

### Extract Sections
This function extracts specific sections from the filing text. For a GraphRAG, it's important to have well-defined sections as they represent different nodes or entities in the graph. Each section corresponds to a key aspect of the financial document, such as business overview, risk factors, etc.

In [2]:
def extract_sections(text, form_type):
    soup = BeautifulSoup(text, 'html.parser')
    text = soup.get_text(separator=' ')
    
    if form_type == "10-K":
        sections = {
            "Item 1. Business": extract_section(text, "Item 1. Business", "Item 1A. Risk Factors"),
            "Item 1A. Risk Factors": extract_section(text, "Item 1A. Risk Factors", "Item 1B. Unresolved Staff Comments"),
            "Item 7. Management's Discussion and Analysis (MD&A)": extract_section(text, "Item 7. Management's Discussion and Analysis", "Item 7A. Quantitative and Qualitative Disclosures About Market Risk"),
            "Item 8. Financial Statements and Supplementary Data": extract_section(text, "Item 8. Financial Statements and Supplementary Data", "Item 9. Changes in and Disagreements with Accountants on Accounting and Financial Disclosure")
        }
    elif form_type == "10-Q":
        sections = {
            "Item 2. Management's Discussion and Analysis (MD&A)": extract_section(text, "Item 2. Management's Discussion and Analysis of Financial Condition and Results of Operations", "Item 3. Quantitative and Qualitative Disclosures About Market Risk"),
            "Item 1. Financial Statements": extract_section(text, "Item 1. Financial Statements", "Item 2. Management's Discussion and Analysis of Financial Condition and Results of Operations"),
            "Item 1A. Risk Factors": extract_section(text, "Item 1A. Risk Factors", "Item 2. Unregistered Sales of Equity Securities and Use of Proceeds"),
            "Item 4. Controls and Procedures": extract_section(text, "Item 4. Controls and Procedures", "Item 5. Other Information")
        }
    else:
        sections = {}
    
    return sections

### Extract Section
This helper function extracts a section of text between two headings. It's essential for isolating the content of interest, which can then be used to populate nodes in the knowledge graph.

In [3]:
def extract_section(text, start_heading, end_heading):
    start_index = text.find(start_heading)
    end_index = text.find(end_heading, start_index)
    if start_index != -1 and end_index != -1:
        return text[start_index:end_index].strip()
    elif start_index != -1:
        return text[start_index:].strip()
    else:
        return ""

### Extract CIK
The Central Index Key (CIK) is a unique identifier for companies in the SEC's EDGAR database. Extracting this allows us to link the filing to the correct entity in our knowledge graph.

In [4]:
def extract_cik(text):
    soup = BeautifulSoup(text, 'html.parser')
    cik_tag = soup.find('ix:nonNumeric', {'name': 'dei:EntityCentralIndexKey'})
    if cik_tag:
        return cik_tag.text.strip()
    return None

### Extract Fiscal Year
Extracting the fiscal year helps in organizing the data temporally within the knowledge graph, allowing for time-based queries and analyses.

In [5]:
def extract_fiscal_year(text):
    soup = BeautifulSoup(text, 'html.parser')
    fiscal_year_tag = soup.find('ix:nonNumeric', {'name': 'dei:DocumentFiscalYearFocus'})
    if fiscal_year_tag:
        return fiscal_year_tag.text.strip()
    return None

### Extract Fiscal Quarter
Similar to fiscal year, the fiscal quarter provides finer granularity for temporal data organization in the graph.

In [6]:
def extract_fiscal_quarter(text):
    soup = BeautifulSoup(text, 'html.parser')
    fiscal_quarter_tag = soup.find('ix:nonNumeric', {'name': 'dei:DocumentFiscalPeriodFocus'})
    if fiscal_quarter_tag:
        return fiscal_quarter_tag.text.strip()
    return None

## Process Tickers
This function processes each ticker, extracts relevant data, and saves it in JSON format. The structured JSON output is suitable for ingestion into a knowledge graph, where each section can be linked to other related data points.

In [7]:
def process_tickers(ticker_file, input_dir, output_dir):
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    
    with open(ticker_file, 'r') as file:
        tickers = json.load(file)
    
    for ticker in tickers:
        ticker_dir = os.path.join(input_dir, ticker)
        if os.path.isdir(ticker_dir):
            for root, dirs, files in os.walk(ticker_dir):
                for file in files:
                    if file.endswith(".txt"):
                        file_path = os.path.join(root, file)
                        form_type = "10-K" if "10-K" in file else "10-Q"
                        try:
                            with open(file_path, 'r') as f:
                                filing_text = f.read()
                                cik = extract_cik(filing_text)
                                fiscal_year = extract_fiscal_year(filing_text)
                                fiscal_quarter = extract_fiscal_quarter(filing_text)
                                sections = extract_sections(filing_text, form_type)
                                
                                if sections:
                                    # Create the JSON structure
                                    data = {
                                        "filing": {
                                            "cik": cik,
                                            "ticker": ticker,
                                            "year": fiscal_year,
                                            "quarter": fiscal_quarter,
                                            "sections": sections
                                        }
                                    }
                                    
                                    # Save the data to a JSON file
                                    json_filename = os.path.join(output_dir, f"{ticker}_{form_type}_{fiscal_year}_Q{fiscal_quarter}.json")
                                    with open(json_filename, 'w') as json_file:
                                        json.dump(data, json_file, indent=4)
                                    
                                    print(f"Processed and saved data for ticker: {ticker}")
                        except Exception as e:
                            print(f"Error processing file {file_path}: {e}")

## Example Usage
Set up the paths for the ticker file, input directory, and output directory, and then call the `process_tickers` function to start processing. This step is crucial for preparing the data for integration into a GraphRAG system.

In [8]:
# Example usage
ticker_file = 'company_tickers.json'  # Path to the JSON file containing the list of tickers
input_dir = 'path_to_downloaded_files'  # Directory containing the downloaded XBRL files
output_dir = 'output_directory'  # Directory to save the JSON files

process_tickers(ticker_file, input_dir, output_dir)