# Risk Factor Title Extraction Task

Yichun Sarah Fan <br>
May 16, 2025

---

**Project Summary**

This Python script extracts risk factors from the ***Item 1A. Risk Factors*** section of SEC 10-K filings for a panel of 10 unique firms over three years. For each filing, the script identifies and captures all visually emphasized risk factor titles, outputting them in a CSV file with the following columns: CIK, filing year, filing date, reporting date, and RFDTitle.

The code consists of two main parts.

**1. Initialize pyedgar environment:**
   - Prepares the environment and necessary libraries for accessing and parsing SEC EDGAR filings.
   - Handles data index setup and imports.

**2. EDGAR data extraction:**
   - Matches each firm-year with the correct 10-K filing.
   - Extracts all accentuated risk factor headings (bold, underlined, or italic) from the Item 1A section.
   - Writes all results to a structured CSV file for further analysis.

This workflow automates and standardizes the extraction of regulatory risk disclosures, enabling efficient and reproducible data collection for research or business purposes.

## Initialize pyedgar environment

In [1]:
import os
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# Define and Create pyedgar Directory Structure
## Base directory
base_dir = '/content/drive/MyDrive/RA/Tech test/pyedgar'

## Define subdirectories for config, index, and filings
conf_dir = os.path.join(base_dir, 'config')
index_dir = os.path.join(base_dir, 'indices')
filing_dir = os.path.join(base_dir, 'filings')

## Create the directories if they don't exist
os.makedirs(conf_dir, exist_ok=True)
os.makedirs(index_dir, exist_ok=True)
os.makedirs(filing_dir, exist_ok=True)

In [3]:
# Create config file
conf_path = os.path.join(conf_dir, 'hades.colab.pyedgar.conf')

with open(conf_path, 'w') as config_file:
    config_file.write(f"""
[DEFAULT]
SEC_BASE_URL = https://www.sec.gov
HEADERS = Sarah Fan (yichun.fan@gwmail.gwu.edu)

[Paths]
INDEX_ROOT = {index_dir}
FILING_ROOT = {filing_dir}

[Index]
INDEX_DELIMITER = |
INDEX_EXTENSION = idx

[Downloader]
KEEP_ALL = False
""")

# Set Environment Variable
os.environ['PYEDGAR_CONF'] = conf_path

In [4]:
from pyedgar import config, Filing, EDGARIndex
print("✅ pyedgar is using config file from:", config.CONFIG_FILE)

✅ pyedgar is using config file from: /content/drive/MyDrive/RA/Tech test/pyedgar/config/hades.colab.pyedgar.conf


## EDGAR data extraction

In [5]:
import pandas as pd
import re
from time import sleep
from datetime import datetime
from bs4 import BeautifulSoup
from dateutil import parser

In [6]:
# Load input file
df_input = pd.read_csv("/content/drive/MyDrive/RA/Tech test/rasamplemini_rfdtitle.csv")

# Ensure CIK is 10-digit zero-padded (required by EDGARIndex)
df_input['cik'] = df_input['cik'].astype(str).str.zfill(10)

# Load EDGAR index (use cached data if available)
idx = EDGARIndex(force_download=False)

In [7]:
idx.indices

{'form_all.idx': '/content/drive/MyDrive/RA/Tech test/pyedgar/indices/form_all.idx',
 'form_DEF14A.idx': '/content/drive/MyDrive/RA/Tech test/pyedgar/indices/form_DEF14A.idx',
 'form_10-K.idx': '/content/drive/MyDrive/RA/Tech test/pyedgar/indices/form_10-K.idx',
 'form_10-Q.idx': '/content/drive/MyDrive/RA/Tech test/pyedgar/indices/form_10-Q.idx',
 'form_8-K.idx': '/content/drive/MyDrive/RA/Tech test/pyedgar/indices/form_8-K.idx'}

In [8]:
index_10k = pd.read_csv('/content/drive/MyDrive/RA/Tech test/pyedgar/indices/form_10-K.idx', sep='|', dtype=str, low_memory=False)
index_10k.head()

Unnamed: 0,CIK,Company Name,Form Type,Date Filed,Accession
0,20,K TRON INTERNATIONAL INC,10-K,1996-03-28,0000893220-96-000500
1,20,K TRON INTERNATIONAL INC,10-K,1997-03-19,0000893220-97-000572
2,20,K TRON INTERNATIONAL INC,10-K405,1998-03-18,0000893220-98-000560
3,20,K TRON INTERNATIONAL INC,10-K,1999-03-23,0000893220-99-000357
4,20,K TRON INTERNATIONAL INC,10-K405,2000-03-30,0000893220-00-000394


In [14]:
def extract_item_1a_titles_from_html(txt):
    # 1. Extract Item 1A block
    pattern = r'(ITEM\s*1A.*?RISK FACTORS?.*?)(ITEM\s*1B|ITEM\s*2|SIGNATURES)'
    match = re.search(pattern, txt, flags=re.IGNORECASE | re.DOTALL)
    if not match:
        print("Item 1A not found")
        return []
    risk_html = match.group(1)

    soup = BeautifulSoup(risk_html, "html.parser")

    # 2. Find all <b> or <strong> tags as candidate subtitles (excluding major headings)
    candidates = []
    for tag in soup.find_all(['b', 'strong']):
        sub = tag.get_text(separator=' ', strip=True)
        if (
            sub
            and not re.search(r'item\s*1a', sub, re.IGNORECASE)
            and not re.search(r'risk factors?', sub, re.IGNORECASE)
            and not sub.endswith(':')
            and not sub.isupper()
        ):
            candidates.append(sub)
    return candidates

In [17]:
def extract_item_1a_titles_from_html(txt):
    """
    Extracts individual risk factor headings from the 'Item 1A: Risk Factors' section
    of an HTML filing. Headings are accentuated (bold, italic, underline), and either
    on their own line or at the start of a paragraph.
    """
    # 1. Extract Item 1A block
    pattern = r'(ITEM\s*1A.*?RISK FACTORS?.*?)(ITEM\s*1B|ITEM\s*2|SIGNATURES)'
    match = re.search(pattern, txt, flags=re.IGNORECASE | re.DOTALL)
    if not match:
        print("Item 1A not found")
        return []
    risk_html = match.group(1)

    soup = BeautifulSoup(risk_html, "html.parser")

    # 2. Find all accentuated tags (bold, italic, underline, or any combo)
    candidates = []
    for tag in soup.find_all(['b', 'strong', 'i', 'em', 'u']):
        # Only take the outermost accentuated tag to avoid nested duplicates
        if tag.find_parent(['b', 'strong', 'i', 'em', 'u']) is not None:
            continue
        # Get text and check it's not a heading or section title
        sub = tag.get_text(separator=' ', strip=True)
        # Check that the tag is isolated or at the start of its parent paragraph
        if sub:
            # Must not be an Item 1A header, Risk Factors label, etc.
            if (
                not re.search(r'item\s*1a', sub, re.IGNORECASE)
                and not re.search(r'risk factors?', sub, re.IGNORECASE)
                and not sub.endswith(':')
                and not sub.isupper()
            ):
                # Isolated line: either tag is direct child of <p>, or first child in parent
                parent = tag.parent
                if (parent.name == "p" or tag is parent.contents[0]):
                    candidates.append(sub)
    return candidates

In [11]:
def extract_reporting_date_from_html(html, maxlen=6000):
    """
    Extracts the reporting date from the HTML content of a 10-K filing.
    Only the first `maxlen` characters are processed for efficiency.

    Extraction steps:
    1. Parse the first part of the HTML to find all <b> and <strong> tags.
    2. Look for text patterns like "For the fiscal year ended [date]" in these tags.
    3. If not found, search for <PERIOD> or PERIOD= tags as a fallback.
    Returns date in 'YYYY/MM/DD' format or an empty string if not found.
    """

    # Only process the first `maxlen` characters for speed
    html_head = html[:maxlen]

    # Step 1: Parse all <b> and <strong> tag text content
    soup = BeautifulSoup(html_head, "html.parser")
    candidate_texts = []
    for tag in soup.find_all(['b', 'strong']):
        txt = tag.get_text(" ", strip=True)
        candidate_texts.append(txt)

    # Step 2: Look for "For the fiscal year ended ..." patterns
    for txt in candidate_texts:
        m = re.search(
            r"for\s+the\s+fiscal\s+year\s+ended\s+([A-Za-z]+\s+\d{1,2},?\s+\d{4})",
            txt, re.IGNORECASE
        )
        if m:
            date_str = m.group(1).replace('\xa0', ' ')  # Remove non-breaking spaces
            try:
                dt = parser.parse(date_str)
                return dt.strftime("%Y/%m/%d")
            except Exception:
                continue

    # Step 3: Fallback - search for <PERIOD> tags or "PERIOD=" patterns
    m = re.search(r"<PERIOD>(\d{8})</PERIOD>", html_head, re.IGNORECASE)
    if m:
        date = m.group(1)
        return f"{date[:4]}/{date[4:6]}/{date[6:]}"

    m = re.search(r"PERIOD[=:\s]+(\d{8})", html_head, re.IGNORECASE)
    if m:
        date = m.group(1)
        return f"{date[:4]}/{date[4:6]}/{date[6:]}"

    # If not found, return empty string
    return ""

In [18]:
output_records = []

for row in df_input.itertuples(index=False):
    cik = str(row.cik).zfill(10)
    year = int(row.filingyear)

    df_cik_match = index_10k[
        (index_10k['CIK'].astype(str).str.zfill(10) == cik) &
        (pd.to_datetime(index_10k['Date Filed']).dt.year == year)
    ]
    if df_cik_match.empty:
        continue

    filing_meta = df_cik_match.iloc[0]
    accession = filing_meta['Accession'].split('/')[-1].replace('.txt', '')
    filingdate = filing_meta['Date Filed']

    try:
        filing = Filing(cik=cik, accession=accession)
        text = filing.full_text
        reportingdate = extract_reporting_date_from_html(text)
        risk_titles = extract_item_1a_titles_from_html(text)
        if not risk_titles:
            risk_titles = ['']
    except Exception as e:
        reportingdate = ''
        risk_titles = ['']

    for title in risk_titles:
        output_records.append({
            'cik': cik,
            'filingyear': year,
            'filingdate': filingdate,
            'reportingdate': reportingdate,
            'RFDTitle': title.strip()
        })

df_output = pd.DataFrame(output_records)
df_output.head(10)

Item 1A not found
Item 1A not found
Item 1A not found
Item 1A not found
Item 1A not found
Item 1A not found


Unnamed: 0,cik,filingyear,filingdate,reportingdate,RFDTitle
0,1750,2018,2018-07-11,2018/05/31,We are affected by factors that adversely impa...
1,1750,2018,2018-07-11,2018/05/31,Our U.S. government contracts may not continue...
2,1750,2018,2018-07-11,2018/05/31,We face risks of cost overruns and losses on f...
3,1750,2018,2018-07-11,2018/05/31,Success at our airframe maintenance facilities...
4,1750,2018,2018-07-11,2018/05/31,"We operate in highly competitive markets, and ..."
5,1750,2018,2018-07-11,2018/05/31,We are subject to significant government regul...
6,1750,2018,2018-07-11,2018/05/31,If we fail to comply with government procureme...
7,1750,2018,2018-07-11,2018/05/31,We are exposed to risks associated with operat...
8,1750,2018,2018-07-11,2018/05/31,"Acquisitions expose us to risks, including the..."
9,1750,2018,2018-07-11,2018/05/31,Market values for our aviation products fluctu...


In [19]:
df_output.to_csv('rasamplemini_rfdtitle_output_Sarah.csv', index=False)

## Reference

Campbell, J. L., Chen, H., Dhaliwal, D. S., Lu, H. M., & Steele, L. B. (2014). The information content of mandatory risk factor disclosures in corporate filings. *Review of Accounting Studies, 19*(1), 396–455.

Gaulin, M. P. (2017). *Risk fact or fiction: The information content of risk factor disclosures* (Doctoral dissertation, Rice University).
