# Risk Factor Title Extraction Task

Yichun Sarah Fan <br>
May 17, 2025

---

**Project Summary**

This Python script extracts risk factors from the ***Item 1A. Risk Factors*** section of SEC 10-K filings for a panel of 10 unique firms over three years. For each filing, the script identifies and captures all visually emphasized risk factor titles, outputting them in a CSV file with the following columns: CIK, filing year, filing date, reporting date, and RFDTitle.

The code consists of two main parts.

**1. Initialize pyedgar environment:**
   - Prepares the environment and necessary libraries for accessing and parsing SEC EDGAR filings.
   - Handles data index setup and imports.

**2. EDGAR data extraction:**
   - Matches each firm-year with the correct 10-K filing.
   - Extracts all accentuated risk factor headings (bold, underlined, or italic) from the Item 1A section.
   - Writes all results to a structured CSV file for further analysis.

This workflow automates and standardizes the extraction of regulatory risk disclosures, enabling efficient and reproducible data collection for research or business purposes.

## Git setup

In [83]:
!git config --global user.name "SarahrahFan"
!git config --global user.email "fyc6373@gmail.com"

In [109]:
github_username = "SarahrahFan"
github_token = "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"

with open("/root/.netrc", "w") as f:
    f.write(f"machine github.com\nlogin {github_username}\npassword {github_token}\n")

!chmod 600 /root/.netrc

In [133]:
%cd /content/drive/MyDrive/RA/Tech\ test

/content/drive/MyDrive/RA/Tech test


In [135]:
!git init

[33mhint: Using 'master' as the name for the initial branch. This default branch name[m
[33mhint: is subject to change. To configure the initial branch name to use in all[m
[33mhint: [m
[33mhint: 	git config --global init.defaultBranch <name>[m
[33mhint: [m
[33mhint: Names commonly chosen instead of 'master' are 'main', 'trunk' and[m
[33mhint: 'development'. The just-created branch can be renamed via this command:[m
[33mhint: [m
[33mhint: 	git branch -m <name>[m
Initialized empty Git repository in /content/drive/MyDrive/RA/Tech test/.git/


In [136]:
!git remote add origin https://github.com/SarahrahFan/EDGAR_riskfactors.git

In [127]:
!git add .

In [128]:
!git commit -m "Initial commit from RA/Tech test"

[main 999f0cc] Initial commit from RA/Tech test
 2 files changed, 1 insertion(+), 466727 deletions(-)
 rewrite V2 Sarah Fan: Risk Factor Title Extraction Task for Form 10-K.ipynb (98%)
 delete mode 100644 pyedgar/indices/form_DEF14A.idx


In [129]:
!git remote remove origin
!git remote add origin https://github.com/SarahrahFan/EDGAR_riskfactors.git

In [131]:
!git rm --cached pyedgar/indices/form_all.idx
!git rm --cached pyedgar/indices/form_8-K.idx
!git rm --cached pyedgar/indices/form_10-Q.idx

!echo "pyedgar/indices/form_all.idx" >> .gitignore
!echo "pyedgar/indices/form_8-K.idx" >> .gitignore
!echo "pyedgar/indices/form_10-Q.idx" >> .gitignore

!git add .gitignore
!git commit -m "Remove large .idx files and ignore them in future"

fatal: pathspec 'pyedgar/indices/form_all.idx' did not match any files
fatal: pathspec 'pyedgar/indices/form_8-K.idx' did not match any files
fatal: pathspec 'pyedgar/indices/form_10-Q.idx' did not match any files
[main c3c0f72] Remove large .idx files and ignore them in future
 1 file changed, 3 insertions(+)


In [132]:
!git push -u origin main

Enumerating objects: 31, done.
Counting objects:   3% (1/31)Counting objects:   6% (2/31)Counting objects:   9% (3/31)Counting objects:  12% (4/31)Counting objects:  16% (5/31)Counting objects:  19% (6/31)Counting objects:  22% (7/31)Counting objects:  25% (8/31)Counting objects:  29% (9/31)Counting objects:  32% (10/31)Counting objects:  35% (11/31)Counting objects:  38% (12/31)Counting objects:  41% (13/31)Counting objects:  45% (14/31)Counting objects:  48% (15/31)Counting objects:  51% (16/31)Counting objects:  54% (17/31)Counting objects:  58% (18/31)Counting objects:  61% (19/31)Counting objects:  64% (20/31)Counting objects:  67% (21/31)Counting objects:  70% (22/31)Counting objects:  74% (23/31)Counting objects:  77% (24/31)Counting objects:  80% (25/31)Counting objects:  83% (26/31)Counting objects:  87% (27/31)Counting objects:  90% (28/31)Counting objects:  93% (29/31)Counting objects:  96% (30/31)Counting objects: 100% (31/31)Counting objects:

## Initialize pyedgar environment

In [2]:
import os
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
# Define and Create pyedgar Directory Structure
## Base directory
base_dir = '/content/drive/MyDrive/RA/Tech test/pyedgar'

## Define subdirectories for config, index, and filings
conf_dir = os.path.join(base_dir, 'config')
index_dir = os.path.join(base_dir, 'indices')
filing_dir = os.path.join(base_dir, 'filings')

## Create the directories if they don't exist
os.makedirs(conf_dir, exist_ok=True)
os.makedirs(index_dir, exist_ok=True)
os.makedirs(filing_dir, exist_ok=True)

In [4]:
# Create config file
conf_path = os.path.join(conf_dir, 'hades.colab.pyedgar.conf')

with open(conf_path, 'w') as config_file:
    config_file.write(f"""
[DEFAULT]
SEC_BASE_URL = https://www.sec.gov
HEADERS = Sarah Fan (yichun.fan@gwmail.gwu.edu)

[Paths]
INDEX_ROOT = {index_dir}
FILING_ROOT = {filing_dir}

[Index]
INDEX_DELIMITER = |
INDEX_EXTENSION = idx

[Downloader]
KEEP_ALL = False
""")

# Set Environment Variable
os.environ['PYEDGAR_CONF'] = conf_path

In [5]:
!pip install pyedgar

Collecting pyedgar
  Downloading pyedgar-0.1.13-py3-none-any.whl.metadata (8.9 kB)
Downloading pyedgar-0.1.13-py3-none-any.whl (52 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.9/52.9 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyedgar
Successfully installed pyedgar-0.1.13


In [6]:
from pyedgar import config, Filing, EDGARIndex
print("✅ pyedgar is using config file from:", config.CONFIG_FILE)

✅ pyedgar is using config file from: /content/drive/MyDrive/RA/Tech test/pyedgar/config/hades.colab.pyedgar.conf


## EDGAR data extraction

In [7]:
import pandas as pd
import re
from time import sleep
from datetime import datetime
from bs4 import BeautifulSoup
from dateutil import parser

In [8]:
# Load input file
df_input = pd.read_csv("/content/drive/MyDrive/RA/Tech test/rasamplemini_rfdtitle.csv")

# Ensure CIK is 10-digit zero-padded (required by EDGARIndex)
df_input['cik'] = df_input['cik'].astype(str).str.zfill(10)
df_input['filingyear'] = df_input['filingyear'].astype(int)

# Load EDGAR index (use cached data if available)
idx = EDGARIndex(force_download=False)

In [122]:
import os; [os.remove(v) for k, v in idx.indices.items() if k != 'form_10-K.idx']

[None, None, None, None]

In [123]:
idx.indices

{'form_10-K.idx': '/content/drive/MyDrive/RA/Tech test/pyedgar/indices/form_10-K.idx'}

In [10]:
index_10k = pd.read_csv('/content/drive/MyDrive/RA/Tech test/pyedgar/indices/form_10-K.idx',
                        sep='|', dtype=str, low_memory=False)
index_10k['cik'] = index_10k['CIK'].astype(str).str.zfill(10)
index_10k['filingyear'] = pd.to_datetime(index_10k['Date Filed'], errors='coerce').dt.year
index_10k.head()

Unnamed: 0,CIK,Company Name,Form Type,Date Filed,Accession,cik,filingyear
0,20,K TRON INTERNATIONAL INC,10-K,1996-03-28,0000893220-96-000500,20,1996
1,20,K TRON INTERNATIONAL INC,10-K,1997-03-19,0000893220-97-000572,20,1997
2,20,K TRON INTERNATIONAL INC,10-K405,1998-03-18,0000893220-98-000560,20,1998
3,20,K TRON INTERNATIONAL INC,10-K,1999-03-23,0000893220-99-000357,20,1999
4,20,K TRON INTERNATIONAL INC,10-K405,2000-03-30,0000893220-00-000394,20,2000


### 10-K index check

In [11]:
# Merge to check which targets are matched in the 10-K index
check = df_input.merge(
    index_10k[['cik', 'filingyear', 'Accession']],
    on=['cik', 'filingyear'],
    how='left',
    indicator=True
)

# Print summary
n_total = len(df_input)
n_matched = (check['_merge'] == 'both').sum()
n_missing = (check['_merge'] == 'left_only').sum()

print(f"Total CIK-year pairs: {n_total}")
print(f"Matched in 10-K index: {n_matched}")
print(f"Missing in 10-K index: {n_missing}")

Total CIK-year pairs: 30
Matched in 10-K index: 36
Missing in 10-K index: 0


In [12]:
# Count how many unique 10-K accessions exist for each (cik, filingyear) pair
dupes = check[check['_merge'] == 'both'] \
    .groupby(['cik', 'filingyear'])['Accession'] \
    .nunique() \
    .reset_index()

# Keep only those cik+filingyear pairs with more than one 10-K accession
dupes = dupes[dupes['Accession'] > 1]

print(f"Number of company/year pairs with multiple 10-Ks: {len(dupes)}")
print(dupes)

# Optionally: List all accessions for these cik+year pairs for inspection
if not dupes.empty:
    # Merge with original data to display all matching accessions
    details = check.merge(dupes[['cik', 'filingyear']], on=['cik', 'filingyear'], how='inner')
    print("\nDetailed duplicate records:")
    print(details[['cik', 'filingyear', 'Accession']].sort_values(['cik', 'filingyear']))

Number of company/year pairs with multiple 10-Ks: 6
           cik  filingyear  Accession
8   0000002034        2017          2
12  0000003116        2013          2
14  0000003116        2015          2
17  0000004962        2012          2
25  0000006201        2006          2
26  0000006201        2007          2

Detailed duplicate records:
           cik  filingyear             Accession
0   0000002034        2017  0001144204-17-045100
1   0000002034        2017  0001144204-17-057835
4   0000003116        2013  0001157523-13-001183
5   0000003116        2013  0001157523-13-001479
2   0000003116        2015  0001171843-15-001465
3   0000003116        2015  0001171843-15-002393
6   0000004962        2012  0001193125-12-077400
7   0000004962        2012  0001140361-12-011832
10  0000006201        2006  0000950134-06-003715
11  0000006201        2006  0000006201-06-000049
8   0000006201        2007  0000950134-07-003888
9   0000006201        2007  0000950134-07-004263


Based on the results above, manual inspection confirms that all duplicate records are due to amended filings (such as revised or supplemental reports). For subsequent analysis, we will use the most recent version, specifically the “FORM 10-K/A” filings, for each CIK-year pair where duplicates exist.

In [13]:
# Keep only target cik + filingyear pairs from your input list
targets = df_input[['cik', 'filingyear']].drop_duplicates()

# Merge all 10-K index entries with the target cik + filingyear pairs
merged = index_10k.merge(targets, on=['cik', 'filingyear'], how='inner')

# Sort by cik, filingyear, Date Filed (ascending), and Accession (ascending)
merged['Date Filed'] = pd.to_datetime(merged['Date Filed'], errors='coerce')
merged = merged.sort_values(['cik', 'filingyear', 'Date Filed', 'Accession'])

# For each (cik, filingyear) group, keep only the latest record (last one in the sorted group)
latest = merged.groupby(['cik', 'filingyear'], as_index=False).last()

# Display the selected records: one most recent 10-K per cik and year
print(latest[['cik', 'filingyear', 'Accession', 'Date Filed', 'Form Type']])

           cik  filingyear             Accession Date Filed Form Type
0   0000001750        2016  0001047469-16-014299 2016-07-13      10-K
1   0000001750        2017  0001047469-17-004528 2017-07-12      10-K
2   0000001750        2018  0001047469-18-004978 2018-07-11      10-K
3   0000001800        2015  0001047469-15-001377 2015-02-27      10-K
4   0000001800        2016  0001047469-16-010246 2016-02-19      10-K
5   0000001800        2017  0001047469-17-000744 2017-02-17      10-K
6   0000002034        2015  0001571049-15-007509 2015-09-11      10-K
7   0000002034        2016  0001571049-16-017785 2016-08-26      10-K
8   0000002034        2017  0001144204-17-057835 2017-11-09    10-K/A
9   0000002488        2008  0001193125-08-038588 2008-02-26      10-K
10  0000002488        2009  0001193125-09-036235 2009-02-24      10-K
11  0000002488        2010  0001193125-10-035218 2010-02-19      10-K
12  0000003116        2013  0001157523-13-001479 2013-03-20    10-K/A
13  0000003116      

### Functions

#### Extraciton for reporting date


In [46]:
def extract_reporting_date_from_html(html, maxlen=6000):
    html_head = html[:maxlen]
    html_head = html_head.replace('&nbsp;', ' ').replace('&#160;', ' ').replace('\xa0', ' ')

    # Step 1: Look directly in HTML for the date pattern
    m = re.search(
        r'(?i)for\s+(?:the\s+)?(?:fiscal\s+)?year\s+ended\s+([A-Za-z]{3,10})\s*\.?\s*(\d{1,2}),?\s*(\d{4})',
        html_head
    )
    if m:
        # Combine month, day, year
        date_str = f"{m.group(1)} {m.group(2)}, {m.group(3)}"
        try:
            dt = parser.parse(date_str)
            return dt.strftime("%Y/%m/%d")
        except Exception as e:
            print("❌ Failed to parse:", date_str, "| Error:", e)

    # Step 2: Fallback to <PERIOD> tag
    m = re.search(r"<PERIOD>(\d{8})</PERIOD>", html_head, re.IGNORECASE)
    if m:
        date = m.group(1)
        print("📄 Found <PERIOD>:", date)
        return f"{date[:4]}/{date[4:6]}/{date[6:]}"

    m = re.search(r"PERIOD[=:\s]+(\d{8})", html_head, re.IGNORECASE)
    if m:
        date = m.group(1)
        print("📄 Found PERIOD=:", date)
        return f"{date[:4]}/{date[4:6]}/{date[6:]}"

    print(f"⚠️ No reporting date found for CIK={cik}, Year={year}")
    return ""

#### Extraction for risk factors section

In [70]:
def extract_item_1a_section_from_html(txt, cik=None, year=None):
    txt = txt.replace('&nbsp;', ' ').replace('&#160;', ' ').replace('\xa0', ' ')

    # Step 1: Try direct match: "Item 1A ... Risk Factors ... Item 1B/2"
    pattern = r'(item\s*1a[\.\:]*.*?risk\s*factors.*?)(item\s*1b|item\s*2|signatures)'
    matches = re.findall(pattern, txt, flags=re.IGNORECASE | re.DOTALL | re.MULTILINE)
    if matches:
        print(f"✅ Found Item 1A section for CIK={cik}, Year={year} (direct match)")
        return max(matches, key=lambda x: len(x[0]))[0]

    # Step 2: Fallback - check <tr> with split "Item 1A" and "Risk Factors"
    tr_blocks = re.findall(r'<tr.*?>.*?</tr>', txt, flags=re.IGNORECASE | re.DOTALL)
    for block in tr_blocks:
        if re.search(r'item\s*1a', block, re.IGNORECASE) and re.search(r'risk\s*factors', block, re.IGNORECASE):
            print(f"🔎 Found split-tag header (Item 1A and Risk Factors in separate TDs) for CIK={cik}, Year={year}")
            start_index = txt.find(block)
            txt_trimmed = txt[start_index:]
            matches = re.findall(pattern, txt_trimmed, flags=re.IGNORECASE | re.DOTALL | re.MULTILINE)
            if matches:
                print(f"✅ Found Item 1A section for CIK={cik}, Year={year} (via fallback)")
                return max(matches, key=lambda x: len(x[0]))[0]
            else:
                print(f"⚠️ Split header found but no full Item 1A block matched for CIK={cik}, Year={year}")
            break

    # Step 3: Fallback - <B>Risk Factors</B> ... until ITEM 7 or ITEM 7A
    alt_patterns = [
        r'<b>\s*risk\s*factors\s*</b>',
        r'<b>\s*factors\s+that\s+could\s+affect\s+future\s+results\s*</b>'
    ]
    for alt_start in alt_patterns:
        m_start = re.search(alt_start, txt, flags=re.IGNORECASE)
        if m_start:
            start = m_start.start()
            m_end = re.search(r'item\s*7\s*(a)?[\.\:]', txt[start:], flags=re.IGNORECASE)
            end = start + m_end.start() if m_end else len(txt)
            print(f"🟡 Found alternate Risk Factor section for CIK={cik}, Year={year}")
            return txt[start:end]

    print(f"❌ Item 1A section not found for CIK={cik}, Year={year}")
    return None

In [76]:
def extract_item_1a_section_from_html(txt, cik=None, year=None):
    txt = txt.replace('&nbsp;', ' ').replace('&#160;', ' ').replace('\xa0', ' ')

    # Step 1: Direct match
    pattern = r'(item\s*1a[\.\:]*.*?risk\s*factors.*?)(item\s*1b|item\s*2|signatures)'
    matches = re.findall(pattern, txt, flags=re.IGNORECASE | re.DOTALL | re.MULTILINE)
    if matches:
        print(f"✅ Found Item 1A section for CIK={cik}, Year={year} (direct match)")
        return max(matches, key=lambda x: len(x[0]))[0], "direct"

    # Step 2: split <tr> fallback
    tr_blocks = re.findall(r'<tr.*?>.*?</tr>', txt, flags=re.IGNORECASE | re.DOTALL)
    for block in tr_blocks:
        if re.search(r'item\s*1a', block, re.IGNORECASE) and re.search(r'risk\s*factors', block, re.IGNORECASE):
            print(f"🔎 Found split-tag header for CIK={cik}, Year={year}")
            start_index = txt.find(block)
            txt_trimmed = txt[start_index:]
            matches = re.findall(pattern, txt_trimmed, flags=re.IGNORECASE | re.DOTALL | re.MULTILINE)
            if matches:
                print(f"✅ Found Item 1A section via fallback for CIK={cik}, Year={year}")
                return max(matches, key=lambda x: len(x[0]))[0], "split-tr"
            break

    # Step 3: fallback to Risk Factors heading before Item 7
    alt_patterns = [
        r'<b>\s*risk\s*factors\s*</b>',
        r'<b>\s*factors\s+that\s+could\s+affect\s+future\s+results\s*</b>'
    ]
    for alt_start in alt_patterns:
        m_start = re.search(alt_start, txt, flags=re.IGNORECASE)
        if m_start:
            start = m_start.start()
            m_end = re.search(r'item\s*7\s*(a)?[\.\:]', txt[start:], flags=re.IGNORECASE)
            end = start + m_end.start() if m_end else len(txt)
            print(f"🟡 Fallback: Risk Factors before Item 7 for CIK={cik}, Year={year}")
            return txt[start:end], "item7-fallback"

    print(f"❌ Item 1A section not found for CIK={cik}, Year={year}")
    return None, None

#### Extraction for titles

In [66]:
def extract_risk_titles_from_item_1a_html(risk_html):
    if not risk_html:
        return []

    soup = BeautifulSoup(risk_html, "html.parser")
    candidates = []

    for tag in soup.find_all(['b', 'i']):
        sub = tag.get_text(separator=' ', strip=True)

        # Remove non-breaking spaces left after HTML parsing
        sub = sub.replace('\xa0', ' ')

        if (
            sub
            and not re.search(r'item\s*1a', sub, re.IGNORECASE)
            and not re.search(r'risk factors?', sub, re.IGNORECASE)
            and not sub.endswith(':')
            and not sub.isupper()
        ):
            candidates.append(sub)

    # Deduplicate while preserving order
    deduped = list(dict.fromkeys(candidates))

    # Print number of titles found
    print(f"✅ Total risk factor titles found: {len(deduped)}")

    return list(dict.fromkeys(candidates))

### Extraction output

#### Reportingdate output check

In [38]:
reporting_records = []

for row in df_input.itertuples(index=False):
    cik = str(row.cik).zfill(10)
    year = int(row.filingyear)

    df_cik_match = index_10k[
        (index_10k['CIK'].astype(str).str.zfill(10) == cik) &
        (pd.to_datetime(index_10k['Date Filed']).dt.year == year)
    ]
    if df_cik_match.empty:
        continue

    filing_meta = df_cik_match.iloc[0]
    accession = filing_meta['Accession'].split('/')[-1].replace('.txt', '')
    filingdate = filing_meta['Date Filed']

    try:
        filing = Filing(cik=cik, accession=accession)
        text = filing.full_text
    except Exception as e:
        print(f"⚠️ Failed to retrieve filing text for {cik}, {accession}: {e}")
        continue

    reportingdate = extract_reporting_date_from_html(text)

    reporting_records.append({
        'cik': cik,
        'filingyear': year,
        'filingdate': filingdate,
        'reportingdate': reportingdate
    })

df_reporting = pd.DataFrame(reporting_records)
print(df_reporting)

           cik  filingyear  filingdate reportingdate
0   0000001750        2018  2018-07-11    2018/05/31
1   0000001750        2017  2017-07-12    2017/05/31
2   0000001750        2016  2016-07-13    2016/05/31
3   0000001800        2017  2017-02-17    2016/12/31
4   0000001800        2016  2016-02-19    2015/12/31
5   0000001800        2015  2015-02-27    2014/12/31
6   0000002034        2017  2017-08-25    2017/06/30
7   0000002034        2016  2016-08-26    2016/06/30
8   0000002034        2015  2015-09-11    2015/06/30
9   0000002488        2010  2010-02-19    2009/12/26
10  0000002488        2009  2009-02-24    2008/12/27
11  0000002488        2008  2008-02-26    2007/12/29
12  0000003116        2015  2015-03-17    2014/12/31
13  0000003116        2014  2014-03-14    2013/12/31
14  0000003116        2013  2013-03-01    2012/12/31
15  0000004962        2012  2012-02-24    2011/12/31
16  0000004962        2011  2011-02-28    2010/12/31
17  0000004962        2010  2010-02-26    2009

#### Risk factors section check

In [77]:
item1a_check = []

for row in df_reporting.itertuples(index=False):
    cik = str(row.cik).zfill(10)
    year = int(row.filingyear)
    filingdate = row.filingdate

    df_cik_match = index_10k[
        (index_10k['CIK'].astype(str).str.zfill(10) == cik) &
        (pd.to_datetime(index_10k['Date Filed']).dt.year == year)
    ]
    if df_cik_match.empty:
        continue

    accession = df_cik_match.iloc[0]['Accession'].split('/')[-1].replace('.txt', '')

    try:
        filing = Filing(cik=cik, accession=accession)
        text = filing.full_text
    except Exception as e:
        print(f"⚠️ Failed to retrieve filing text for {cik}, {accession}: {e}")
        continue

    # Step 1A: check for presence of Item 1A block
    risk_html = extract_item_1a_section_from_html(text, cik=cik, year=year)
    status = "✅ Found" if risk_html else "❌ Not found"

    item1a_check.append({
        'cik': cik,
        'filingyear': year,
        'filingdate': filingdate,
        'item_1a_found': status
    })

df_check = pd.DataFrame(item1a_check)
print(df_check.head(10))

✅ Found Item 1A section for CIK=0000001750, Year=2018 (direct match)
✅ Found Item 1A section for CIK=0000001750, Year=2017 (direct match)
✅ Found Item 1A section for CIK=0000001750, Year=2016 (direct match)
✅ Found Item 1A section for CIK=0000001800, Year=2017 (direct match)
✅ Found Item 1A section for CIK=0000001800, Year=2016 (direct match)
✅ Found Item 1A section for CIK=0000001800, Year=2015 (direct match)
✅ Found Item 1A section for CIK=0000002034, Year=2017 (direct match)
✅ Found Item 1A section for CIK=0000002034, Year=2016 (direct match)
✅ Found Item 1A section for CIK=0000002034, Year=2015 (direct match)
✅ Found Item 1A section for CIK=0000002488, Year=2010 (direct match)
✅ Found Item 1A section for CIK=0000002488, Year=2009 (direct match)
✅ Found Item 1A section for CIK=0000002488, Year=2008 (direct match)
✅ Found Item 1A section for CIK=0000003116, Year=2015 (direct match)
✅ Found Item 1A section for CIK=0000003116, Year=2014 (direct match)
✅ Found Item 1A section for CIK=00

#### Risk titles output

In [79]:
rfd_records = []

for row in islice(df_reporting.itertuples(index=False), 6, 7):  # 仅取第7条数据
    cik = str(row.cik).zfill(10)
    year = int(row.filingyear)
    filingdate = row.filingdate

    df_cik_match = index_10k[
        (index_10k['CIK'].astype(str).str.zfill(10) == cik) &
        (pd.to_datetime(index_10k['Date Filed']).dt.year == year)
    ]
    if df_cik_match.empty:
        continue

    accession = df_cik_match.iloc[0]['Accession'].split('/')[-1].replace('.txt', '')

    try:
        filing = Filing(cik=cik, accession=accession)
        text = filing.full_text
    except Exception:
        print(f"⚠️ Failed to load filing for CIK={cik}, Accession={accession}")
        continue

    # 提取 Item 1A 内容 + 来源标识
    risk_html, source = extract_item_1a_section_from_html(text, cik=cik, year=year)

    if not risk_html:
        continue

    # 提取小标题
    risk_titles = extract_risk_titles_from_item_1a_html(risk_html)

    if not risk_titles:
        risk_titles = ['']

    for title in risk_titles:
        rfd_records.append({
            'cik': cik,
            'filingyear': year,
            'filingdate': filingdate,
            'source': source,
            'RFDTitle': title.strip()
        })

# 构建 DataFrame
df_rfd = pd.DataFrame(rfd_records)
print(df_rfd)

✅ Found Item 1A section for CIK=0000002034, Year=2017 (direct match)
✅ Total risk factor titles found: 77
           cik  filingyear  filingdate  source  \
0   0000002034        2017  2017-08-25  direct   
1   0000002034        2017  2017-08-25  direct   
2   0000002034        2017  2017-08-25  direct   
3   0000002034        2017  2017-08-25  direct   
4   0000002034        2017  2017-08-25  direct   
..         ...         ...         ...     ...   
72  0000002034        2017  2017-08-25  direct   
73  0000002034        2017  2017-08-25  direct   
74  0000002034        2017  2017-08-25  direct   
75  0000002034        2017  2017-08-25  direct   
76  0000002034        2017  2017-08-25  direct   

                                             RFDTitle  
0                                    Item 1. Business  
1                                        Human Health  
2                          Pharmaceutical Ingredients  
3   Outlook for Global Medicines Through 2021: Bal...  
4            

### Final output

In [80]:
# test
df_rfd.to_csv('risk_titles.csv', index=False)

## Reference

Campbell, J. L., Chen, H., Dhaliwal, D. S., Lu, H. M., & Steele, L. B. (2014). The information content of mandatory risk factor disclosures in corporate filings. *Review of Accounting Studies, 19*(1), 396–455.

Gaulin, M. P. (2017). *Risk fact or fiction: The information content of risk factor disclosures* (Doctoral dissertation, Rice University).
