# Risk Factor Title Extraction Task

Yichun Sarah Fan <br>
May 19, 2025

---

**Project Summary**

This Python script extracts risk factors from the ***Item 1A. Risk Factors*** section of SEC 10-K filings for a panel of 10 unique firms over three years. For each filing, the script identifies and captures all visually emphasized risk factor titles, outputting them in a CSV file with the following columns: CIK, filing year, filing date, reporting date, and RFDTitle.

The code consists of two main parts.

**1. Initialize pyedgar environment:**
   - Prepares the environment and necessary libraries for accessing and parsing SEC EDGAR filings.
   - Handles data index setup and imports.

**2. EDGAR data extraction:**
   - Matches each firm-year with the correct 10-K filing.
   - Extracts all accentuated risk factor headings (bold, underlined, or italic) from the Item 1A section.
   - Writes all results to a structured CSV file for further analysis.

This workflow automates and standardizes the extraction of regulatory risk disclosures, enabling efficient and reproducible data collection for research or business purposes.

## Git setup

In [None]:
!git config --global user.name "SarahrahFan"
!git config --global user.email "fyc6373@gmail.com"

In [None]:
github_username = "SarahrahFan"
github_token = "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"

with open("/root/.netrc", "w") as f:
    f.write(f"machine github.com\nlogin {github_username}\npassword {github_token}\n")

!chmod 600 /root/.netrc

In [None]:
%cd /content/drive/MyDrive/RA/Tech\ test

/content/drive/MyDrive/RA/Tech test


In [None]:
!git init

[33mhint: Using 'master' as the name for the initial branch. This default branch name[m
[33mhint: is subject to change. To configure the initial branch name to use in all[m
[33mhint: [m
[33mhint: 	git config --global init.defaultBranch <name>[m
[33mhint: [m
[33mhint: Names commonly chosen instead of 'master' are 'main', 'trunk' and[m
[33mhint: 'development'. The just-created branch can be renamed via this command:[m
[33mhint: [m
[33mhint: 	git branch -m <name>[m
Initialized empty Git repository in /content/drive/MyDrive/RA/Tech test/.git/


In [None]:
!git remote add origin https://github.com/SarahrahFan/EDGAR_riskfactors.git

error: remote origin already exists.


In [None]:
!echo "pyedgar/indices/form_all.idx" >> .gitignore
!echo "pyedgar/indices/form_8-K.idx" >> .gitignore
!echo "pyedgar/indices/form_10-Q.idx" >> .gitignore

!git add .
!git commit -m "Clean initial commit without large files"

[master (root-commit) ef1c736] Clean initial commit without large files
 7 files changed, 355923 insertions(+)
 create mode 100644 .gitignore
 create mode 100644 Sarah Fan: Risk Factor Title Extraction Task for Form 10-K.ipynb
 create mode 100644 V2 Sarah Fan: Risk Factor Title Extraction Task for Form 10-K.ipynb
 create mode 100644 pyedgar/config/hades.colab.pyedgar.conf
 create mode 100644 pyedgar/indices/form_10-K.idx
 create mode 100644 rasamplemini_rfdtitle.csv
 create mode 100644 submit_Sarah_Fan_Risk_Factor_Title_Extraction_Task_for_Form_10_K.ipynb


In [None]:
!git branch -M main
!git push -u --force origin main

Enumerating objects: 12, done.
Counting objects:   8% (1/12)Counting objects:  16% (2/12)Counting objects:  25% (3/12)Counting objects:  33% (4/12)Counting objects:  41% (5/12)Counting objects:  50% (6/12)Counting objects:  58% (7/12)Counting objects:  66% (8/12)Counting objects:  75% (9/12)Counting objects:  83% (10/12)Counting objects:  91% (11/12)Counting objects: 100% (12/12)Counting objects: 100% (12/12), done.
Delta compression using up to 2 threads
Compressing objects: 100% (11/11), done.
Writing objects: 100% (12/12), 4.33 MiB | 2.34 MiB/s, done.
Total 12 (delta 1), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (1/1), done.[K
To https://github.com/SarahrahFan/EDGAR_riskfactors.git
 * [new branch]      main -> main
Branch 'main' set up to track remote branch 'main' from 'origin'.


## Git update

In [23]:
!git status

fatal: not a git repository (or any of the parent directories): .git


In [None]:
!git add .

[main 1649c66] Update [file/module]: improved Item 1A extraction and handling notes
 1 file changed, 1 insertion(+), 1 deletion(-)
 rewrite V2 Sarah Fan: Risk Factor Title Extraction Task for Form 10-K.ipynb (87%)


In [None]:
!git commit -m "Optimize extraction logic: added character limit, special character cleanup, footnote handling, and table of contents distinction"

In [None]:
!git push

In [None]:
# !git push origin main

Enumerating objects: 5, done.
Counting objects:  20% (1/5)Counting objects:  40% (2/5)Counting objects:  60% (3/5)Counting objects:  80% (4/5)Counting objects: 100% (5/5)Counting objects: 100% (5/5), done.
Delta compression using up to 2 threads
Compressing objects:  33% (1/3)Compressing objects:  66% (2/3)Compressing objects: 100% (3/3)Compressing objects: 100% (3/3), done.
Writing objects:  33% (1/3)Writing objects:  66% (2/3)Writing objects: 100% (3/3)Writing objects: 100% (3/3), 6.23 KiB | 455.00 KiB/s, done.
Total 3 (delta 2), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas:   0% (0/2)[Kremote: Resolving deltas:  50% (1/2)[Kremote: Resolving deltas: 100% (2/2)[Kremote: Resolving deltas: 100% (2/2), completed with 2 local objects.[K
To github.com:SarahrahFan/EDGAR_riskfactors.git
   ef1c736..1649c66  main -> main


## Initialize pyedgar environment

In [1]:
import os
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# Define and Create pyedgar Directory Structure
## Base directory
base_dir = '/content/drive/MyDrive/RA/Tech test/pyedgar'

## Define subdirectories for config, index, and filings
conf_dir = os.path.join(base_dir, 'config')
index_dir = os.path.join(base_dir, 'indices')
filing_dir = os.path.join(base_dir, 'filings')

## Create the directories if they don't exist
os.makedirs(conf_dir, exist_ok=True)
os.makedirs(index_dir, exist_ok=True)
os.makedirs(filing_dir, exist_ok=True)

In [3]:
# Create config file
conf_path = os.path.join(conf_dir, 'hades.colab.pyedgar.conf')

with open(conf_path, 'w') as config_file:
    config_file.write(f"""
[DEFAULT]
SEC_BASE_URL = https://www.sec.gov
HEADERS = Sarah Fan (yichun.fan@gwmail.gwu.edu)

[Paths]
INDEX_ROOT = {index_dir}
FILING_ROOT = {filing_dir}

[Index]
INDEX_DELIMITER = |
INDEX_EXTENSION = idx

[Downloader]
KEEP_ALL = False
""")

# Set Environment Variable
os.environ['PYEDGAR_CONF'] = conf_path

In [4]:
# !pip install pyedgar

In [5]:
from pyedgar import config, Filing, EDGARIndex
print("✅ pyedgar is using config file from:", config.CONFIG_FILE)

✅ pyedgar is using config file from: /content/drive/MyDrive/RA/Tech test/pyedgar/config/hades.colab.pyedgar.conf


## EDGAR data extraction

### 10-K index input

In [6]:
import pandas as pd
import re
from time import sleep
from datetime import datetime
from bs4 import BeautifulSoup
from dateutil import parser

In [7]:
# Load input file
df_input = pd.read_csv("/content/drive/MyDrive/RA/Tech test/rasamplemini_rfdtitle.csv")

# Ensure CIK is 10-digit zero-padded (required by EDGARIndex)
df_input['cik'] = df_input['cik'].astype(str).str.zfill(10)
df_input['filingyear'] = df_input['filingyear'].astype(int)

# Load EDGAR index (use cached data if available)
idx = EDGARIndex(force_download=False)

In [8]:
idx.indices

{'form_10-K.idx': '/content/drive/MyDrive/RA/Tech test/pyedgar/indices/form_10-K.idx'}

In [9]:
index_10k = pd.read_csv('/content/drive/MyDrive/RA/Tech test/pyedgar/indices/form_10-K.idx',
                        sep='|', dtype=str, low_memory=False)
index_10k['cik'] = index_10k['CIK'].astype(str).str.zfill(10)
index_10k['filingyear'] = pd.to_datetime(index_10k['Date Filed'], errors='coerce').dt.year
index_10k.head()

Unnamed: 0,CIK,Company Name,Form Type,Date Filed,Accession,cik,filingyear
0,20,K TRON INTERNATIONAL INC,10-K,1996-03-28,0000893220-96-000500,20,1996
1,20,K TRON INTERNATIONAL INC,10-K,1997-03-19,0000893220-97-000572,20,1997
2,20,K TRON INTERNATIONAL INC,10-K405,1998-03-18,0000893220-98-000560,20,1998
3,20,K TRON INTERNATIONAL INC,10-K,1999-03-23,0000893220-99-000357,20,1999
4,20,K TRON INTERNATIONAL INC,10-K405,2000-03-30,0000893220-00-000394,20,2000


### 10-K index check

In [10]:
# Merge to check which targets are matched in the 10-K index
check = df_input.merge(
    index_10k[['cik', 'filingyear', 'Accession']],
    on=['cik', 'filingyear'],
    how='left',
    indicator=True
)

# Print summary
n_total = len(df_input)
n_matched = (check['_merge'] == 'both').sum()
n_missing = (check['_merge'] == 'left_only').sum()

print(f"Total CIK-year pairs: {n_total}")
print(f"Matched in 10-K index: {n_matched}")
print(f"Missing in 10-K index: {n_missing}")

Total CIK-year pairs: 30
Matched in 10-K index: 36
Missing in 10-K index: 0


In [11]:
# Count how many unique 10-K accessions exist for each (cik, filingyear) pair
dupes = check[check['_merge'] == 'both'] \
    .groupby(['cik', 'filingyear'])['Accession'] \
    .nunique() \
    .reset_index()

# Keep only those cik+filingyear pairs with more than one 10-K accession
dupes = dupes[dupes['Accession'] > 1]

print(f"Number of company/year pairs with multiple 10-Ks: {len(dupes)}")
print(dupes)

# Optionally: List all accessions for these cik+year pairs for inspection
if not dupes.empty:
    # Merge with original data to display all matching accessions
    details = check.merge(dupes[['cik', 'filingyear']], on=['cik', 'filingyear'], how='inner')
    print("\nDetailed duplicate records:")
    print(details[['cik', 'filingyear', 'Accession']].sort_values(['cik', 'filingyear']))

Number of company/year pairs with multiple 10-Ks: 6
           cik  filingyear  Accession
8   0000002034        2017          2
12  0000003116        2013          2
14  0000003116        2015          2
17  0000004962        2012          2
25  0000006201        2006          2
26  0000006201        2007          2

Detailed duplicate records:
           cik  filingyear             Accession
0   0000002034        2017  0001144204-17-045100
1   0000002034        2017  0001144204-17-057835
4   0000003116        2013  0001157523-13-001183
5   0000003116        2013  0001157523-13-001479
2   0000003116        2015  0001171843-15-001465
3   0000003116        2015  0001171843-15-002393
6   0000004962        2012  0001193125-12-077400
7   0000004962        2012  0001140361-12-011832
10  0000006201        2006  0000950134-06-003715
11  0000006201        2006  0000006201-06-000049
8   0000006201        2007  0000950134-07-003888
9   0000006201        2007  0000950134-07-004263


Based on the results above, manual inspection confirms that all duplicate records are due to amended filings (such as revised or supplemental reports). For subsequent analysis, we will use the most recent version, specifically the “FORM 10-K/A” filings, for each CIK-year pair where duplicates exist.

**[Revised notes]** During the extraction process, I observed that most 10-K/A filings do not modify the Item 1A Risk Factors section. As a result, I have decided to extract this section from the original 10-K filings instead, in order to streamline processing and ensure greater consistency across documents.

In [12]:
# Keep only target cik + filingyear pairs from your input list
targets = df_input[['cik', 'filingyear']].drop_duplicates()

# Merge all 10-K index entries with the target cik + filingyear pairs
merged = index_10k.merge(targets, on=['cik', 'filingyear'], how='inner')

# Convert Date Filed to datetime format
merged['Date Filed'] = pd.to_datetime(merged['Date Filed'], errors='coerce')

# Filter to keep only original 10-K filings (ignore 10-K/A)
merged = merged[merged['Form Type'] == '10-K']

# Sort by cik, filingyear, Date Filed (ascending), and Accession (ascending)
merged = merged.sort_values(['cik', 'filingyear', 'Date Filed', 'Accession'])

# For each (cik, filingyear) group, keep only the latest record (last one in the sorted group)
latest_10k = merged.groupby(['cik', 'filingyear'], as_index=False).last()

# Display the selected records: one most recent 10-K per cik and year
print(latest_10k[['cik', 'filingyear', 'Accession', 'Date Filed', 'Form Type']])

           cik  filingyear             Accession Date Filed Form Type
0   0000001750        2016  0001047469-16-014299 2016-07-13      10-K
1   0000001750        2017  0001047469-17-004528 2017-07-12      10-K
2   0000001750        2018  0001047469-18-004978 2018-07-11      10-K
3   0000001800        2015  0001047469-15-001377 2015-02-27      10-K
4   0000001800        2016  0001047469-16-010246 2016-02-19      10-K
5   0000001800        2017  0001047469-17-000744 2017-02-17      10-K
6   0000002034        2015  0001571049-15-007509 2015-09-11      10-K
7   0000002034        2016  0001571049-16-017785 2016-08-26      10-K
8   0000002034        2017  0001144204-17-045100 2017-08-25      10-K
9   0000002488        2008  0001193125-08-038588 2008-02-26      10-K
10  0000002488        2009  0001193125-09-036235 2009-02-24      10-K
11  0000002488        2010  0001193125-10-035218 2010-02-19      10-K
12  0000003116        2013  0001157523-13-001183 2013-03-01      10-K
13  0000003116      

### Functions

#### Extraciton for reporting date


In [13]:
def extract_reporting_date_from_html(html, maxlen=6000):
    html_head = html[:maxlen]
    html_head = html_head.replace('&nbsp;', ' ').replace('&#160;', ' ').replace('\xa0', ' ')

    # Step 1: Look directly in HTML for the date pattern
    m = re.search(
        r'(?i)for\s+(?:the\s+)?(?:fiscal\s+)?year\s+ended\s+([A-Za-z]{3,10})\s*\.?\s*(\d{1,2}),?\s*(\d{4})',
        html_head
    )
    if m:
        # Combine month, day, year
        date_str = f"{m.group(1)} {m.group(2)}, {m.group(3)}"
        try:
            dt = parser.parse(date_str)
            return dt.strftime("%Y/%m/%d")
        except Exception as e:
            print("❌ Failed to parse:", date_str, "| Error:", e)

    # Step 2: Fallback to <PERIOD> tag
    m = re.search(r"<PERIOD>(\d{8})</PERIOD>", html_head, re.IGNORECASE)
    if m:
        date = m.group(1)
        print("📄 Found <PERIOD>:", date)
        return f"{date[:4]}/{date[4:6]}/{date[6:]}"

    m = re.search(r"PERIOD[=:\s]+(\d{8})", html_head, re.IGNORECASE)
    if m:
        date = m.group(1)
        print("📄 Found PERIOD=:", date)
        return f"{date[:4]}/{date[4:6]}/{date[6:]}"

    print(f"⚠️ No reporting date found for CIK={cik}, Year={year}")
    return ""

#### Extraction for risk factors section

In this version of the `extract_item_1a_section_from_html` function, all logging and fallback diagnostics have been removed. If the `Item 1A` section cannot be located using either a direct heading match or a split `<tr>` tag fallback, the function will simply return None without attempting additional heuristics.

This is appropriate because the extraction process has already successfully identified the `Item 1A` section in `93.3%` of the filings (28 out of 30), leaving only two unmatched cases. Upon closer inspection, these two filings place their risk factor discussions under **ITEM 7. MANAGEMENT’S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS OF OPERATIONS**, using inconsistent headings such as “*Risk Factors*” and “*Factors That Could Affect Future Results*”.


Due to the variation in formatting and the added complexity it would introduce to robustly detect such edge cases, this version of the function intentionally **does not handle them**. Addressing those special cases would require additional effort in title normalization or context-aware parsing, which falls outside the scope of the current streamlined extraction logic.

In [14]:
def clean_html_tables(txt):
    soup = BeautifulSoup(txt, 'html.parser')

    for table in soup.find_all('table'):
        table_text = table.get_text(" ", strip=True).lower()

        has_item_1a = bool(table.find('b', string=lambda s: s and 'item 1a' in s.lower()))
        has_risk_factors = bool(table.find('b', string=lambda s: s and 'risk factors' in s.lower()))

        if has_item_1a and has_risk_factors:
            continue

        if any(keyword in table_text for keyword in [
            'forward-looking statements', 'you are cautioned', 'undue reliance'
        ]):
            table.decompose()

    return str(soup)

In [15]:
def extract_item_1a_section_from_html(txt, cik=None, year=None):
    txt = txt.replace('&nbsp;', ' ').replace('&#160;', ' ').replace('\xa0', ' ')
    txt = txt.replace('&#151;', 'nnn')
    txt = re.sub(r'<!-- XBRL Footnotes Begin -->(.*?)<!-- XBRL Footnotes End -->', '', txt, flags=re.DOTALL)
    txt = clean_html_tables(txt)

    soup = BeautifulSoup(txt, 'html.parser')

    found_header = None

    for tag in soup.find_all('b'):
      text = tag.get_text(" ", strip=True).lower()

      if (
          "item 1a" in text
          and "risk factors" in text
          and not any(exclude in text for exclude in ["see", "refer", "page", "discussion", "discussed"])
      ):
        found_header = tag
        break

    if found_header:
        print(f"✅ Found structural header tag for CIK={cik}, Year={year}")

        header_text = found_header.get_text(" ", strip=True)
        header_text_clean = re.escape(header_text[:80])
        header_match = re.search(header_text_clean, txt, flags=re.IGNORECASE)

        if header_match:
            start_index = header_match.start()
            txt_trimmed = txt[start_index:]

            end_match = re.search(r'item\s+1b|item\s+2', txt_trimmed, flags=re.IGNORECASE)
            end_index = end_match.start() if end_match else len(txt_trimmed)

            section_text = txt_trimmed[:end_index]

            if len(section_text.strip())<100:
              print(f"⚠️ Ignored short Item 1A section (<100 chars) for CIK={cik}, Year={year}")
              return None, None

            return section_text, "parsed-structural"

        else:
            print(f"⚠️ Header found structurally but not located in raw HTML for CIK={cik}")

    # Fallback regex
    pattern = r'((?:^|[\n>])\s*item\s*1a[\.\:]*((?!nnn).)*?risk\s*factors.*?)(item\s*1b|item\s*2)'
    matches = re.findall(pattern, txt, flags=re.IGNORECASE | re.DOTALL | re.MULTILINE)
    if matches:
        print(f"✅ Found Item 1A section for CIK={cik}, Year={year} (regex fallback)")
        return max(matches, key=lambda x: len(x[0]))[0], "regex-fallback"

    print(f"❌ Item 1A section not found for CIK={cik}, Year={year}")
    return None, None

In [16]:
def extract_item_1a_section_from_html(txt, cik=None, year=None):
    # Step 0: Normalize text and remove footnotes/tables
    txt = txt.replace('&nbsp;', ' ').replace('&#160;', ' ').replace('\xa0', ' ')
    txt = txt.replace('&#151;', 'nnn')
    txt = re.sub(r'<!-- XBRL Footnotes Begin -->(.*?)<!-- XBRL Footnotes End -->', '', txt, flags=re.DOTALL)
    txt = clean_html_tables(txt)

    soup = BeautifulSoup(txt, 'html.parser')

    # Step 1: Structural match using <b> tags
    for tag in soup.find_all('b'):
        text = tag.get_text(" ", strip=True).lower()

        if (
            "item 1a" in text
            and "risk factors" in text
            and not any(exclude in text for exclude in ["see", "refer", "page", "discussion", "discussed"])
        ):
            header_text = tag.get_text(" ", strip=True)
            header_text_clean = re.escape(header_text[:80])
            header_match = re.search(header_text_clean, txt, flags=re.IGNORECASE)

            if header_match:
                start_index = header_match.start()
                txt_trimmed = txt[start_index:]

                end_match = re.search(r'item\s+1b|item\s+2', txt_trimmed, flags=re.IGNORECASE)
                end_index = end_match.start() if end_match else len(txt_trimmed)

                section_text = txt_trimmed[:end_index]

                if len(section_text.strip()) >= 100:
                    print(f"✅ Found valid structural Item 1A section for CIK={cik}, Year={year}")
                    return section_text, "parsed-structural"
                else:
                    print(f"⚠️ Skipped short Item 1A section (<100 chars) for CIK={cik}, Year={year}")
                    continue  # try next candidate

    # Step 2: Fallback regex
    pattern = r'((?:^|[\n>])\s*item\s*1a[\.\:]*((?!nnn).)*?risk\s*factors.*?)(item\s*1b|item\s*2)'
    matches = re.findall(pattern, txt, flags=re.IGNORECASE | re.DOTALL | re.MULTILINE)

    if matches:
        best = max(matches, key=lambda x: len(x[0]))[0]
        if len(best.strip()) >= 100:
            print(f"✅ Found Item 1A section for CIK={cik}, Year={year} (regex fallback)")
            return best, "regex-fallback"
        else:
            print(f"⚠️ Skipped short fallback Item 1A section (<100 chars) for CIK={cik}, Year={year}")

    print(f"❌ Item 1A section not found for CIK={cik}, Year={year}")
    return None, None

#### Extraction for titles

In [17]:
def extract_risk_titles_from_item_1a_html(risk_html):
    if not risk_html:
        return []

    soup = BeautifulSoup(risk_html, "html.parser")
    candidates = []

    # Step 1: Tags with semantic emphasis
    for tag in soup.find_all(['b', 'i', 'u']):
        sub = tag.get_text(separator=' ', strip=True).replace('\xa0', ' ').strip()
        if (
            sub
            and not re.search(r'item\s*1a', sub, re.IGNORECASE)
            and not re.search(r'risk\s*factors?', sub, re.IGNORECASE)
            and not sub.endswith(':')
            and not sub.isupper()
        ):
            candidates.append(sub)

    # Step 2: Additional support for <font> or <div> with bold styles
    for tag in soup.find_all(['font', 'div']):
        style = tag.get('style', '').lower()
        if 'font-weight: bold' in style:
            sub = tag.get_text(separator=' ', strip=True).replace('\xa0', ' ').strip()
            if (
                sub
                and not re.search(r'item\s*1a', sub, re.IGNORECASE)
                and not re.search(r'risk\s*factors?', sub, re.IGNORECASE)
                and not sub.endswith(':')
                and not sub.isupper()
            ):
                candidates.append(sub)

    # Deduplicate while preserving order
    deduped = list(dict.fromkeys(candidates))

    # Filter out very short titles (less than 10 characters)
    filtered = [title for title in deduped if len(title) >= 10]

    print(f"✅ Total risk factor titles found: {len(filtered)}")

    return filtered

### Extraction output

#### Reportingdate output check

In [18]:
reporting_records = []

for row in df_input.itertuples(index=False):
    cik = str(row.cik).zfill(10)
    year = int(row.filingyear)

    df_cik_match = latest_10k[
        (latest_10k['CIK'].astype(str).str.zfill(10) == cik) &
        (pd.to_datetime(latest_10k['Date Filed']).dt.year == year)
    ]
    if df_cik_match.empty:
        continue

    filing_meta = df_cik_match.iloc[0]
    accession = filing_meta['Accession'].split('/')[-1].replace('.txt', '')
    filingdate = filing_meta['Date Filed']

    try:
        filing = Filing(cik=cik, accession=accession)
        text = filing.full_text
    except Exception as e:
        print(f"⚠️ Failed to retrieve filing text for {cik}, {accession}: {e}")
        continue

    reportingdate = extract_reporting_date_from_html(text)

    reporting_records.append({
        'cik': cik,
        'filingyear': year,
        'filingdate': filingdate,
        'reportingdate': reportingdate
    })

df_reporting = pd.DataFrame(reporting_records)
print(df_reporting)

           cik  filingyear filingdate reportingdate
0   0000001750        2018 2018-07-11    2018/05/31
1   0000001750        2017 2017-07-12    2017/05/31
2   0000001750        2016 2016-07-13    2016/05/31
3   0000001800        2017 2017-02-17    2016/12/31
4   0000001800        2016 2016-02-19    2015/12/31
5   0000001800        2015 2015-02-27    2014/12/31
6   0000002034        2017 2017-08-25    2017/06/30
7   0000002034        2016 2016-08-26    2016/06/30
8   0000002034        2015 2015-09-11    2015/06/30
9   0000002488        2010 2010-02-19    2009/12/26
10  0000002488        2009 2009-02-24    2008/12/27
11  0000002488        2008 2008-02-26    2007/12/29
12  0000003116        2015 2015-03-17    2014/12/31
13  0000003116        2014 2014-03-14    2013/12/31
14  0000003116        2013 2013-03-01    2012/12/31
15  0000004962        2012 2012-02-24    2011/12/31
16  0000004962        2011 2011-02-28    2010/12/31
17  0000004962        2010 2010-02-26    2009/12/31
18  00000049

#### Risk factors section check

In [19]:
item1a_check = []

for row in df_reporting.itertuples(index=False):
    cik = str(row.cik).zfill(10)
    year = int(row.filingyear)
    filingdate = row.filingdate

    df_cik_match = latest_10k[
        (latest_10k['CIK'].astype(str).str.zfill(10) == cik) &
        (pd.to_datetime(latest_10k['Date Filed']).dt.year == year)
    ]
    if df_cik_match.empty:
        continue

    accession = df_cik_match.iloc[0]['Accession'].split('/')[-1].replace('.txt', '')

    try:
        filing = Filing(cik=cik, accession=accession)
        text = filing.full_text
    except Exception as e:
        print(f"⚠️ Failed to retrieve filing text for {cik}, {accession}: {e}")
        continue

    # Step 1A: check for presence of Item 1A block
    risk_html, source = extract_item_1a_section_from_html(text, cik=cik, year=year)
    status = "✅ Found" if risk_html else "❌ Not found"

    item1a_check.append({
        'cik': cik,
        'filingyear': year,
        'filingdate': filingdate,
        'item_1a_found': status
    })

df_check = pd.DataFrame(item1a_check)

✅ Found valid structural Item 1A section for CIK=0000001750, Year=2018
✅ Found valid structural Item 1A section for CIK=0000001750, Year=2017
✅ Found valid structural Item 1A section for CIK=0000001750, Year=2016
✅ Found valid structural Item 1A section for CIK=0000001800, Year=2017
✅ Found valid structural Item 1A section for CIK=0000001800, Year=2016
✅ Found valid structural Item 1A section for CIK=0000001800, Year=2015
✅ Found valid structural Item 1A section for CIK=0000002034, Year=2017
✅ Found valid structural Item 1A section for CIK=0000002034, Year=2016
✅ Found Item 1A section for CIK=0000002034, Year=2015 (regex fallback)
✅ Found Item 1A section for CIK=0000002488, Year=2010 (regex fallback)
✅ Found Item 1A section for CIK=0000002488, Year=2009 (regex fallback)
✅ Found Item 1A section for CIK=0000002488, Year=2008 (regex fallback)
✅ Found Item 1A section for CIK=0000003116, Year=2015 (regex fallback)
✅ Found Item 1A section for CIK=0000003116, Year=2014 (regex fallback)
✅ Foun

In [None]:
"""Risk factors content check

from itertools import islice

for row in islice(df_reporting.itertuples(index=False), 24,25):
    cik = str(row.cik).zfill(10)
    year = int(row.filingyear)

    df_cik_match = latest_10k[
        (latest_10k['CIK'].astype(str).str.zfill(10) == cik) &
        (pd.to_datetime(latest_10k['Date Filed']).dt.year == year)
    ]
    if df_cik_match.empty:
        continue

    accession = df_cik_match.iloc[0]['Accession'].split('/')[-1].replace('.txt', '')
    try:
        filing = Filing(cik=cik, accession=accession)
        text = filing.full_text
    except Exception as e:
        print(f"⚠️ Failed to retrieve filing for {cik}: {e}")
        continue

    section, source = extract_item_1a_section_from_html(text, cik=cik, year=year)

    if section:
        print(f"\n🎯 First matched Item 1A section for CIK={cik}, Year={year}")
        print("📌 Source:", source)
        print("📄 Section Preview:\n")
        print(section[:1500])
        break
"""

#### Risk titles output

In [20]:
rfd_records = []

for row in df_reporting.itertuples(index=False):
    cik = str(row.cik).zfill(10)
    year = int(row.filingyear)
    filingdate = row.filingdate

    df_cik_match = latest_10k[
        (latest_10k['CIK'].astype(str).str.zfill(10) == cik) &
        (pd.to_datetime(latest_10k['Date Filed']).dt.year == year)
    ]
    if df_cik_match.empty:
        continue

    accession = df_cik_match.iloc[0]['Accession'].split('/')[-1].replace('.txt', '')

    try:
        filing = Filing(cik=cik, accession=accession)
        text = filing.full_text
    except Exception:
        print(f"⚠️ Failed to load filing for CIK={cik}, Accession={accession}")
        continue

    risk_html, source = extract_item_1a_section_from_html(text, cik=cik, year=year)

    if not risk_html:
        continue

    risk_titles = extract_risk_titles_from_item_1a_html(risk_html)

    if not risk_titles:
        risk_titles = ['']

    for title in risk_titles:
        rfd_records.append({
            'cik': cik,
            'filingyear': year,
            'filingdate': filingdate,
            'RFDTitle': title.strip()
        })

df_rfd = pd.DataFrame(rfd_records)
print(df_rfd)

✅ Found valid structural Item 1A section for CIK=0000001750, Year=2018
✅ Total risk factor titles found: 18
✅ Found valid structural Item 1A section for CIK=0000001750, Year=2017
✅ Total risk factor titles found: 19
✅ Found valid structural Item 1A section for CIK=0000001750, Year=2016
✅ Total risk factor titles found: 19
✅ Found valid structural Item 1A section for CIK=0000001800, Year=2017
✅ Total risk factor titles found: 20
✅ Found valid structural Item 1A section for CIK=0000001800, Year=2016
✅ Total risk factor titles found: 19
✅ Found valid structural Item 1A section for CIK=0000001800, Year=2015
✅ Total risk factor titles found: 19
✅ Found valid structural Item 1A section for CIK=0000002034, Year=2017
✅ Total risk factor titles found: 64
✅ Found valid structural Item 1A section for CIK=0000002034, Year=2016
✅ Total risk factor titles found: 57
✅ Found Item 1A section for CIK=0000002034, Year=2015 (regex fallback)
✅ Total risk factor titles found: 49
✅ Found Item 1A section for 

In [None]:
"""
from itertools import islice

rfd_records = []

for row in islice(df_reporting.itertuples(index=False), 25,26):
    cik = str(row.cik).zfill(10)
    year = int(row.filingyear)
    filingdate = row.filingdate

    df_cik_match = latest_10k[
        (latest_10k['CIK'].astype(str).str.zfill(10) == cik) &
        (pd.to_datetime(latest_10k['Date Filed']).dt.year == year)
    ]
    if df_cik_match.empty:
        continue

    accession = df_cik_match.iloc[0]['Accession'].split('/')[-1].replace('.txt', '')

    try:
        filing = Filing(cik=cik, accession=accession)
        text = filing.full_text
    except Exception:
        print(f"⚠️ Failed to load filing for CIK={cik}, Accession={accession}")
        continue

    risk_html, source = extract_item_1a_section_from_html(text, cik=cik, year=year)

    if not risk_html:
        continue

    risk_titles = extract_risk_titles_from_item_1a_html(risk_html)

    if not risk_titles:
        risk_titles = ['']

    for title in risk_titles:
        rfd_records.append({
            'cik': cik,
            'filingyear': year,
            'filingdate': filingdate,
            'RFDTitle': title.strip()
        })

df_rfd = pd.DataFrame(rfd_records)
print(df_rfd)
"""

### Final output

In [21]:
df_output = df_reporting.merge(
    df_rfd[['cik', 'filingyear', 'RFDTitle']],
    on=['cik', 'filingyear'],
    how='left'
)

print(df_output.head())

          cik  filingyear filingdate reportingdate  \
0  0000001750        2018 2018-07-11    2018/05/31   
1  0000001750        2018 2018-07-11    2018/05/31   
2  0000001750        2018 2018-07-11    2018/05/31   
3  0000001750        2018 2018-07-11    2018/05/31   
4  0000001750        2018 2018-07-11    2018/05/31   

                                            RFDTitle  
0  We are affected by factors that adversely impa...  
1  Our U.S. government contracts may not continue...  
2  We face risks of cost overruns and losses on f...  
3  Success at our airframe maintenance facilities...  
4  We operate in highly competitive markets, and ...  


In [22]:
df_output.to_csv('rasamplemini_rfdtitle_Sarahoutput.csv', index=False)

## Reference

Campbell, J. L., Chen, H., Dhaliwal, D. S., Lu, H. M., & Steele, L. B. (2014). The information content of mandatory risk factor disclosures in corporate filings. *Review of Accounting Studies, 19*(1), 396–455.

Gaulin, M. P. (2017). *Risk fact or fiction: The information content of risk factor disclosures* (Doctoral dissertation, Rice University).
