# LLM Document Analysis

**Name:** Jose Ayala<br>
**Class:** CAP5619 - Artificial Intelligence for FinTech<br>
**Assignment:** Text Extraction<br>
**Date:** March 11, 2025  

### Imports

In [1]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import json
import time
import sys
import subprocess
import csv
import os

### Fetching Company Data from the SEC API

This block of code fetches company data from the SECâ€™s public API, which provides company ticker symbols and related information in JSON format. The JSON data is then parsed and converted into a Pandas DataFrame for further analysis. We'll use this data to extract the necessary company information for processing SEC 8-K filings.


In [2]:
# Fetch the company data from the SEC API
url = "https://www.sec.gov/files/company_tickers.json"
headers = {'User-Agent': 'Your Name (jo647937@ucf.edu)'}
response = requests.get(url, headers=headers)
company_data = response.json()
df = pd.DataFrame.from_dict(company_data, orient='index')
print(df.tail())

      cik_str ticker                        title
9715  1949257  RDPTF  Radiopharm Theranostics Ltd
9716  1506721  BLFBY       Balfour Beatty plc/ADR
9717  1506721  BAFBF       Balfour Beatty plc/ADR
9718  1991946  CGBSW       Crown LNG Holdings Ltd
9719  1449664  NWSZF             CTF Services Ltd


### Defining Variables for Company Processing

In this block of code, we define some important variables that will be used throughout the process of handling SEC 8-K filings. These variables will guide the flow of data extraction and help manage the processing of a large number of company filings.


In [3]:
# Define the number of companies to process
number_of_companies_to_process = 10000
keywords_found = 0
final_content_with_keyword = []

### Processing SEC 8-K Filings

This block of code fetches and processes SEC 8-K filings for a specified number of companies. It starts by retrieving company data and iterates through each company to extract relevant filing information. For each filing, it checks for specific keywords related to new or updated products. The content is cleaned and stored if any keywords are found. The script also tracks the processing time and stops once the desired number of companies has been processed.


In [4]:
# Start the timer
start_time = time.time()

# Process each of the companies
company_count = 0
for key, company in company_data.items():

    # End processing when reaching the specified number of companies
    if company_count >= number_of_companies_to_process:
        break

    cik = company.get('cik_str')
    company_name = company.get('title')
    company_ticker = company.get('ticker') 

    # Construct the 8-K URL for each CIK
    url_8k = f"https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={cik}&type=8-K&count=10&output=atom"
    response_8k = requests.get(url_8k, headers=headers)

    # Parse the response as XML
    soup = BeautifulSoup(response_8k.text, 'xml')

    # Find all 8-K entries and extract the url
    entries = soup.find_all('entry')
    for entry in entries:
        filing_href = entry.find('content').find('filing-href').text

        # Send a GET request to retrieve the content of the 8-K filing
        response_html = requests.get(filing_href, headers=headers)
        soup_html = BeautifulSoup(response_html.text, 'html.parser')

        # Find all rows in the table and loop through the rows to find the 8-K document
        rows = soup_html.find_all('tr')
        base_url = "https://www.sec.gov"

        for row in rows:
            description_cell = row.find_all('td')

            # Check if the second column is '8-K' (Form Type)
            if description_cell and len(description_cell) > 1 and description_cell[1].text.strip() == '8-K':
                
                # Extract the link from the third column
                link_tag = description_cell[2].find('a', href=True)
                
                if link_tag:
                    partial_url = link_tag['href']

                    # Remove the '/ix?doc=' part of the url if it exists, if not use as is
                    if partial_url.startswith('/ix?doc='):
                        document_url = partial_url.replace('/ix?doc=', '')
                        full_url = base_url + document_url
                    else:
                        full_url = base_url + partial_url

                    # Send a GET request to retrieve the content of the document
                    response = requests.get(full_url, headers=headers)

                    # Parse the content
                    soup = BeautifulSoup(response.text, 'html.parser')

                    # Extract the 'dei:DocumentPeriodEndDate'
                    date_tag = soup.find(attrs={"name": "dei:DocumentPeriodEndDate"})
                    if date_tag:
                        document_period_end_date = date_tag.text.strip()

                    # Find all occurrences of 'Item' in the text
                    item_tags = soup.find_all(string=lambda text: text and text.startswith("Item"))

                    # Process the "Item" sections
                    collecting = False
                    final_content = ''
                    content_between = []
                    
                    for item_tag in item_tags:
                        # Start collecting when we find "Item"
                        if "Item" in item_tag and not collecting:
                            collecting = True
                            content_between.append(item_tag.strip())

                            # Get the next sibling or the tag that contains the item details
                            next_tag = item_tag.find_parent().find_next(
                                'div')

                            while next_tag:
                                # Stop if we reach "Item 9.01"  or "SIGNATURE" and exclude its content
                                if "Item 9.01" in next_tag.get_text() or "SIGNATURE" in next_tag.get_text():
                                    break

                                content_between.append(next_tag.get_text(strip=True))
                                next_tag = next_tag.find_next('div')

                    # Process the extracted content
                    for content in content_between:
                        cleaned_content = re.sub(r'\s+', ' ', content)  # Remove extra spaces and line breaks
                        cleaned_content = re.sub(r'[^A-Za-z0-9.,;: ]+', '', cleaned_content)  # Remove special characters
                        final_content += cleaned_content + " "  # Append cleaned content with space separator

                    # Remove any trailing spaces and change to all lower case
                    final_content = final_content.strip()
                    final_content = final_content.lower()
                    final_content = f"Company_Name: {company_name} Ticker: {company_ticker} FilingTime: {document_period_end_date} SEC 8-k: {final_content}"

                    # Keywords that indicate new/updated products
                    keywords = [
                        "new product", "product launch", "introduced a new", "debut of",
                        "unveiled", "first-of-its-kind", "groundbreaking product", "updated product",
                        "enhanced version", "improved", "new feature",
                        "upgrade to", "next-generation", "redesigned", "rebranded", "expansion of",
                        "available for purchase", "began shipping", "now available", "pre-order",
                        "commercial launch", "market release"
                    ]

                    for keyword in keywords:
                        if keyword in final_content:
                            keywords_found += 1
                            final_content_with_keyword .append(final_content)
                            break

    company_count += 1

    # Calculate elapsed time
    elapsed_time = time.time() - start_time
    elapsed_seconds = int(elapsed_time)
    
    # Print the refreshed time every second
    print(f"\rProcessing URLs - Time elapsed: {elapsed_seconds}s", end='', flush=True)

print("The number of keywords found were ", keywords_found)
print(f"Number of strings with matching keywords: {len(final_content_with_keyword)}\n")

Processing URLs - Time elapsed: 25016sThe number of keywords found were  268
Number of strings with matching keywords: 268



### Processing Content with Ollama

This code processes extracted content using the Ollama API to analyze SEC 8-K filings for new product information. It sends prompts containing filing details to Ollama and retrieves structured information such as company name, ticker, filing time, new product name, and a brief description. The response is parsed, and the relevant data is extracted using regular expressions. The extracted data is then appended to a CSV file for further analysis, with the company name, ticker, filing time, new product, and description being recorded. If no new product information is found, placeholders ("none") are used.


In [5]:
# Process the extracted content using Ollama
def run_ollama(question: str) -> str:
    try:
        result = subprocess.run(
            ["ollama", "run", "llama3.2:1b", question],
            capture_output=True, text=True, encoding='utf-8'
        )
        return result.stdout.strip()
    except Exception as e:
        return f"Error: {e}"
        

for content in final_content_with_keyword:
    # Truncate the content
    if len(content) > 5000:
        content = content[:5000] + "..."

    # Create the prompt for Ollama
    question = f"""
    Review the following SEC 8-k filing and extract the following information about new products.
    Information to Extract:
    Company Name: Name of the company
    Ticker: Ticker of the company
    Filing_Time: The document period end date of the 8-k
    New Product: The name of the new product
    Description: A description of the new product in less than 180 characters.
    Here is the 8-k filing - {content}
    Remember you must return the company_name, ticker, filing_time, New Product (if any), and Description (if any).
    Please respond with all 5 pieces of information.
    If no new product information if found, just respond with the company_name, ticker, filing_time, and for new_product and description indicate none.
    """

    ollama_response = run_ollama(question)

    # Patterns to match in response
    patterns = {
        'company_name': r'Company\s*Name:\s*(.*)',
        'ticker': r'Ticker:\s*(.*)',
        'filing_time': r'Filing\s*Time:\s*(.*)',
        'new_product': r'New\s*Product:\s*(.*)',
        'description': r'Description:\s*(.*)'
    }

    # Method to Extract information from response
    def extract_information(response: str) -> dict:
        extracted = {}
        for key, pattern in patterns.items():
            match = re.search(pattern, response, re.IGNORECASE)
            extracted[key] = match.group(1).strip() if match else 'None'
        return extracted

    # Method to append information to CSV
    def append_to_csv(file_name: str, data: dict):
        fieldnames = ['company_name', 'ticker', 'filing_time', 'new_product', 'description']
        file_exists = os.path.isfile(file_name)
        
        with open(file_name, 'a', newline='', encoding='utf-8') as file:
            writer = csv.DictWriter(file, fieldnames=fieldnames, delimiter="|") 
            
            if not file_exists:
                writer.writeheader()
            
            writer.writerow(data)

    # Extract the information and put in CSV
    extracted_data = extract_information(ollama_response)
    append_to_csv("SEC 8-K Analysis", extracted_data)

print("Completed processing the extracted information using Ollama")
    

Completed processing the extracted information using Ollama


### Filtering and Saving Processed SEC 8-K Data

This code reads a CSV file containing SEC 8-K analysis data and filters out rows where any column contains the value 'none'. It processes the data by checking for 'none' in the `company_name`, `ticker`, `filing_time`, `new_product`, and `description` columns. If any of these fields contains 'none', the row is excluded. The filtered data is then written to a new CSV file called "SEC 8-K Analysis Filtered", retaining the same structure with pipe delimiters.


In [6]:
# Read the CSV file into a list of dictionaries
with open('SEC 8-K Analysis', 'r', encoding='utf-8') as file:
    reader = csv.DictReader(file, delimiter='|')  # Assuming the pipe delimiter
    data = [row for row in reader]

# Filter out rows where any column has 'none'
filtered_data = [
    row for row in data
    if row['company_name'].strip().lower() != 'none' and row['ticker'].strip().lower() != 'none' 
    and row['filing_time'].strip().lower() != 'none'
    and 'none' not in row['new_product'].strip().lower()
    and 'none' not in row['description'].strip().lower()
]

# Define the fieldnames (headers)
fieldnames = ['company_name', 'ticker', 'filing_time', 'new_product', 'description']

# Write the filtered data to a new CSV file
with open('SEC 8-K Analysis Filtered', 'w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=fieldnames, delimiter='|')
    
    # Write the header
    writer.writeheader()
    
    # Write the filtered rows
    for row in filtered_data:
        writer.writerow(row)
