# 10-K Filings Analysis Workflow

This notebook is dedicated to the analysis of 10-K filings obtained from the [SEC EDGAR database](https://www.sec.gov/edgar.shtml). To enhance efficiency, we're leveraging pre-cleaned 10-K documents provided by the [Notre Dame Software Repository for Accounting and Finance (SRAF)](https://sraf.nd.edu/sec-edgar-data/cleaned-10x-files/). These pre-processed documents reduce the initial data cleaning burden and facilitate a more focused analysis.

## Step 1: Acquisition and Initial Filtering of 10-K Documents

The initial phase involves downloading zip files from the SRAF, which contain a mix of 10-Q and 10-K filings for publicly traded companies in the U.S. Our goal in this step is to sift through these files to retain only the 10-K filings for subsequent analysis. The steps include:

1. Decompressing all zip files to access the contained documents.
2. Scanning each `.txt` document to identify 10-K filings, utilizing their uniform naming convention for identification.
3. Segregating and relocating any files that are not 10-Ks to a distinct directory, ensuring our primary working directory contains solely 10-K documents for each entity and corresponding fiscal year.

Following this, a selection of 10-K filings will be randomly chosen to form the sample size for detailed analysis.

## Step 2: Extraction of Specific Content from 10-K Filings

Our analysis is particularly focused on the "Risk Factors" segment, identified as Item 1A in 10-K filings. This section offers insights into the potential risks and challenges companies may face. To isolate this information, the following steps will be undertaken:

1. Thorough examination of each 10-K document to locate the "Risk Factors" section.
2. Extraction of the "Risk Factors" content from each document and saving it separately for in-depth analysis. This may involve saving the information in a new file or a database, depending on the requirements of the subsequent analysis.
3. Optionally, the original 10-K documents, post-extraction of the relevant sections, can be moved to a different directory for archiving purposes.

This focused approach on Item 1A aims to elucidate the risk landscapes of various companies, offering valuable insights into their operational and strategic vulnerabilities.



next step as mentioned befor check if item 1a is inside the file


In [1]:
import os
import shutil
import re  # Import the regular expressions library

def contains_keyword(file_path, pattern):
    """Check if the file contains the given pattern using regex."""
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            file_content = file.read()
            # Use re.search to look for the pattern in the file content
            if re.search(pattern, file_content, re.IGNORECASE):
                return True
    except Exception as e:
        print(f"Error reading file {file_path}: {e}")
    return False

def move_files_without_keyword(source_dir, dest_dir, pattern):
    """Move files that do not contain the regex pattern to the destination directory."""
    # Ensure the destination directory exists
    os.makedirs(dest_dir, exist_ok=True)

    # Counters for reporting
    total_files = 0
    moved_files = 0

    # Iterate over all .txt files in the source directory
    for filename in os.listdir(source_dir):
        if filename.endswith('.txt'):
            total_files += 1
            file_path = os.path.join(source_dir, filename)

            # If the file does not contain the keyword pattern, move it to the destination directory
            if not contains_keyword(file_path, pattern):
                shutil.move(file_path, os.path.join(dest_dir, filename))
                moved_files += 1

    print(f"Total .txt files processed: {total_files}")
    print(f"Files moved to '{dest_dir}': {moved_files}")

# Get the current working directory
project_root_dir = os.getcwd()

# Define the source and destination directories relative to the current working directory
source_dir = os.path.join(project_root_dir, 'SAMPLE_10Ks')
dest_dir = os.path.join(source_dir, '10K without item 1A')

# Define the regex pattern for "Item 1A" accounting for common variations
pattern = r'Item\s+1[Aa]'

# Move files that do not contain the pattern
move_files_without_keyword(source_dir, dest_dir, pattern)


Total .txt files processed: 500
Files moved to '/Users/christiannikolov/Downloads/New_Version/FS-Finance-Management/SAMPLE_10Ks/10K without item 1A': 173


renameing them

To rename each .txt file in the /content/drive/MyDrive/FrankfurtSchool/Guided_Studies_in_Financial_Management/SAMPLE_10Ks directory based on the pattern year_name_cik by extracting the year from the filename, and the company name and CIK number from the file's content, you can follow these steps:

Iterate through each .txt file in the specified directory.
Extract the year from the first 4 characters of the filename.
Read the file's content to find the company name and CIK number using regex.
Construct the new filename using the year_name_cik pattern.
Rename the file to the new filename.
Here's a Python script that implements these steps:

In [2]:
import os
import re

def extract_info_from_content(file_content):
    """Extract the company name and CIK number from the file's content."""
    # Regex patterns for company name and CIK
    company_pattern = r'COMPANY CONFORMED NAME:\s+(.*)\s'
    cik_pattern = r'CENTRAL INDEX KEY:\s+(\d+)\s'

    # Find company name
    company_match = re.search(company_pattern, file_content)
    company_name = company_match.group(1) if company_match else None

    # Normalize company name for filename (remove disallowed characters and shorten)
    if company_name:
        company_name = re.sub(r'[^\w\s]', '', company_name)  # Remove non-alphanumeric characters
        company_name = re.sub(r'\s+', '_', company_name)  # Replace spaces with underscores
        company_name = company_name[:50]  # Limit length for simplicity

    # Find CIK
    cik_match = re.search(cik_pattern, file_content)
    cik = cik_match.group(1) if cik_match else None

    return company_name, cik

def rename_files_in_directory(directory):
    """Rename files in the specified directory based on the year, company name, and CIK."""
    failed_files = []  # List to store files that failed to rename

    for filename in os.listdir(directory):
        if filename.endswith('.txt'):
            # Extract year from the filename
            year = filename[:4]

            # Construct the full path to the file
            file_path = os.path.join(directory, filename)

            try:
                # Read the file's content
                with open(file_path, 'r', encoding='utf-8') as file:
                    content = file.read()

                    # Extract company name and CIK from the content
                    company_name, cik = extract_info_from_content(content)

                    if company_name and cik:
                        # Construct the new filename
                        new_filename = f"{year}_{company_name}_{cik}.txt"
                        new_file_path = os.path.join(directory, new_filename)

                        # Rename the file
                        os.rename(file_path, new_file_path)
                        print(f"Renamed '{filename}' to '{new_filename}'")
                    else:
                        # Log the failure and add the file to the failed_files list
                        failed_files.append(filename)
                        print(f"Failed to rename '{filename}': Missing company name or CIK")
            except Exception as e:
                failed_files.append(filename)
                print(f"Error processing '{filename}': {e}")

    # Return the list of files that failed to rename for further investigation
    return failed_files

# Specify the directory containing the 10K files
project_root_dir = os.getcwd()
directory = os.path.join(project_root_dir, 'SAMPLE_10Ks')

# Rename the files in the directory and get the list of files that failed to rename
failed_files = rename_files_in_directory(directory)

if failed_files:
    print(f"\nFiles that could not be renamed: {len(failed_files)}")
    for file in failed_files:
        print(file)
else:
    print("\nAll files were successfully renamed.")


Renamed '20230331_10-K_edgar_data_89140_0001410578-23-000504.txt' to '2023_SERVOTRONICS_INC_DE_0000089140.txt'
Renamed '20170223_10-K_edgar_data_909108_0000909108-17-000023.txt' to '2017_DIAMOND_HILL_INVESTMENT_GROUP_INC_0000909108.txt'
Renamed '20070515_10-K_edgar_data_918580_0001104659-07-040164.txt' to '2007_Gaming_Partners_International_CORP_0000918580.txt'
Renamed '20200228_10-K_edgar_data_1129155_0001104659-20-027034.txt' to '2020_MARINE_PRODUCTS_CORP_0001129155.txt'
Renamed '20210304_10-K_edgar_data_1001385_0001437749-21-004994.txt' to '2021_NORTHWEST_PIPE_CO_0001001385.txt'
Renamed '20100219_10-K_edgar_data_34782_0000034782-10-000007.txt' to '2010_1ST_SOURCE_CORP_0000034782.txt'
Renamed '20100614_10-K_edgar_data_1271057_0001144204-10-033291.txt' to '2010_CHINABIOTICS_INC_0001271057.txt'
Renamed '20090302_10-K_edgar_data_104894_0001193125-09-042475.txt' to '2009_WASHINGTON_REAL_ESTATE_INVESTMENT_TRUST_0000104894.txt'
Renamed '20170224_10-K_edgar_data_1022344_0001558370-17-000934

test if all the files have been renamed succesfully


In [3]:
failed_files

[]

placeholder for above if this would not be the case

cut the item 1a estimation and collect it in a new txt and for every 5000 a new txt is created:

this does not make sense and needs to be changed, becasue later we need to individually analyse the item 1a per company per year
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


In [None]:
from pathlib import Path
import re
import pandas as pd

# Initialize the source and output directories
project_root_dir = Path.cwd()
source_dir = project_root_dir / 'SAMPLE_10Ks'
output_dir = project_root_dir / 'SAMPLE_10Ks/Item_1A_Estimations'
output_dir.mkdir(parents=True, exist_ok=True)

# Define the target strings for "Item 1A" and "Item 1B"
targets_1a = ["ITEM 1A. RISK FACTORS", "ITEM 1A RISK FACTORS", "ITEM 1A.", "1A. RISK FACTORS", "1A RISK FACTORS"]
targets_1b = ["ITEM 1B. UNRESOLVED STAFF COMMENTS", "ITEM 1B UNRESOLVED STAFF COMMENTS", "ITEM 1B.", "1B. UNRESOLVED STAFF COMMENTS", "1B UNRESOLVED STAFF COMMENTS", "1B.", "ITEM 2"]

# DataFrame to log files that could not be processed or do not contain "Item 1A"
issues_df = pd.DataFrame(columns=['Filename', 'Reason'])

def find_section_end(content, targets):
    """Find the end of the section by locating the nearest subsequent target."""
    positions = [content.find(target) for target in targets if content.find(target) != -1]
    return min(positions) if positions else len(content)

def process_files():
    file_counter = 0  # Counter for processed files
    output_file_index = 0  # Index for output files

    for file_path in source_dir.glob("*.txt"):
        try:
            content = file_path.read_text(encoding='utf-8', errors='ignore').upper()
            start_positions = [content.find(target) for target in targets_1a if content.find(target) != -1]
            start_pos = min(start_positions) if start_positions else -1

            if start_pos != -1:
                end_pos = find_section_end(content[start_pos:], targets_1b)
                section = content[start_pos:start_pos + end_pos]

                # Create a new output file for every 5000 files processed
                if file_counter % 5000 == 0:
                    output_file_index += 1
                with (output_dir / f"combined_item_1a_{output_file_index}.txt").open('a', encoding='utf-8') as output_file:
                    output_file.write(f"--- {file_path.name} ---\n{section}\n\n")

            else:
                issues_df.loc[len(issues_df)] = [file_path.name, "No 'Item 1A' section found"]

            file_counter += 1

        except Exception as e:
            issues_df.loc[len(issues_df)] = [file_path.name, f"Error processing file: {e}"]

    print(f"Processed {file_counter} files. Combined 'Item 1A' sections into {output_file_index} file(s).")

process_files()

# Display or save the DataFrame of issues
if not issues_df.empty:
    print("Files with issues:")
    print(issues_df)


now the keyword analysis can begin


the code below has not been ajusted yet the logic stays the same, firs tthe item 1a extraction need to be individualized






In [None]:
import pandas as pd
from collections import Counter
import os
import random
from concurrent.futures import ThreadPoolExecutor, as_completed

keywords = [
    "access control", "cybersecurity posture", "information", "legal liability", "data exfiltration",
            "security awareness training", "authorization", "phishing", "APT", "secure sockets layer",
            "threat intelligence", "zero trust architecture", "smishing", "whaling", "supply chain attack",
            "cryptojacking", "reputation", "identity management", "trojan", "security architecture", "firewall",
            "financial fraud", "botnet attack", "result", "patch management", "IoT security", "ransomware",
            "technology", "privacy breach", "SOC", "secure coding", "security information management", "NIST",
            "network access control", "operation", "breach", "security audit", "hack", "public key infrastructure",
            "DDoS", "malvertising", "CSRF", "endpoint security", "CIS Controls", "data privacy", "SIM",
            "social engineering", "cyber warfare", "computer", "disruption", "spear phishing", "application security",
            "cybersecurity strategy", "zero-day", "identity theft", "hardware security", "insider threat",
            "blockchain security", "PKI", "damage", "ransomware-as-a-service", "threat hunting",
            "intellectual property theft", "financial", "antivirus", "exploit", "MFA", "CCPA",
            "cybersecurity insurance", "threat landscape", "service", "phishing attack", "cybersecurity regulation",
            "RCE", "brand damage", "NAC", "cross-site request forgery", "cyber resilience", "system",
            "risk management", "biometric security", "cybersecurity audit", "cyber hygiene", "SSL", "trust erosion",
            "cyber law", "data", "business", "failure", "network security", "regulatory fines", "FISMA",
            "vulnerability", "security operations center", "IoT attack", "spyware", "cyber espionage",
            "quantum cryptography", "PCI DSS", "encryption", "include", "shadow IT", "cloud security", "malware",
            "penetration testing", "cybersecurity framework", "GDPR", "cyber threat intelligence", "ISO 27001",
            "incident response", "HIPAA", "mobile security", "security by design", "unauthorized", "loss",
            "customer", "transport layer security", "security", "espionage", "secure shell", "RaaS", "digital forensics",
            "security policy", "risk assessment", "remote code execution", "compliance violation", "cybersecurity policy",
            "vishing", "SSH", "authentication", "TLS", "VPN", "fileless malware", "intrusion"
]

project_root_dir = Path.cwd()
source_dir = project_root_dir / 'SAMPLE_10Ks/Item_1A_Estimations'
output_dir = project_root_dir / 'SAMPLE_10Ks/Keyword_Analysis'
output_dir.mkdir(parents=True, exist_ok=True)  # Ensure the output directory exists

def count_keywords(text, keywords):
    words = text.split()
    return Counter(word for word in words if word in keywords)

def process_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()

    if "Item 1A" in text or "Item 1a" in text:
        filename = os.path.basename(file_path)
        parts = filename.split('_')
        if len(parts) >= 3:
            company = '_'.join(parts[1:-1])
            year = parts[-1][:4]
        else:
            company = 'Unknown'
            year = 'Unknown'

        keyword_counts = count_keywords(text, keywords)
        row_data = {'Company': company, 'Year': year}
        row_data.update(keyword_counts)
        return row_data
    else:
        return None

def process_files(base_directory, max_files):
    rows = []
    all_files = []

    for root, dirs, files in os.walk(base_directory):
        for filename in files:
            if filename.endswith('.txt'):
                all_files.append(os.path.join(root, filename))

    random.shuffle(all_files)
    all_files = all_files[:max_files]

    with ThreadPoolExecutor() as executor:
        futures = [executor.submit(process_file, file) for file in all_files]
        for future in as_completed(futures):
            result = future.result()
            if result:
                rows.append(result)
                print(f"File processed: {result['Company']} {result['Year']}")

    if not rows:
        print("No relevant files found.")
    else:
        print(f"Total files processed: {len(rows)}")

    return pd.DataFrame(rows, columns=['Company', 'Year'] + keywords)

# Paths and parameters as specified
base_directory = source_dir
max_files = 1000

# Process the files and get the results DataFrame
results_df = process_files(base_directory, max_files)

# Saving the DataFrame to a CSV file
resulted_csv_path = output_dir / 'Keyword_Analysis_Results.csv'
results_df.to_csv(resulted_csv_path, index=False)

print(f"Analysis results saved to {resulted_csv_path}.")





use the excel which i have put in the what app group to get the logic and create the normlization and the logistic regression


In [None]:
import pandas as pd
import numpy as np
from scipy.stats import zscore
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Load your DataFrame here. For demonstration, let's assume it's already loaded as `results_df`.
# Assuming `results_df` is already loaded and in the desired format
# Define your 'keywords' list based on your specific keywords of interest
# keywords = ['keyword1', 'keyword2', 'keyword3', ...]

# Calculate total keyword frequencies across all documents
keyword_frequencies = results_df[keywords].sum()

# Calculate the total of all keyword frequencies to determine weights
total_keyword_frequency = keyword_frequencies.sum()

# Determine weights for each keyword
keyword_weights = keyword_frequencies / total_keyword_frequency

# Calculate weighted counts for each document
for keyword in keywords:
    weighted_column = f'{keyword}_Weighted'
    results_df[weighted_column] = results_df[keyword] * keyword_weights[keyword]

# Sum the weighted counts for each document to get a total weighted count
results_df['Total_Weighted_Count'] = results_df[[f'{kw}_Weighted' for kw in keywords]].sum(axis=1)

# Normalize the total weighted counts to get the cybersecurity score
results_df['Cybersecurity_Score'] = (results_df['Total_Weighted_Count'] - results_df['Total_Weighted_Count'].min()) / (results_df['Total_Weighted_Count'].max() - results_df['Total_Weighted_Count'].min())

# Calculate Z-Scores for Cybersecurity Scores
results_df['Cybersecurity_Score_Z'] = zscore(results_df['Cybersecurity_Score'])

# Calculate the mean (mittlerer Wert) of the total weighted count
mean_total_weighted_count = results_df['Total_Weighted_Count'].mean()

# Apply the logistic function for each keyword's weighted frequency
for keyword in keywords:
    weighted_column = f'{keyword}_Weighted'
    # Calculate the logistic function value for each document
    results_df[f'{keyword}_Logistic'] = 1 / (1 + np.exp(-(results_df[weighted_column] - mean_total_weighted_count)))

# Visualization of Cybersecurity Scores
num_bins = max(10, int(len(results_df['Cybersecurity_Score'].unique()) / 10))
plt.figure(figsize=(12, 6))
sns.histplot(results_df['Cybersecurity_Score'], bins=num_bins, kde=True, color='skyblue')
plt.title('Distribution of Normalized Cybersecurity Scores', fontsize=16)
plt.xlabel('Normalized Cybersecurity Score', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
sns.despine()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

# Define the output directory for saving the analysis results
project_root_dir = Path.cwd()  # Gets the current working directory
output_dir = project_root_dir / 'Analysis_Results'  # Adjust this path as needed
output_dir.mkdir(parents=True, exist_ok=True)  # Creates the directory if it doesn't exist

# Save the enhanced DataFrame with cybersecurity scores and logistic calculations to a CSV file
enhanced_with_logistic_path = output_dir / "Enhanced_Keywords_with_Logistic_Scores.csv"
results_df.to_csv(enhanced_with_logistic_path, index=False)
print(f"Enhanced analysis results with logistic scores saved to {enhanced_with_logistic_path}.")
