# DevOps Engineer Interview Project
**Author : Felix Yuzhou Sun**

## 0. Introduction

In the context of financial analysis, accurately extracting Earnings Per Share (EPS) data from regulatory filings is essential for investors, analysts, and financial professionals. This task, often complicated by the diverse formats and structures of financial documents, requires a robust and adaptable parsing solution.

This project involves developing a parser to extract EPS data from SEC EDGAR filings in HTML format, as part of an assessment for a DevOps Engineer Interview at Trexquant Investment LP. The primary objective is to create a versatile parser that can handle various filing formats, systematically extract the latest quarterly EPS for each company, and present this data in a structured CSV format.

The development and testing of this parser are grounded in the analysis of 50 provided HTML files, focusing on identifying common patterns, handling variations, and ensuring the parser's adaptability to unseen document formats. The outcome of this project will be a reliable tool capable of facilitating informed financial decision-making through accurate data extraction.

## 1. Understanding the data

To build a robust parser, we will meticulously analyze the 50 training filings, identifying the common patterns and locations where EPS data is typically reported. This analysis will encompass recognizing keywords, table structures, and contextual cues that signal the presence of relevant information. We will also pay close attention to potential inconsistencies and edge cases, ensuring our parser's accuracy and adaptability across different EDGAR filings.

In [None]:
# Set Up Environment
%pip install beautifulsoup4 pandas

In [38]:

import os
from bs4 import BeautifulSoup
import pandas as pd
import re
from collections import defaultdict

In [40]:
# Check the html data
current_directory = os.getcwd()
data_directory  = 'Training_Filings'
data_path = os.path.join(current_directory, data_directory)
print(f"Current working directory: {current_directory}")
print(data_path)

# Verify the data size
html_files = [f for f in os.listdir(data_path) if f.endswith('.html')]
# Total number of files
total_files = len(html_files)
print(f"Found {total_files} HTML files.")

Current working directory: /home/bbharbinger/Projects/EPS_Paser
/home/bbharbinger/Projects/EPS_Paser/Training_Filings
Found 50 HTML files.


### Check the data structure: Tag Analysis

In [None]:
# Dictionary to hold the number of files in which each tag appears
tag_file_count = defaultdict(int)

# Iterate through all files and count tags
for file in html_files:
    filepath = os.path.join(data_path, file)
    
    with open(filepath, 'r', encoding='utf-8') as html_file:
        soup = BeautifulSoup(html_file, 'html.parser')
        
        # Extract unique tags from the current file
        unique_tags_in_file = set(tag.name for tag in soup.find_all(True))
        
        # Update the count of each tag, but only once per file
        for tag in unique_tags_in_file:
            tag_file_count[tag] += 1

# Calculate the percentage of files in which each tag appears
tag_percentage = {tag: (count / total_files) * 100 for tag, count in tag_file_count.items()}

# Sort the tags by percentage in descending order
sorted_tags = sorted(tag_percentage.items(), key=lambda item: item[1], reverse=True)

# Print results
print(f"{'Tag':<15}{'Files':<10}{'Percentage':<10}")
for tag, percentage in sorted_tags:
    print(f"{tag:<15}{tag_file_count[tag]:<10}{percentage:<10.2f}%")

### Tag analysis results

| Tag 1         | Files | Percentage | Tag 2         | Files | Percentage |
|---------------|-------|------------|---------------|-------|------------|
| table         | 50    | 100.00%    | sup           | 38    | 76.00%     |
| document      | 50    | 100.00%    | img           | 36    | 72.00%     |
| type          | 50    | 100.00%    | a             | 26    | 52.00%     |
| html          | 50    | 100.00%    | p             | 23    | 46.00%     |
| font          | 50    | 100.00%    | b             | 14    | 28.00%     |
| td            | 50    | 100.00%    | u             | 13    | 26.00%     |
| sequence      | 50    | 100.00%    | i             | 8     | 16.00%     |
| filename      | 50    | 100.00%    | ul            | 4     | 8.00%      |
| text          | 50    | 100.00%    | li            | 4     | 8.00%      |
| tr            | 50    | 100.00%    | center        | 4     | 8.00%      |
| body          | 50    | 100.00%    | h1            | 3     | 6.00%      |
| title         | 49    | 98.00%     | h2            | 3     | 6.00%      |
| description   | 49    | 98.00%     | meta          | 2     | 4.00%      |
| div           | 49    | 98.00%     |  em           | 1     | 2.00%      |
| br            | 46    | 92.00%     | strike        | 1     | 2.00%      |
| hr            | 42    | 84.00%     | strong        | 1     | 2.00%      |


### Insights from the Tag Analysis


1. **Core Tags (`table`, `td`, `tr`)**:
   - **High Occurrence (100%)**: Tags like `<table>`, `<td>`, `<tr>`, and related tags appear in every document. This indicates that EPS data, which is typically presented in tabular format, is consistently structured within tables. Therefore, the parser should primarily focus on searching within `<table>` tags to locate EPS values.

2. **Document Structure (`html`, `body`, `title`)**:
   - **High Occurrence (100%)**: Tags such as `<html>`, `<body>`, and `<title>` are present in every file, indicating that all documents follow a well-defined HTML structure. The `<title>` tag, which often contains the document's title, might give context (e.g., the company name or report type) that could help identify relevant sections of the document.

3. **Financial Data Tags (`sup`, `sub`)**:
   - **Moderate Occurrence (76%, 84%)**: Tags like `<sup>` (often used for superscripts like footnotes) might be associated with small notations beside financial data, including EPS figures. These tags are important for handling any associated footnotes or references that could alter the interpretation of the EPS data (e.g., adjustments for extraordinary items).
   
4. **Formatting Tags (`font`, `br`, `hr`, `b`)**:
   - **High to Moderate Occurrence**: Tags like `<font>`, `<br>`, and `<b>` indicate that the documents use various formatting styles to highlight or structure the data. Bold tags (`<b>`) often emphasize key financial metrics, including EPS.

5. **Section Identifiers (`h1`, `h2`, `h3`)**:
   - **Low Occurrence (6%)**: Header tags like `<h1>`, `<h2>`, and `<h3>` are less commonly used, but when they do appear, they likely indicate section titles that could guide the parser to the relevant parts of the document (e.g., "Financial Statements" or "Earnings Per Share").

6. **Special Cases (`meta`, `strike`, `strong`)**:
   - **Low Occurrence (4% and below)**: Tags such as `<meta>`, `<strike>`, and `<strong>` are rare, suggesting that specific documents might have unique formatting or metadata that could affect how EPS is presented. These tags might be used for more complex document structures or for signaling deprecated or emphasized content.


### Conclusion

1. **Focus on Tables**: Since tables are the most consistent element across all documents, we should prioritize searching within `<table>` tags, specifically looking for rows (`<tr>`) and cells (`<td>`) that likely contain EPS data.
  
2. **Use Document Structure**: Leverage the consistency in the document structure (e.g., `<html>`, `<body>`, `<title>`) to navigate efficiently to relevant sections and potentially identify the context in which EPS data is presented.

3. **Handle Footnotes and References**: Pay attention to superscripts and other formatting tags that might modify or provide additional context to EPS data, ensuring that all extracted figures are interpreted correctly.

4. **Edge Cases**: While most documents are structured similarly, there may be less common tags and special cases. We need to make the parser flexible to ensure robust performance across all files.