---
title: "Data Collection"
format:
    html: 
        code-fold: false
---

{{< include instructions.qmd >}} 


{{< include overview.qmd >}} 

{{< include methods.qmd >}} 

# Code 

Provide the source code used for this section of the project here.

If you're using a package for code organization, you can import it at this point. However, make sure that the **actual workflow steps**—including data processing, analysis, and other key tasks—are conducted and clearly demonstrated on this page. The goal is to show the technical flow of your project, highlighting how the code is executed to achieve your results.

Ensure that the code is well-commented to enhance readability and understanding for others who may review or use it. If relevant, link to additional documentation or external references that explain any complex components. This section should give readers a clear view of how the project is implemented from a technical perspective.

This page is a technical narrative, NOT just a notebook with a collection of code cells, include in-line Prose, to describe what is going on.

## Collect Company Financial Data

In this section, we would like to collect company financial statement data by utilizing CIK codes of each company from our company list to fetch data from the SEC.GOV API.

In [16]:
import requests
import pandas as pd
import os
import time

In [17]:
# Load the list of CIK codes from company_data.csv
cik_codes = "../../data/processed-data/company_data.csv"

# Directory to save the JSON files
output_company_data_dir = "../../data/raw-data/company_data"
# Create the output directory if it doesn't exist
os.makedirs(output_company_data_dir, exist_ok=True)

In [18]:
# Load CSV and extract the "CIK-code" column
try:
    df = pd.read_csv(cik_codes)
    cik_list = df["CIK-code"].dropna().astype(str).tolist()  # Convert to string and remove NaN values
except Exception as e:
    print(f"Error reading CSV file: {e}")
    exit()

In [19]:
# Loop through each CIK code and fetch data
for cik in cik_list:
    # Construct the API URL dynamically for each CIK code
    url = f"https://data.sec.gov/api/xbrl/companyfacts/{cik}.json"
    headers = {
        "User-Agent": "kl1160@georgetown.edu",
        "Accept-Encoding": "gzip, deflate"
    }

    try:
        # Make a GET request to fetch data from the API
        response = requests.get(url, headers=headers)

        if response.status_code == 200:
            data = response.json()

            # Save the fetched data to a JSON file
            output_file = os.path.join(output_company_data_dir, f"{cik}.json")
            with open(output_file, "w") as f:
                f.write(response.text)

            print(f"Data for CIK {cik} saved to {output_file}")
        else:
            # Handle unsuccessful requests with appropriate logging
            print(f"Failed to fetch data for CIK {cik}. HTTP status: {response.status_code}")

        # Pause to avoid overloading the server
        time.sleep(1)
    except Exception as e:
        # Log any errors that occur during the data fetching process
        print(f"Error fetching data for CIK {cik}: {e}")

print("Data fetching completed.")

Data for CIK CIK0000066740 saved to ../../data/raw-data/company_data/CIK0000066740.json
Data for CIK CIK0001551152 saved to ../../data/raw-data/company_data/CIK0001551152.json
Data for CIK CIK0001018840 saved to ../../data/raw-data/company_data/CIK0001018840.json
Failed to fetch data for CIK CIK0000008680. HTTP status: 404
Data for CIK CIK0000002488 saved to ../../data/raw-data/company_data/CIK0000002488.json
Data for CIK CIK0000874761 saved to ../../data/raw-data/company_data/CIK0000874761.json
Data for CIK CIK0000002969 saved to ../../data/raw-data/company_data/CIK0000002969.json
Data for CIK CIK0001559720 saved to ../../data/raw-data/company_data/CIK0001559720.json
Data for CIK CIK0001086222 saved to ../../data/raw-data/company_data/CIK0001086222.json
Data for CIK CIK0001646972 saved to ../../data/raw-data/company_data/CIK0001646972.json
Data for CIK CIK0001652044 saved to ../../data/raw-data/company_data/CIK0001652044.json
Data for CIK CIK0001018724 saved to ../../data/raw-data/com

Next, we would like to extract every unique financial statement indicators in the JSON files, then manually pick up important and useful financial statement indicators based our accounting knowledge for subsequent analysis (as dependent variables).

In [20]:
import json

# Read an example JSON file to extract unique financial statement indicators
example_json_path = "../../data/raw-data/company_data/CIK0000066740.json"
with open(example_json_path, "r") as file:
    data = json.load(file)

# Initialize a dictionary to store extracted indicators
financial_indicators = []

# Traverse the JSON structure to find financial metrics
for taxonomy, metrics in data.get("facts", {}).items():
    for metric_key, metric_data in metrics.items():
        indicator = {
            "taxonomy": taxonomy,
            "key": metric_key,
            "label": metric_data.get("label", ""),
            "description": metric_data.get("description", ""),
            "units": list(metric_data.get("units", {}).keys())
        }
        financial_indicators.append(indicator)

# Output the extracted indicators
for indicator in financial_indicators:
    print(f"Taxonomy: {indicator['taxonomy']}")
    print(f"Key: {indicator['key']}")
    print(f"Label: {indicator['label']}")
    print(f"Description: {indicator['description']}")
    print(f"Units: {indicator['units']}")
    print("-" * 80)

# Save to a file
output_fina_idc_json = "../../data/raw-data/financial_indicators.json"
with open(output_fina_idc_json, "w") as f:
    json.dump(financial_indicators, f, indent=4)
print(f"Extracted financial indicators saved to {output_fina_idc_json}.")

Taxonomy: dei
Key: EntityCommonStockSharesOutstanding
Label: Entity Common Stock, Shares Outstanding
Description: Indicate number of shares or other units outstanding of each of registrant's classes of capital or common stock or other ownership interests, if and as stated on cover of related periodic report. Where multiple classes or units exist define each class/interest by adding class of stock items such as Common Class A [Member], Common Class B [Member] or Partnership Interest [Member] onto the Instrument [Domain] of the Entity Listings, Instrument.
Units: ['shares']
--------------------------------------------------------------------------------
Taxonomy: dei
Key: EntityPublicFloat
Label: Entity Public Float
Description: The aggregate market value of the voting and non-voting common equity held by non-affiliates computed by reference to the price at which the common equity was last sold, or the average bid and asked price of such common equity, as of the last business day of the 

We manually extracted 10 important and useful financial statement indicators.
Here are the common names, descriptions, and their potential relationships to gender ratios for these 10 financial statement indicators:

1. Net Income (NetIncomeLoss)
- Description: Represents the company’s total profit or loss after deducting all expenses, taxes, and costs from revenue.
- Relation to Gender Ratio: Gender diversity in leadership may influence financial decision-making, operational efficiency, and profitability, potentially impacting net income.

2. Operating Income (OperatingIncomeLoss)
- Description: Measures the profit from core business operations, excluding taxes and interest expenses.
- Relation to Gender Ratio: Leadership styles associated with gender diversity could affect operational strategies, efficiency, and cost management, influencing operating income.

3. Gross Profit (GrossProfit)
- Description: Revenue minus the cost of goods sold, reflecting the company's ability to produce and sell goods at a profit.
- Relation to Gender Ratio: A diverse leadership team might bring varied approaches to product pricing and production efficiency, impacting gross profit margins.

4. Comprehensive Income (ComprehensiveIncomeNetOfTax)
- Description: The total change in equity from transactions and other events, including net income and other comprehensive income.
- Relation to Gender Ratio: Gender diversity could influence long-term strategies and risk management, potentially affecting comprehensive income outcomes.

5. Earnings Per Share (Basic) (EarningsPerShareBasic)
- Description: The portion of a company's profit allocated to each outstanding share of common stock, calculated on a basic level.
- Relation to Gender Ratio: Policies implemented by diverse leadership could affect investor confidence, profitability, and share value, impacting EPS.

6. Revenue from Contracts with Customers (RevenueFromContractWithCustomerExcludingAssessedTax)
- Description: Revenue generated from the sale of goods or services to customers, excluding assessed taxes.
- Relation to Gender Ratio: Gender-diverse teams may bring innovative approaches to customer relationships and sales strategies, affecting revenue.

7. Entity Public Float (EntityPublicFloat)
- Description: The total market value of the company’s shares held by public investors, excluding insiders.
- Relation to Gender Ratio: Diverse leadership might enhance market perception and investor confidence, influencing the public float.

8. Share-Based Compensation Expense (AllocatedShareBasedCompensationExpense)
- Description: The cost of share-based payments for employee compensation, such as stock options or equity awards.
- Relation to Gender Ratio: A gender-diverse leadership team may adopt different employee retention and incentive strategies, affecting this expense.

9. Cash and Cash Equivalents (CashAndCashEquivalentsAtCarryingValue)
- Description: The amount of liquid cash or cash equivalents readily available for use by the company.
- Relation to Gender Ratio: Leadership diversity could impact cash management practices, risk preferences, and liquidity policies.

10. Accounts Receivable (Net) (AccountsReceivableNetCurrent)
- Description: The net amount of money owed to the company by customers, after deducting allowances for doubtful accounts.
- Relation to Gender Ratio: Gender diversity may influence credit policies and collection efficiency, affecting accounts receivable performance.

These indicators cover profitability, liquidity, operational efficiency, and equity valuation, offering insights into how gender diversity in leadership might affect various aspects of a company's financial performance.

Another issue here is that some financial statement indicators, being amounts, but not ratios, may be heavily influenced by company size. To address this, we can normalize the metrics by dividing them by a size-related metric, such as total assets of a company.

Here are some advantages of using total assets for standardization:

1. Reflects Company Size:

Total assets represent the scale of a company, encompassing its resources, investments, and financial standing. Standardizing financial indicators by total assets provides a size-adjusted metric, enabling fairer comparisons across companies of varying sizes.

2. Widely Used in Financial Ratios:

Many common financial ratios use total assets as a denominator (e.g., Return on Assets (ROA) = Net Income / Total Assets). This makes it a well-accepted practice in accounting and finance.

3. Cross-Indicator Applicability:

Almost all the financial indicators we would like to use (e.g., Net Income, Gross Profit, Accounts Receivable) can be effectively standardized using total assets, ensuring consistency in this approach.

In [21]:
# Extract 10 important and valuable financial statement indicators for subsequent analysis
# Extract "Assets" to standardize financial statement indicators

# List of target financial indicators
target_indicators = [
    "NetIncomeLoss",
    "OperatingIncomeLoss",
    "GrossProfit",
    "ComprehensiveIncomeNetOfTax",
    "EarningsPerShareBasic",
    "RevenueFromContractWithCustomerExcludingAssessedTax",
    "EntityPublicFloat",
    "AllocatedShareBasedCompensationExpense",
    "CashAndCashEquivalentsAtCarryingValue",
    "AccountsReceivableNetCurrent",
    "Assets"
]

In [22]:
# Directory containing JSON files
company_data_dir = "../../data/raw-data/company_data"
output_fina_data_csv = "../../data/processed-data/financial_data.csv"

In [23]:
# Initialize a DataFrame to store the financial data
combined_data = []

# Loop through each JSON file in the directory
for file_name in os.listdir(company_data_dir):
    if file_name.endswith(".json"):
        file_path = os.path.join(company_data_dir, file_name)
        cik_code = file_name.split(".")[0]  # Extract CIK-code from file name

        with open(file_path, "r") as file:
            data = json.load(file)

        # Extract relevant financial indicators
        row = {"CIK-code": cik_code}
        for indicator in target_indicators:
            value = None
            for taxonomy, metrics in data.get("facts", {}).items():
                if indicator in metrics:
                    units_data = metrics[indicator].get("units", {})
                    # Search across all units for the target fiscal year
                    for unit, records in units_data.items():
                        # Filter records for the target fiscal year
                        filtered_records = [
                            record for record in records if record.get("fy") == 2023
                        ]
                        if filtered_records:
                            # Sort by `filed` date to get the latest entry
                            latest_record = max(
                                filtered_records, key=lambda x: x.get("filed", "")
                            )
                            value = latest_record.get("val")
            row[indicator] = value

        combined_data.append(row)

# Convert the combined data into a DataFrame
fina_df = pd.DataFrame(combined_data)

In [24]:
# Save the DataFrame to a CSV file
fina_df.to_csv(output_fina_data_csv, index=False)
print(f"Combined financial indicators saved to {output_fina_data_csv}")

Combined financial indicators saved to ../../data/processed-data/financial_data.csv


{{< include closing.qmd >}} 