# Obtain 10-K file size

### Summary of Work Flow:
+ 1. Import Libraries: Import necessary libraries (requests, os, time, csv, pandas) to handle HTTP requests, file operations, and data processing.
+ 2. Define SEC Base URL: Set the base URL for SEC's EDGAR archive.
+ 3. Read 10-K URLs from CSV: Define a function to read the input CSV file, construct the URLs for the 10-K filings, and gather relevant information.
+ 4. Get File Sizes: Define a function that sends HEAD requests to get file sizes for the 10-K filings without downloading them.
+ 5. Specify File Paths: Set the paths for the input CSV file and the output CSV file.
+ 6. Extract URLs: Call the function to read and extract URLs from the input CSV.
+ 7. Save Results: Call the function to get file sizes and save the results to the output CSV file.

# Step 1: Import Required Libraries

In [18]:
import requests
import os
import time
import csv
import pandas as pd

- We import necessary libraries for handling HTTP requests, file operations, and data handling.
- The requests library is used to interact with the SEC's website.
- The pandas library is used to read and manipulate CSV files.-

# Step 2: Define the SEC Base URL


In [19]:
# Define the base URL for SEC archives
base_url = "https://www.sec.gov/Archives/"

- We define the base URL of the SEC EDGAR archive, where the 10-K forms are located.
- This base URL will be combined with the file paths to construct complete URLs for each 10-K form.

# Step 3: Function to Read CSV and Construct URLs


In [20]:
# Function to read the CSV file and construct the URLs
def read_10k_urls_from_csv(csv_file):
    data = []
    df = pd.read_csv(csv_file)  # Read the CSV file into a DataFrame
    
    # Iterate over each row in the DataFrame and construct the full URL
    for _, row in df.iterrows():
        url = f"{base_url}{row['File_Name']}"  # Construct the full URL
        data.append([row['Form_Type'], row['Company_Name'], row['CIK'], row['Date_Filed'], row['File_Name'], url])
    
    return data

- This function reads the input CSV containing 10-K filing information. It uses pandas to load the CSV into a DataFrame.
- For each row in the DataFrame, we construct a full URL by combining the base URL with the File_Name from the CSV.
- We store the full URL along with other relevant data (Form_Type, Company_Name, etc.) in a list for further processing.

# Step 4: Function to Get File Sizes Without Downloading the Files


In [21]:
# Function to get 10-K file sizes without downloading the files
def get_10k_file_sizes(data, output_csv):
    # Define request headers
    headers = {
        'User-Agent': 'TestBot +spoo@test.ai',
        'Accept-Encoding': 'gzip, deflate',
        'Host': 'www.sec.gov'
    }

    # Rate limit: 10 requests per second
    request_interval = 1.0 / 10  # Interval between requests, in seconds

    # Prepare CSV file to store the results
    with open(output_csv, 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = ['Form_Type', 'Company_Name', 'CIK', 'Date_Filed', 'File_Name', 'URL', 'File_Size (Bytes)']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()

        for i, row in enumerate(data, start=1):
            form_type, company_name, cik, date_filed, file_name, url = row
            start_time = time.time()

            # Send HEAD request to get file size from headers
            response = requests.head(url, headers=headers)
            if response.status_code == 200:
                # Get the file size from the headers
                file_size = response.headers.get('Content-Length', 0)

                # Write data and file size to the CSV file
                writer.writerow({
                    'Form_Type': form_type,
                    'Company_Name': company_name,
                    'CIK': cik,
                    'Date_Filed': date_filed,
                    'File_Name': file_name,
                    'URL': url,
                    'File_Size (Bytes)': file_size
                })
                print(f"Processed {url} ({file_size} bytes)")
            else:
                print(f"Failed to get file size for {url}")

            # Rate limiting to avoid overwhelming the SEC's servers
            elapsed_time = time.time() - start_time
            if elapsed_time < request_interval:
                time.sleep(request_interval - elapsed_time)






- This function sends HEAD requests to the constructed URLs to get the file sizes without downloading the actual files.
- The requests.head() method only retrieves headers (including Content-Length which indicates file size).
- The file sizes and related data are written into an output CSV file for easy storage and analysis.

# Step 5: Specify Input and Output File Paths


In [22]:
# Path to the CSV file with the 10-K URLs and other data
csv_file_path = '/Users/spoorthy/Projects/Accounting/combined_filtered.csv'  # Update with actual path to your CSV file


# Step 6: Extract URLs from the Input CSV


In [23]:
# Read URLs and other relevant information from the CSV file
data_10k = read_10k_urls_from_csv(csv_file_path)

# Step 7: Specify the Output CSV Path


In [24]:
# Output CSV file to store the results with file sizes
output_csv = "/Users/spoorthy/Projects/Accounting/10K_file_sizes.csv"  # Update this path


# Step 8: Get File Sizes and Save to CSV


In [25]:

# Get the 10-K file sizes and save them to CSV
get_10k_file_sizes(data_10k, output_csv)

Processed https://www.sec.gov/Archives/edgar/data/1467373/0001467373-23-000324.txt (20 bytes)
Processed https://www.sec.gov/Archives/edgar/data/320193/0000320193-23-000106.txt (20 bytes)
Processed https://www.sec.gov/Archives/edgar/data/1730168/0001730168-23-000096.txt (20 bytes)
Processed https://www.sec.gov/Archives/edgar/data/909832/0000909832-23-000042.txt (20 bytes)
Processed https://www.sec.gov/Archives/edgar/data/315189/0001558370-23-019812.txt (20 bytes)
Processed https://www.sec.gov/Archives/edgar/data/32604/0000032604-23-000044.txt (20 bytes)
Processed https://www.sec.gov/Archives/edgar/data/804328/0000804328-23-000055.txt (20 bytes)
Processed https://www.sec.gov/Archives/edgar/data/829224/0000829224-23-000058.txt (20 bytes)
Processed https://www.sec.gov/Archives/edgar/data/1403161/0001403161-23-000099.txt (20 bytes)
Processed https://www.sec.gov/Archives/edgar/data/1744489/0001744489-23-000216.txt (20 bytes)
Processed https://www.sec.gov/Archives/edgar/data/66740/0000066740-