# Read Form.idx and select only 10-K filings for given firms;

This code reads multiple .idx files from a directory, filters them by a specific Form_Type (e.g., "10-K") and a list of CIK numbers, and then saves the accumulated filtered data to a CSV file. Here's an explanation of how the code works and how you can modify it as necessary:

# 1. Import Libraries:

In [11]:
import pandas as pd
import re
import os
import chardet

+ pandas is used for handling tabular data.
+ os is used to navigate the file system.
+ chardet is used to detect the encoding of files, ensuring that files with different encodings can be read correctly.

# 2. List of Form Types and CIK Numbers to Filter:

In [12]:
def extract_ciks(csv_file):
    cik_list = []
    df = pd.read_csv(csv_file)

    for _, row in df.iterrows():
        cik_list.append(row['cik'])

    return cik_list

In [13]:
cik_list = extract_ciks('/Users/spoorthy/Projects/Accounting/sp100_list.csv')
print(cik_list)

[1800, 2488, 773840, 4962, 5272, 318154, 320193, 1390777, 732712, 1067983, 12927, 14272, 18230, 19617, 93410, 21344, 21665, 1166691, 831001, 313616, 27419, 315189, 1744489, 1326160, 32604, 34088, 753308, 1048911, 36104, 37996, 40533, 40545, 1467858, 1637459, 354950, 50863, 51143, 200406, 59478, 936468, 60667, 63908, 1613103, 64803, 310158, 66740, 70858, 320187, 72971, 77476]


In [14]:
form_types = ['10-K']
# cik_list = [1770787, 1690080, 1847360]  # Example CIKs, replace with actual values


+ Replace the example CIKs with the actual CIK numbers you're interested in.

# 3. Convert CIK to format consistent with IDX

In [15]:
ciks = [str(cik).zfill(10) for cik in cik_list]

# 4. Function to Read and Filter a File:

In [16]:
def parse_idx_file(file_path, form_types, ciks):
    records = []
    with open(file_path, 'r', encoding='utf-8') as file:
        lines = file.readlines()

    data_start = next(i for i, line in enumerate(lines) if re.match(r'-{3,}', line)) + 1

    for line in lines[data_start:]:
        parts = re.split(r'\s{2,}', line.strip())
        if len(parts) == 5:
            form_type, company_name, cik, date_filed, file_name = parts
            cik_padded = cik.zfill(10)

            # Debugging output
            if form_type in form_types:
                print(f"Found matching form type: {form_type}, CIK: {cik_padded}")

            if form_type in form_types and cik_padded in ciks:
                records.append([form_type, company_name, cik_padded, date_filed, file_name])

    return records

+ parse_idx_file: This function reads a single .idx file, detects its encoding, and loads the file as a pandas DataFrame.
+ It automatically finds the first line that contains a row of dashes (e.g., -------), which indicates the start of the data table in the .idx file. The actual data starts on the next line. 
+ For each line, it breaks it into five columns (parts): form_type, company_name, cik, date_filed, file_name.
+ It filters the rows based on the form type (Form_Type == "10-K") and CIK list.

# 5. Define Directory Containing .idx Files:



In [17]:
idx_directory = "/Users/spoorthy/Projects/Accounting/index_files"  # Update this path


+ Update this path to the directory where your .idx files are stored.


# 6. Accumulate Filtered Data:



In [18]:
all_records = []
for file_name in os.listdir(idx_directory):
    if file_name.endswith('.idx'):
        file_path = os.path.join(idx_directory, file_name)
        print(f"Parsing file: {file_path}")
        all_records.extend(parse_idx_file(file_path, form_types, ciks))

accumulated_df = pd.DataFrame(all_records, columns=['Form_Type', 'Company_Name', 'CIK', 'Date_Filed', 'File_Name'])
print(accumulated_df.head())

Parsing file: /Users/spoorthy/Projects/Accounting/index_files/2023QTR4.idx
Found matching form type: 10-K, CIK: 0001605331
Found matching form type: 10-K, CIK: 0000771497
Found matching form type: 10-K, CIK: 0001144215
Found matching form type: 10-K, CIK: 0000868857
Found matching form type: 10-K, CIK: 0001090872
Found matching form type: 10-K, CIK: 0000003545
Found matching form type: 10-K, CIK: 0001823584
Found matching form type: 10-K, CIK: 0000775057
Found matching form type: 10-K, CIK: 0000928465
Found matching form type: 10-K, CIK: 0000720500
Found matching form type: 10-K, CIK: 0000006281
Found matching form type: 10-K, CIK: 0001314052
Found matching form type: 10-K, CIK: 0000744452
Found matching form type: 10-K, CIK: 0000006951
Found matching form type: 10-K, CIK: 0001755101
Found matching form type: 10-K, CIK: 0000779544
Found matching form type: 10-K, CIK: 0000879407
Found matching form type: 10-K, CIK: 0001674862
Found matching form type: 10-K, CIK: 0001755058
Found matchin

In [19]:
print(accumulated_df)

    Form_Type                Company_Name         CIK  Date_Filed  \
0        10-K                  Apple Inc.  0000320193  2023-11-03   
1        10-K                  DEERE & CO  0000315189  2023-12-15   
2        10-K         EMERSON ELECTRIC CO  0000032604  2023-11-13   
3        10-K              Walt Disney Co  0001744489  2023-11-21   
4        10-K                       3M CO  0000066740  2022-02-09   
..        ...                         ...         ...         ...   
245      10-K  VERIZON COMMUNICATIONS INC  0000732712  2024-02-09   
246      10-K    WELLS FARGO & COMPANY/MN  0000072971  2024-02-20   
247      10-K                  FEDEX CORP  0001048911  2024-07-15   
248      10-K                  NIKE, Inc.  0000320187  2024-07-25   
249      10-K               Medtronic plc  0001613103  2024-06-20   

                                       File_Name  
0     edgar/data/320193/0000320193-23-000106.txt  
1     edgar/data/315189/0001558370-23-019812.txt  
2      edgar/data/

+ This loop iterates through each .idx file in the directory, applies the read_and_filter_file function, and accumulates the filtered data in accumulated_df.


# 7. Save the Filtered Data to a CSV File:



In [20]:
output_file = "/Users/spoorthy/Projects/Accounting/combined_filtered.csv"  # Update this path
accumulated_df.to_csv(output_file, index=False)

+ The accumulated DataFrame is saved as a CSV file. Update the path to the location where you want to save the output file.

# 8. How to Customize:
+ Adjust Column Widths: If the structure of your .idx files differs, modify the column_widths variable to match your file format.
+ CIK Numbers: Replace the cik_list with the actual CIK numbers you're interested in.
+ File Directory: Set the idx_files_directory to the location of your .idx files.
+ Form Type: If you want to filter by a different form type (e.g., "10-Q"), modify the form_type argument in the read_and_filter_file function call.