### Company General Data
This script is used to work with the company tickers API https://www.sec.gov/files/company_tickers.json
and the submissions API https://data.sec.gov/submissions/CIK{cik}.json 

in order to retrieve the general information of the companies present in the EDGAR DataBase

In [2]:
import requests
import pandas as pd
import numpy as np

In [3]:
headers = {'User-Agent': "jose.trindade@bts.tech"}
company_tickers = requests.get("https://www.sec.gov/files/company_tickers.json",
                              headers = headers)

company_tickers_json = company_tickers.json()

In [10]:
list_dicts = [company_tickers_json[company] for company in company_tickers_json]
companyData = pd.DataFrame(list_dicts)
companyData

Unnamed: 0,cik_str,ticker,title
0,320193,AAPL,Apple Inc.
1,789019,MSFT,MICROSOFT CORP
2,1652044,GOOGL,Alphabet Inc.
3,1018724,AMZN,AMAZON COM INC
4,1045810,NVDA,NVIDIA CORP
...,...,...,...
10563,1392694,SURGW,"SurgePays, Inc."
10564,1498382,DMPWW,"Kintara Therapeutics, Inc."
10565,1848821,GTACU,Global Technology Acquisition Corp. I
10566,1848821,GTACW,Global Technology Acquisition Corp. I


In [11]:
# We need to add zeros because some CIKs differ in digits and the API needs 10 digit CIK. 

companyData['cik_str'] = companyData['cik_str'].astype(str).str.zfill(10)
companyData

Unnamed: 0,cik_str,ticker,title
0,0000320193,AAPL,Apple Inc.
1,0000789019,MSFT,MICROSOFT CORP
2,0001652044,GOOGL,Alphabet Inc.
3,0001018724,AMZN,AMAZON COM INC
4,0001045810,NVDA,NVIDIA CORP
...,...,...,...
10563,0001392694,SURGW,"SurgePays, Inc."
10564,0001498382,DMPWW,"Kintara Therapeutics, Inc."
10565,0001848821,GTACU,Global Technology Acquisition Corp. I
10566,0001848821,GTACW,Global Technology Acquisition Corp. I


**Next step**: In order to retrieve more data about the companies we need to use submissions API and combine the data. Let's get the following columns: sic, sicDescription, category, entityType, exchanges, fiscalYearEnd, stateOfIncorporation and build a DataFrame

In [12]:
# List of columns to create based on API data
columns_to_create = ['sic', 'sicDescription', 'category', 'entityType', 'exchanges', 'fiscalYearEnd', 'stateOfIncorporation']

In [13]:
# Function to create columns based on json fetched
def fetch_and_create_columns(df, cik_column, fetch_function, new_columns):
    
    #Copy df
    df = df.copy()
    
    # Apply the fetch_function to each row in the DataFrame
    df['ApiData'] = df[cik_column].apply(fetch_function)

    # Create new columns based on the API data using .loc for assignment
    for column in new_columns:
        df.loc[:, column] = df['ApiData'].apply(lambda x: x[column] if x else None)

    # Drop the 'ApiData' column if no longer needed
    df = df.drop('ApiData', axis=1)

    return df


# Function to fetch additional data from the API
def fetch_data_from_api(cik):
    # Replace the following URL with your API endpoint
    headers = {'User-Agent': "jose.trindade@bts.tech"}
    api_url = f"https://data.sec.gov/submissions/CIK{cik}.json"
    response = requests.get(api_url, headers = headers)
    
    
    if response.status_code == 200:
        # Assume the API returns a JSON response
        return response.json()
    else:
        return None

In [None]:
# DONT RUN THIS UNLESS YOU DON'T HAVE THE DATA ALREADY
# Get df with 10568 companies with new accounting columns
#complete_CompanyData = fetch_and_create_columns(companyData, 'cik_str', fetch_data_from_api, columns_to_create)

In [None]:
# DONT RUN THIS UNLESS YOU DON'T HAVE THE DATA ALREADY
# Export df with 10568 companies to csv
#complete_CompanyData.to_csv("Company_Data.csv")

In [75]:
# Importing the data generated above
df = pd.read_csv("Company_Data.csv", index_col=0)

In [76]:
# SOME REFORMAT

# Format the fiscalYearEnd
df['fiscalYearEnd'] = pd.to_datetime(df['fiscalYearEnd'], errors='coerce', format='%m%d')
df['fiscalYearEnd'] = df['fiscalYearEnd'].dt.strftime('%d-%m')

# Filling NaN with the value 0 and changing the type from float to integer
df["sic"] = df["sic"].fillna(0).astype(int)

# Exporting it back to csv
df.to_csv("Company_Data.csv")

In [3]:
company_data = pd.read_csv("Company_Data.csv", index_col=0)

In [22]:
# Number of companies per SIC
count_sic = company_data.groupby(["sic","sicDescription"]).agg("count").sort_values(
                    by="cik_str", ascending = False)[["cik_str"]].rename(
                                    columns = {"cik_str":"count"}).reset_index()

In [24]:
# Number of companies with the same SIC as Apple
count_sic[count_sic["sic"]==3571]

Unnamed: 0,sic,sicDescription,count
223,3571,Electronic Computers,7


In [25]:
count_sic.to_csv("companies_per_sic")

### Output
The Output of this script is:
- a csv file that contains the general information about companies listed in SEC.
- a csv file that contains the number of companies per SIC