# Case study

## Extract companies' financial informations.

# Data extraction method

To retrieve the required informations I decided to use edgartools, a Python library with methods that allow to directly send API requests and retrieve the data.

This method allows for faster and easier data retrievial compared to other available methods, allows to directly convert the filings in dataframes, and to extract directly data of interest.

# Data extraction process

After creating an empty dataframe to fill with the retrieved companies' financial informations, I created a for loop to retrieve the data.


- using the cik I make an api request using the Company method of edgartools.
- from here I can access the company information (address, company name, industry description) as well as the 10-k of the latest 5 years.
- from each 10-k I accessed the financial statement, and extracted the latest occurrence of revenue. I choose to extract the latest occurrence in the financial statement, as sometimes there is just one instance of revenue, sometimes multiple, and when multiple I am going to select the total revenue (which is the latest occurrence of revenue in the statement)
- once I gathered all the data, I add a new row to the dataframe, with the informations about the company, year of financial statement, revenue, address, industry standard.

# Final step

After retrieving all the informations, some adjustment need to be made to get the correct format for 'geonameen'. Due to some conflict between dependencies, I finished this task in a separate notebook named: 'Cleaning'.

In [1]:
pip install edgartools

Collecting edgartools
  Downloading edgartools-3.11.5-py3-none-any.whl.metadata (17 kB)
Collecting rank-bm25>=0.2.1 (from edgartools)
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Collecting rapidfuzz>=3.5.0 (from edgartools)
  Downloading rapidfuzz-3.12.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting stamina>=24.2.0 (from edgartools)
  Downloading stamina-24.3.0-py3-none-any.whl.metadata (5.5 kB)
Collecting textdistance>=4.5.0 (from edgartools)
  Downloading textdistance-4.6.3-py3-none-any.whl.metadata (18 kB)
Collecting unidecode>=1.2.0 (from edgartools)
  Downloading Unidecode-1.3.8-py3-none-any.whl.metadata (13 kB)
Downloading edgartools-3.11.5-py3-none-any.whl (1.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Downloading rapidfuzz-3.12.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86

In [2]:
import pandas as pd
from edgar import *

In [3]:
us_states = ['AL','AK','AZ','AR','CA','CO','CT','DE','DC','FL','GA',
             'HI','ID','IL','IN','IA','KS','KY','LA','ME','MD','MA',
             'MI','MN','MS','MO','MT','NE','NV','NH','NJ','NM','NY',
             'NC','ND','OH','OK','OR','PA','RI','SC','SD','TN','TX',
             'UT','VT','VA','WA','WV','WI','WY']


In [4]:
set_identity("Sara Chiarelli sarachiarelli@outlook.it") # required to access edgar tools

In [5]:
filings = get_filings(2024, form= "10-K") # to get the latest companies' filings

In [6]:
df = filings.to_pandas() # converting the filings data in a pandas dataframe to access each row

In [7]:
df1 = df[:1000] # creating a dataframe containing informations about 1000 companies, to use to retrieve companies financial data

In [8]:
df1 = df1.sort_values(["company", "filing_date"])
df1

Unnamed: 0,form,company,cik,filing_date,accession_number
666,10-K,1 800 FLOWERS COM INC,1084869,2024-09-06,0001437749-24-028591
812,10-K,3D SYSTEMS CORP,910638,2024-08-13,0000910638-24-000030
663,10-K,"5E Advanced Materials, Inc.",1888654,2024-09-09,0000950170-24-104782
625,10-K,"A-Mark Precious Metals, Inc.",1591588,2024-09-13,0000950170-24-106317
943,10-K,AAR CORP,1750,2024-07-19,0001104659-24-080890
...,...,...,...,...,...
225,10-K,"i3 Verticals, Inc.",1728688,2024-11-25,0001728688-24-000102
605,10-K,"iBio, Inc.",1420720,2024-09-20,0001420720-24-000038
590,10-K/A,"iBio, Inc.",1420720,2024-09-24,0001420720-24-000041
606,10-K,iPower Inc.,1830072,2024-09-20,0001683168-24-006560


In [23]:
dfn = pd.DataFrame(columns=["timevalue", "companyname", "industryclassification", "Country", "revenue", "revenue_unit"])

In [24]:
for i in range(1000):
  try:
    cik = df1.loc[i,"cik"]
    company = Company(f"{cik}")
    name = company.name
    address = str(company.business_address.state_or_country_desc)
    if address in us_states:
      country = "United States"
    else:
      country = address
    sic = company.sic_description
    latest_10k = company.latest("10-K",5) # to get the 5 latest 10k
    for tenk in latest_10k:
      ten = tenk.obj()
      x = ten.financials.income.data
      years = x.columns[0]
      try:
        rev = (((x[x['concept'].str.contains('us-gaap_Revenue', case=False, na=False)]).iloc[-1]).iloc[1]) # to get the latest occurrence of revenue in the filing (sometimes is just one occurence, sometimes is a sum, so i will get the total revenue)
        revenue = abs(int(rev))
      except:
        continue
      unit = "USD"
      new_row = pd.DataFrame({"timevalue": [years], "companyname": [name], "industryclassification": [sic],"Address": [address], "Country":[country], "revenue" : [revenue], "revenue_unit" : [unit]})
      dfn = pd.concat([dfn, new_row], ignore_index=True)
  except:
    continue




In [27]:
len(dfn)

2995

In [28]:
dfn

Unnamed: 0,timevalue,companyname,industryclassification,Country,revenue,revenue_unit,Address
0,2024,DAILY JOURNAL CORP,Newspapers: Publishing or Publishing & Printing,United States,67709000,USD,CA
1,2023,DAILY JOURNAL CORP,Newspapers: Publishing or Publishing & Printing,United States,54009000,USD,CA
2,2024,"EXP OldCo Winddown, Inc.",Retail-Apparel & Accessory Stores,United States,1864182000,USD,OH
3,2023,"EXP OldCo Winddown, Inc.",Retail-Apparel & Accessory Stores,United States,1870296000,USD,OH
4,2022,"EXP OldCo Winddown, Inc.",Retail-Apparel & Accessory Stores,United States,1208374000,USD,OH
...,...,...,...,...,...,...,...
2990,2021,"Global Arena Holding, Inc.",Services-Prepackaged Software,United States,641629,USD,NY
2991,2020,"Global Arena Holding, Inc.",Services-Prepackaged Software,United States,477773,USD,NY
2992,2019,"Global Arena Holding, Inc.",Services-Prepackaged Software,United States,716517,USD,NY
2993,2023,"Optimus Healthcare Services, Inc.",Services-Commercial Physical & Biological Rese...,United States,1218882,USD,NY


In [25]:
dfn.to_excel("Case_Study_to_clean.xlsx")

In [26]:
from google.colab import files
files.download('Case_Study_to_clean.xlsx')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>