In [11]:
import pandas as pd
# ensure that all columns are shown and that colum content is not cut
pd.set_option('display.max_rows', 500) # ensure that all rows are shown
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width',1000)

# Analyze Segments Information

In this notebook, we analyze what information is available in the segments filter.

To do that, we use the joined databag containing all the filtered and joined data (as created in the automation example in 08_00_automation_basics notebook), this will use quite some memory and also take a minute or so to load. 

As an alternative, you could also use only the data of 2024:
<pre>
# As an alternative, using the data of a single year
from secfsdstools.d_container.databagmodel import JoinedDataBag
from secfsdstools.e_collector.zipcollecting import ZipCollector
from secfsdstools.u_usecases.bulk_loading import default_postloadfilter

collector = ZipCollector.get_zip_by_names(names=["2024q1.zip", "2024q2.zip", "2024q3.zip", "2024q4.zip"], 
                                          forms_filter=["10-K", "10-Q"],                                        
                                          post_load_filter=default_postloadfilter)

all_joined_bag: JoinedDataBag = collector.collect().join()
pre_num_df = all_joined_bag.pre_num_df
print(len(pre_num_df))
</pre>

In [2]:
from secfsdstools.d_container.databagmodel import JoinedDataBag

path_to_all = "C:/data/sec/automated/_4_single_bag/all"
all_joined_bag = JoinedDataBag.load(path_to_all)
pre_num_df = all_joined_bag.pre_num_df

2025-02-23 07:02:59,156 [INFO] configmgt  reading configuration from C:\Users\hansj\.secfsdstools.cfg
2025-02-23 07:02:59,848 [INFO] updateprocess  Launching data update process ...
2025-02-23 07:02:59,876 [INFO] task_framework  Starting process SecDownloadingProcess
2025-02-23 07:02:59,878 [INFO] secdownloading_process  reading table in main page: https://www.sec.gov/dera/data/financial-statement-data-sets.html
2025-02-23 07:03:00,759 [INFO] task_framework  Starting process ToParquetTransformerProcess
2025-02-23 07:03:00,762 [INFO] task_framework  Starting process ReportParquetIndexerProcess


## Basic information

In [3]:
print(len(pre_num_df))

62187005


The whole dataset (as of February 2025) has over **62 million** rows in the joined pre_num_df dataframe. Now, let's see how many rows have information inside the `segments` column:

In [4]:
print(sum(~(pre_num_df.segments=='')))

26381721


Around **40%** of the datapoints have segments information.

Now let us see, how many different values we have in the `segments` column:

In [5]:
print(pre_num_df.segments.nunique(dropna=True))

844868


It seems as there are many different values within the segments column. So, it will be intersting to know, if certain values are more frequent and therefore important than others.

## Category/Axis

### Basics

Usually, entries with segments information "belong" to an entry with the same `tag` that has None in its `segments` column. 

As an example, let us look at the Apple 10-Q report of the second quarter of 2024 which adsh "0000320193-24-000069". We will also filter for the Revenues tag `RevenueFromContractWithCustomerExcludingAssessedTax` and the values for only the second quarter (qtrs==1).

In [6]:
example_segments = pre_num_df[(pre_num_df.adsh=="0000320193-24-000069") & (pre_num_df.tag=="RevenueFromContractWithCustomerExcludingAssessedTax") & (pre_num_df.qtrs==1)]
example_segments

Unnamed: 0,adsh,tag,version,ddate,qtrs,uom,segments,coreg,value,footnote,report,line,stmt,inpth,rfile,plabel,negating
61209002,0000320193-24-000069,RevenueFromContractWithCustomerExcludingAssessedTax,us-gaap/2023,20240331,1,USD,BusinessSegments=GreaterChinaSegment;,,16372000000.0,,2,7,IS,0,H,Net sales,0
61209005,0000320193-24-000069,RevenueFromContractWithCustomerExcludingAssessedTax,us-gaap/2023,20240331,1,USD,ProductOrService=IPad;,,5559000000.0,,2,7,IS,0,H,Net sales,0
61209006,0000320193-24-000069,RevenueFromContractWithCustomerExcludingAssessedTax,us-gaap/2023,20240331,1,USD,,,90753000000.0,,2,7,IS,0,H,Net sales,0
61209011,0000320193-24-000069,RevenueFromContractWithCustomerExcludingAssessedTax,us-gaap/2023,20240331,1,USD,BusinessSegments=RestOfAsiaPacificSegment;,,6723000000.0,,2,7,IS,0,H,Net sales,0
61209012,0000320193-24-000069,RevenueFromContractWithCustomerExcludingAssessedTax,us-gaap/2023,20240331,1,USD,BusinessSegments=EuropeSegment;,,24123000000.0,,2,7,IS,0,H,Net sales,0
61209015,0000320193-24-000069,RevenueFromContractWithCustomerExcludingAssessedTax,us-gaap/2023,20240331,1,USD,BusinessSegments=JapanSegment;,,6262000000.0,,2,7,IS,0,H,Net sales,0
61209016,0000320193-24-000069,RevenueFromContractWithCustomerExcludingAssessedTax,us-gaap/2023,20240331,1,USD,ProductOrService=Service;,,23867000000.0,,2,7,IS,0,H,Net sales,0
61209018,0000320193-24-000069,RevenueFromContractWithCustomerExcludingAssessedTax,us-gaap/2023,20240331,1,USD,ProductOrService=Mac;,,7451000000.0,,2,7,IS,0,H,Net sales,0
61209019,0000320193-24-000069,RevenueFromContractWithCustomerExcludingAssessedTax,us-gaap/2023,20240331,1,USD,ProductOrService=WearablesHomeandAccessories;,,7913000000.0,,2,7,IS,0,H,Net sales,0
61209020,0000320193-24-000069,RevenueFromContractWithCustomerExcludingAssessedTax,us-gaap/2023,20240331,1,USD,BusinessSegments=AmericasSegment;,,37273000000.0,,2,7,IS,0,H,Net sales,0


Usually, entries in the segments columns have the format `<category/axis>=<value>` and in the above example, we see that we mainly have two axes: `BusinessSegments`and `ProductOrService`. The first one obviously gives a more detailed view of the revenues that were made in different region. We would also expect, that the values sum up to the total value shown in the entry without `segments` information: 9.07+10. And indeed, they do: 1.637, 0.672, 2.412, 0.626, and 3.727

In [7]:
sum(example_segments[example_segments.segments.str.startswith('BusinessSegments')].value)

90753000000.0

The second axis `ProductOrService` is a little bit trickier, since it shows two levels. First, we have separation for Product (`ProductOrService=Product`) or Service (`ProductOrService=Service`). This two values will also sum up to the total of 9.0753+10: 6.6886+10 + 2.38670. But we also have the Revenues for different products: `ProductOrService=IPad`, `ProductOrService=IPhone`, ... . We expect, that the values of the products should sum up, or at least come close to the value of the total product value `ProductOrService=Product`: 6.6886+10.

In [8]:
sum(example_segments[example_segments.segments.isin(['ProductOrService=IPad;','ProductOrService=Mac;','ProductOrService=WearablesHomeandAccessories;','ProductOrService=IPhone;'])].value)

66886000000.0

## Overview on Categories/Axes

Since the format of the `segments` column is `<category/axis>=<value>`, let's create a category column, so that we can investigate how many different categories we have and how often they appear. We simply split the string inside the segments column at the = sign and use the first part as `category`.

In [9]:
pre_num_df['category'] = pre_num_df.segments.str.split("=", n=1, expand=True)[0]

Let's see how many different categories we have:

In [10]:
print(pre_num_df.category.nunique(dropna=True))

6050


There are around 6000 "main" cataegories, resp. axes.

In order to know which categories are the most important ones, let's display to top 10 for every financial statement (BS, IS, CF):

In [12]:
def get_value_counts(stmt: str) -> pd.Series:
  print("Results for: ", stmt)
  p_n_stmt_df = pre_num_df[(pre_num_df.stmt==stmt) & ~(pre_num_df.segments=='')]
  categories_stmt =  p_n_stmt_df.category.value_counts()
  print("different categories in", "stmt", len(categories_stmt))
  print("top ten\n")
  print(categories_stmt[:10])
  print("-------------------------------------\n\n")
  return categories_stmt

bs_categories = get_value_counts("BS")
is_categories = get_value_counts("IS")
cf_categories = get_value_counts("CF")

Results for:  BS
different categories in stmt 3457
top ten

                                       11365908
EquityComponents                        1334326
ClassOfStock                             800987
FairValueByFairValueHierarchyLevel       702379
InvestmentIdentifier                     496133
BusinessSegments                         481380
ConsolidatedEntities                     474227
ConsolidationItems                       354201
FinancingReceivablePortfolioSegment      250581
FinancialInstrument                      224834
Name: category, dtype: int64
-------------------------------------


Results for:  IS
different categories in stmt 2663
top ten

                                                            10345352
BusinessSegments                                             2315314
ConsolidationItems                                            957705
EquityComponents                                              738475
ProductOrService                                       

## An Example Deep Dive into Apple's 10-K

Let us have a look at Apple's 10-K reports. 

Therefore we load by Apples's cik 320193 and the forms 10-K. Moreover, we are just interestes in IS reports.

Of course, we could also use the CompanyCollector instead to get all 10-K reports for Apple.

In [2]:
from secfsdstools.d_container.databagmodel import JoinedDataBag

path_to_all = "C:/data/sec/automated/_4_single_bag/all"
apple_10k_joined_bag = JoinedDataBag.load(path_to_all, ciks_filter=[320193], forms_filter=['10-K'], stmt_filter=['IS'])
apple_10k_pre_num_df = apple_10k_joined_bag.pre_num_df
print(apple_10k_pre_num_df.shape)

2025-02-24 06:48:51,067 [INFO] databagmodel  apply sub_df filter: [('cik', 'in', [320193]), ('form', 'in', ['10-K'])]
2025-02-24 06:48:51,235 [INFO] databagmodel  apply pre_num_df filter: ["('adsh', 'in', ['0001193125-09-214859', '0001193125-10-238044', '0001193125-11-282113', '0001193125-...)", "('stmt', 'in', ['IS'])"]


(703, 17)


In [9]:
tags = apple_10k_pre_num_df.tag.unique()
revenue_tags = [t for t in tags if 'revenue' in t.lower()]
print(tags)
print(revenue_tags)

['IncomeLossFromContinuingOperationsBeforeIncomeTaxesMinorityInterestAndIncomeLossFromEquityMethodInvestments'
 'SellingGeneralAndAdministrativeExpense' 'EarningsPerShareDiluted'
 'CostOfGoodsAndServicesSold'
 'WeightedAverageNumberOfSharesOutstandingBasic' 'IncomeTaxExpenseBenefit'
 'SalesRevenueNet' 'NetIncomeLoss' 'GrossProfit'
 'WeightedAverageNumberOfDilutedSharesOutstanding'
 'ResearchAndDevelopmentExpense' 'NonoperatingIncomeExpense'
 'EarningsPerShareBasic' 'OperatingIncomeLoss' 'OperatingExpenses'
 'CommonStockDividendsPerShareDeclared'
 'IncomeLossFromContinuingOperationsBeforeIncomeTaxesExtraordinaryItemsNoncontrollingInterest'
 'Revenues' 'RevenueFromContractWithCustomerExcludingAssessedTax']
['SalesRevenueNet', 'Revenues', 'RevenueFromContractWithCustomerExcludingAssessedTax']


In [22]:
apple_10k_pre_num_df[(apple_10k_pre_num_df.qtrs==4) & apple_10k_pre_num_df.tag.isin(revenue_tags) & apple_10k_pre_num_df.segments.isin(['', 'ProductOrService=Service;', 'ProductOrService=Product;'])]

Unnamed: 0,adsh,tag,version,ddate,qtrs,uom,segments,coreg,value,footnote,report,line,stmt,inpth,rfile,plabel,negating
6,0001193125-09-214859,SalesRevenueNet,us-gaap/2009,20090930,4,USD,,,36537000000.0,,3,3,IS,0,X,Net sales,0
35,0001193125-10-238044,SalesRevenueNet,us-gaap/2009,20100930,4,USD,,,65225000000.0,,2,3,IS,0,X,Net sales,0
69,0001193125-11-282113,SalesRevenueNet,us-gaap/2011,20110930,4,USD,,,108249000000.0,,2,3,IS,0,H,Net sales,0
114,0001193125-12-444068,SalesRevenueNet,us-gaap/2012,20120930,4,USD,,,156508000000.0,,2,3,IS,0,H,Net sales,0
157,0001193125-13-416534,SalesRevenueNet,us-gaap/2013,20130930,4,USD,,,170910000000.0,,2,3,IS,0,H,Net sales,0
221,0001193125-14-383437,SalesRevenueNet,us-gaap/2014,20140930,4,USD,,,182795000000.0,,2,3,IS,0,H,Net sales,0
253,0001193125-15-356351,SalesRevenueNet,us-gaap/2015,20150930,4,USD,,,233715000000.0,,2,3,IS,0,H,Net sales,0
316,0001628280-16-020309,SalesRevenueNet,us-gaap/2015,20160930,4,USD,,,215639000000.0,,2,1,IS,0,H,Net sales,0
356,0000320193-17-000070,SalesRevenueNet,us-gaap/2017,20170930,4,USD,,,229234000000.0,,2,1,IS,0,H,Net sales,0
419,0000320193-18-000145,Revenues,us-gaap/2018,20180930,4,USD,,,265595000000.0,,2,1,IS,0,H,Net sales,0


In [None]:
Beispiel, wie sich das ganze verändert hat, am Beispiel von apple -> ProductService zuerst direkt in then Revenues und dann nur noch als segments -> filtern für Apple und 10K

In [17]:
pre_num_df[pre_num_df.category=="ProductOrService"].segments.value_counts()[:20]

ProductOrService=Product;                       30670
ProductOrService=Service;                       23430
ProductOrService=ProductAndServiceOther;         9342
ProductOrService=License;                        4801
ProductOrService=DepositAccount;                 4360
ProductOrService=FoodAndBeverage;                4052
ProductOrService=ServiceOther;                   3147
ProductOrService=Royalty;                        3009
ProductOrService=Other;                          2941
ProductOrService=SubscriptionAndCirculation;     2878
ProductOrService=OtherRevenue;                   2803
ProductOrService=ProfessionalServices;           2701
ProductOrService=TechnologyService;              2505
ProductOrService=ShippingAndHandling;            2453
ProductOrService=LicenseAndService;              2366
ProductOrService=Occupancy;                      2306
ProductOrService=FinancialServiceOther;          2290
ProductOrService=FinancialService;               1995
ProductOrService=Advertising

In [18]:
pre_num_df[pre_num_df.category=="EquityComponents"].segments.value_counts()[:20]

EquityComponents=CommonStock;                                                                                                                                                     1843463
EquityComponents=RetainedEarnings;                                                                                                                                                1789650
EquityComponents=AdditionalPaidInCapital;                                                                                                                                         1491660
EquityComponents=AccumulatedOtherComprehensiveIncome;                                                                                                                              999438
EquityComponents=NoncontrollingInterest;                                                                                                                                           568396
EquityComponents=TreasuryStock;                                       

In [19]:
pre_num_df[pre_num_df.segments=="EquityComponents=RetainedEarnings;"].tag.value_counts()[:20]

NetIncomeLoss                                                                                       537582
StockholdersEquity                                                                                  457010
StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest                              202800
ProfitLoss                                                                                          189271
DividendsCommonStockCash                                                                             43819
AdjustmentsToAdditionalPaidInCapitalSharebasedCompensationRequisiteServicePeriodRecognitionValue     20601
NetIncomeLossAvailableToCommonStockholdersBasic                                                      18245
DividendsCommonStock                                                                                 17630
StockIssuedDuringPeriodValueNewIssues                                                                13444
OtherComprehensiveIncomeLossNetOfTax 

Frage, was kann ich benützen, um standardisierung zu verbessern? Welche Kombinationen machen Sinn? Wie macht EquityComponents=RetainedEarnings zusammen mit NetIncomeLosse Sinn?

In [21]:
pre_num_df[(pre_num_df.tag=="NetIncomeLoss") & (pre_num_df.segments=="EquityComponents=RetainedEarnings;")][:10]

Unnamed: 0,adsh,tag,version,ddate,qtrs,uom,segments,coreg,value,footnote,report,line,stmt,inpth,rfile,plabel,negating,category
2420067,0000083246-13-000015,NetIncomeLoss,us-gaap/2012,20130331,1,USD,EquityComponents=RetainedEarnings;,,183000000.0,,4,41,BS,0,H,Net income,0,EquityComponents
6664980,0001472595-16-000203,NetIncomeLoss,us-gaap/2016,20160331,1,USD,EquityComponents=RetainedEarnings;,,167403000.0,,2,1,BS,0,H,Net income (loss),0,EquityComponents
13066115,0001493152-20-020829,NetIncomeLoss,us-gaap/2020,20200930,3,USD,EquityComponents=RetainedEarnings;,,-6920000.0,,2,46,BS,0,H,Net loss,0,EquityComponents
13066116,0001493152-20-020829,NetIncomeLoss,us-gaap/2020,20200930,1,USD,EquityComponents=RetainedEarnings;,,-2590000.0,,2,46,BS,0,H,Net loss,0,EquityComponents
15235759,0001520138-22-000169,NetIncomeLoss,us-gaap/2021,20211231,4,USD,EquityComponents=RetainedEarnings;,,2211309.0,,2,50,BS,0,H,Net Income ( Loss),0,EquityComponents
15336500,0001697884-22-000006,NetIncomeLoss,us-gaap/2022,20220131,4,USD,EquityComponents=RetainedEarnings;,,1264002.0,,2,38,BS,0,H,Net income (loss),1,EquityComponents
15911082,0001683168-22-008482,NetIncomeLoss,us-gaap/2022,20221031,1,USD,EquityComponents=RetainedEarnings;,,-517.0,,2,18,BS,0,H,NET LOSS,0,EquityComponents
17704972,0001903596-23-000855,NetIncomeLoss,us-gaap/2022,20230930,1,USD,EquityComponents=RetainedEarnings;,,-1814215.0,,2,46,BS,0,H,Net loss,0,EquityComponents
19657303,0000038074-09-000029,NetIncomeLoss,us-gaap/2008,20090331,4,USD,EquityComponents=RetainedEarnings;,,767743000.0,,5,2,CF,0,X,Net income,0,EquityComponents
19657565,0001193125-09-170002,NetIncomeLoss,us-gaap/2008,20090630,2,USD,EquityComponents=RetainedEarnings;,,482600000.0,,5,4,CF,0,X,Net income,0,EquityComponents
