In [1]:
import pandas as pd
# ensure that all columns are shown and that colum content is not cut
pd.set_option('display.max_rows', 500) # ensure that all rows are shown
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width',1000)

# Analyze Segments Information

In this notebook, we analyze what information that is available in the segments column.

To do that, we use the joined databag containing all the filtered and joined data (as created in the automation example in 08_00_automation_basics notebook). This will use quite some memory and also take a minute or so to load. 

As an alternative, you could also use only the data of 2024:
<pre>
# As an alternative, using the data of a single year
from secfsdstools.d_container.databagmodel import JoinedDataBag
from secfsdstools.e_collector.zipcollecting import ZipCollector
from secfsdstools.u_usecases.bulk_loading import default_postloadfilter

collector = ZipCollector.get_zip_by_names(names=["2024q1.zip", "2024q2.zip", "2024q3.zip", "2024q4.zip"], 
                                          forms_filter=["10-K", "10-Q"],                                        
                                          post_load_filter=default_postloadfilter)

all_joined_bag: JoinedDataBag = collector.collect().join()
pre_num_df = all_joined_bag.pre_num_df
</pre>

In [7]:
from secfsdstools.d_container.databagmodel import JoinedDataBag

path_to_all = "C:/data/sec/automated/_4_single_bag/all"
all_joined_bag = JoinedDataBag.load(path_to_all)
pre_num_df = all_joined_bag.pre_num_df

## Basic information

In [8]:
print(len(pre_num_df))

62187005


The whole dataset (as of February 2025) has over **62 million** rows in the joined pre_num_df dataframe. Now, let's see how many rows have information inside the `segments` column:

In [9]:
print(sum(~(pre_num_df.segments=='')))

26381721


Around **40%** of the datapoints have segments information.

Now let us see, how many different values we have in the `segments` column:

In [10]:
print(pre_num_df.segments.nunique(dropna=True))

844868


It seems as there are many different values within the segments column. So, it will be intersting to know, if certain values are more frequent and therefore more important than others.

## Category/Axis

### Basics

Usually, entries with segments information "belong" to an entry with the same `tag` that has None in its `segments` column. 

As an example, let us look at the Apple 10-Q report of the second quarter of 2024 which adsh equals "0000320193-24-000069". We will also filter for the Revenues tag `RevenueFromContractWithCustomerExcludingAssessedTax` and the values for only the second quarter (qtrs==1) and not for the combined values of quarter 1 and 2 (qtrs=2).

In [11]:
example_segments = pre_num_df[(pre_num_df.adsh=="0000320193-24-000069") & (pre_num_df.tag=="RevenueFromContractWithCustomerExcludingAssessedTax") & (pre_num_df.qtrs==1)]
example_segments

Unnamed: 0,adsh,tag,version,ddate,qtrs,uom,segments,coreg,value,footnote,report,line,stmt,inpth,rfile,plabel,negating
61209002,0000320193-24-000069,RevenueFromContractWithCustomerExcludingAssessedTax,us-gaap/2023,20240331,1,USD,BusinessSegments=GreaterChinaSegment;,,16372000000.0,,2,7,IS,0,H,Net sales,0
61209005,0000320193-24-000069,RevenueFromContractWithCustomerExcludingAssessedTax,us-gaap/2023,20240331,1,USD,ProductOrService=IPad;,,5559000000.0,,2,7,IS,0,H,Net sales,0
61209006,0000320193-24-000069,RevenueFromContractWithCustomerExcludingAssessedTax,us-gaap/2023,20240331,1,USD,,,90753000000.0,,2,7,IS,0,H,Net sales,0
61209011,0000320193-24-000069,RevenueFromContractWithCustomerExcludingAssessedTax,us-gaap/2023,20240331,1,USD,BusinessSegments=RestOfAsiaPacificSegment;,,6723000000.0,,2,7,IS,0,H,Net sales,0
61209012,0000320193-24-000069,RevenueFromContractWithCustomerExcludingAssessedTax,us-gaap/2023,20240331,1,USD,BusinessSegments=EuropeSegment;,,24123000000.0,,2,7,IS,0,H,Net sales,0
61209015,0000320193-24-000069,RevenueFromContractWithCustomerExcludingAssessedTax,us-gaap/2023,20240331,1,USD,BusinessSegments=JapanSegment;,,6262000000.0,,2,7,IS,0,H,Net sales,0
61209016,0000320193-24-000069,RevenueFromContractWithCustomerExcludingAssessedTax,us-gaap/2023,20240331,1,USD,ProductOrService=Service;,,23867000000.0,,2,7,IS,0,H,Net sales,0
61209018,0000320193-24-000069,RevenueFromContractWithCustomerExcludingAssessedTax,us-gaap/2023,20240331,1,USD,ProductOrService=Mac;,,7451000000.0,,2,7,IS,0,H,Net sales,0
61209019,0000320193-24-000069,RevenueFromContractWithCustomerExcludingAssessedTax,us-gaap/2023,20240331,1,USD,ProductOrService=WearablesHomeandAccessories;,,7913000000.0,,2,7,IS,0,H,Net sales,0
61209020,0000320193-24-000069,RevenueFromContractWithCustomerExcludingAssessedTax,us-gaap/2023,20240331,1,USD,BusinessSegments=AmericasSegment;,,37273000000.0,,2,7,IS,0,H,Net sales,0


Usually, entries in the segments columns have the format `<category/axis>=<value>` and in the above example, we see that we mainly have two axes: `BusinessSegments`and `ProductOrService`. The first one gives a more detailed view of the revenues that were made in different regiond. We would also expect, that the values sum up to the total value shown in the entry without `segments` information: 9.07+10. And indeed, they do: 1.637+10, 0.672+10, 2.412+10, 0.626+10, and 3.727+10 sum up to 9.07+10

In [14]:
sum(example_segments[example_segments.segments.str.startswith('BusinessSegments')].value)

90753000000.0

The second axis `ProductOrService` is a little bit trickier, since it shows two levels. First, we have separation for Product (`ProductOrService=Product`) or Service (`ProductOrService=Service`). This two values will also sum up to the total of 9.0753+10: 6.6886+10 + 2.38670+10. But we also have the Revenues for different products: `ProductOrService=IPad`, `ProductOrService=IPhone`, ... . We expect, that the values of the products should sum up, or at least come close to the value of the total product value `ProductOrService=Product`: 6.6886+10.

In [15]:
sum(example_segments[example_segments.segments.isin(['ProductOrService=IPad;','ProductOrService=Mac;','ProductOrService=WearablesHomeandAccessories;','ProductOrService=IPhone;'])].value)

66886000000.0

**Conclusion**: we cannot simply expect, that value for certain "axis" will directly add up to the total value.

## Overview on Categories/Axes

Since the format of the `segments` column is `<category/axis>=<value>`, let's create a category column, so that we can investigate how many different categories we have and how often they appear. We simply split the string inside the segments column at the = sign and use the first part as `category`.

In [16]:
pre_num_df['category'] = pre_num_df.segments.str.split("=", n=1, expand=True)[0]

Let's see how many different categories we have:

In [18]:
print(pre_num_df.category.nunique(dropna=True))

6050


There are around 6000 "main" cataegories, resp. axes.

In order to know which categories are the most important ones, let's display to top 10 for every financial statement (BS, IS, CF):

In [12]:
def get_value_counts(stmt: str) -> pd.Series:
  print("Results for: ", stmt)
  p_n_stmt_df = pre_num_df[(pre_num_df.stmt==stmt) & ~(pre_num_df.segments=='')]
  categories_stmt =  p_n_stmt_df.category.value_counts()
  print("different categories in", "stmt", len(categories_stmt))
  print("top ten\n")
  print(categories_stmt[:10])
  print("-------------------------------------\n\n")
  return categories_stmt

bs_categories = get_value_counts("BS")
is_categories = get_value_counts("IS")
cf_categories = get_value_counts("CF")

Results for:  BS
different categories in stmt 3457
top ten

                                       11365908
EquityComponents                        1334326
ClassOfStock                             800987
FairValueByFairValueHierarchyLevel       702379
InvestmentIdentifier                     496133
BusinessSegments                         481380
ConsolidatedEntities                     474227
ConsolidationItems                       354201
FinancingReceivablePortfolioSegment      250581
FinancialInstrument                      224834
Name: category, dtype: int64
-------------------------------------


Results for:  IS
different categories in stmt 2663
top ten

                                                            10345352
BusinessSegments                                             2315314
ConsolidationItems                                            957705
EquityComponents                                              738475
ProductOrService                                       

**Conclusion**: EquityComponents, BusinessSegments, ClassOfStock, LegalEntity, ConsolidationItems, and ConsolidationEntities are among the top 10 of all statements.

## An Example Deep Dive into Apple's 10-K

Let us have a look at Apple's 10-K reports. 

Therefore we load by Apples's cik 320193 and the forms 10-K. Moreover, we are just interestes in IS reports. We use Predicate Pushdown on the "big single bag", but of course, we could also use the CompanyCollector instead to get all 10-K reports for Apple, or applying the filters on the already loaded dataset.

In [3]:
from secfsdstools.d_container.databagmodel import JoinedDataBag

path_to_all = "C:/data/sec/automated/_4_single_bag/all"
apple_10k_joined_bag = JoinedDataBag.load(path_to_all, ciks_filter=[320193], forms_filter=['10-K'], stmt_filter=['IS'])
apple_10k_pre_num_df = apple_10k_joined_bag.pre_num_df
print(apple_10k_pre_num_df.shape)

2025-02-26 06:17:13,669 [INFO] databagmodel  apply sub_df filter: [('cik', 'in', [320193]), ('form', 'in', ['10-K'])]
2025-02-26 06:17:13,856 [INFO] databagmodel  apply pre_num_df filter: ["('adsh', 'in', ['0001193125-09-214859', '0001193125-10-238044', '0001193125-11-282113', '0001193125-...)", "('stmt', 'in', ['IS'])"]


(703, 17)


In [6]:
tags = apple_10k_pre_num_df.tag.unique()
revenue_tags = [t for t in tags if 'revenue' in t.lower()]
print("Tags used in the IS reports of Apple's 10-K:\n", tags, "\n")
print("Tags containing 'revenue':\n", revenue_tags)

Tags used in the IS reports of Apple's 10-K:
 ['IncomeLossFromContinuingOperationsBeforeIncomeTaxesMinorityInterestAndIncomeLossFromEquityMethodInvestments'
 'SellingGeneralAndAdministrativeExpense' 'EarningsPerShareDiluted'
 'CostOfGoodsAndServicesSold'
 'WeightedAverageNumberOfSharesOutstandingBasic' 'IncomeTaxExpenseBenefit'
 'SalesRevenueNet' 'NetIncomeLoss' 'GrossProfit'
 'WeightedAverageNumberOfDilutedSharesOutstanding'
 'ResearchAndDevelopmentExpense' 'NonoperatingIncomeExpense'
 'EarningsPerShareBasic' 'OperatingIncomeLoss' 'OperatingExpenses'
 'CommonStockDividendsPerShareDeclared'
 'IncomeLossFromContinuingOperationsBeforeIncomeTaxesExtraordinaryItemsNoncontrollingInterest'
 'Revenues' 'RevenueFromContractWithCustomerExcludingAssessedTax'] 

Tags containing 'revenue':
 ['SalesRevenueNet', 'Revenues', 'RevenueFromContractWithCustomerExcludingAssessedTax']


Let us look only at Tags containing 'revenue' for Apple's 10-K reports. We only want the data for the whole year, so we also filter for qtrs==4. Furthermore, we want to see the "main" value (meaning segments is empty) and segments values for "ProductOrService=Service;" and "ProductOrService=Product;"

In [22]:
apple_10k_pre_num_df[(apple_10k_pre_num_df.qtrs==4) & apple_10k_pre_num_df.tag.isin(revenue_tags) & apple_10k_pre_num_df.segments.isin(['', 'ProductOrService=Service;', 'ProductOrService=Product;'])]

Unnamed: 0,adsh,tag,version,ddate,qtrs,uom,segments,coreg,value,footnote,report,line,stmt,inpth,rfile,plabel,negating
6,0001193125-09-214859,SalesRevenueNet,us-gaap/2009,20090930,4,USD,,,36537000000.0,,3,3,IS,0,X,Net sales,0
35,0001193125-10-238044,SalesRevenueNet,us-gaap/2009,20100930,4,USD,,,65225000000.0,,2,3,IS,0,X,Net sales,0
69,0001193125-11-282113,SalesRevenueNet,us-gaap/2011,20110930,4,USD,,,108249000000.0,,2,3,IS,0,H,Net sales,0
114,0001193125-12-444068,SalesRevenueNet,us-gaap/2012,20120930,4,USD,,,156508000000.0,,2,3,IS,0,H,Net sales,0
157,0001193125-13-416534,SalesRevenueNet,us-gaap/2013,20130930,4,USD,,,170910000000.0,,2,3,IS,0,H,Net sales,0
221,0001193125-14-383437,SalesRevenueNet,us-gaap/2014,20140930,4,USD,,,182795000000.0,,2,3,IS,0,H,Net sales,0
253,0001193125-15-356351,SalesRevenueNet,us-gaap/2015,20150930,4,USD,,,233715000000.0,,2,3,IS,0,H,Net sales,0
316,0001628280-16-020309,SalesRevenueNet,us-gaap/2015,20160930,4,USD,,,215639000000.0,,2,1,IS,0,H,Net sales,0
356,0000320193-17-000070,SalesRevenueNet,us-gaap/2017,20170930,4,USD,,,229234000000.0,,2,1,IS,0,H,Net sales,0
419,0000320193-18-000145,Revenues,us-gaap/2018,20180930,4,USD,,,265595000000.0,,2,1,IS,0,H,Net sales,0


A few interesting points we see. Over the years, Apple was using different Tags to report the overall Revene. First, it was **SalesRevenueNet**, then just **Revenue** ein 2018, and from 2019 on it was **RevenueFromContractWithCustomerExcludingAssessedTax**. 
Moreover, reporting individual values for services and products sold started only in 2018. previous to that, they didn't report these more fine grained numbers. 

Note, there are also the tags **SalesRevenueGoodsNet** and **SalesRevenueServicesNet**. So it is very likely that find reports using those, instead of using ProductOrService=Service;" and "ProductOrService=Product;" segmnets.

**Conclusion**: The same value can be reported with different tags, and depending on the tag, there could also be ways on how to report the same value using segments. And more, even the same company can use different approaches over the years. So being able to standardize the information is crucial.