In [None]:
from util.plot import plot_interactive_lines

## BDS (CSV Data)

### Download

https://www.census.gov/data/datasets/time-series/econ/bds/bds-datasets.html

In [None]:
base_url = "https://www2.census.gov/programs-surveys/bds/tables/time-series/"

In [None]:
year = 2019

In [None]:
dfe_s = pd.read_csv(base_url+f"bds{year}_sec.csv")

In [None]:
dfe_az = pd.read_csv(base_url+f"bds{year}_ea_ez.csv")

In [None]:
dfe_sa = pd.read_csv(base_url+f"bds{year}_sec_ea.csv")

### Categories

https://www.census.gov/content/dam/Census/programs-surveys/business-dynamics-statistics/codebook-glossary.pdf

In [None]:
# although in BDS 2019 they make the glossary in a pdf, 
# we can still use the web archive
# I have not done a thorough check but it seems the code does not change
# cat_tables = pd.read_html("https://www.census.gov/programs-surveys/bds/documentation/glossary.html")
cat_tables = pd.read_html("https://web.archive.org/web/20210319075210/https://www.census.gov/programs-surveys/bds/documentation/glossary.html")

cat_e_age, cat_e_size, cat_f_age, cat_f_size, cat_nonmetro, cat_sector, cat_state = [
    i.dropna(axis=1, how="all") for i in cat_tables[2:]
]

cat_sector = cat_sector.set_index(0).iloc[1:,0].to_dict()

# eage and fage have same categories 
cat_age = cat_e_age.set_index(0).iloc[2:,0].to_dict()
cat_agecoarse = cat_e_age.set_index(3).iloc[2:2+5,3].to_dict()

cat_esize = cat_e_size.set_index(0).iloc[2:,0].to_dict()
cat_fsize = cat_f_size.set_index(0).iloc[2:,0].to_dict()
# sizecoarse is same for esize and fsize
cat_sizecoarse = cat_e_size.set_index(3).iloc[2:2+3,3].to_dict()

cat_state_ = pd.DataFrame()
for i in range(4):
    i = i*2
    cat_state_ = pd.concat([cat_state_, cat_state.loc[1:,i:i+1].set_axis([0,1],axis=1)], 
                           axis=0, ignore_index=True)
cat_state = cat_state_.set_index(0).iloc[:-4,0].to_dict()

# TO-DO: msa; ...

BDS_cats = {"cat_sector":cat_sector,
            "cat_age":cat_age, "cat_agecoarse":cat_agecoarse,
            "cat_esize":cat_esize, "cat_fsize":cat_fsize, "cat_sizecoarse":cat_sizecoarse,
            "cat_state":cat_state,
           }

# # save cats for other use
# from util.io import write_file
# write_file(BDS_cats, "util/BDS_cats.pkl")

### BDS Methodology

https://www.census.gov/programs-surveys/bds/documentation/methodology.html

#### Concepts and Methodology

The BDS data measure the **net change in employment at the establishment level**. These changes come about in one of four ways. 
- A net increase in employment can come from either opening establishments or expanding establishments. (sum of all jobs added as gross job gains)
- A net decrease in employment can come from either closing establishments or contracting establishments. (sum of all jobs lost as gross job losses)
- The net change in employment is the difference between gross job gains and gross job losses.

The **formal definitions of employment changes** are as follows:
- Job Creation (JC) – Job creation is the sum of all employment gains from expanding establishments from year t–1 to year t including establishment startups. Note that the contribution of firm births can be measured by using the job creation from establishments with firm age equal to zero.
- Job Destruction (JD) – Job destruction is the sum of all employment losses from contracting establishments from year t–1 to year t including establishments shutting down.

Some **simple identities** are useful to note to interpret and use these statistics. 
- Let $E_{i t}$ be employment in year $t$ for establishment $i$. Define **establishment-level employment growth** as follows:
$$
g_{i t}=\left(E_{i t}-E_{i t-1}\right) / X_{i t}
$$
where
$$
X_{i t}=0.5 *\left(E_{i t}+E_{i t-1}\right)
$$
- This growth rate measure has become *standard in analysis of establishment and firm dynamics because it shares some useful properties of log differences but also accommodates entry and exit* (see Davis, Haltiwanger and Schuh [1996], and Tornquist, Vartia, and Vartia [1985]). 
- The above definitions of **JC and JD** for establishments classified in **group $s$** (e.g., a firm size, firm age category) are given by:
$$
J C_{s t}=\sum_{i \in s, g_{i t} \geq 0}\left(E_{i t}-E_{i t-1}\right) \quad J D_{s t}=\sum_{i \in s, g_{i t}<0}\left(E_{i t}-E_{i t-1}\right)
$$
- The **net change in employment** for establishments in group $s$ satisfies the following identity:
$$
N E T_{s t}=J C_{s t}-J D_{s t}=\sum_{i \in s, g_{i t} \geq 0}\left(E_{i t}-E_{i t-1}\right)-\sum_{i \in s, g_{i t}<0}\left(E_{i t}-E_{i t-1}\right)
$$
- For **growth rates**, the analagous relationships are given by:
$$
\begin{gathered}
J C R_{s t}=\sum_{i \in s, g_{i t} \geq 0}\left(X_{i t} / X_{s t}\right) g_{i t}=J C_{s t} / X_{s t} \quad J D R_{s t}=\sum_{i \in s, g_{i t}<0}\left(X_{i t} / X_{s t}\right)\left|g_{i t}\right|=J D_{s t} / X_{s t} \\
N E T R_{s t}=\sum_{i \in s}\left(X_{i t} / X_{s t}\right) g_{i t}=N E T_{s t} / X_{s t}=\left(J C_{s t}-J D_{s t}\right) / X_{s t}
\end{gathered}
$$
where $X_{s t}=\sum_{i \in s} X_{i t}$
  - The latter variable $X_{s t}$ denotes the sum of average employment over a consecutive two-year period and as is clear from the above it is simple to convert the changes to rates by dividing the relevant measures by this variable. Note that in general the variable $X_{s t}$ for a particular classification is not equal to the simple average of the employment variable using the current and prior year since establishments are assigned the characteristics of the firm that owns the establishment in $t$ and this may have changed from year $t-1$ to year $t$.
- The *employment measure used for the tabulations is the number of employees at the establishment for the payroll period including March 12. As such, all growth rates are based on March–to–March changes and the tabulations for a given year are the changes from the prior year to the current year.* 
  - An establishment opening or entrant is an establishment with positive employment in the current year and zero employment in the prior year. 
  - An establishment closing or exit is an establishment with zero employment in the current year and positive employment in the prior year. 
  - The vast majority of establishment openings are true greenfield entrants. Similarly, the vast majority of establishment closings are true establishment exits (i.e., operations ceased at this physical location). However, *there are a small number of establishments that temporarily shutdown (i.e., have a year with zero employment) and these are excluded from the counts of establishment openings and closings*.

In the released series, the job flow measures are provided in terms of both level changes (e.g., the number of jobs) as well as rates using the above denominator as described above to convert level changes to rates. In addition, the number of establishments in each of the categories of change (openings, closings, continuers) and the classifications (e.g., firm size, firm age, etc.) is provided which permits tracking the gross and net flows of the number of establishments. The decomposition into openings, closings and continuers permits decomposing gross job creation into the component from continuing establishments that are expanding and establishment openings and decomposing gross job destruction into the component from continuing establishments that are contracting and establishment closings.

It is critical to emphasize that the BDS contains measures of net and gross flows of establishments and jobs at the establishment level. 
- All establishments are, however, linked to their parent firm so that the net and gross flows of establishments and jobs can be categorized by the characteristics of the parent firm. In particular, *establishments are classified by both the size of the parent firm and age of the parent firm* as defined below. This *enables quantifying the contribution of firms by firm size and firm age in terms of establishment and job net and gross flows*. 
- For example, and of particular interest, the contribution of firm startups to the net and gross flows of establishments and jobs can be ascertained by using the tabulations of firm age zero. As described in detail on the BDS page here, establishments are assigned a firm age based upon the age of the parent firm. *The age of the parent firm is based on the age of the oldest establishment in the firm. A firm age of zero represents a firm where all establishments in the firm are entrants in that year –– hence it is a new firm.* By construction, tabulations of firm age zero represent establishment entrants that are part of a new firm. *Most new firms are single–unit firms.*


### BDS FAQ

How do I **compute the establishment entry and exit rate** using variables in the BDS tables?
- The establishment entry rate and establishment exit rates can be computed using establishment counts provided in each BDS table. 
- *Establishment entry (exit) rates are defined as the count of establishment entrants (exits) in year t divided by the average count of employment active establishments in year t and  year t-1.* 
- However, this computation cannot be done by simply taking estab_entry in year t divided by the average estab count in year t and year t-1. *Due to year-to-year scope changes in estabs, the estab count in year t is not longitudinally consistent with the estab count in year t-1. In order to ensure that the count of year t employment active establishments are longitudinally consistent with the count of employment active estabs in year t-1, we need to apply the same scope of establishments for the pair of years.* For year t, the longitudinally consistent count of employment active establishments in year t-1 can be derived as: estabs_year t-1 = estabs year t + estabs_exit year t – estabs_entry year t
- We can then compute the **estabs_entry_rate** as: estabs_entry_rate year t= 100*(estabs_entry year t/(0.5*(estabs year t + estabs_year t-1))
- **Establishment exit rate** can be computed in a similar fashion as: estabs_exit_rate year t = 100*(estabs_exit year t/(0.5*(estabs year t + estabs_year t-1)))

How often is the BDS updated?
- The BDS is updated once a year.  The timing of the BDS release is dependent upon the availability of the source files from the Business Register and the CBP program.  The final BR files are typically available in October, one year after the reference year.  The CBP program then completes its edits of the BR data files and publishes in April—about 6 months later.  Finally the BDS uses the BR and CBP files and completes its processing about 5 months later.  So BDS data are typically released in September—2 years after the reference year.     



Adding net flows and current employment does not yield the following year's employment. Why?
- The discrepancies result from processing decisions required when working with longitudinal administrative data (or in this case linked cross sections).  A rolling window of establishment attributes is used to determine whether a case should be considered in scope for the BDS tabulations. Establishments can and do switch from being in scope to out of scope.  For example, an establishment may switch from an in-scope industry to an out of scope industry such as 52592 Trusts, Estates, and Agency Accounts.  In this case, the establishment would contribute to the employment stock in one year but not the next.  Moreover, as continuing establishments enter and exit the scope of the BDS, their employment is not registered as job creation and destruction.
- In addition to scope changes, establishments may also move between by-variable categories across years.   For instance, a continuing establishment with no employment change may be classified as firm size “1 to 4” in one year and then “5 to 9” in the next.   Since firm size is the average of firm size in the current and prior year, an establishment that had 4 employees in one year and 5 in the following two years would meet this criteria.  When establishments switch between categories, the employment stocks in one cell will rise while the other declines without necessarily involving changes in employment flows (job creation or destruction).
- Despite these nuances, the longitudinally consistent measure of employment in t-1 for the t-1 to t flows can be retrieved by the variable Denom (which is the average of employment in t-1 and t defined in a longitudinally consistent fashion at the establishment level), along with reported period t employment. It can also be retrieved by using the reported Net Job Creation level for t-1 and t along with reported employment in period t. Overall, then, reported employment in year t is relevant for the flows between t-1 and t (and so reported employment in t-1 is relevant for the flows between t-2 and t-1).



What is the **difference between metro and metropolitan statistical area (MSA)**?
- *Metro is a designation applied at the county level, while an MSA is a grouping of one or more metro-designated counties.*  In addition, the statistics for metro areas pool all metro counties in the U.S.

How are startups defined? How can I identify them in the data?
- Startups are firms with the age of 0. No previous employment is associated with these firms and all its establishments are de novo establishments.

How can I identify **establishment entry** in the data?
- **Establishment entry (“estab_entry”)** is one of the variables tabulated in the BDS.  It is *defined as the count of establishments born within the cell during the last 12 months, where “born” is defined as going from zero March 12 employment in year t-1 to positive March 12 employment in year t.*

Why are there **entrants with establishment age greater than zero**?
- Age 0 establishments are de novo establishments in the economy based on the first year of positive employment in the week that includes March 12. They are establishments classified under Establishment Age = 0.  Establishments can be entrants with age greater than zero in year t if, for example, they have positive March 12 employment in year t-3, have zero March 12 employment in both year t-2 and year t-1, and then have positive March 12 employment again in year t.

What is a **firm  death**?
- Firm  death *identifies events where all the establishments associated with a particular firm and the firm itself cease all operations.* Note that *firm legal entities that cease to exist because of merger and acquisition activity are not classified as firm exits in these data.*

What are **"Left Censored" Firms**?
- These are *firms born before 1976 and for which we do not know their true age.*

Can the number of left censored firms increase?
- It is not a given that the number of left censored firms and the number of establishments owned by such firms in a given year will be lower than those in the previous years. This is not hard to see once we understand that firms are legal units that can be rearranged, merged, or split into separate legal entities. Put simply, a left censored firm can be split into two left censored firms. Similarly, the acquisition or creation of new establishments by a left censored firms can result in a higher number of establishments belonging to left censored firms. Merger and acquisition activity can in principle then increase the number of left censored firms and the number of establishments owned by such firms. Note that this is not true for statistics on the number of establishments using establishment age. That is, the number of left censored establishments using establishments age must decline over time and this is true in the BDS data.

Why do I get different numbers for the same statistic in different tables?
- There are a number of possible reasons for discrepancies between statistics in different tables and specifically between tables at higher levels of aggregation (e.g. economy-wide) versus tables at lower levels of aggregation (e.g., state, industry, etc.).
- 1  Noise infusion:  Because noise is added at the cell level for certain variables, including:  job creation, job creation_births, job creation_continuers, job destruction, job destruction_deaths, job destruction_continuers, and net job creation, this can cause discrepancies in these statistics across tables.
- 2  Rounding:  Rounding can cause some small discrepanices across tables.
- 3  For the variables ‘firm’ and ‘firmdeath_firms’, these statistics are often smaller in the economy-wide table than as summed across tables at the lower geographic and industrial aggregations.  This is due to the fact that in the BDS, geography and industry are defined at the establishment level, and not the firm level.  So firms with multiple establishments operating in different geographies or industries will show up in the total firm data for each geography or industry in which they operate at least one establishment.  The same holds true for the variable ‘firmdeath_firms’.  This discrepancy does not apply to the other statistics in the BDS, however, because these statistics all represent tabulations of establishment-level data. // Related to this, it should be noted that in contrast to firm geography and industry—which as noted are assigned at the establishment level—firm size and age are assigned at the firm level and therefore represent the size or age of the entire firm, not just the part that operates in a given geography or industry.  A given firm, therefore, will generally only show up in the total firms data in one age or size category in a given table.  There are some exceptions to this rule for the firm size and initial size categories, as noted in the cases outlined in #4 below.
- 4  For the statistic ‘firm’ and ‘firmdeath_firms’, there can be discrepanices in these statistics between the economy-wide table versus the firm size and initial firm size tables.  This difference is due to the fact that establishments are assigned a firm size based on the average size of their associated firm in (t) and (t-1). Initial firm size is based upon the size of the establishment's associated firm in (t-1) or, in the case of entrants, their size in (t). Importantly, an establishment's associated firm may change between periods. This means that multi-unit firms, due to merger and acquisition activity, may have establishments with different combinations (t) and (t-1) firm identifiers and thus may be assigned to different firm size and initial firm size categories. For example, suppose a large multi-unit firm acquires a small continuing single-unit firm. The acquired single-unit establishment's initial firm size in (t) will be assigned based upon the size of that single-unit in (t-1) whereas the multi-unit's other establishments will be assigned an initial firm size based on the (t-1) size of the multi-unit firm.
- 5  The fact that cells in tables at lower levels of aggregtion are sometimes suppressed with (D) or (S), obviously means discrepancies can exist between these statistics at lower versus higher levels of aggregation.

### Release Note

#### 2019 BDS Release Note

https://www2.census.gov/programs-surveys/bds/updates/bds2019-release-note.pdf

### Explore

(experimental explore // should create clean functions here and make another notebook for exploration)

#### Entry/Exit Rate by Sector

In [None]:
dt = dfe_s.assign(sector_ = dfe_s.sector.map(cat_sector))

In [None]:
plot_interactive_lines(dt, "year", "estabs_entry_rate", "sector_")

In [None]:
plot_interactive_lines(dt, "year", "estabs_exit_rate", "sector_")

#### Average Size by Sector 

In [None]:
dt = (dfe_s
         .eval("avgsize_e = emp / estabs")
         .eval("avgsize_f = emp / firms")
         .assign(sector_ = dfe_s.sector.map(cat_sector))
        )

In [None]:
plot_interactive_lines(dt, "year", "avgsize_e", "sector_")

In [None]:
plot_interactive_lines(dt, "year", "avgsize_f", "sector_")

#### Entry Size By Sector

In [None]:
entry_age = "a) 0"
dt = (dfe_sa
         .eval("avgsize_e = emp / estabs")
         .eval("avgsize_f = emp / firms")
         .assign(sector_ = dfe_sa.sector.map(cat_sector))
         .query('eage == @entry_age')
        )

In [None]:
plot_interactive_lines(dt, "year", "avgsize_e", "sector_")

In [None]:
plot_interactive_lines(dt, "year", "avgsize_f", "sector_")