## Setup 

In [None]:
base_url = "https://www2.census.gov/programs-surveys/bds/tables/time-series/"

In [None]:
# https://www.census.gov/data/datasets/time-series/econ/bds/bds-datasets.html
dff_a = pd.read_csv(base_url+"bds2019_fa.csv")
dff_sa = pd.read_csv(base_url+"bds2019_sec_fa.csv")
# dfe_sa = pd.read_csv(base_url+"bds2019_sec_ea.csv")

In [None]:
from util.io import read_file
BDS_cats = read_file("https://github.com/Alalalalaki/Econ-Data-Notes/raw/master/ByData/BDS/util/BDS_cats.pkl")


In [None]:
BDS_cats["cat_sector"]
BDS_cats["cat_age"]

In [None]:
def clean(df):
    df = df.replace("(X)", np.NaN).apply(lambda x: pd.to_numeric(x, errors='ignore'))
    df = df.query("year >= 1979") # following literature
    return df

dff_a = dff_a.pipe(clean)
dff_sa = dff_sa.pipe(clean)


## Empirical Facts

### Fact 1

**Note**:
- In the paper they use "cumulative net job creation since birth", while in their appendix they also try use "emp" and the results are similar. Actually although in paper they say cumulative NJC does not yield exactly same as "emp" due to "net job creation is clearned from not true startups", actually in the data for age 0 these two varaibles are exactly the same. So although I haven't checked if cumulated NJC through historical records would equal to the "emp" for age>0 categories, I think it would not be very different and thus we directly use  "emp" here.
- The employment deviation is quite different from the paper. This is not due to period used. This is not due to using job creation vs emp (as we have discussed). I guess this might be due to some recent data adjustment? Actually I check their replication package and find the data is different from 2019 data. And this is likely due to the redesign in 2018 version, see [here](https://twitter.com/ngoldschlag/status/1311360848741445638). Although not sure about the difference that I find here, this redesign seems to be large enough to change the image of some long-run trend e.g. establishment exit rate. Also given that this paper use the data even way earlier than the 2018 version, there could have more difference. 
- The recession years plot in the paper's figure 1 is somehow expand the recession period to one year before recession. But the basic pattern holds here.

In [None]:
# use the NBER recession indicator
bc = pd.read_html("https://www.nber.org/research/data/us-business-cycle-expansions-and-contractions")[0]

def plot_recession(ax):
    from matplotlib.patches import Rectangle
    for _, i in bc.iloc[-6:,:].droplevel(0,axis=1).iterrows():
        ax.add_patch(Rectangle((int(i["Peak Year"]), ax.get_ylim()[0]),
                               int(i["Trough Year"])-int(i["Peak Year"])+1,
                                ax.get_ylim()[1]-ax.get_ylim()[0], 
                               color="grey",alpha=0.5))

In [None]:
fig, ax = plt.subplots()
ages = ["a) 0", "f) 5"]
for a in ages:
    (dff_a.query("fage == @a")
        # .assign(emp_deviation = lambda x: x.emp.pct_change() ) # alternatively simply see growth rate, ok not very informative
        .assign(emp_deviation = lambda x: x.emp / x.emp.mean() -1 )
        # .assign(emp_deviation = lambda x: np.log(x.emp) - np.log(x.emp.mean())) # alternatively use log, same result 
        .assign(year = lambda x: x.year - int(a[-1:]))
        .plot("year", "emp_deviation", 
              label=f"fage=={a} (shift back {int(a[-1:])}ys)", 
              ax=ax)
    )
(dff_a
    .groupby("year")["emp"].sum().pct_change()
    .plot(label="Aggregate Employment Growth Rate", ax=ax)
)
plt.legend()
plot_recession(ax)

ax.set(title="Employment Deviations From Period Mean (EMP)");

**@directly looking at the entrant average size** (they study this in Appendix A.7, should move)

In [None]:
age0 = "a) 0"
(dff_a.query("fage == @age0")
    .assign(avgsize = lambda x: x.emp / x.firms)
    .plot("year", "avgsize", label=f"fage=={year}", title="Entrant Avgsize")
)
plot_recession(plt.gca());

In [None]:
fig,ax = plt.subplots(figsize=(6,12))
(dff_sa.groupby("sector").apply(lambda x: (x.query("fage == @age0")
    .assign(avgsize = lambda x: x.emp / x.firms)
    .plot("year", "avgsize", label=f"sector=={x.name}", ax=ax)) 
                               )
);

### Fact 2

**Note**:
- The the paper here uses "log deviations from an HP trend taken across chorts of the same age". We directly use above simple deviations from period mean. The result is weaken than the paper shows. This is partly due to the HP stuff as the appendix of the paper shows. But the further decline might relate to the above data adjustment. 

In [None]:
emp_dev_age0 = (dff_a.query("fage == @age0")
     .assign(emp_deviation = lambda x: x.emp / x.emp.mean() -1)
     .set_index("year")["emp_deviation"].rename("a0")
)
ages = ['b) 1','c) 2','d) 3','e) 4','f) 5',]
for a in ages:
    emp_dev_agea = (dff_a.query("fage == @a")
     .assign(emp_deviation = lambda x: x.emp / x.emp.mean() -1)
     .assign(year = lambda x: x.year - int(a[-1:]))
     .set_index("year")["emp_deviation"].rename("aa")
    )
    temp = pd.concat([emp_dev_age0,emp_dev_agea], join="inner",axis=1)
    print(f"age=={a}:", np.correlate(temp.a0, temp.aa))

### Fact 3