## Comparability of CPS Classifications over time

Is there any guidance provided by the NLSY or the anyone else on how to ensure between the different CPS classifications in the NLSY dataset? Or do we simply have to assign occupations to white- and blue collar ourselves. Is there a well-published paper that thoroughly documents what they have done. 

Currently, I am simply relying on footnote 18 of Keane & Wolpin (1997) for the classification: 

> Occupational categories are based on one-digit census codes. Blue-collar occupations
are (i) craftsmen, foremen, and kindred; (ii) operatives and kindred; (iii)
laborers, except farm; (iv) farm laborers and foremen; and (v) service workers.
White-collar occupations are (i) professional, technical, and kindred; (ii) managers,
officials, and proprietors; (iii) sales workers; (iv) farmers and farm managers; and
(v) clerical and kindred.

This works well until the CPS 70 codes are no longer provided in the NLSY and subsituted with the 2000 codes.

### Solution 1
The Integrated Public Use Microdata Series (IPUMS-USA) has spent a lot of effort in creating crosswalks between the decennial occupation classification systems of the CPS. Bridging two major classification changes in 1980 and 2000, they decided to create a variable (OCC1990) which can be used by researcher to analyse longer periods of occupational data. 

**Literature**:

- Here is an [essay](https://usa.ipums.org/usa/chapter4/chapter4.shtml) on the subject
- [BLS working paper](../documents/literature/proposed_category_system_1960_2000_census_occupations.pdf) on a consistent category system from 1960 to 2000

A crosswalk over all decennial systems from 1950 to 2000 can be found on this [page](https://usa.ipums.org/usa/volii/occ_ind.shtml) and downloaded [here](https://usa.ipums.org/usa/resources/volii/documents/occ1990_xwalk.xls).

**From now on, there are two possible ways to construct consistent occupations codes:**

1. The crosswalk can be used to map as many categories from the 2000er system to the 1970er system. Then, the footnotes from Keane, Wolpin (1997) can be applied to the codes.

    The benefits would be that the definition of blue and white collar workers is identical and does not have to be adjusted.
    
    The disadvantage is that there are 143 categories of the 2000_1 system which will not be mapped and 124 of the 2000_5 system. (For completeness, there are 116 and 130 unmapped categories for the 1970 codes in the 2000_1 and 2000_5 system, respectively.)
2. One could use IPUMS OCC1990 variable to map all categories between all systems, but one has to adjust the definition of blue and white collar workers.

    Some numbers:
    - maps every category from 1970, 2000 1%, and 2000 5%
    - comprises 396 categories, not blown up like the 2000er with ca. 500 items, more similar to 1970 with 440 items
    
**Then, there is also the question which 2000er category system should be used from the crosswalk**. The IPUMS website has an explanation for the difference ([source](https://usa.ipums.org/usa/volii/occ2000.shtml)):

> The 2000 5% sample contains less detail than the 2000 1% sample. In the 5% sample, any category representing fewer than 10,000 people was combined with a larger, more generalized category.

Since the NLSY page does not offer more information on the categories, I compared the overlapping categories and whether they are detailed in the [NLSY documentation on the 2000er system](https://www.nlsinfo.org/sites/nlsinfo.org/files/attachments/121217/att300.pdf). The NLSY 2000 codes correspond to 2000 1% codes.

In [None]:
import pandas as pd

crosswalk = pd.read_excel('../data/external/occ_crosswalks/occ1990_xwalk.xls')
crosswalk = crosswalk.iloc[:,[0, 1, 4, 7, 8]]
crosswalk.columns = ['OCC1990', 'DESCRIPTION', 'CPS_1970', 'CPS_2000_1', 'CPS_2000_5']
# Delete headings in file
crosswalk = crosswalk.loc[~(crosswalk.OCC1990 == '#')]

In [None]:
# Look for missing connections between the systems
# crosswalk.loc[crosswalk.CPS_2000_1.notnull()].isnull().sum()

In [None]:
# View overhanging 2000 1% codes
# crosswalk[crosswalk.CPS_2000_1.notnull() & crosswalk.CPS_2000_5.isnull()]