# The Career Decisions of Young Men

This notebook processes and explores the estimation sample used by Michael Keane and Kenneth Wolpin to study the career decisions of young men. 

> Keane, M. P. and Wolpin, K. I. (1997). [The career decisions of young men](http://www.journals.uchicago.edu/doi/10.1086/262080). *Journal of Political Economy*, 105(3), 473-522.

The sample is based on the [National Longitudinal Survey of Youth 1979 (NLSY79)](https://www.bls.gov/nls/nlsy79.htm) which is available to download [here](https://github.com/structDataset/career_decisions_data/blob/master/KW_97.raw). 

## Preparations

We first peform some basic preparations to ease further processing.

In [21]:
import pandas as pd
import numpy as np

# We ensure a proper formatting of the wage variable.
pd.options.display.float_format = '{:,.2f}'.format

# We label and format the different columns.
columns = ['Identifier', 'Age', 'Schooling', 'Choice', 'Wage']
dtype = {'Identifier': np.int, 'Age': np.int,  'Schooling': np.int,  'Choice': 'category'}

# We read the original data file.
df = pd.DataFrame(np.genfromtxt('KW_97.raw'), columns=columns).astype(dtype)

# We label the different choice categories.
df['Choice'].cat.categories = ['Schooling', 'Home', 'White', 'Blue', 'Military']

# We set the index for easier interpretability.
df['Period'] = df['Age'] - 16
df.set_index(['Identifier', 'Period'], inplace=True, drop=True)

# Basic Structure

First we explore the basic structure of the dataset. All individuals enter the model initially at the same age and are then observed for a varying number of consecutive years. Each year, the individual's decision to work in either a white or blue collar occupation, attend school, enroll in the miliary, or remain at home is recorded. If working, the dataset potentially also contains that year's wage as a full-time equivalent.

In [24]:
df.head(15)

Unnamed: 0_level_0,Unnamed: 1_level_0,Age,Schooling,Choice,Wage
Identifier,Period,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
6,0,16,11,Schooling,
6,1,17,12,Schooling,
6,2,18,13,Schooling,
6,3,19,14,Schooling,
6,4,20,15,Schooling,
6,5,21,16,Home,
6,6,22,16,White,14062.67
6,7,23,16,White,15921.17
6,8,24,16,White,18602.73
6,9,25,16,White,19693.95


## Basic Descriptives

Now we are ready to reproduce some descriptive statistics from the paper.

### Choices

We reproduce the distribution of individuals across the different alternatives as reported in Tabel 1.

In [23]:
pd.crosstab(index=df['Age'], columns=df['Choice'], margins=True)

Choice,Schooling,Home,White,Blue,Military,All
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
16,1178,145,4,45,1,1373
17,1014,197,15,113,20,1359
18,561,296,92,331,70,1350
19,420,293,115,406,107,1341
20,341,273,149,454,113,1330
21,275,257,170,498,106,1306
22,169,212,256,559,90,1286
23,105,185,336,546,68,1240
24,65,112,284,416,44,921
25,24,61,215,267,24,591


### Wages

We reproduce the average real wages by occupation as reported in Table 4.

In [9]:
tab = pd.crosstab(index=df['Age'], columns=df['Choice'], values=df['Wage'], aggfunc='mean', margins=True)
tab[['All', 'White', 'Blue', 'Military']]

Choice,All,White,Blue,Military
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
16,10217.74,9320.76,10286.74,
17,11036.6,10049.76,11572.89,9005.36
18,12060.75,11775.34,12603.82,10171.87
19,12246.68,12376.42,12949.84,9714.6
20,13635.87,13824.01,14363.66,10852.51
21,14977.0,15578.14,15313.45,12619.37
22,17561.28,20236.08,16947.9,13771.56
23,18719.84,20745.56,17884.95,14868.65
24,20942.42,24066.64,19245.19,15910.84
25,22754.54,24899.23,21473.31,17134.46
