# Example Usage of IPUMS Data

## Small Background

> Disclaimer: Section co-written with ChatGPT

**IPUMS** (Integrated Public Use Microdata Series) is a project that harmonizes and centralizes survey data from across the globe to allow for easier comparison and analysis.
The data is free to access but an account must be made per each individual project imprint of **IPUMS** (link: https://www.ipums.org)

The **IPUMS CPS** dataset is derived from the U.S. Current Population Survey (CPS), which is a monthly survey conducted by the U.S. Census Bureau and the Bureau of Labor Statistics (BLS) (link here: https://cps.ipums.org/cps/). 
The CPS collects data on employment, unemployment, earnings, demographics, and other social and economic factors. 
The most important variables within the CPS are:

1. **Household-Level Data**: Contains information about the composition of households, housing characteristics, and household-level variables such as total income.
2. **Person-Level Data**: Focuses on individual respondents, providing detailed information on demographic factors such as age, race, gender, and educational attainment.
3. **Labor Force Data**: Records employment status, occupation, hours worked, and wage data for individuals. 

## Dependency Set-Up

In [1]:
using DrWatson
@quickactivate "CompositionalMLStudy"

using DataFrames

import DrWatson:
  datadir

import IPUMS:
  load_ipums_extract,
  parse_ddi

## Setting Constants

In [2]:
# IPUMS Data Directory
IPUMS_DIR = datadir("exp_raw", "IPUMS")

#=

  IPUMS DDI and DAT file to be used.
  When one file name is changed, the 
  other should be updated as well.

=#

# DDI Data Dictionary
DDI_FILE = "cps_00097.xml"

# IPUMS CPS Example Data 
DAT_FILE = "cps_00097.dat"

"cps_00097.dat"

## Basic Exploration of IPUMS Data

### Loading Data

Loading data dictionary (xml) and [IPUMS International](https://international.ipums.org/international/index.shtml) data file (dat):

In [3]:
ddi = parse_ddi(joinpath(IPUMS_DIR, DDI_FILE));
df = load_ipums_extract(ddi, joinpath(IPUMS_DIR, DAT_FILE));
first(df, 5)

Row,YEAR,SERIAL,MONTH,CPSID,ASECFLAG,ASECWTH,FOODSTMP,PERNUM,CPSIDP,ASECWT,AGE,EMPSTAT,AHRSWORKT,HEALTH
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Float64,Int64,Int64,Int64,Float64,Int64,Int64,Int64,Int64
1,2011,33,3,20100302841800,1,308.26,1,1,20100302841801,308.26,46,10,50,2
2,2011,33,3,20100302841800,1,308.26,1,2,20100302841803,216.84,14,0,999,1
3,2011,33,3,20100302841800,1,308.26,1,3,20100302841802,249.14,10,0,999,1
4,2011,46,3,20110102848200,1,265.55,1,1,20110102848201,265.55,63,10,47,3
5,2011,46,3,20110102848200,1,265.55,1,2,20110102848202,265.55,62,32,999,5


### Examining Metadata

By `DataFrame`:

In [4]:
meta_info = metadata(df)
for md in keys(meta_info)
  println("$(md):\n----------------\n\n $(meta_info[md])\n\n")
end

extract_notes:
----------------

 User-provided description:  Reproducing cps00011 example data


citation:
----------------

 Publications and research reports based on the IPUMS-CPS database must cite it appropriately. The citation should include the following:

Sarah Flood, Miriam King, Renae Rodgers, Steven Ruggles, J. Robert Warren and Michael Westberry. Integrated Public Use Microdata Series, Current Population Survey: Version 10.0 [dataset]. Minneapolis, MN: IPUMS, 2022. https://doi.org/10.18128/D030.V10.0

The licensing agreement for use of IPUMS-CPS data requires that users supply us with the title and full citation for any publications, research reports, or educational materials making use of the data or documentation. Please add your citation to the IPUMS bibliography: http://bibliography.ipums.org/


extract_date:
----------------

 2023-03-20


conditions:
----------------

 Users of IPUMS-CPS data must agree to abide by the conditions of use. A user's license is valid for

By each column of the `DataFrame`:

In [5]:
for colname in names(df)
  println("$(colname):\n----------------\n\n $(colmetadata(df, colname, "description"))\n\n")
end

YEAR:
----------------

 YEAR reports the year in which the survey was conducted.  YEARP is repeated on person records.


SERIAL:
----------------

 SERIAL is an identifying number unique to each household in a given survey month and year.  All person records are assigned the same serial number as the household record they follow.  A combination of YEAR, MONTH, and SERIAL provides a within-sample unique identifier for every household in IPUMS-CPS; YEAR, MONTH, SERIAL, and PERNUM uniquely identify every person within a single sample.

SERIAL is a new value generated for IPUMS-CPS and should not be confused with the household serial number created by the Census Bureau and included in the original CPS data.


MONTH:
----------------

 MONTH indicates the calendar month of the CPS interview.


CPSID:
----------------

 CPSID is an IPUMS-CPS defined variable that uniquely identifies households across CPS samples. The first six digits of CPSID index the four-digit year and two-digit month th