# People per tax unit

Look at `XTOT` and the UBI variables (`nu18`, `n1821`, `n21`) in the PUF and CPS files.

## Setup

In [1]:
import pandas as pd
import numpy as np

## Load

In [2]:
UBI_COLS = ['nu18', 'n1820', 'n21']

In [3]:
puf = pd.read_csv('~/puf.csv', usecols=['XTOT', 'RECID'] + UBI_COLS)

## Preprocess

In [4]:
puf['persons'] = puf[UBI_COLS].sum(axis=1)

In [5]:
puf['XTOT1'] = np.where(puf.XTOT == 0, 1, puf.XTOT)

In [6]:
puf['persons_minus_XTOT1'] = puf.persons - puf.XTOT1

## Analyze

In [7]:
puf.groupby('XTOT').size()

XTOT
0     9583
1    86180
2    75820
3    30557
4    30015
5    16417
6       16
7        3
dtype: int64

In [8]:
puf.groupby('persons').size()

persons
1     91851
2     78208
3     31964
4     30084
5     16371
6        69
7        23
8        17
10        4
dtype: int64

### Difference between `sum(nu18, n1820, n21)` and `max(XTOT, 1)`

Limit to records with `persons <= 5` due to `XTOT` top-coding at 5.

In [9]:
puf[puf.persons <= 5].groupby('persons_minus_XTOT1').size()

persons_minus_XTOT1
0    242959
1      5459
2        53
3         6
4         1
dtype: int64

Number of records with `persons <= 5` and `persons > XTOT1`.

In [10]:
puf[(puf.persons <= 5) & (puf.persons_minus_XTOT1 > 0)].shape[0]

5519

Percent of records with `persons <= 5` where `persons > XTOT1`.

In [11]:
(puf[(puf.persons <= 5) & (puf.persons_minus_XTOT1 > 0)].shape[0] /
 puf[(puf.persons <= 5)].shape[0])

0.022211221919043133

In [12]:
puf[puf.persons <= 5].sort_values('persons_minus_XTOT1', ascending=False).head()

Unnamed: 0,n1820,n21,nu18,XTOT,RECID,persons,XTOT1,persons_minus_XTOT1
247185,0,3,2,1,247186,5,1,4
245705,0,4,1,2,245706,5,2,3
244875,0,5,0,2,244876,5,2,3
244447,0,2,3,2,244448,5,2,3
248020,3,1,1,2,248021,5,2,3
