# Issue with the NTIA supplement to the CPS, for San Francisco

## Sources
* Data: https://www.ntia.doc.gov/page/download-digital-nation-datasets
* Map (useful for checks!): https://www.ntia.doc.gov/data/digital-nation-data-explorer#sel=internetAtHome&disp=map
* Docs: https://www.ntia.doc.gov/files/ntia/publications/november-2019-techdocs.pdf
* Universes & code examples: https://www.ntia.doc.gov/files/ntia/data_central_downloads/code/create-ntia-tables-stata.zip


## The issue with San Francisco.

Let's check in on households in CA and SF (the problem shown is similar for person-based estimates).  We will compare the "internet use by anyone in the household" (internetAtHome) value of the NTIA estimate for CA, which is 79.9% as can be seen here:
* https://www.ntia.doc.gov/data/digital-nation-data-explorer#sel=internetAtHome&disp=map

In [1]:
cps_test = pd.read_csv("data/nov19-cps.csv")

# Householders
cps_test["isHouseholder"] = (cps_test.perrp > 0) & (cps_test.perrp < 3) & \
                            (cps_test.hrhtype > 0) & (cps_test.hrhtype < 9)

# Households in California, and those in SF county/city.
ca_households = cps_test.query("(gestfips == 6) & isHouseholder")
sf_households = ca_households.query("gtco == 75")

# Get the "official" estimate and compare to our calculted quantity for CA, 
#   and then SF (no official available)
ca_ntia_official = 0.799
ca_internet = ca_households.query("heinhome == 1").hwhhwgt.sum() / ca_households.hwhhwgt.sum()
sf_internet = sf_households.query("heinhome == 1").hwhhwgt.sum() / sf_households.hwhhwgt.sum()

ca_ntia_official, ca_internet, sf_internet

(0.799, 0.7992982448757221, 0.6540205571069156)

San Francisco is 15% worse than California as a whole!? Are the errors just enormous??  No: just 5%.

In [2]:
hh               = sf_households.filter(regex = "hhwgt[1-9]\d*", axis = 1).sum()
hh_with_internet = sf_households.query("heinhome == 1").filter(regex = "hhwgt[1-9]\d*", axis = 1).sum()

sf_internet_reps = hh_with_internet / hh

np.sqrt((4 / 160) * ((sf_internet_reps - sf_internet)**2).sum())

0.049732365088150265

So I exactly reproduce the "correct" answer, but SF county shows up as quite significantly lower internet use at home than the rest of CA.

This simply does not feel correct to me, and indeed, we get a different impression from the ACS.

Dallas is the other outlier in the ACS / CPS comparison.

Is my definition of SF county just erroneous?  (Note that the "usual" definition in the CPS NTIA Universe is (`prtage >= 3`) and non-Armed Forces household (`prpertyp != 3`).  But I will not apply that.)

In [3]:
cps_test.query("(gestfips == 6) & (gtco == 75) & (pwsswgt > 0)").agg({"pwsswgt" : ["sum", "count"]}).to_dict()

{'pwsswgt': {'sum': 804606.1993000001, 'count': 200.0}}

This isn't perfect: the current number is around 880k, but it's not far off.  The outliers by this metric are Dallas and Fort Worth (are they swapped?) and San Jose (also low -- so, not swapped within CBSA with SF).

Are there negative weights?  (This isn't quantum mechanics, but...)

In [4]:
(cps_test.pwsswgt < 0).mean()

0.0

My replicate weights, following the procedures below, are *half* of Rafi's... but he assures me that it's 1.96, not 2: he quotes 95% CI's and I quote SEs.

* https://cps.ipums.org/cps/repwt.shtml
* https://www2.census.gov/programs-surveys/cps/datasets/2020/march/2020_ASEC_Replicate_Weight_Usage_Instructions.docx

In [5]:
ca_internet      = ca_households.query("heinhome == 1").hwhhwgt.sum() / ca_households.hwhhwgt.sum()
hh               = ca_households.filter(regex = "hhwgt[1-9]\d*", axis = 1).sum()
hh_with_internet = ca_households.query("heinhome == 1").filter(regex = "hhwgt[1-9]\d*", axis = 1).sum()

ca_internet_reps = hh_with_internet / hh

np.sqrt((4 / 160) * ((ca_internet_reps - ca_internet)**2).sum())

0.006545476572600432

In [6]:
ar_households    = cps_test.query("(gestfips == 5) & isHouseholder")

ar_internet      = ar_households.query("heinhome == 1").hwhhwgt.sum() / ar_households.hwhhwgt.sum()
hh               = ar_households.filter(regex = "hhwgt[1-9]\d*", axis = 1).sum()
hh_with_internet = ar_households.query("heinhome == 1").filter(regex = "hhwgt[1-9]\d*", axis = 1).sum()

ar_internet_reps = hh_with_internet / hh

np.sqrt((4 / 160) * ((ar_internet_reps - ar_internet)**2).sum())

0.023113497617789587

Compare CA and AR error as as 1.3% and 4.5%.