# Behavioral Risk Factor Surveillance System (BRFSS), 2013

Behavioral Risk Factor Surveillance System (BRFSS), 2013


In [1]:
from ambry import get_library
l = get_library()
b = l.bundle('cdc.gov-brfss-2013-0.0.1')
p = l.partition('cdc.gov-brfss-2013-brfss')
df = p.dataframe()

First, let's check that we have basic data integrity, by calculating the crude and weighted percentages for one of the varaibles. Here are the value counts for the ``_PRACE`` variable, "Preferred race category ". These values should match up with the percentages [listed on page 112 in the codebook](http://www.cdc.gov/brfss/annual_data/2013/pdf/CODEBOOK13_LLCP.pdf).

In [2]:
(df._prace1.value_counts()/len(df))

1     0.824368
2     0.085419
6     0.022244
4     0.021685
3     0.020339
99    0.010678
5     0.006312
77    0.006265
7     0.002302
8     0.000329
dtype: float64

Now, perform the same calculation for the weighted values. The values of the ``_llcpwt`` variable can be interpreted as the number of people that a record represents. 

Since the BRFSS is a survey, it only asks questions of a subset of people, but they don't get responses from all races and genders in the same proportion as there are of those races and genders in the population. So, the surve administrators calculate, for each important demographic group, how many members of that group each response represents. For instance, If the weight for a Hispanic Female in San Mateo county is 115, that record represents not one person, but 115 people. If the survey got more responses from women than from men, then the weighting for a Hispanic Male in San Mateo county may be larger than 115. 

Because the weights are the number of people that each record represents, the sum of all of the weights should equal the number of people in the whole population, which is people over 18 for the BRFSS. 


In [25]:
us_actual = (314100000 * (1-.231)) # US pop in 2012, from Google
brfss_est = df._llcpwt.sum()

brfss_est, us_actual, (us_actual - brfss_est), (us_actual - brfss_est)/us_actual

(246024363, 239697300.0, -6327063.0, -0.026396054523768104)

The estimate of the population from the BRFSS weights is within 2.6% of the estimate from the US Census. 

Now, we can group the weights by preferred race and compare the results to the weighted frequences in the codebook. 


In [26]:
df[['_prace1', '_llcpwt']].groupby('_prace1').sum()/ df._llcpwt.sum()


Unnamed: 0_level_0,_llcpwt
_prace1,Unnamed: 1_level_1
1,0.745429
2,0.124899
3,0.019216
4,0.048525
5,0.004314
6,0.027491
7,0.002146
8,0.000488
77,0.012695
99,0.014443
