# Example Analysis Code

What follows are some basic analyses of the Duke Forge/SSRI COVID19 Digital Lab Social Distancing Survey Week 1. This data includes survey responses from March 29 through March 31st. 

Because this is a *weighted* survey, note that simple summary statistics will not accurately reflect population statistics. This notebook includes code that can be easily modified for calculating basically statistics using these weights. 

Also note that many variables have top-codes of `9`. This is because surveys were conducted with automated calling, so respondents had to punch in values on their phones. Thus be careful in interpreting top-values. 

The code the follows is written in Python and requires both the `pandas` and `statsmodels` packages, both of which can easily be installed using `conda` or `pip`. 

R users who wish to do their own analyses will likely find the `Survey` package to be of use. 

## Replicating Number of Children Per Household

Before analysis, we being by replicating one of the summary statistics provided by the firm that conducted this survey. In particular, the survey firm has reported that, after adjusting for weights, the proportion of households with No Children, One Child, or Two Children are 60\%, 15\%, and 17\% respectively. We replicate those estimates here. 

In [10]:
# Load the cleaned, and labeled version 
# of the survey. 
# The code that generates this cleaned data
# can be found in `10_import_and_format_week1.py`. 

import pandas as pd
import numpy as np

svy = pd.read_csv('../00_raw_data/'
                  '20200401_duke_covid_survey/'
                  'raw_survey_data_202004_CLEANED.csv')
svy.columns

Index(['Unnamed: 0', 'uniqueID', 'Date', 'Voter File Match',
       'Registered Voter (of Voter File Matches)', 'weight',
       'Q1. Health Quality', 'age', 'DEMOGRAPHICS - GENDER',
       'Q4. Number of People in HH', 'Q5. Children in HH',
       'Q6. Non-HH Face to Face Count', 'Q7. Six Feet Away? (If Q6 > 0)',
       'Q8. HH Member Going to Work',
       'Q9. Children Interacting with Other Children ',
       'Q10. Times in Group > 20 in Last Week', 'Family', 'Friends',
       'Co-workers', 'Clients, patients, or patrons',
       'Any other type of person not already mentioned',
       'Q12. Handwashing Count',
       'Q13. Currently Practicing Social Distancing?',
       'Q14. Currently Experiencing Symptoms?',
       'Q15. Likelihood of getting Coronavirus',
       'Q16. NC Response to Coronavirus', 'Q17. Changes to Routine ',
       'Q18. College Degree', 'Q19. Latino', 'Q20. Race',
       'Q21. Panel Willingness', 'Q19-20. Race + Ethnicity', 'Survey Mode',
       'DEMOGRAPHICS 

Note that all variables up to Q21 are questions asked in this survey. Subsequent variables come largely from Clarity Campaigns, which has done its best to match survey respondents with an internal database built off other sources. Information on these variables can be found in `00_raw_data/20200401_duke_covid_survey/`. 

In [11]:
len(svy)

1274

Now we can replicate results from Clarity. For this we will rely on the `DescrStatsW` function from `statsmodels`. [Documentation can be found here](https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.DescrStatsW.html), but the basic idea is that it takes a vector of data, a vector of weights, and then returns various weighted statistics. 

In [12]:
from statsmodels.stats.weightstats import DescrStatsW

for num in ['None', 'One', 'Two']:
    svy[num] = svy['Q5. Children in HH'] == num
    w = DescrStatsW(svy[num], svy['weight'])
    print(f"Share of households with {num} kids is {w.mean:.3f}")

Share of households with None kids is 0.599
Share of households with One kids is 0.151
Share of households with Two kids is 0.171


Of course, calculating overall averages is only kinda interesting. Generally, we want to know statistics for sub-populations. Normally, we'd just do this using the `groupby` operator, but things are a little more complicated with weighting. 

To begin, let's again calculate the proportion of households with different numbers of children using groupby for the trivial case where everyone is in the same group. Once we know that works, we can start looking at more interesting sub-populations. 

In [13]:
def get_group_mean(data, question):
    temp = data[[question, 'weight']]
    temp = temp[pd.notnull(temp[question])]
    wsvy = DescrStatsW(temp[question], temp['weight'])
    return wsvy.mean

for num in ['None', 'One', 'Two']:
    svy[num] = svy['Q5. Children in HH'] == num
    svy['dummy'] = 1

    raw = svy[num].mean()

    w = svy.groupby('dummy').apply(lambda x: get_group_mean(x, num))
    w = w.iloc[0]
    
    print(f'Share households with {num} kids: {w:.3f}')

Share households with None kids: 0.599
Share households with One kids: 0.151
Share households with Two kids: 0.171


Great! Now let's break down number of children by race. To do this, we'll first re-code race since most categories are too small to be statistically valuable:

In [14]:
# Values before grouping
race = 'Q19-20. Race + Ethnicity'
svy[race].value_counts()

White                 928
Black                 236
Hispanic or Latino     51
Another race           39
Asian                  20
Name: Q19-20. Race + Ethnicity, dtype: int64

In [15]:
svy[race] = svy[race].replace({'Asian': 'Other',
                               'Hispanic or Latino': 'Other',
                               'Another race': 'Other'})
svy[race].value_counts()

White    928
Black    236
Other    110
Name: Q19-20. Race + Ethnicity, dtype: int64

Now we can look at number of children by racial group:

In [16]:
svy['None'] = svy['Q5. Children in HH'] == 'None'
svy.groupby(race).apply(lambda x: get_group_mean(x, 'None'))

Q19-20. Race + Ethnicity
Black    0.632850
Other    0.539235
White    0.601377
dtype: float64

## Analyzing COVID-Related Outcomes

Now that we've gotten the basics of working with weighted survey data out of the way (and by comparing our calculated values to known outcomes from the survey firm, we know we've done it correctly), let's start looking at some COVID-related variables!

### Large Groups

A key question in our analysis is whether people have been in large groups in the last week. Note that because this survey was conducted on the 29th, 30th, and 31st, most North Carolineans were not yet under a "shelter-in-place" order during the week preceding this survey, so we wouldn't expect people's answers to "How many times have you been in a group of > 20 people in the last week" to be all zeros!

Note that `9` is a top-code here, so values of `9` mean "9 or greater".

In [19]:
# Get weighted proportions in each category
big_group = 'Q10. Times in Group > 20 in Last Week'

def get_group_sumweights(data, question):
    temp = data[[question, 'weight']]
    temp = temp[pd.notnull(temp[question])]
    wsvy = DescrStatsW(temp[question], temp['weight'])
    return wsvy.sum_weights

sums = svy.groupby(big_group).apply(lambda x: get_group_sumweights(x, big_group))
proportions = sums / sums.sum()
proportions

Q10. Times in Group > 20 in Last Week
0.0    0.795307
1.0    0.085695
2.0    0.039488
3.0    0.019599
4.0    0.003209
5.0    0.011107
6.0    0.005033
7.0    0.002025
9.0    0.038537
dtype: float64

So clearly the *vast* majority of people haven't been in big groups (or won't admit to it). So let's just look at the share of people who've EVER been in a big group in the last year, and see how it breaks down by sub-population. 

In [23]:
svy['any_group']= (svy[big_group] > 0) & pd.notnull(svy[big_group])
svy.loc[pd.isnull(svy[big_group]), 'any_group'] = np.nan

race = 'Q19-20. Race + Ethnicity'
avg_num = svy.groupby(race).apply(lambda x: get_group_mean(x, 'any_group'))
avg_num

Q19-20. Race + Ethnicity
Black    0.232377
Other    0.250658
White    0.189627
dtype: float64

Some small differences. Clarity provides a "likelihood people attend church" we can check to see if that's driving things (Black North Carolineans have slightly higher likelihood of attending church, says Clarity data):

In [27]:
svy.groupby(race).apply(
    lambda x: get_group_mean(x, 'CHURCH ATTENDANCE'))

Q19-20. Race + Ethnicity
Black    3.478403
Other    2.414467
White    2.922348
dtype: float64

We can also look for variation in likelihood by education -- appears those without college degree more likely to be in groups. 

In [29]:
educ = 'Q18. College Degree'
avg_num = svy.groupby(educ).apply(
    lambda x: get_group_mean(x, 'any_group'))
avg_num

Q18. College Degree
No     0.268921
Yes    0.176835
dtype: float64

## Distancing in last 24 hours

The survey also asks about number of people with whom one has had face-to-face interactions in the last 24 hours, then in how many of those interactions was the respondent able to maintain social distance. The difference is num of people they weren't able to keep distance with. 

However, note top-codes make interpreting this a little tricky...

In [39]:
svy['close_interacts'] = (svy['Q6. Non-HH Face to Face Count'] - 
                          svy['Q7. Six Feet Away? (If Q6 > 0)'])
svy.loc[svy['Q6. Non-HH Face to Face Count'] == 0, 'close_interacts'] = 0
svy.loc[svy['Q6. Non-HH Face to Face Count'] == 9, 'close_interacts'] = np.nan

In [40]:
svy.close_interacts.value_counts()

 0.0    795
 1.0    125
 2.0     58
 3.0     25
 4.0     17
-1.0     13
 5.0      8
 7.0      6
 6.0      6
-3.0      5
-4.0      4
-6.0      4
-2.0      4
-7.0      3
-8.0      2
 8.0      2
-5.0      1
Name: close_interacts, dtype: int64

OK, negatives are clearly junk...

In [41]:
svy.loc[svy['close_interacts'] < 0, 'close_interacts'] = 0

In [43]:
sums = svy.groupby('close_interacts').apply(
           lambda x: get_group_sumweights(x, 'close_interacts'))
proportions = sums / sums.sum()
proportions

close_interacts
0.0    0.774122
1.0    0.106074
2.0    0.058568
3.0    0.019922
4.0    0.017027
5.0    0.005028
6.0    0.005419
7.0    0.009362
8.0    0.004479
dtype: float64

So 11% said they had at least one interaction without distancing. 