# Example Analysis Code

What follows are some basic analyses of the Duke Forge/SSRI COVID19 Digital Lab Social Distancing Survey Weeks 1, 2 and 3.

Because this is a *weighted* survey, note that simple summary statistics will not accurately reflect population statistics. This notebook includes code that can be easily modified for calculating basically statistics using these weights. 

Also note that many variables have top-codes of `9`. This is because surveys were conducted with automated calling, so respondents had to punch in values on their phones. Thus be careful in interpreting top-values. Pre-processing of many variables can be found in `10_code/20_add_analysis_vars.py`. 

The code the follows is written in Python and requires both the `pandas` and `statsmodels` packages, both of which can easily be installed using `conda` or `pip`. 

R users who wish to do their own analyses will likely find the `Survey` package to be of use. 

Note this data includes ALL responses, including partials. As a result, the composition of your sample may change if you start using demographic controls that occur later in the survey (and thus are less likely to be included for a given respondent). 

## Replicating Number of Children Per Household

Before analysis, we being by replicating one of the summary statistics provided by the firm that conducted this survey. Top-line summary statistics from the survey firm can be found in `40_reports/Survey_Summaries`. 

In particular, we'll replicate the share of households with No Children, One Child, and Two Children. The survey firm has reported that, after adjusting for weights, the proportion of households of each type are: 

- No Children: 61\% (Week 1), 63\% (Week 2), 64\% (week 3)
- One Child: 14\% (Week 1), 14\% (Week 2), 15% (Week 3)
- Two Children: 17\% (Week 1), 14\% (Week 2), 12% (week 3)

In [1]:
# Load the cleaned, and labeled version 
# of the survey. 
# The code that generates this cleaned data
# can be found in `10_import_and_format_week1.py`. 

import pandas as pd
import numpy as np

svy = pd.read_csv('../20_analysis_datasets/'
                  'merged_surveys_w_analysis_vars.csv')
svy.columns

Index(['Unnamed: 0', 'uniqueID', 'Date', 'Voter File Match',
       'Registered Voter (of Voter File Matches)', 'weight',
       'Q1. Health Quality', 'age', 'DEMOGRAPHICS - GENDER',
       'Q4. Number of People in HH', 'Q5. Children in HH',
       'Q6. Non-HH Face to Face Count', 'Q7. Six Feet Away? (If Q6 > 0)',
       'Q8. HH Member Going to Work',
       'Q9. Children Interacting with Other Children ',
       'Q10. Times in Group > 20 in Last Week', 'Family', 'Friends',
       'Co-workers', 'Clients, patients, or patrons',
       'Any other type of person not already mentioned',
       'Q12. Handwashing Count',
       'Q13. Currently Practicing Social Distancing?',
       'Q14. Currently Experiencing Symptoms?',
       'Q15. Likelihood of getting Coronavirus',
       'Q16. NC Response to Coronavirus', 'Q17. Changes to Routine ',
       'Q18. College Degree', 'Q19. Latino', 'Q20. Race',
       'Q21. Panel Willingness', 'Q19-20. Race + Ethnicity', 'Survey Mode',
       'DEMOGRAPHICS 

Now let's add a few convenience vars: 

Note that all variables up to Q21 are questions asked in this survey. Subsequent variables come largely from Clarity Campaigns, which has done its best to match survey respondents with an internal database built off other sources. Information on these variables can be found in `40_reports`. 

In [2]:
# Make split samples
svy1 = svy[svy['week1']].copy()
svy2 = svy[svy['week2']].copy()
svy3 = svy[svy['week3']].copy()

In [3]:
len(svy)

5228

Now we can replicate results from Clarity. For this we will rely on the `DescrStatsW` function from `statsmodels`. [Documentation can be found here](https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.DescrStatsW.html), but the basic idea is that it takes a vector of data, a vector of weights, and then returns various weighted statistics. 

In [4]:
from statsmodels.stats.weightstats import DescrStatsW

for week in ['week1', 'week2', 'week3']:
    
    oneweek = svy[svy[week]].copy()
    for num in ['None', 'One', 'Two']:
        oneweek[num] = oneweek['Q5. Children in HH'] == num
        temp = oneweek[pd.notnull(oneweek['Q5. Children in HH'])]    
        w = DescrStatsW(temp[num], temp['weight'])
        print(f"Share of households with {num} kids in {week} is {w.mean:.1%}")

Share of households with None kids in week1 is 60.8%
Share of households with One kids in week1 is 14.3%
Share of households with Two kids in week1 is 16.9%
Share of households with None kids in week2 is 62.3%
Share of households with One kids in week2 is 13.7%
Share of households with Two kids in week2 is 14.1%
Share of households with None kids in week3 is 63.3%
Share of households with One kids in week3 is 15.1%
Share of households with Two kids in week3 is 11.7%


Of course, calculating overall averages is only kinda interesting. Generally, we want to know statistics for sub-populations. Normally, we'd just do this using the `groupby` operator, but things are a little more complicated with weighting. 

To begin, let's again calculate the proportion of households with different numbers of children using groupby for the trivial case where everyone is in the same group. Once we know that works, we can start looking at more interesting sub-populations. 

In [5]:
def get_group_mean(data, question):
    temp = data[[question, 'weight']]
    temp = temp[pd.notnull(temp[question])]
    wsvy = DescrStatsW(temp[question], temp['weight'])
    return wsvy.mean

for num in ['None', 'One', 'Two']:
    week1 = svy[svy['week1']].copy()
    week1[num] = (week1['Q5. Children in HH'] == num)
    week1 = week1[pd.notnull(week1['Q5. Children in HH'])]    

    raw = week1[num].mean()

    w = week1.groupby('full_sample').apply(lambda x: get_group_mean(x, num))
    w = w.iloc[0]
    
    print(f'Share households with {num} kids in Week 1: {w:.1%}')

Share households with None kids in Week 1: 60.8%
Share households with One kids in Week 1: 14.3%
Share households with Two kids in Week 1: 16.9%


Great! Now let's break down number of children by race. To do this, we'll first re-code race since most categories are too small to be statistically valuable:

Now we can look at number of children by racial group:

In [6]:
race = 'Q19-20. Race + Ethnicity'
svy['None'] = svy['Q5. Children in HH'] == 'None'
temp = svy[pd.notnull(svy['Q5. Children in HH'])]
svy.groupby(race).apply(lambda x: get_group_mean(x, 'None'))

Q19-20. Race + Ethnicity
Black    0.592688
Other    0.460425
White    0.596882
dtype: float64

## Analyzing COVID-Related Outcomes

Now that we've gotten the basics of working with weighted survey data out of the way (and by comparing our calculated values to known outcomes from the survey firm, we know we've done it correctly), let's start looking at some COVID-related variables!

### Large Groups

A key question in our analysis is whether people have been in large groups in the last week. Note that because this survey was conducted on the 29th, 30th, and 31st, most North Carolineans were not yet under a "shelter-in-place" order during the week preceding this survey, so we wouldn't expect people's answers to "How many times have you been in a group of > 20 people in the last week" to be all zeros!

Note that `9` is a top-code here, so values of `9` mean "9 or greater".

In [7]:
# Get weighted proportions in each category
big_group = 'Q10. Times in Group > 20 in Last Week'

for week in ['week1', 'week2', 'week3']:
    def get_group_sumweights(data, question):
        temp = data[[question, 'weight']]
        temp = temp[pd.notnull(temp[question])]
        wsvy = DescrStatsW(temp[question], temp['weight'])
        return wsvy.sum_weights

    temp = svy[svy['week'] == week]
    sums = temp.groupby(big_group).apply(lambda x: get_group_sumweights(x, big_group))
    proportions = sums / sums.sum()
    print(f'In {week}, the number of (reported) times people had been in groups > 20 was:')
    print(proportions)
    print('\n')

In week1, the number of (reported) times people had been in groups > 20 was:
Q10. Times in Group > 20 in Last Week
0.0    0.746319
1.0    0.114472
2.0    0.038606
3.0    0.016420
4.0    0.008605
5.0    0.013843
6.0    0.004725
7.0    0.005151
8.0    0.001168
9.0    0.050691
dtype: float64


In week2, the number of (reported) times people had been in groups > 20 was:
Q10. Times in Group > 20 in Last Week
0.0    0.786261
1.0    0.097194
2.0    0.046486
3.0    0.017138
4.0    0.004666
5.0    0.009612
6.0    0.008975
7.0    0.001183
8.0    0.000541
9.0    0.027944
dtype: float64


In week3, the number of (reported) times people had been in groups > 20 was:
Q10. Times in Group > 20 in Last Week
0.0    0.786749
1.0    0.095527
2.0    0.041902
3.0    0.012926
4.0    0.005501
5.0    0.017040
6.0    0.003151
7.0    0.006315
8.0    0.002234
9.0    0.028654
dtype: float64




So clearly the *vast* majority of people haven't been in big groups (or won't admit to it). So let's just look at the share of people who've EVER been in a big group in the last year, and see how it breaks down by sub-population. 

In [8]:
for week in ['week1', 'week2', 'week3']:
    print(f'share ever a large group in week {week}')
    print(f"{svy[svy[week]].groupby('full_sample').apply(lambda x: get_group_mean(x, 'ever_in_group')).iloc[0]:.1%}")
    print('\n')

share ever a large group in week week1
25.4%


share ever a large group in week week2
21.4%


share ever a large group in week week3
21.3%




In [9]:
# And by Age
for week in ['week1', 'week2', 'week3']:
    print(f'share ever a large group in week {week}')
    print(f"{svy[svy.week == week].groupby('age_ranges').apply(lambda x: get_group_mean(x, 'ever_in_group'))}")
    print('\n')

share ever a large group in week week1
age_ranges
35 - 55    0.216389
55 - 65    0.283741
< 35       0.240695
> 65       0.306654
dtype: float64


share ever a large group in week week2
age_ranges
35 - 55    0.168526
55 - 65    0.252119
< 35       0.174079
> 65       0.325449
dtype: float64


share ever a large group in week week3
age_ranges
35 - 55    0.189855
55 - 65    0.275588
< 35       0.135362
> 65       0.287055
dtype: float64




In [10]:
avg_num = svy.groupby('age_ranges').apply(lambda x: get_group_mean(x, 'ever_in_group'))
avg_num

age_ranges
35 - 55    0.191836
55 - 65    0.269285
< 35       0.179659
> 65       0.306619
dtype: float64

In [11]:
svy1.sample().T

Unnamed: 0,378
Unnamed: 0,378
uniqueID,379
Date,2020-03-30
Voter File Match,Yes
Registered Voter (of Voter File Matches),Yes
...,...
any_close_interactions,0
ever_in_group,1
someone_working,0
race,Black


Some small differences. Clarity provides a "likelihood people attend church" we can check to see if that's driving things (Black North Carolineans have slightly higher likelihood of attending church, says Clarity data):

In [12]:
svy.groupby(race).apply(
    lambda x: get_group_mean(x, 'CHURCH ATTENDANCE'))

Q19-20. Race + Ethnicity
Black    3.352280
Other    2.628803
White    3.104852
dtype: float64

We can also look for variation in likelihood by education -- appears those without college degree more likely to be in groups. 

In [13]:
educ = 'Q18. College Degree'
avg_num = svy1.groupby(educ).apply(
    lambda x: get_group_mean(x, 'ever_in_group'))
avg_num

Q18. College Degree
4         [0.0]
No     0.266648
Yes    0.178007
dtype: object

## Distancing in last 24 hours

The survey also asks about number of people with whom one has had face-to-face interactions in the last 24 hours, then in how many of those interactions was the respondent able to maintain social distance. The difference is num of people they weren't able to keep distance with. 

However, note top-codes make interpreting this a little tricky...

In [14]:
sums = svy1.groupby('close_interactions').apply(
           lambda x: get_group_sumweights(x, 'close_interactions'))
proportions = sums / sums.sum()
proportions

close_interactions
0.0    0.757763
1.0    0.106005
2.0    0.069115
3.0    0.030573
4.0    0.012364
5.0    0.007400
6.0    0.006449
7.0    0.008483
8.0    0.001847
dtype: float64

So 11% said they had at least one interaction without distancing week 1, and...

In [15]:
sums = svy2.groupby('close_interactions').apply(
           lambda x: get_group_sumweights(x, 'close_interactions'))
proportions = sums / sums.sum()
proportions

close_interactions
0.0    0.767109
1.0    0.087397
2.0    0.086595
3.0    0.034958
4.0    0.006986
5.0    0.011499
6.0    0.002079
7.0    0.002146
8.0    0.001231
dtype: float64

Nearly the same in week 2...

In [35]:
for i in ['< 35', '35 - 55', '55 - 65', '> 65']:
    sums = svy[svy['age_ranges'] == i].groupby('close_interactions').apply(
                                              lambda x: get_group_sumweights(x, 'close_interactions'))
    proportions = sums / sums.sum()
    print(f'for age range {i}:')
    print(proportions.iloc[0:4])

for age range < 35:
close_interactions
0.0    0.734960
1.0    0.096784
2.0    0.091326
3.0    0.034632
dtype: float64
for age range 35 - 55:
close_interactions
0.0    0.797518
1.0    0.080136
2.0    0.068961
3.0    0.024890
dtype: float64
for age range 55 - 65:
close_interactions
0.0    0.789022
1.0    0.090469
2.0    0.057130
3.0    0.032189
dtype: float64
for age range > 65:
close_interactions
0.0    0.779158
1.0    0.115527
2.0    0.051339
3.0    0.028885
dtype: float64


In [36]:
for i in ['Yes', 'No']:
    sums = svy[svy['Q18. College Degree'] == i].groupby('close_interactions').apply(
                                              lambda x: get_group_sumweights(x, 'close_interactions'))
    proportions = sums / sums.sum()
    print(f'for college degree: {i}:')
    print(proportions.iloc[0:4])

for college degree: Yes:
close_interactions
0.0    0.815381
1.0    0.087411
2.0    0.048914
3.0    0.021707
dtype: float64
for college degree: No:
close_interactions
0.0    0.761081
1.0    0.093688
2.0    0.076615
3.0    0.033916
dtype: float64


# Reasons for being near people


In [29]:
for i in ['Family', 'Friends', 'Co-workers', 'Clients, patients, or patrons']:
    svy[f'near_{i}'] = (svy[i] == 'Yes').astype('int')
    svy.loc[pd.isnull(svy[i]), f'near_{i}'] = np.nan
    avgs = get_group_mean(svy, f'near_{i}')
    print(f"Within last 24 hours have you been within 10 feet of a {i}:")
    print(f'{avgs:.2%}')

Within last 24 hours have you been within 10 feet of a Family:
69.42%
Within last 24 hours have you been within 10 feet of a Friends:
20.00%
Within last 24 hours have you been within 10 feet of a Co-workers:
21.21%
Within last 24 hours have you been within 10 feet of a Clients, patients, or patrons:
12.92%


In [31]:
work = 'Q8. HH Member Going to Work'
svy[work].value_counts(normalize=True)

No        0.602773
Yes       0.374860
Unsure    0.022366
Name: Q8. HH Member Going to Work, dtype: float64

In [34]:
sums = svy.groupby(work).apply(lambda x: 
                               get_group_sumweights(x, work))
proportions = sums / sums.sum()
print(f'Share of households with someone going to work:')
print(proportions)

Share of households with someone going to work:
Q8. HH Member Going to Work
No        0.489973
Unsure    0.018572
Yes       0.491455
dtype: float64
