# Political Alignment Case Study

Allen Downey

[MIT License](https://en.wikipedia.org/wiki/MIT_License)

This is the first in a series of notebooks that make up a case study in exploratory data analysis.

In this notebook, we 

1. Read data from the General Social Survey (GSS),

2. Clean the data, particularly dealing with special codes that indicate missing data,

3. Validate the data by comparing the values in the dataset with values documented in the codebook.

4. Generate "resampled" datasets that correct for deliberate oversampling in the dataset, and

5. Store the resampled data in a binary format (HDF5) that makes it easier to work with in the notebooks that follow this one.

If you are running this notebook in Colab, the following cell downloads files and installs some software we need.

If you are running in another environment, it is up to you to download data and install packages.

In [1]:
# If we're running in Colab, set up the environment

import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    !pip install empiricaldist
    !git clone --depth 1 https://github.com/AllenDowney/PoliticalAlignmentCaseStudy
    %cd PoliticalAlignmentCaseStudy

The following cells load the packages we need.  If everything works, there should be no error messages.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import utils

### Reading the extract

The data we'll use is from the General Social Survey (GSS).  Using the GSS Data Explorer, I have selected a subset of the variables in the GSS and made it available as an extract.  You can view my project page to see the variables in the extract and their documentation:

https://gssdataexplorer.norc.org/projects/52787/extracts

The follow function reads the data files and returns a Pandas DataFrame.

In [3]:
gss = utils.read_gss('gss_eda')
print(gss.shape)
gss.head()

(64814, 105)


Unnamed: 0,year,id_,agewed,divorce,sibs,childs,age,educ,paeduc,maeduc,...,ballot,wtssall,adults,compuse,databank,wtssnr,spkrac,spkcom,spkmil,spkmslm
0,1972,1,0,0,3,0,23,16,10,97,...,0,0.4446,1,0,0,1.0,0,1,0,0
1,1972,2,21,2,4,5,70,10,8,8,...,0,0.8893,2,0,0,1.0,0,2,0,0
2,1972,3,20,2,5,4,48,12,8,8,...,0,0.8893,2,0,0,1.0,0,2,0,0
3,1972,4,24,2,5,0,27,17,16,12,...,0,0.8893,2,0,0,1.0,0,1,0,0
4,1972,5,22,2,2,2,61,12,8,8,...,0,0.8893,2,0,0,1.0,0,1,0,0


### Missing data

For many variables, missing values are encoded with numerical codes that we need to replace before we do any analysis.

For example, for `polviews`, the values 8, 9, and 0 represent "Don't know", "No answer", and "Not applicable".

"Not applicable" usually means the respondent was not asked a particular question.

To keep things simple, we'll treat all of these values as equivalent, but we should keep in mind that we lose some information by doing that.  For example, if a respondent refuses to answer a question, that might suggest something about their answer.  If so, treating their response as missing data might bias the results.

Fortunately, for most questions the number of respondents who refused to answer is small.

The following function replaces invalid data for many of the variables in the GSS dataset.  You'll see how it works in an exercise at the end of this notebook.

In [4]:
utils.gss_replace_invalid(gss)

### Resampling

The GSS uses stratified sampling, which means that some groups are deliberately oversampled to help with statistical validity.

As a result, each respondent has a sampling weight which is proportional to the number of people in the population they represent.

Before running any analysis, we can compensate for stratified sampling by "resampling", that is, by drawing a random sample from the dataset, where each respondent's chance of appearing in the sample is proportional to their sampling weight.

`utils` provides a function to do this resampling.

In [5]:
np.random.seed(19)
sample = utils.resample_by_year(gss, 'wtssall')

### Saving the results

I'll save the results to an HDF5 file, which is a binary format that makes it much faster to read the data back.

This file contains three random resamplings of the original dataset.

In [6]:
# if the file already exists, remove it
import os

if os.path.isfile('eds.gss.hdf5'):
    !rm eds.gss.hdf5

In [7]:
for i in range(3):
    np.random.seed(i)
    sample = utils.resample_by_year(gss, 'wtssall')

    key = f'gss{i}'
    sample.to_hdf('eds.gss.hdf5', key)

In [8]:
%time gss0 = pd.read_hdf('eds.gss.hdf5', 'gss0')
gss.shape

CPU times: user 4 ms, sys: 24 ms, total: 28 ms
Wall time: 53.4 ms


(64814, 105)

### Validation

Before working with any dataset, it is important to validate it, which means checking for errors.

The kinds of errors you have to check for depend on the nature of the data, the collection process, how the data is stored and transmitted, etc.

For this dataset, there a few kinds of validation we'll think about:

1) We need to check the integrity of the dataset; that is, whether the data were corrupted or changed during transmission, storage, or conversion from one format to another.

2) We need to check our interpretation of the data; for example, whether we replaced the right codes for missing data.

3) We will also keep an eye out for data, or patterns, that might indicate problems with the survey process and the recording of the data.  For example, in a different dataset I worked with, I found a surprising number of respondents whose height was supposedly 62 centimeters.  After investigating, I concluded that they were probably 6 feet, 2 inches, and their heights were recorded incorrectly.

Validating data can be a tedious process, but it is important.  If you interpret data incorrectly and publish invalid results, you will be embarrassed in the best case, and in the worst case you might do serious harm.

However, we don't expect you to validate every variable in this dataset.  Instead, we will demonstrate the process, and then ask you to validate one additional variable as an exercise.

For purposes of validation, we need to use the original dataset before replacing missing data and before resampling.  So I'll load it again:

In [9]:
gss = utils.read_gss('gss_eda')
gss.shape

(64814, 105)

The first variable we'll validate is called `polviews`.  It records responses to the following question:

>We hear a lot of talk these days about liberals and conservatives. 
I'm going to show you a seven-point scale on which the political views that people might hold are arranged from extremely liberal--point 1--to extremely conservative--point 7. Where would you place yourself on this scale?

You can read the documentation of this variable in the GSS codebook: https://gssdataexplorer.norc.org/projects/52787/variables/178/vshow

The responses are encoded like this:

```
1	Extremely liberal
2	Liberal
3	Slightly liberal
4	Moderate
5	Slghtly conservative
6	Conservative
7	Extrmly conservative
8	Don't know
9	No answer
0	Not applicable
```

We have already replaced 8, 9, and 0 with the special value NaN, which represents missing data.  So we expect the other values to be the numbers from 1 to 7.

The following function, `values`, takes a series that represents a single variable, and returns the values in the series and their frequencies.

In [10]:
def values(series):
    """Count the values and sort.
    
    series: pd.Series
    
    returns: series mapping from values to frequencies
    """
    return series.value_counts().sort_index()

Here are the values for the variable `polviews`.

In [11]:
values(gss['polviews'])

0     6777
1     1682
2     6514
3     7010
4    21370
5     8690
6     8230
7     1832
8     2326
9      383
Name: polviews, dtype: int64

We can select values from a single year like this:

In [12]:
one_year = (gss['year']==1974)
polviews_one_year = gss.loc[one_year, 'polviews']

And look at the values and their frequencies:

In [13]:
values(polviews_one_year)

1     22
2    201
3    207
4    564
5    221
6    160
7     35
8     70
9      4
Name: polviews, dtype: int64

If you compare these results to the values in the codebook, you should see that they agree.

Exercise: Go back and change 1974 to another year, and compare the results to the codebook.

Now let's replace the numeric codes that represent missing data.

In [14]:
replaced = gss['polviews'].replace([0, 8, 9], np.nan)

We can count the number of valid responses:

In [15]:
replaced.notna().sum()

55328

And the number of missing responses:

In [16]:
replaced.isna().sum()

9486

If you compare these results to the numbers in the code book, THEY DON'T MATCH!

It turns out that the code book has not been updated yet with the 2018 data.

If we subtract off the missing cases from 2018, we get the number that's in the codebook.

In [17]:
one_year = (gss['year']==2018)
polviews_one_year = gss.loc[one_year, 'polviews'].replace([0, 8, 9], np.nan)

replaced.isna().sum() - polviews_one_year.isna().sum()

9385

Exercise: In order to validate the other variables, we ask each person who works with this notebook to validate one variable.

If you run the following cell, it will choose one of the columns from the dataset at random.  That's the variable you will check.

If you get `year` or `id_`, run the cell again to get a different variable name.

In [18]:
np.random.seed(None)
np.random.choice(gss.columns)

'pres08'

Go back through the cells in this section and replace `polviews` with your random variable.  Then run the cells again and go to this online survey to report the results: https://forms.gle/tmST8YCu4qLc414F7

Note: Not all questions were asked during all years, so you might have to choose a different year to check.