# Political Alignment Case Study

Allen Downey

[MIT License](https://en.wikipedia.org/wiki/MIT_License)

This is the third in a series of notebooks that make up a case study in exploratory data analysis.

In this notebook, we explore the relationship between political alignment and three variables that reflect "outlook", specifically, beliefs about whether other people are generally fair, helpful, and trustworthy.

1. We use the Pandas function [`groupby`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) to plot the average response to several questions as a function of time.

2. We use the Pandas function [`pivot table`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html) to compute the average response as a function of time,

3. And we use resampling to see whether the features we see in the figures might be due to randomness, or whether they are likely to reflect actual changes in the works.

If you are running this notebook in Colab, the following cell downloads files and installs some software we need.

If you are running in another environment, it is up to you to download data and install packages.

In [1]:
# If we're running in Colab, set up the environment

import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    !pip install empiricaldist
    !git clone --depth 1 https://github.com/AllenDowney/ExploratoryDataAnalysis
    %cd ExploratoryDataAnalysis

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from utils import decorate

# import two functions we wrote in the previous notebook
from utils import plot_series_lowess, plot_columns_lowess

### Loading the GSS data

If the HDF5 files doesn't exist, create it.

In [3]:
import os

filename = 'eds.gss.hdf5'
if os.path.isfile(filename):
    print('File already exists; no need to create it.')
else:
    # read and clean the data
    gss = utils.read_gss('gss_eda')
    utils.gss_replace_invalid(gss)
    
    # resample and save the resampled DataFrames
    for i in range(3):
        np.random.seed(i)
        sample = utils.resample_by_year(gss, 'wtssall')

        key = f'gss{i}'
        sample.to_hdf(filename, key)

Load one of the resampled DataFrames.

In [4]:
gss = pd.read_hdf('eds.gss.hdf5', 'gss0')
gss.shape

Make a color dictionary.

In [5]:
muted = sns.color_palette('muted', 5)
sns.palplot(muted)

In [6]:
colors = {'Conservative': muted[3], 
              'Moderate': muted[4], 
               'Liberal': muted[0]}

### 3-point scale

To make it easier to visualize groups, I'm going to lump the 7-point scale into a 3-point scale.

In [7]:
def make_polviews3(df):
    """Replace 7 point scale with 3 point scale.
    
    df: DataFrame
    """
    d = {1:'Liberal', 
         2:'Liberal', 
         3:'Liberal', 
         4:'Moderate', 
         5:'Conservative', 
         6:'Conservative', 
         7:'Conservative'}
    
    df['polviews3'] = df.polviews.replace(d)
    
make_polviews3(gss)

With this scale, there are roughly the same number of people in each group.

In [8]:
def values(series):
    return series.value_counts().sort_index()

values(gss['polviews3'])

## Fair

In this notebook, we explore relationships between political alignment (conversative, liberal) and "outlook", which means the way people perceive the world.  Specifically, we'll look at their responses to questions about

1. Whether people generally try to be fair.

2. Whether people can generally be trusted.

3. Whether people generally try to be helpful.

Before we look at results, you might want to guess: do you think conservatives or liberals tend to give more positive responses to these questions?

Let's see if the data are consistent with your expectations.

The [first question](https://gssdataexplorer.norc.org/projects/52787/variables/440/vshow) we'll look at is:

Do you think most people would try to take advantage of you if they got a chance, or would they try to be fair?

The possible responses are:

```
1	Take advantage
2	Fair
3	Depends
```

As always, we'll start by looking at the distribution of responses, that is, how many people give each response:

In [9]:
values(gss['fair'])

The plurality think people try to be fair, so that's promising.

These responses are categorical, but we can put them on a numerical scale, representing the most positive response (Fair) with 1, the most negative response (Take advantage) with 0, and "depends" with 0.5.

This scale is arbitrary, but it let's us quantify changes over time.

In [10]:
d = {1:0, 2:1, 3:0.5}
gss['fair'].replace(d, inplace=True)
values(gss['fair'])

We saw the `grouby` function in the previous notebook; now we wrap it in a function that takes a DataFrame, groups by year, and then computes the mean of the given variable.

In [11]:
def group_by_year(df, varname):
    """Group by year and compute mean of `varname`.
    
    df: DataFrame
    varname: string variable name
    
    returns: Series
    """
    grouped = df.groupby('year')
    return grouped[varname].mean().dropna()

We also saw `decorate` in the previous notebook; now we wrap it in a function that let's us avoid repeating the elements that are common across figures.

In [12]:
def decorate_by_year(**options):
    """Label the axes.
    
    options: keyword arguments passed to `decorate`.
    """
    decorate(xlabel='Year',
             ylabel='Fraction saying yes',
             xlim=[1970, 2020],
             **options)

Here's the fraction of people who say people try to be fair, plotted over time.  As in the previous notebook, we plot the data points themselves with circles and a local regression model as a line.

In [13]:
mean_by_year = group_by_year(gss, 'fair')
plot_series_lowess(mean_by_year, 'C1')

title='Would most people try to be fair?'
decorate_by_year(title=title)

The following function uses the Pandas function `pivot_table` to make a DataFrame with one row for each year and one column for each value of `polviews3`, where each element is the mean response to `fair` for one year and one political alignment.

In [14]:
def group_by_polviews(df, varname):
    """Group by polviews and year, and compute mean of varname.
    
    df: DataFrame
    varname: string variable name
    
    returns: DataFrame with one row per year,
             one column per value in 'polviews3'
    """
    return df.pivot_table(values=varname, 
                          index='year', 
                          columns='polviews3', 
                          aggfunc='mean')

Here's how it works, and the first few rows of the table.

In [15]:
mean_by_polviews = group_by_polviews(gss, 'fair')
mean_by_polviews.head()

Now we can use `plot_columns_lowess` to see the results.

In [16]:
columns = ['Conservative', 'Liberal', 'Moderate']
plot_columns_lowess(mean_by_polviews, columns, colors)
decorate_by_year(title=title)

In the following two sections, you can do the same analysis with the other two outlook variables, `trust` and `helpful`.

## Trust

**Exercise:**  Follow the instructions below to general similar graphs for response to [this question](https://gssdataexplorer.norc.org/projects/52787/variables/441/vshow)

> Generally speaking, would you say that most people can be trusted or that you can't be too careful in dealing with people?

```
1	Can trust
2	Cannot trust
3	Depends
```

Use `values` to display the distribution of responses.

In [17]:
# Solution goes here

Use `replace` to recode the responses so the most positive response is 1, the most negative response is 0, and "Depends" is 0.5.

In [18]:
# Solution goes here

Use `group_by_year` and `plot_series_lowess` to plot the fraction of people giving the positive response over time.

In [19]:
# Solution goes here

Use `group_by_polviews` to make a pivot table with the average response as a function of time, grouped by `polviews3`.

In [20]:
# Solution goes here

Use `plot_columns_lowess` to plot the average response over time, grouped by `polviews3`.

In [21]:
# Solution goes here

## Helpful

**Exercise:** Generate similar figures for responses to [this question](https://gssdataexplorer.norc.org/projects/52787/variables/439/vshow):

>Would you say that most of the time people try to be helpful, or that they are mostly just looking out for themselves?

```
1	Helpful
2	Lookout for self
3	Depends
```

Hint: You can select a set of cells by clicking on the first and shift-clicking on the last.  Then you can copy and paste them.

In [22]:
# Solution goes here

## Simulating possible datasets

The figures we have generated so far in this notebook are based on a single resampling of the GSS data.  Some of the features we see in these figure might be due to random sampling, not necessarily actual changes in the world.

By generating the same figures again using additional resampled datasets, we can get a sense of how much variation there is due to random sampling.

To make that easier, the following function contains the code from the previous analysis, all in one place.

In [23]:
def plot_by_polviews(df, varname, title):
    """Plot mean response by polviews and year.
    
    df: DataFrame
    varname: name of the variable to plot
    title: string title for the figure
    """
    make_polviews3(df)
    mean_by_polviews = group_by_polviews(df, varname)
    plot_columns_lowess(mean_by_polviews, columns, colors)
    decorate_by_year(title=title)

Now we can loop through the three resampled datasets in the HDF5 file and generate a figure for each one.

In [24]:
varname = 'fair'
title = 'Would most people try to be fair?'

for i in range(3):
    key = f'gss{i}'

    plt.figure()
    df = pd.read_hdf('eds.gss.hdf5', key)
    d = {1:0, 2:1, 3:0.5}
    df[varname].replace(d, inplace=True)
    plot_by_polviews(df, varname, title)

Features that are the same in all three figures are more likely to reflect things actually happening in the world.  Features that differ substantially between the figures are more likely to be artifacts of random sampling.

In this context, "artifact" has the sense of ["something observed in a scientific investigation or experiment that is not naturally present but occurs as a result of the preparative or investigative procedure"](https://www.lexico.com/en/definition/artifact).