# Political Views

In [1]:
try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from empiricaldist import Pmf

In [3]:
from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)

download('https://github.com/AllenDowney/PoliticalAlignmentCaseStudy/' +
         'raw/update2021/gss_eds.3.hdf5')

In [4]:
datafile = 'gss_eds.3.hdf5'
gss = pd.read_hdf(datafile, 'gss0')
gss.shape

## Political alignment

The people surveyed as part of the GSS were asked about their "political alignment", which is where they place themselves on a spectrum from liberal to conservative.

The variable `polviews` contains responses to the following question (see <https://gssdataexplorer.norc.org/projects/52787/variables/178/vshow>):

> We hear a lot of talk these days about liberals and conservatives. 
I'm going to show you a seven-point scale on which the political views that people might hold are arranged from extremely liberal--point 1--to extremely conservative--point 7. Where would you place yourself on this scale?

Here are the valid responses:

```
1	Extremely liberal
2	Liberal
3	Slightly liberal
4	Moderate
5	Slightly conservative
6	Conservative
7	Extremely conservative
```

To see how the responses have changed over time, we'll inspect them at the beginning and end of the observation period.
First I'll select the column.

In [9]:
def values(series):
    """Count the values and sort.
    
    series: pd.Series
    
    returns: series mapping from values to frequencies
    """
    return series.value_counts().sort_index()

**Exercise:** Create a Boolean `Series` that is `True` when `year` is 2021, then use it to select corresponding values from `polviews`. Use `values` to display the selected values and how many times each one appears. 

## PMFs

To visualize these distributions, we'll use a Probability Mass Function (PMF), which is similar to a histogram, but there are two differences:

* In a histogram, values are often put in bins, with more than one value in each bin. In a PMF each value gets its own bin.

* A histogram computes a count, that is, how many times each value appears; a PMF computes a probability, that is, what fraction of the time each value appears. 

I'll use the `Pmf` class from `empiricaldist` to compute a PMF.

In [13]:
def decorate(**options):
    """Decorate the current axes.
    
    Call decorate with keyword arguments like
    decorate(title='Title',
             xlabel='x',
             ylabel='y')
             
    The keyword arguments can be any of the axis properties
    https://matplotlib.org/api/axes_api.html
    """
    ax = plt.gca()
    ax.set(**options)
    
    handles, labels = ax.get_legend_handles_labels()
    if handles:
        ax.legend(handles, labels)

    plt.tight_layout()

## Groupby



**Exercise:** The standard deviation quantifies the spread of the distribution, which is one way to measure polarization.
Plot standard deviation of `polviews` for each year of the survey from 1972 to 2021.
Does it show evidence of increasing polarization?

## Local Regression

In the previous section we plotted mean and standard deviation of `polviews` over time.  Both plots are quite noisy.
We can use **local regression** to compute a smooth line through these data points (see <https://en.wikipedia.org/wiki/Local_regression>).  

The following function takes a Pandas Series and uses an algorithm called LOWESS to compute a smooth line.  LOWESS stands for "locally weighted scatterplot smoothing".

In [26]:
from statsmodels.nonparametric.smoothers_lowess import lowess

def make_lowess(series):
    """Use LOWESS to compute a smooth line.
    
    series: pd.Series
    
    returns: pd.Series
    """
    # take the data out of the Series
    y = series.values
    x = series.index.values

    # compute the linear regression
    smooth = lowess(y, x)
    
    # put the result into a Series
    index, data = np.transpose(smooth)
    return pd.Series(data, index=index) 

In [27]:
def plot_series_lowess(series, color):
    """Plots a series of data points and a smooth line.
    
    series: pd.Series
    color: string or tuple
    """
    # plot the data with circles
    series.plot(linewidth=0, marker='o', color=color, alpha=0.5)
    
    # plot the local regression with a line
    smooth = make_lowess(series)
    smooth.plot(label='', color=color)

**Exercise:** Use `plot_series_lowess` to plot the standard deviation of `polviews` with a smooth line.

## Cross Tabulation

In the previous sections, we treated `polviews` as a numerical quantity, so we were able to compute means and standard deviations.
But the responses are really categorical, which means that each value represents a discrete category, like "liberal" or "conservative".
In this section, we'll treat `polviews` as a categorical variable.  Specifically, we'll compute the number of respondents in each category for each year, and plot changes over time.

Pandas provides a function called `crosstab` that computes a **cross tabulation** (see <https://en.wikipedia.org/wiki/Contingency_table>).
It takes two `Series` objects as arguments and returns a `DataFrame`.

## Plotting

Political Alignment Case Study

Copyright 2020 Allen B. Downey

License: [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)