# Quiz 3

BEFORE YOU START THIS QUIZ:

1. Click on "Copy to Drive" to make a copy of the quiz,

2. Click on "Share",
    
3. Click on "Change" and select "Anyone with this link can edit"
    
4. Click "Copy link" and

5. Paste the link into [this Canvas assignment](https://canvas.olin.edu/courses/390/assignments/6149).

DO THIS BEFORE YOU START THE QUIZ.

The following cells define functions we have used before.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
def values(series):
    """Count the values and sort.

    series: pd.Series

    returns: series mapping from values to frequencies
    """
    return series.value_counts().sort_index()

In [3]:
def decorate(**options):
    """Decorate the current axes.

    Call decorate with keyword arguments like
    decorate(title='Title',
             xlabel='x',
             ylabel='y')

    The keyword arguments can be any of the axis properties
    https://matplotlib.org/api/axes_api.html
    """
    ax = plt.gca()
    ax.set(**options)

    handles, labels = ax.get_legend_handles_labels()
    if handles:
        ax.legend(handles, labels)

    plt.tight_layout()

In [4]:
from statsmodels.nonparametric.smoothers_lowess import lowess


def make_lowess(series):
    """Use LOWESS to compute a smooth line.

    series: pd.Series

    returns: pd.Series
    """
    y = series.values
    x = series.index.values

    smooth = lowess(y, x)
    index, data = np.transpose(smooth)

    return pd.Series(data, index=index)

In [5]:
def plot_series_lowess(series, color="C0"):
    """Plots a series of data points and a smooth line.

    series: pd.Series
    color: string or tuple
    """
    series.plot(linewidth=0, marker="o", color=color, alpha=0.5)
    smooth = make_lowess(series)
    smooth.plot(label="_", color=color)

The following cell installs `empiricaldist` if needed.

In [6]:
try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist

The following cell downloads the GSS data.

In [7]:
from os.path import basename, exists


def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + local)


download(
    "https://github.com/AllenDowney/PoliticalAlignmentCaseStudy/"
    + "raw/update2021/gss_eds.hdf5"
)

This data file contains the original data, which uses stratified sampling, so it is not a representative sample.

In [8]:
datafile = "gss_eds.hdf5"
gss = pd.read_hdf(datafile, "gss")
gss.shape

## The "Rise of the Nones"

During the last 20 years or so, the number of people who say they have no specific religious affiliation has increased substantially.
These people are often called "Nones" because when they are asked their religious preference, they choose "None".
In this notebook, we'll explore this so-called "Rise of the Nones".

In the GSS dataset, the variable `relig` records responses to the following question:

> What is your religious preference?  Is it Protestant, Catholic, Jewish, some other religion, or no religion?

You can read the [documentation of this variable here](https://gssdataexplorer.norc.org/variables/287/vshow).

The following cell selects this column from the GSS `DataFrame` and displays the distribution of the responses.

In [9]:
relig = gss["relig"]
values(relig)

**Question 1:** Before we go on, it is a good idea to check how many values are missing.
Use `isna` to check which values in `relig` are `NaN`, and display the total number of missing values.

**Question 2:** Let's also spot-check the data to make sure it is consistent with the code book. Select only the rows from 1972 and use `values` to display the values of `relig` and how many times each one appears. 

Add a comment to indicate whether the results are  [consistent with the code book](https://gssdataexplorer.norc.org/variables/287/vshow).

**Question 3**: Use the `empiricaldist` library to create a `Pmf` object that represents the distribution of religious preferences.

**Question 4:** Make a bar chart that shows the quantities in the `Pmf` on the x-axis and the fraction of people who reported each preference on the y-axis. 

**Question 5:** On the previous graph, add appropriate labels for the x- and y-axis, and give the whole graph an appropriate title.
You can use my `decorate` function or call Pyplot functions directly, whichever you prefer.

**Question 6:** One problem with the previous figure is that it is hard to see some of the small bars. In the next cell, make a version of the same figure that where the y-axis is on a logarithmic scale.

You can ignore the following empty cells.

## Fraction of "Nones" over time

Now let's see how the fraction of "Nones" has changed over time.

**Question 7:** Add a new column named `none` to the GSS `DataFrame`; it should be a Boolean Series that is `True` where `relig` equals 4 and `False` otherwise.

You can use the following cell to check whether it worked.
The value `True` should appear 8918 times. 

In [19]:
values(gss["none"])

**Question 8:** Use `groupby` to group the GSS DataFrame by year.
Then select the `none` column from the resulting `GroupBy` object and compute its mean.
The result should be a `Series` that contains years as labels in the index and the fraction of "Nones" as values.

You can use the following cell to plot the results.

In [22]:
plot_series_lowess(none_by_year)

decorate(
    xlabel="Year",
    ylabel="Fraction reporting none",
    title="Religious preference vs year of interview",
)

## Group By Age

The following cells create a new column called `age10` that puts people into 10-year age groups. You don't need to understand all of the details.

In [23]:
bins = np.arange(17, 100, 10)
bins

In [24]:
labels = bins[:-1] + 5
labels

In [25]:
gss["age10"] = pd.cut(gss["age"], bins, labels=labels).astype(float)

Here are the values of `age10` and the number of times each one appears.

In [26]:
values(gss["age10"])

The following figure shows the fraction of Nones in each age group.

In [27]:
gss.groupby("age10")["none"].mean().plot(style="-o", color="C1")

decorate(
    xlabel="Age", ylabel="Fraction reporting None", title="Religious preference vs age"
)

## Group by Cohort

The column `cohort` contains year each respondent was born, so-called because all of the people born in the same period of time belong to the same "birth cohort".

The following cells create a new column called `cohort10` that puts people into 10-year birth cohorts.

In [28]:
bins = np.arange(1885, 2021, 10)
bins

In [29]:
labels = bins[:-1] + 5
labels

In [30]:
gss["cohort10"] = pd.cut(gss["cohort"], bins, labels=labels).astype(float)

Here are the value of `cohort10` and the number of times each one appears.

In [31]:
values(gss["cohort10"])

The following figure shows the fraction of Nones in each age group.

In [32]:
gss.groupby("cohort10")["none"].mean().plot(style="-s", color="C2")

decorate(
    xlabel="Birth cohort",
    ylabel="Fraction reporting none",
    title="Religious preference vs cohort",
)

**Question 9:** Make a version of the previous figure that plots the line and markers using a color selected from any Seaborn palette. You can [read about palettes here](https://seaborn.pydata.org/tutorial/color_palettes.html).

## Cross Tabulation

**Question 10:** Make a cross tabulation that has values of `cohort10` as labels in the index and values of `age10` as column names.
The values in the table should be the number of people in each cohort who were interviewed at each age.

You can use the following cell to plot the results (assuming you assign the cross tabulation to `xtab`).

In [36]:
plt.pcolormesh(xtab.columns, xtab.index, xtab, shading="nearest")
plt.gca().invert_yaxis()

decorate(xlabel="Age", ylabel="Cohort", title="Cross Tabulation of Cohort and Age")

## Pivot Table

The following question is just for fun -- I won't grade it because we learned about pivot tables too recently.

**Question 11 Just for Fun:** Make a pivot table that has values of `age10` as labels in the index and values of `cohort10` as column names. The values in the pivot table should be the fraction of "Nones" in each age group and birth cohort.

You can use the following cell to plot the results.

In [39]:
table.plot(style="-o")

decorate(
    xlabel="Birth cohort",
    ylabel="Fraction reporting none",
    title="Religious preference, group by age plot by cohort",
)

Ignore the following empty cells.