In [None]:
import pandas as pd
import numpy as np
import os

import matplotlib.pyplot as plt
# plt.rcParams['figure.figsize'] = (10, 5)
plt.style.use('seaborn-white')   # seaborn custom plot style
plt.rc('figure', dpi=100, figsize=(10, 5))   # set default size/resolution
plt.rc('font', size=12)   # font size

# Lecture 5 – Unfaithful Data, Hypothesis Testing

## DSC 80, Spring 2022

### Announcements

- Project 1 is released!
    - The Checkpoint is due **tomorrow at 11:59PM**.
    - The whole project is due on **Thursday, April 14th at 11:59PM**.
    - Use [this sheet](https://docs.google.com/spreadsheets/d/1PMtGpd4U6rYBn6Ut6eHQzSo4PdBwluU-ppx87ROy_N8/edit#gid=0) to find a pair programming partner.
- Lab 2 is due on **Monday, April 11th at 11:59PM**.
- There is only one discussion section now, on Wednesdays from **7-8:30PM**.

### Agenda

- Unfaithful data.
- Missing values.
- Hypothesis testing.

## Unfaithful data

### Is the data "faithful" to the DGP?

- In other words, how well does the data represent reality?

- Does the data contain unrealistic or "incorrect" values?
    - Dates in the future for events in the past.
    - Locations that don't exist.
    - Negative counts.
    - Misspellings of names.
    - Large outliers.

### Is the data "faithful" to the DGP?
    
- Does the data violate obvious dependencies?
    - Age and birthday don't match. 
- Was the data entered by hand?
     - Spelling errors.
     - Fields shifted.
     - Did the form require fields or provide default values?  
- Are there obvious signs of data falsification (aka "curbstoning")?
    - Repeated names.
    - Fake looking email addresses.
    - Repeated use of uncommon names or fields.

<center><img src='imgs/data-sd.png' width=70%></center>

### Example: Police vehicle stops 🚔

The dataset we're working with contains all of the vehicle stops that the San Diego Police Department made in 2016.

<center><img src="imgs/image_5.png"/></center>

### General questions

1. Check the data types. Notice any issues?
2. Do string fields have consistent values?
3. Are there missing values that we don't understand?
4. Are all values within a reasonable range?
5. How do we deal with the messiness we find?

In [None]:
stops = pd.read_csv('data/vehicle_stops_2016_datasd.csv')
stops.head()

### Data types
* Are the data types correct?
* If not, are they easily fixable?

In [None]:
stops.head(1)

In [None]:
stops.info()

### Unfaithfulness
* Are there suspicious values?
* If a value is suspicious, can we trust the observation?
* For example, consider `'subject_age'` – some are too high to be true, some are too low to be true.

In [None]:
stops['subject_age'].unique()

In [None]:
ages = pd.to_numeric(stops['subject_age'], errors='coerce')
ages.describe()

Ages range all over the place, from 0 to 220. Was a 220 year old really pulled over?

In [None]:
stops.loc[ages > 100]

In [None]:
ages.loc[(ages >= 0) & (ages < 16)].value_counts()

In [None]:
stops.loc[(ages >= 0) & (ages < 16)]

### Unfaithful `'subject_age'`

* Ages of `'No Age'` and `0` are likely explicit null values.
* What do we do about the exceptionally small and large ages?
    - Do we throw the entire row away, even if the rest of row is well-formed?
* What about the 14 and 15 year olds?
    - Each has more than one occurrence – these could be real entries!

### Human-entered data
* Which fields were likely entered by a human?
* Which fields were likely generated by code?
    - What was the original source?

Let's look at all unique stop causes. Notice that there are three different causes related to bicycles, which should probably all fall under the same cause.

In [None]:
stops['stop_cause'].value_counts()

Let's plot the distribution of ages, within a reasonable range (15 to 85). What do you notice?

In [None]:
# DSC 10 review: what does density=True do?
ages.loc[(ages > 15) & (ages <= 85)].plot(kind='hist', density=True, bins=70, ec='w');

Now let's look at the first few and last few rows of `stops`.

In [None]:
stops[['timestamp', 'stop_date', 'stop_time']].head()

In [None]:
stops[['timestamp', 'stop_date', 'stop_time']].tail(10)

Do you think `'-0:81'` is a time that a computer would record?

### Unfaithful data vs. outliers

* Unfaithful data are data that don't accurately represent the data generating process.
* Outliers are "unusual" observations, unlike the rest of the data. They may be real, or they may be unfaithful.
    - For instance, it's possible that a 102-year old was pulled over for speeding.
* The two are hard to tell apart; doing so often requires research and domain knowledge.

### Outliers

* **Consistently "incorrect" values**.
    - Example: Recorded ages of -1 or 99.
    - These are often "default" values, often used when a value is missing.
    - Solution: Change the value to the correct one if it is known!
    
* **Abnormal artifacts from the data collection process**.
    - Example: Spikes in recorded ages at round numbers (25, 30, 35, 40), or spikes in recorded COVID cases on Mondays.
    - Solution: Try "smoothing", e.g. binning the ages.
        
* **Unreasonable outliers**.
    - Example: Age of 200.
    - Solution: Not sure. Could remove the row. Could be indicative of a bug in the data collection process. Could be real!

### Reminder: tools 🛠

You'll use the following methods regularly when initially exploring a dataset.

- `.describe()`: see basic numerical information about a Series/DataFrame.
- `.info()`: see data types and the number of missing values in a Series/DataFrame.
- `.value_counts()`: see the distribution of a categorical variable.
- `.plot(kind='hist')`: plot the distribution of a numerical variable.

## Missing values

### Where'd you go?

* Missing values in a dataset can occur from:
    - Intentional logic, where a value doesn't make sense.
    - A non-response in the measurement process.
    - Mistakes in the data recording process.
    - ...
* Another term for "missing" is "null".
    
* Missing values are most often encoded with `NULL`, `None`, `NaN`, `''`, etc.

### Common representations of "null"

- All forms of `0` (e.g. `0`, `'0'`, `'zero'`) are common substitutes for null.
- -1 is common if a column must be non-negative.
- 1900 and 1970 are common if a non-null date is required.
    - Remember, Unix time starts counting from January 1, 1970.

### Common representations of "null"

- Some common representations for "null" are also real values themselves!
- For instance, the point 0°00'00.0"N+0°00'00.0"E in the South Atlantic Ocean is called "Null Island."

<center><img src='imgs/null.png' width=60%></center>

- [This person's name is Mr. Null!](https://www.wired.com/2015/11/null/)

### Missing values in the stops dataset

What are the non-`np.NaN` null values in the stops dataset?
- Service Area: `'Unknown'`.
- Subject Age: `0`, `'No Age'`.
- Others?

In [None]:
stops

### Finding null values in `pandas`

* Null values are encoded using NumPy's `NaN` value, which is of type `float`.
* The `isna` method for DataFrame/Series detects missing values.
    - It returns a Boolean DataFrame/Series.
    - `isnull` is equivalent to `isna`.

In [None]:
type(np.NaN)

In [None]:
# All of the rows where the subject age is missing
stops[stops['subject_age'].isna()]

In [None]:
# Proportion of values missing in the subject_age column
stops['subject_age'].isna().mean()

In [None]:
# Proportion of missing values in all columns
stops.isna().mean()

### Dropping observations with null values
- The `dropna` method:
    - when used on a Series, returns a new Series with all null entries removed.
    - when used on a DataFrame, returns a new DataFrame where all rows with at least one null value are removed.
- Don't drop rows unless absolutely necessary!
    - Usually, there is still useful information in the other columns.

In [None]:
stops.head()

In [None]:
stops.dropna().head()

In [None]:
stops.shape

In [None]:
stops.dropna().shape

### Dropping observations with null values

When used on a DataFrame:

* `.dropna()` drops **rows** containing **at least one** null value.
* `.dropna(how='all')` drops **rows** containing **only** null values.
* `.dropna(axis=1)` drops **columns** containing at least one null value.
* Other keyword arguments: `thresh`, `subset`.

In [None]:
nans = pd.DataFrame([[0, 1, np.NaN], [np.NaN, np.NaN, np.NaN], [1, 2, 3]], columns='A B C'.split())
nans

In [None]:
nans.dropna(how='any')

In [None]:
nans.dropna(how='all')

In [None]:
nans.dropna(axis=1)

In [None]:
nans.dropna(subset=['A', 'B'])

### Filling null values

The `fillna` method replaces all null values. Specifically:

* `.fillna(val)` fills null entries with the value `val`.
* `.fillna(dict)` fills null entries using a dictionary `dict` of column/row values.
* `.fillna(method='bfill')` and `.fillna(method='ffill')` fill null entries using neighboring non-null entries.

In [None]:
nans

In [None]:
# Filling all NaNs with the same value
nans.fillna('billy')

In [None]:
# Filling NaNs differently for each column
nans.fillna({'A': 'f0', 'B': 'f1', 'C': 'f2'})

In [None]:
# Dictionary of column means
# Note that most numerical methods ignore null values
means = {c: nans[c].mean() for c in nans.columns}
means

In [None]:
# Filling NaNs with column means
nans.fillna(means)

In [None]:
# Another way of doing the same thing
nans.apply(lambda x: x.fillna(x.mean()), axis=0)

In [None]:
nans

In [None]:
# bfill stands for "backfill"
nans.fillna(method='bfill')

In [None]:
# ffill stands for "forward fill"
nans.fillna(method='ffill')

### Data types and `np.NaN`

* The result of *any* comparison (`==`, `!=`, `<`, `>`) with `np.NaN` is `False`.
* Instead, use the function `pd.isna`, which returns whether the argument is `np.NaN` or `None`.
    - Can also use `pd.isnull`.
* Remember, `NaN` is of type `float` – watch out for type coercion!

In [None]:
nans

In [None]:
np.NaN == np.NaN

In [None]:
pd.isna(np.NaN)

In [None]:
nans.isna()

In [None]:
nans.isnull()

In [None]:
# Since np.NaN is a float, the Series is of type float despite the two ints
pd.Series([0, 1, np.NaN])

### More soon...

- That's all we'll discuss regarding missing values for now.
- However, once we recap hypothesis and permutation testing, we will introduce the idea of **imputation**, in which we will learn how to fill missing values using other information in the DataFrame.
- Stay tuned!

## Hypothesis testing

### Answering questions with confidence 💪

Now our data is clean and we're confident that it's faithful to the data generating process.

How do we ask questions and draw conclusions about the data generating process, using our observed data?

Run the following cell to set things up.

In [None]:
np.random.seed(42)

flips = pd.DataFrame(np.random.choice(['H', 'T'], p=[0.55, 0.45], size=(114,1)), columns=['result'])

### Was the coin fair? 🪙

* Given a dataset of coin flips, we want to try and answer the question, "was the coin fair?"
* Do we "trust" the dataset?
    * Maybe whoever kept track of the coin flips made some typos.
* What is "fair"? 
    - Ideally, we see the exact same number of heads and tails. But how often will that happen exactly?
    - What is a reasonable deviation?

In [None]:
flips.head()

In [None]:
flips.value_counts()

In [None]:
# The to_frame method converts a Series to a DataFrame
flips['result'].value_counts().to_frame()

In [None]:
# Normalized
flips['result'].value_counts(normalize=True).to_frame()

### Null hypothesis

- We start with an initial belief as to how the data was generated, which is called a **null hypothesis**.
    - In our example, it is that the coin was fair.
    - The null hypothesis must be a **probability model**, i.e. something that we can simulate under.
- Somehow, we need to decide whether our observation (e.g. 68 heads and 46 tails) is consistent with that belief.
- To make this decision, we will:
    - Assume the belief is true.
    - Consider all possible outcomes under that assumption, along with their probabilities.
        - e.g. if the coin truly was fair, what's the probability of seeing 40% heads? 61% heads? 49% heads?
    - See how likely our observation was, under this assumption.

### Test statistics

- A **test statistic** is a number that we compute in each repetition of an experiment, to help us make a decision.

- Suppose a coin was flipped $N$ times, and $N_H$ flips were heads. Then, each of the following is a test statistic we could choose:

    * $N_H$ (number of heads).
    * $\frac{N_H}{N}$ (proportion of heads).
    * $N_H - \frac{N}{2}$ (difference from expected number of heads).
    * $|N_H - \frac{N}{2}|$ (absolute difference from expected number of heads).

- The former three would be helpful for the alternative hypothesis "the coin was biased in favor of heads" (or tails).
- The latter would be helpful for the alternative hypothesis "the coin was biased."

### Making decisions

- After choosing a test statistic, we need to compute the **distribution of the test statistic, under the assumption that the null hypothesis is true** ("under the null").
    - In DSC 10 and 80, we do this through simulation, which means our calculations are approximate.
    - In other courses, you may do this by-hand (e.g. for the coin example you could use the binomial distribution).
- Once we have this distribution, we can compute **the probability of seeing an observation as or more extreme than our observation**, under this assumption.
    - This is called a **p-value**.
- If that probability is very small, it means that the null hypothesis is unlikely to explain our observation, and we should reject it.

### Running a hypothesis test, DSC 10 style

Let's use the number of heads ($N_H$) as our test statistic. We need to:
1. Compute the **observed value** of the test statistic, i.e. the observed number of heads.
2. Simulate values of the test statistic under the null, i.e. under the assumption that the coin was fair.
3. Use the resulting distribution to calculate the (approximate) probability of seeing 68 or more heads, under the assumption the coin was fair.

In [None]:
# This DataFrame contains our "observed data"
flips.head()

In [None]:
# Number of coin flips
flips.shape

In [None]:
# Observed statistic
obs = (flips['result'] == 'H').sum()
obs

In [None]:
# Number of simulations
N = 10000

# 10000 times, we want to flip a coin 114 times
results = []
for _ in range(N):
    simulation = np.random.choice(['H', 'T'], p=[0.5, 0.5], size=114)
    sim_heads = (simulation == 'H').sum()  # Test statistic
    results.append(sim_heads)

Each entry in `results` is the number of heads in 114 simulated coin flips.

In [None]:
results[:10]

### Plotting the empirical distribution of the test statistic

In [None]:
pd.Series(results).plot(kind='hist', 
                        density=True,
                        bins=np.arange(35, 76, 1),
                        ec='w',
                        title='Number of Heads in 114 Flips of a Fair Coin');
plt.axvline(x=obs, color='red', linewidth=4);

**Question:** Do you think the coin was fair?

In [None]:
(np.array(results) >= obs).mean()

- Under the assumption the coin is fair, the probability of seeing 68 or more heads is ~2.5%.
    - This is called a **p-value**.
- So either:
    - The coin is fair and we saw a really rare event, or
    - The coin is not fair.
- We need a **cutoff** to determine whether to reject the null hypothesis, given this probability.

## Summary, next time

### Summary

- Data cleaning is the process of transforming data so that it is an accurate representation of the data generating process.
- Unfaithful data is data that is not representative of the data generating process. When working with messy data, we must look for:
    - Missing values (i.e. "null" values).
    - Incorrect values.
- Useful methods to be aware of: `fillna`, `isna`/`isnull`, `dropna`.
- Hypothesis testing allows us to make confident conclusions regarding the data generating process, given some observed data.
- **Next time:** how to perform a "faster" hypothesis test. More test statistics and examples.