In [1]:
import pandas as pd
import numpy as np
import os

# Lecture 4 – Messy Data

## DSC 80, Spring 2022

### Announcements

- Lab 1 is due **tonight at 11:59PM**.
    - Watch [this video 🎥](https://youtu.be/FpTo4AM9B30) for setup instructions.
    - Please submit the [Welcome + Alternate Exams Form](https://docs.google.com/forms/d/e/1FAIpQLSdBKLcPs4Xi0plaIw0MVZ0DyGcvnSZyHxKVC7S7LwEiCchepQ/viewform) **by tonight**.
- Project 1 is released!
    - Watch [this video 🎥](https://www.youtube.com/watch?v=Os-BT0FTzVg) to get an overview of the project, and use [this sheet](https://docs.google.com/spreadsheets/d/1PMtGpd4U6rYBn6Ut6eHQzSo4PdBwluU-ppx87ROy_N8/edit#gid=0) to find a pair programming partner.
    - The Checkpoint is due on **Thursday, April 7th at 11:59PM**.
    - The whole project is due on **Thursday, April 14th at 11:59PM**.

### Agenda

- Recap: adding columns.
- Introduction to messy data and data cleaning.
- Kinds of variables.
- Unfaithful data.

## Recap: adding columns

In [2]:
elections_fp = os.path.join('data', 'elections.csv')
elections = pd.read_csv(elections_fp)
elections.head(10)

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss
5,Bush,Republican,53.4,1988,win
6,Dukakis,Democratic,45.6,1988,loss
7,Clinton,Democratic,43.0,1992,win
8,Bush,Republican,37.4,1992,loss
9,Perot,Independent,18.9,1992,loss


### Adding and modifying columns, using a copy

* To add a new column to a DataFrame, use the `assign` method.
* To add a new row to a DataFrame, use the `append` method.
* Both `assign` and `append` return a copy of the DataFrame, **which is a great feature!**
* To change the values in a column, re-assign its name to a sequence of the desired values.

As an aside, you should try your best to write **chained** `pandas` code, as follows:

In [4]:
elections['%'] / 100

0     0.507
1     0.410
2     0.066
3     0.588
4     0.376
5     0.534
6     0.456
7     0.430
8     0.374
9     0.189
10    0.492
11    0.407
12    0.084
13    0.484
14    0.479
15    0.483
16    0.507
17    0.529
18    0.457
19    0.511
20    0.472
21    0.482
22    0.461
23    0.513
24    0.469
Name: %, dtype: float64

In [9]:
elections = elections.assign(prop_of_vote = elections['%']/100)

In [10]:
elections

Unnamed: 0,Candidate,Party,%,Year,Result,prop_of_vote
0,Reagan,Republican,50.7,1980,win,0.507
1,Carter,Democratic,41.0,1980,loss,0.41
2,Anderson,Independent,6.6,1980,loss,0.066
3,Reagan,Republican,58.8,1984,win,0.588
4,Mondale,Democratic,37.6,1984,loss,0.376
5,Bush,Republican,53.4,1988,win,0.534
6,Dukakis,Democratic,45.6,1988,loss,0.456
7,Clinton,Democratic,43.0,1992,win,0.43
8,Bush,Republican,37.4,1992,loss,0.374
9,Perot,Independent,18.9,1992,loss,0.189


In [None]:
(
    elections
    .assign(proportion_of_vote=(elections['%'] / 100))
    .head()
)

You can chain together several steps at a time:

In [None]:
(
    elections
    .assign(proportion_of_vote=(elections['%'] / 100))
    .assign(Result=elections['Result'].str.upper())
    .head()
)

You can also use `assign` when the desired column name has spaces, by using keyword arguments.

In [None]:
(
    elections
    .assign(**{'Proportion of Vote': elections['%'] / 100})
    .head()
)

### Adding and modifying columns, in-place

* You can assign a new row or column to a DataFrame **in-place** using `loc` or `[]`.
    - Works like dictionary assignment.
    - Unlike `assign`, this **modifies** the underlying DataFrame rather than a copy of it.
* This is the more "common" way of adding/modifying columns. 
    - ⚠️ Warning: Exercise caution when using this approach, since this approach changes the values of existing variables.

In [None]:
# By default, .copy() returns a deep copy of the object it is called on,
# meaning that if you change the copy the original remains unmodified.
mod_elec = elections.copy()
mod_elec.head()

In [None]:
mod_elec['Proportion of Vote'] = mod_elec['%'] / 100
mod_elec.head()

In [None]:
mod_elec['Result'] = mod_elec['Result'].str.upper()
mod_elec.head()

In [None]:
# 🤔
mod_elec.loc[-1, :] = ['Carter', 'Democratic', 50.1, 1976, 'WIN', 0.501]
mod_elec.loc[-2, :] = ['Ford', 'Republican', 48.0, 1976, 'LOSS', 0.48]
mod_elec

In [None]:
mod_elec = mod_elec.sort_index()
mod_elec.head()

In [None]:
# df.reset_index(drop=True) drops the current index 
# of the DataFrame and replaces it with an index of increasing integers
mod_elec.reset_index(drop=True)

## Introduction to messy data

### There is no such thing as "raw data"!

* Data are the result of measurements that must be recorded.
* Humans design the measurements and record the results.
* Data is **always** an imperfect record of the underlying processing being measured.

### Data generating process

* A **data generating process** is the underlying, real-world (probabilistic) mechanism that generates observed data. 
* Observed data is an incomplete artifact of the data generating process.
* **A data generating process is what a statistical model attempts to describe.**
    - From DSC 10: a model is a set of assumptions about how data were generated.
    - More on this later in the quarter.
- Data cleaning requires an understanding of the data generating process.

### Example: COVID case counts 🦠

Suppose our **goal** is to determine the number of COVID cases in the US **yesterday**.
- What are we really asking for – the number of people who tested positive yesterday, or the number of people who contracted COVID yesterday?
- Tested positive on what type of test? How accurate is that type of test?
- How often are test results reported? Is there a delay in when test results are reported?

<center><img src='imgs/christmas.png' width=70%></center>

Why do you think so few cases were reported on Christmas Day – is it because COVID was less prevalent on Christmas Day as compared to the days before and after, or is it likely for some other reason? 🎅

### Data provenance

- As data scientists, we often need to work with datasets that others collected, for a purpose that is different than our current interest.
- As such, it's important to understand the "story" of how a dataset came to be, or the **provenance** of the data. Specifically, we need to be aware of:
    1. Assumptions about the data generating process.
    2. How the initial values in the dataset came to be.  
    3. How any data processing or storage decisions affected the values in the dataset.

The bigger picture question we're asking here is, **can we trust our data?**

### Data cleaning 🧹

- Data cleaning is the process of transforming data so that it best represents the underlying data generating process.

- In practice, data cleaning is often detective work to understand data provenance.
    - **Always be skeptical of your data!**

### Keys to data cleaning

Data cleaning often addresses: 

* The **structure** of the recorded data.
    - Is the data stored in a tabular format (e.g. CSV, SQL, Google Sheets) or in another format (JSON, XML)?
    - Are the individuals properly represented as rows?
* The **encoding** and **format** of the values in the data.
    - Are the data types of all columns reflective of the **kinds of data** they contain?
* Corrupt and "**incorrect**" data, and missing values.
    - Were there flaws in the data recording process? In other words, is our data **faithful** to the data generating process?
    
Let's focus on the latter two.

## Kinds of data

### Kinds of data

<center><img src='imgs/data-types.png' width=90%></center>

### Discussion Question

Determine the kind of each of the following variables.
- Fuel economy in miles per gallon.
- Number of quarters at UCSD.
- Class standing (freshman, sophomore, etc.).
- Income bracket (low, medium, high).
- Bank account number.

### Example: DSC 80 students

In the next cell, we'll load in an example dataset containing information about past DSC 80 students.

- `'PID'` and `'Student Name'`: student PID and name.
- `'Month'`, `'Day'`, `'Year'`: date when the student was accepted to UCSD.
- `'2021 tuition'` and `'2022 tuition'`: amount paid in tuition in 2021 and 2022, respectively.
- `'Percent Growth'`: growth between the two aforementioned columns.
- `'Paid'`: whether or not the student has paid tuition for this quarter yet.
- `'DSC 80 Final Grade'`: either `'Pass'`, `'Fail'`, or a number.

What needs to be changed in the DataFrame to compute statistics?

In [None]:
students = pd.read_csv(os.path.join('data', 'students.csv'))
students

### How much has each student paid in total tuition in 2021 and 2022?

In [None]:
students

In [None]:
total = students['2021 tuition'] + students['2022 tuition']
total

### Check the data types of `students`!

* What kinds of data should each column have?
    - Qualitative or quantitative?
    - Discrete or continuous?
    - Ordinal or nominal?
* What data type *should* each column have?

* Use the `dtypes` attribute (or the `info` method) to peek at the data types.

In [None]:
students.dtypes

### Cleaning `'2021 tuition'` and `'2022 tuition'`

* `'2021 tuition'` and `'2022 tuition'` are stored as `object`s (strings), not numerical values.
* The `'$'` character causes the entries to be interpreted as strings.
* We can use `str` methods to strip the dollar sign.

In [None]:
# This won't work. Why?
students['2021 tuition'].astype(float)

In [None]:
# That's better!
students['2021 tuition'].str.strip('$').astype(float)

We can loop through the columns of `students` to apply the above procedure. (Looping through columns is fine, just avoid looping through rows.)

In [None]:
for col in students.columns:
    if 'tuition' in col:
        students[col] = students[col].str.strip('$').astype(float)
        
students

Alternatively, we can do this without a loop by using `str.contains` to find only the columns that contain tuition information.

In [None]:
cols = students.columns.str.contains('tuition')
students.loc[:, cols] = students.loc[:, cols].astype(float)
students

### Cleaning `'Paid'`

* Currently, `'Paid'` contains the strings `'Y'` and `'N'`.
    * `'Y'`s and `'N'`s typically result from manual data entry.
* The `'Paid'` column should contain `True`s and `False`s, or `1`s and `0`s.
* Solutions:
    - Use the `replace` Series method.
    - Create a Boolean Series through comparison.

In [None]:
students['Paid'].replace({'Y': True, 'N': False})

In [None]:
students['Paid'].value_counts()

In [None]:
students['Paid'] = students['Paid'] == 'Y'
students

### Cleaning `'Month'`, `'Day'`, and `'Year'`
* Currently, these are stored separately using the `int64` data type. This could be *fine* for certain purposes, but ideally they are stored as a single column (e.g. for sorting).
* Solutions:
    * Store dates as strings of the form `'YYYY-MM-DD'`.
    * Store dates as `datetime64` objects (later).

In [None]:
(
    students['Year'].astype(str) + '-' + 
    students['Month'].astype(str).str.zfill(2) + '-' + 
    students['Day'].astype(str).str.zfill(2)
)

Note:
- Due to **broadcasting**, we were able to add a Series to a string.
- The `zfill` string method adds zeroes to the start of a string until it reaches the specified length.

### Cleaning `'DSC 80 Final Grade'`

* Currently, `'DSC 80 Final Grade'`s are stored as `object`s (strings).
* Unless we somehow store this column to a numeric type, we can't do any arithmetic with it.
* However, due to the existence of strings like `'Pass'`, we can't use `astype` to convert it.
* Solution: use `pd.to_numeric(s, errors='coerce')`, where `s` is a Series.
    - ⚠️ Be careful with this!
    - `errors='coerce'` can cause uninformed destruction of data.

In [None]:
# Won't work!
students['DSC 80 Final Grade'].astype(int)

In [None]:
pd.to_numeric(students['DSC 80 Final Grade'], errors='coerce')

In [None]:
students['DSC 80 Final Grade'] = pd.to_numeric(students['DSC 80 Final Grade'], errors='coerce')
students

In [None]:
pd.to_numeric?

### Cleaning `'Student Name'`
* We want names to be formatted as `'Last Name, First Name'`, a common format.
* One solution: use the Series `apply` method.
    - If `s` is a Series, `s.apply(func)` applies the function `func` to each entry of `s`.

In [None]:
students['Student Name']

In [None]:
def transpose_name(name):
    firstname, lastname = name.split()
    return lastname + ', ' + firstname

transpose_name('King Triton')

In [None]:
students['Student Name'].apply(transpose_name)

### Aside: string methods

`str` methods are useful – use them!
- To use them, access the `str` attribute of Series.
- Then, whatever method/operator comes immediately after will be applied to each element of the Series individually, rather than the Series as a whole.

In [None]:
parts = students['Student Name'].str.split()
parts

In [None]:
parts.str[1] + ', ' + parts.str[0]

### More data type ambiguities

- 1649043031 looks like a number, but is probably a date.
    - [Unix timestamps](https://www.unixtimestamp.com) count the number of seconds since January 1st, 1970.

- "USD 1,000,000" looks like a string, but is actually a number **and** a unit.
    
- 92093 looks like a number, but is really a zip code (and isn't equal to 92,093).
    
- Sometimes, `False` appears in a column of country codes. Why might this be? 
🤔

### Example: the Norway problem 🇳🇴

In [None]:
import yaml

player = '''
name: Magnus Carlsen
age: 31
country: NO
'''

In [None]:
yaml.safe_load(player)

## Unfaithful data

### Is the data "faithful" to the DGP?

- In other words, how well does the data represent reality?

- Does the data contain unrealistic or "incorrect" values?
    - Dates in the future for events in the past.
    - Locations that don't exist.
    - Negative counts.
    - Misspellings of names.
    - Large outliers.

### Is the data "faithful" to the DGP?
    
- Does the data violate obvious dependencies?
    - Age and birthday don't match. 
- Was the data entered by hand?
     - Spelling errors.
     - Fields shifted.
     - Did the form require fields or provide default values?  
- Are there obvious signs of data falsification (aka "curbstoning")?
    - Repeated names.
    - Fake looking email addresses.
    - Repeated use of uncommon names or fields.

<center><img src='imgs/data-sd.png' width=70%></center>

### Example: Police vehicle stops 🚔

The dataset we're working with contains all of the vehicle stops that the San Diego Police Department made in 2016.

<center><img src="imgs/image_5.png"/></center>

### General questions

1. Check the data types. Notice any issues?
2. Do string fields have consistent values?
3. Are there missing values that we don't understand?
4. Are all values within a reasonable range?
5. How do we deal with the messiness we find?

In [None]:
stops = pd.read_csv('data/vehicle_stops_2016_datasd.csv')
stops.head()

### Data types
* Are the data types correct?
* If not, are they easily fixable?

In [None]:
stops.head(1)

In [None]:
stops.info()

### Unfaithfulness
* Are there suspicious values?
* If a value is suspicious, can we trust the observation?
* For example, consider `'subject_age'` – some are too old to be true, some are too low to be true.

In [None]:
stops['subject_age'].unique()

In [None]:
ages = pd.to_numeric(stops['subject_age'], errors='coerce')
ages.describe()

Ages range all over the place, from 0 to 220. Was a 220 year old really pulled over?

In [None]:
stops.loc[ages > 100]

In [None]:
ages.loc[(ages >= 0) & (ages < 16)].value_counts()

In [None]:
stops.loc[(ages >= 0) & (ages < 16)]

### Unfaithful `'subject_age'`

* Ages of `'No Age'` and `0` are likely explicit null values.
* What do we do about the exceptionally small and large ages?
    - Do we throw the entire row away, even if the rest of row is well-formed?
* What about the 14 and 15 year olds?
    - Each has more than one occurrence – these could be real entries!

In the coming weeks, we'll cover more solutions to these problems.

## Summary, next time

### Summary

- Data provenance describes the "origin story" of a dataset, from the data generating process to its storage.
- Data cleaning is the process of transforming data so that it best represents the underlying data generating process.
- We must ensure that each column in a DataFrame uses the correct data type for the **kind** of data in the column.
- We must also ensure that our data is **faithful** to the data generating process, by looking for missing or strange values.
- **Next time:** finish discussing unfaithful data, and (re)introduce hypothesis testing.