In [27]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display
import myst_nb

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)
        

# businesses
bus = pd.read_csv('data/businesses.csv', encoding='ISO-8859-1')

# inspections
insp = pd.read_csv("data/inspections.csv")

# violations
viol = pd.read_csv("data/violations.csv")

# DAWN
colspecs = [(0,6), (14,29), (33,35), (35, 37), (37, 39), (1213, 1214)]
varNames = ["id", "wt", "age", "sex", "race","type"]
dawn = pd.read_fwf('data/DAWN-Data.txt', colspecs=colspecs, header=None, index_col=0,
                   names = varNames)

(ch:wrangling_checks)=
# Quality Checks

Once your data are in a table and you understand its scope and granularity, it's time to inspect their quality. You may have come across errors in the data as you examined it and wrangled it into a data frame. In this section, we continue the inspection and carry out a more thorough assessment of the quality of the features and their values.  We consider quality from four vantage points:

**Quality based on scope**. Earlier, in {numref}`Chapter %s <ch:data_scope>`, we addressed whether or not the data that have been collected can adequately address the problem at hand. We introduced a ***framework*** to help us consider possible limitations in the way in which the data were collected that might impact the generalizability of our findings. These considerations are especially important to consider as we deliberate or findings, but here, in this section, we limit ourselves to checking the quality of values using information about the scope of the data.  For example, with San Francisco restaurant inspections, the first three digits of a restaurant's zip code should be 941.


In [28]:
bus['postal_code'].value_counts().tail(10)

95105    1
92672    1
94602    1
        ..
94621    1
Ca       1
64110    1
Name: postal_code, Length: 10, dtype: int64

Other initial digits can point to a problem with the data. 

As another example, background reading on atmospheric CO2 uncovered typical measurements to be about 400 ppm worldwide and we say the monthly averages of CO2 at Mauna Loa to be in the 300 to 400 ppm range.        

**Quality of measurements and recorded values**. We can use our knowledge of the scope of the data to check quality, as mentioned already, but we can also check the quality of measurements by considering what might be a reasonable value for a feature. For example, whats seems like a reasonable range for the number of violations in a restaurant inspection?  Possibly, 0 to 5? Other checks can be based on common knowledge of ranges: a restaurant inspection score must be between 0 and 100; months run between 1 and 12.  We can use documentation to tells us the expected values for a feature. For example, the type of emergency room visit in the DAWN has been coded as 1, 2, ..., 8 (see {numref}`Figure %s <DAWN_codebook>` below for a screenshot of the codebook explanation of the field) so we can confirm that all values for the type of visit are indeed either a 1 or 2 or ... 8.

```{figure} figures/DAWN_codebook.png
---
name: DAWN_codebook
---

Screenshot of the description of the CASETYPE variable in the DAWN survey. Notice that there are eight possible values for this feature. And to help in figuring out if we have properly read the data, we can check the counts for these eight values. 
```

More generic checks can confirm, say, that a proportion ranges between 0 and 1. We want to ensure that the data type matches our expectations. For example, we expect a price to be a number, whether or not it's stored as integer, floating point, or string.  Confirming that the units of measurement match what is expected can be another useful quality check to perform (for example weight recorded in pounds, not kilograms). We can devise checks for all of these situations.  

**Quality of relationships.**   In addition to checking individual values in a column, we also want to cross-check values between features. To do this cross-checking, we use builtin conditions on the relationship of values between two (or more) features. For example, according to the documentation for the DAWN study, alcohol consumption is only considered a type of visit for patients under 21 so we can check that all instances that record alcohol for the type of visit also have age recorded as under 21. Alcohol is coded as a 3, and ages under 21 are coded as 1, 2, 3, and 4.

In [29]:
pd.crosstab(dawn['age'], dawn['type'])

type,1,2,3,4,5,6,7,8
age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
-8,2,2,0,21,5,1,1,36
1,0,6,20,6231,313,4,2101,69
2,8,2,15,1774,119,4,119,61
...,...,...,...,...,...,...,...,...
9,1616,3770,0,12404,3407,75,150,18381
10,616,1207,0,12291,2412,31,169,7109
11,205,163,0,24085,2218,12,308,1537


The cross tabulation shows that all of the alcohol cases are for patients under 21. The data values are as expected.       

**Quality for analysis**. Even when data pass your quality checks, problems can arise with its usefulness. For example, if all but a handful of values for a feature are identical, then that feature adds little to the understanding of underlying patterns and relationships.  Or, if there are too many missing values, especially if there is a discernible pattern in the missing values, our findings may be limited.  We dedicate a section to the topic of missing values ({numref}`Section %s <ch:wrangling_missing>`). If a feature has many bad/corrupted values, then we might question the accuracy of even those values that fall in the appropriate range. 

We see that the type of restaurant inspection can be either routine or a complaint. The San Francisco restaurants inspections had only one inspection for a complaint and all of the rest are routine.

In [30]:
pd.value_counts(insp['type'])

routine      14221
complaint        1
Name: type, dtype: int64

This feature adds nothing to the data and we can simply drop it from the data frame.

Two questions that might have come to your mind at this point are: How do we look for garbled, anomalous, and inconsistently coded values? And, once we find them, what do we do? We address these questions generally here, and show some of the practicalities in the example in {numref}`Section %s <ch:wrangling_restaurants>`. 

**How to find bad values and features?** 

- Check summary statistics, distributions, and value counts. {numref}`Chapter %s <ch:eda>` provides examples and guidance on how to go about checking the quality of your data using visualizations and summary statistics. We briefly mention a few approaches here. A table of counts of unique values in a feature can uncover unexpected encodings and lopsided distributions, where one option is a rare occurrence. Percentiles can be helpful in revealing the proportion of values with unusually high (or low) values.
- Logical expressions can identify records with values out of range or relationships that are out of wack. Simply computing the number of records that do not pass the quality, check can quickly reveal the size of the problem.
- Examine the whole record for those records with problematic values for a particular feature. At times, an entire record is garbled when, for example, a comma is misplaced in a CSV formatted file. Or, the record(s) might represent an unusual situation (such as ranches being included in data on house sales), and you will need to decide whether they should be included in your analysis or not.
- Refer to an external source to figure out if there's a reason for the anomaly.

**What to do with your discoveries?** You have essentially four options: leave the data as is; modify values; remove features; or drop records.  Not every unusual aspect of the data needs to be fixed. You might have discovered a characteristic of your data that will inform you with your analysis, but otherwise does not need any correcting. Or, you might find that the problem is relatively minor and most likely will not impact your analysis so you can leave the data as is. 

On the other hand, you might want to replace corrupted values with NaN ({numref}`Section %s <ch:wrangling_missing>` describes the potential impact of missing values on an analysis). You might have figured out what went wrong and correct the value. Other possibilities for modifying records are covered in the examples of {numref}`Chapter %s <ch:eda>`. If you plan to change the values of a variable, then it's good practice to create a new feature with the modified value and preserve the original feature, or at a minimum, create a new feature that indicates which values in the original feature have been modified. These approaches give you some flexibility in checking the influence of the modified values on your analysis. 

If you find yourself modifying many values for a feature then you might consider eliminating the feature from the dataset. Either way, you will want to study the possible impact of excluding the feature from your analysis.  In particular, you will want to determine whether the records with corrupted values are similar to each other, and different from the rest of the data. This would indicate that you may be unable to capture the impact of a potentially useful feature in your analysis. Rather than exclude the feature entirely, there may be a transformation that allows you to keep the feature while reducing the level of detail recorded.

At times, you will want to eliminate the problematic records. In general, we do not want to drop a large number of observations from a dataset without good reason. We may want to scale back our investigation to a particular subgroup of the data, but that's a different situation than dropping records because of a corrupted value in a field (see Section xx).  When you discover that an unusual value is in fact correct,  you still might decide to exclude the record from your analysis because it's so different from the rest of your data and you do not want it to overly influence your analysis. 

We suggested that there may be times when you want to replace corrupted data values with NA, and hence treat them as missing. At other times, data might arrive missing. What to do with missing data is an important topic and there is a lot of research on this problem; we touch on some of the possible ways to address missing data in the next section.