Cleaning Data for Effective Data Science
========================================

Doing the Other 80% of the Work
-------------------------------

* A course by David Mertz, Ph.D.
* Sponsored by Juniper Insights
* 5-6 October, 2022, Singapore

Day 1
-----

08:30 **Registration**

08:45 **Tabular Data Formats**

**Tidying up**: Tidy data carefully separates variables
(columns/features) from observations (rows/samples). At the intersection
of these two, we find values, i.e. datum. Unfortunately, the data we
encounter is often not arranged in this useful way, and it requires
*normalization*.

**Comma separated values**: Delimited text files are ubiquitous. These
text files put multiple values on each line, and separate those values
with some character, such as a comma. They are almost always the
exchange format used to transport data between other tabular
representations, and much data both starts and ends life as CSV.

**Exercise**: Cleanup the toy tabular data in students.csv. Either "by hand"
in a text editor and/or using Python, fix the formatting and data
issues.

10:15 **Coffee break**

10:30 **Working with Data**

**Spreadsheets considered harmful**: Spreadsheets in general, and Excel
in particular, are anathema to effective data science. A great share of
the world\'s data lives in Excel spreadsheets, which engender special
types of corruption.

**Relational databases management systems** use strict column typing and
frequent use of formal foreign keys and constraints, and are hence a
boon for data science. Even somewhat informally assembled databases have
many desirable properties, even if not well normalized. Not all
relational databases are *tidy*, but they all take you several large
steps in that direction.

**Dataframes**: Many libraries across almost as many programming
languages support the *data frame* abstraction. Most data scientists
prefer this style of processing data. Data frames allow an easy
expression of filtering, grouping, aggregation, sorting, and vectorized
function application.

**Introduction to anomaly detection**: Data goes bad in the ordinary
course of its collection, collation, transmission, and transcription.
Perhaps an instrument gives a bad reading. Perhaps some values are
systematically altered in the course of re-encoding to a different data
format. Perhaps the wrong units of measure were used for a subset of the
data.

**Exercise**: Cleanup the sample data in Film\_Awards.xlsx to produce a tidy
Pandas DataFrame. Even though the data set is small enough to type by
hand, use Python/Pandas programming techniques to perform the cleanup
and reorganization.

13:00 **Lunch break**

14:00 **Anomaly Detection**

**Missing data and sentinels**: In textual data formats missing data is
indicated either by absence or by a sentinel. Use of sentinels is not
limited to text formats. Often in SQL, for example, TEXT or CHAR columns
that could use NULL to indicate missing values instead contain
sentinels.

15:30: **Coffee break**

15:45 **Anomaly Detection**

**Miscoded data** is often categorical data. Ordinal data might be
included too inasmuch as it has known bounds. For example, if a ranking
scale is specified as ranging from 1 to 10, any values outside of that
numeric range must be miscoded in some manner.

**Fixed bounds**: Based on our domain knowledge of the problem and data
set, we may know of fixed bounds for particular variables. The tallest
human who has lived was Robert Pershing Wadlow at 271㎝ ; the shortest
was Chandra Bahadur Dangi at 55㎝ . Values outside this range are
probably unreasonable.

**Exercise**: Use the sample data in humans-names.csv and attempt to
identify likely data entry errors among the 25,000 rows of data.

17:00 **Day ends**

Day 2
-----

08:30 **Registration**

08:45 **Anomaly Detection**

**Outliers**: Values may fall within normative ranges, nonetheless be
strongly uncharacteristic. The standard ways to characterize the
expectedness of a value are measures called z-score and inter-quartile
range.

**Multivariate outliers**: Sometimes univariate features can fall within
relatively moderate z-score boundaries, and yet combinations of those
features are unlikely or unreasonable.

Exercise: Use the small and famous Michelson-Morley data set morley.dat
to identify outliers that are likely to be inaccurate or unreliable
data. As with many real-world problems, you will first need to determine
how to parse the data at all before doing statistical analysis.

10:15 **Coffee break**

10:30 **Data Quality**

**Missing Data**: When a record has missing data, you have three
choices.

1.  discard that particular record
2.  impute some value(s) for the missing field.
3.  decide that because of the amount or distribution of missing data,
    the data is simply not usable for the purpose at hand.

**Biasing trends**: When you detect a sample bias within your data you
need to make a domain area judgment about its significance.

1.  The distribution of observations is unrepresentative of the
    underlying domain.
2.  The data themselves may reveal a bias by trends that exist between
    multiple variables. This could be a phenomenon you have detected in
    the data, or could be a collection or curation artifact.

**Normalization and scaling**: Normalization of data is simply bringing
all the features being utilized in a data set into comparable numeric
ranges. This often improves the quality of models based on that data.

**Cyclicity and autocorrelation**: Sometimes you expect your data to
have periodic behavior. In such cases---especially when multiple
overlapping cyclicities exist within sequential data---the deviations
from the cyclical patterns can be more informative than the raw values.
Most frequently we see this in association with time series data.

Exercise: Using the "Brad's House" temperature data used the lesson to
read all data into a Pandas DataFrame, then, characterize each
observation as "regular," "interesting," "missing," or "data error."

13:00 **Lunch break**

14:00 **Value imputation**

**Typical value imputation**: Pretty much the simplest thing we can do
is assume a missing value is similar to the general trend for that same
feature. In some cases, domain knowledge informs us as to what a
reasonable default is. The existing data can also provide guidance for
imputation.

**Locality imputation**: In a time series, the measurement taken at one
particular minute is \"local\" to the measurement taken at the next
minute. Locality in general, however, is not specifically about
sequence. In a parameter space, including physical space, locality might
be \"closeness\" in that space. Imputing values based on nearby values
is often a reasonable way of filling in data we do not actually have.

**Sampling** is modification of a data set in order to rebalance it in
some manner. An imbalance can reflect **either** the data collection
techniques used **or** the underlying pattern of the phenomenon you are
measuring.

Exercise: Use the data set height-weight-color.csv that purports to
observe the named features of numerous humans, and generate several
synthetic features that may be useful for future predictive models.

15:30: **Coffee break**

15:45 **Value imputation**

**Trend imputation**: The most common type of trend that data scientists
use for imputation is with time series data. Missing observations are
likely similar those nearby. Approaches to trend imputation include
forward fill, backward fill, local regression, time-sensitive
regression, non-local regression, and correlational imputation.

**Exercise**: Use several different value imputation techniques on the
fictional excited-kryptonite.fwf data set used in lessons, and analyze
how much different are the imputations made by these several techniques.

17:00 **Day ends**