# SLU15: Working With Real Data
---

In this notebook we will cover the following:
* Tidy data
* Numerical data
* Scaling
* Ordinal data
* Label encoding
* Categorical data
* Categorical dtype
* Get dummies.

> Happy datasets are all alike; every unhappy dataset is unhappy in its own way.

(Shamelessly adapted from [Tolstoy's Anna Karenina](https://en.wikipedia.org/wiki/Anna_Karenina_principle).)

# 1 Tidy data principles

At the beginning of any project, it is critical to structure datasets in a way that facilitates work.

Most datasets are dataframes made up of rows and columns, containing values that belong to a variable and an observation:
* *Variables* contain all values that measure the same thing across observations
* *Observations* contain all values measured on the same unit (e.g., same person) across variables.

The ideas of *tidy data* ([Wickham, 2014](http://vita.had.co.nz/papers/tidy-data.html)) provide a standardized framework to organize and structure datasets, making them easy to manipulate, model and visualize.
1. Each variable forms a column
2. Each observation forms a row
3. Each type of observational unit forms a table (or dataframe).

We will be using a preprocessed version of the `avengers` dataset, by [FiveThirtyEight](http://fivethirtyeight.com/).

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

avengers = pd.read_csv('data/avengers.csv')
avengers.head(3)

# 2 Types of data in Pandas

As stated above, a dataset is a collection of values, usually either numbers (quantitative) or strings (qualitative).

In [None]:
avengers.dtypes

Pandas main data types are:
* Numeric (`int`, `float`)
* Datetime (`datetime`, `timedelta`)
* Object (for strings).

The convenient `DataFrame.select_dtypes` allows us to select variables (columns in our dataframe) by data type.

In [None]:
(avengers.select_dtypes(include='object')
         .head(3))

# 3 Apply functions over variables (or columns)

Pandas provides us with a convenient `df.apply` method that enables us to apply over entire columns. 

Let's use it to compute the mean and the mode for numeric and non-numeric values, respectively.

In [None]:
from scipy import stats

avengers.apply(stats.mode)

Let's use `df.select_dtypes` and `df.apply` together to compute the mean for numerical columns.

In [None]:
(avengers.select_dtypes(include='int64')
         .apply(np.mean))

# 4 Apply functions over observations (or rows)

Alternatively, we can use `df.apply` to apply functions over rows with a little adjustment, by setting `axis=1`.

Let's use it to compute the norm of our row vectors (sort of, since we are considering only the numerical columns for now).

In [None]:
from numpy.linalg import norm

(avengers.select_dtypes(include='int64')
         .apply(norm, axis=1)
         .sample())

As an experiment and so you see two different use cases, let's try to scale each row to a unit vector:
1. We will use `df.apply` to divide *each value or cell* by the norm of the row vector
2. We will use `df.apply` to compute the norm of the *entire row*, just like we did above, to see if we succeeded.

In [None]:
def normalize(row):
    """
    Takes a vector of values and transforms it into a unit vector with length 1.
    This is achieved by computing v / ||v|| for each value in the row vector.
    """
    return row / norm(row)

(avengers.select_dtypes(include='int64')
         .apply(normalize, axis=1)
         .apply(norm, axis=1)
         .sample())

# 5 Types of statistical data

There are three main types of statistical data:
1. Numerical
2. Categorical
3. Ordinal (which is a little bit of both, as you will see).

## 5.1 Numerical data

Numerical data is information that is measurable. It's always collected in number form, although not all data in number form is numerical.

Things we can do with numerical data:
* Mathematical operations (e.g., addition, distances and the normalization above)
* Sort it in ascending or descending order.

**Discrete data**

Discrete data take on certain values, although the list of values may be finite or not. 

Some data can even be continuous, but measured in a discrete way (e.g., age). 

Likewise, `TotalDeaths` and `TotalReturns` in our `avengers` data are discrete variable.

**Continuous data**

Continuous data can take any value within a range: `Appearances` is an example in our data.

### 5.1.1 Scaling numerical values

Often times, the numeric variables in our dataset have very different scales, that is, take on different ranges of values.

It's usually a good practice to scale them during the preprocessing of our data, typically you will do one of two things:
1. Scale variables to a given range
2. Standardize all variables.

These transformations change the data itself, but not the distribution. Why is it important to scale the data:
* When predictor values have different ranges, particular features can dominate the algorithm (e.g., think [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance))
* Different scales can make estimators unable to learn correctly from certain features in smaller ranges
* You don't want your feature to rely on the scale of the measurement involved
* Optimization methods (e.g., gradient descent) will converge faster, and otherwise they may not converge at all.

A notable exception are decision tree-based estimators that are robust to arbitrary scaling of the data.

**Scale all variables to a given range**

We would transform all variables so that the minimum and the maximum of the transformed data take certain values, e.g., 0 and 1:

$$ x_i' = \frac{x_i - x_{min}}{x_{max} - x_{min}} $$

In [None]:
from sklearn.preprocessing import MinMaxScaler

def scale_data(df, scaler, plot=True):
    df = df.copy()
    cols = df.select_dtypes(include='int64').columns
    df[cols] = scaler.fit_transform(df[cols])
    if plot:
        plot_scaled_data(df, cols)
    return df

def plot_scaled_data(df, cols):
    plt.figure(figsize=(10, 8))
    for col in cols:
        sns.distplot(df[col])
    plt.title('Distribution of numerical variables (after scaling)')
    plt.show()
    return None

min_max_scaler = MinMaxScaler()
(avengers.pipe(scale_data, min_max_scaler)
         .describe())

**Standardize all variables**

Standardization means both centering the data around 0 (by removing the mean) and scaling it to unit variance:

$$ z_i =  \frac{x_i - \mu}{\sigma}$$

In [None]:
from sklearn.preprocessing import StandardScaler

standard_scaler = StandardScaler()
(avengers.pipe(scale_data, standard_scaler)
         .describe())

## 5.2 Categorical data

Categorical data represents categories (e.g., gender, marital status, hometown).

Categorical variables can take on a limited, and usually fixed, number of possible values, also known as levels.

The categories can also take on numerical values (e.g., ids), but those numbers have no mathematical meaning:
* You can't do mathematical operations, even if the computer says yes
* Nor sort them in ascending or descending order.

A limitation of categorical data in the form of strings is that estimators, in general, don't know how to deal with it.

**Binary data**

A binary variable is a variable with only two possible values: like `Active` and `Gender` in our `avengers` dataset.

Since our algorithms can't deal with data in the form of strings, we need to transform such variables to a numerical form.

The method `Series.map` allows us to easily deal with this cases, mapping inputs to outputs.

In [None]:
(avengers['Active'].map({'YES': 1, 'NO': 0})
                   .sample())

Let's use it convert both columns to either 0 or 1.

In [None]:
(avengers.assign(Active=avengers['Active'].map({'YES': 1, 'NO': 0}),
                 Gender=avengers['Gender'].map({'MALE': 1, 'FEMALE': 0}))
         .sample())

Pandas provide us with a `category` dtype for categorical data:
* Easily identify and signal categorical columns for processing and other Python libraries
* Converting a string variable with a few different values to a categorical variable saves memory
* By converting to a categorical we can specify an order on the categories (more on this later).

Let's consider a categorical features: `Universe`.

In [None]:
avengers_cat = avengers.copy()
avengers_cat = avengers_cat.assign(Universe=avengers['Universe'].astype('category'))

avengers_cat.describe(include='category')

Categorical data has a `categories` and an `ordered` property:
* `Series.cat.categories` prints the different values (or levels) the variable can take on
* `Series.cat.ordered` prints whether the categorical variable has a natural order or not (hint: if it has, it's not purely categorical)

In [None]:
avengers_cat['Universe'].cat.categories

In [None]:
avengers_cat['Universe'].cat.ordered

**Dummy encoding**

Dummy encoding allows us to use categorical predictor variables in our models.

In [None]:
categorical_features = avengers.select_dtypes(include='category').columns
(pd.get_dummies(avengers_cat, columns=categorical_features, drop_first=True).sample())

**High cardinality data**

The column `Name` is an example of a high cardinality categorical variable and try to dummify it, the *dimensionality* of the dataset will explode (more on this later).

In fact, in this particular case, since if works as an identifier, we should simply drop it due to lack of relevancy.

In [None]:
avengers = avengers.drop('Name', axis=1)
avengers.head(3)

An alternative way to deal with high cardinality would be to keep only the most frequent values, and encode the remaining ones as a special case (e.g., "others").

## 5.3 Ordinal data

Ordinal statistical data refers to categories that have a natural order, but the distance between them is not known.

We will use the `Membership` variable as an example since it appears to be an order in the degree of commitment of our avengers.

We can also use the `category` dtype.

In [None]:
avengers_ord = avengers.copy()
avengers_ord = avengers_ord.assign(Membership=avengers['Membership'].astype('category'))

avengers_ord['Membership'].cat.categories

However, this time we need to set the order for our categories, since there is one! We `category` datatype is flexible enough to accommodate this.

In [None]:
ordered_cats = ['Honorary', 'Academy', 'Probationary', 'Full']
avengers_ord = avengers_ord.assign(Membership=avengers_ord['Membership'].cat.set_categories(ordered_cats, ordered=True))

avengers_ord['Membership'].min(), avengers_ord['Membership'].max()

Again, remember that our models need variables in numeric form, in order to be able to make sense of them.

The `category` datatypes deals with this gracefully for us.

In [None]:
(avengers_ord.assign(Membership=avengers_ord['Membership'].cat.codes)
         .sample(n=5))

However, and as usual, there is a trade-off here:
* If we assign integer values to our ordinal categories we are imposing the assumption that they are equally spaced
* If we convert them to dummy variables, we will lose the constraint with their order.

# 6 Bonus (not required for exercises)

## 6.1 Scaling with outliers

Scalers differ from each other in the way to estimate the parameters used to shift and scale each feature.

In the presence of some very large outliers, using the scalers above leads to te compression of inliers:

Since outliers have an influence on the minimum, maximum, mean and standard deviation, these scalers will *shrink* the range of the feature values.

The alternative is to scale the features in a way that is robust to outliers: using the median (instead of the mean) and the interquartile range.

In [None]:
from sklearn.preprocessing import RobustScaler

standard_scaler = RobustScaler()
(avengers.pipe(scale_data, standard_scaler)
         .describe())

## 6.2 Other ways to encode

When you are not processing the entire dataset at once, these encoders work great on preserving the encoding consistency:
* The`.fit()` method assigns our categories or labels to a specific output (e.g., a numerical value)
* Then`.transform()` transforms the data using this mapping, failing gracefully you when strange things happen (e.g., unseen categories).

Also, they can used in very convenient ways with other `sklearn` utilities and a typical workflow.

On the other hand, none of them is an option to deal with ordinal data (unless you decide to go with dummy encoding).

### 6.2.1 Label encoder

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit_transform(avengers['Universe'])
le.classes_

If we try to transform categories previously unseen by the encoder, it will raise an error (which is a good thing).

In [None]:
# this is supposed to go wrong :)
le.transform(avengers['Membership'])

### 6.2.2 One-hot encoder

This encoder only accepts inputs in numerical form and typically we need to use it after the label encoder.

In [None]:
from sklearn.preprocessing import OneHotEncoder

le = LabelEncoder()
universe_numeric = le.fit_transform(avengers['Universe'])
he = OneHotEncoder(sparse=False, handle_unknown='error')
he.fit_transform(universe_numeric.reshape(-1, 1))