# Data Transformation
*Curtis Miller*

In this notebook I describe further data transformation topics.

## Binning Quantitative Data

We may refer to data that can take any number within a certain range (or any finite number) as quantitative data, and data that falls into certain categories as categorical data.

We cannot go from categorical data to quantitative data without imputation (and trouble) but we can turn quantitative data into categorical data by binning. That is, we don't report the number; we only report that it fell into one of several bins, which, collectively, account for all numbers taken in the dataset.

Let's demonstrate. First, load in a dataset.

In [None]:
import pandas as pd
%matplotlib inline

In [None]:
pop_pyramids = pd.read_csv("PopPyramids.csv", index_col=["Country", "Year", "Age"])
pop_pyramids = pop_pyramids.loc[:, ["Male Population", "Female Population"]]    # Only want two columns, for illustration
pop_pyramids.columns = pd.Index(["Male", "Female"])
pop_pyramids.head()

To bin data, use the **pandas** function `cut()`, which will return a `Series` of binned data.

We use this to bin population counts.

In [None]:
srs = pd.cut(pop_pyramids.Male, 10)    # Give number of bins; result is a categorical variable, in a Series
srs

In [None]:
srs.value_counts()

In [None]:
srs.value_counts().plot("bar")

In [None]:
srs = pd.cut(pop_pyramids.Male, [0, 1000, 10000, 100000, 1000000, 10000000,    # Give bin edges
                                 100000000])
srs

In [None]:
srs.value_counts()

In [None]:
srs.value_counts().plot("bar")

## Clamping Quantitative Data

Clamping restricts quantitative to a certain range. Data falling outside this range is replaced with the nearest endpoint of the range.

Why clamp? If your clamping to prevent "impossible" values, be careful; there may be a better approach. (For example, in your data set of people with their ages, you might have one reported age be 220. This is clearly wrong and you might be tempted to clamp age so it cannot be less than 0 or more than 100. On the other hand, 100 might be inappropriate for correcting this error; perhaps the data entry professional typed "220" instead of "20" or "22".)

I won't ask why you want to clamp; let's assume you have a good reason. The pandas method `clip()` can be used for clamping.

In [None]:
pop_pyramids.Male["China"]

In [None]:
srs = pop_pyramids.Male.clip(lower=0, upper=1000000)    # All data now within range [0, 1000000]
srs["China"]

## Recoding and Replacing

Some data sets encode certain data values in certain ways. Male/female in a survey could be recorded as:

* Strings: `"male"` and `"female"`, or `"M"` and `"F"`
* Numbers: `0` and `1`, or `1` and `0`, or `1` and `2`

Sometimes missing data is recorded specially, like a missing age value could be coded as `999` (obviously not a real age). We may wish to replace these quantities with desired encodings.

I generate a fictitious dataset below, censor it by giving "missing" data the value `999`, then replace `999` with `nan`.

In [None]:
import numpy as np
from numpy.random import randn

In [None]:
vec = (randn(12) * 10).round()
vec[[1, 2, 5, 6]] = 999
df = pd.DataFrame(vec.reshape(4, 3))
df

In [None]:
df.replace({999: np.nan}, inplace=True)    # Replacement scheme, done in place
df

In [None]:
df2 = pd.DataFrame({"Sex": ['m', 'f', 'f', 'f', 'm', 'f'],
                    "HoursSlept": [6, 6, 9, 8, 5, 8]})
df2

In [None]:
df2.loc[:, "Sex"].replace({'m': 0, 'f': 1}, inplace=True)
df2

In [None]:
df2.mean()    # Interpretable value for Sex: it's the proportion of the sample that is female

## Derivative Values

Here we calculate useful statistics from existing data. For example, we can reconstruct the columns excluded from the `pop_pyramids` dataset.

In [None]:
pop_pyramids["Total"] = pop_pyramids.Male + pop_pyramids.Female    # Total population
pop_pyramids.head()

In [None]:
pop_pyramids["MalePercentage"] = pop_pyramids.Male / pop_pyramids.Total
pop_pyramids.head()

In [None]:
pop_pyramids["FemalePercentage"] = pop_pyramids.Female / pop_pyramids.Total
pop_pyramids.head()

In [None]:
pop_pyramids["MaleFemaleRatio"] = pop_pyramids.Male / pop_pyramids.Female
pop_pyramids.head()

In [None]:
# Which countries have most men to women?
pop_pyramids.sort_index(inplace=True)    # Cannot do slicing without sorting first
pop_pyramids.loc[(slice(None), 2017, "Total"), "MaleFemaleRatio"].sort_values(ascending=False)

## Mathematical/Statistical Transformations

Sometimes we want statistically transformed versions of the data. This is applying a mathematical function to the data and using a different number for analysis. For example:

* We may use $z_i = \frac{x_i - \bar{x}}{s_x}$ to rescale/reshape data ($\bar x$ is the data's mean, $s_x$ the data's average)
* In time series, we may be interested in log-differences, where $r_t = \log{x_t} - \log{x_{t - 1}}$; this is done to, say, stock data

This type of transformation can be done easily.

In [None]:
xbar = pop_pyramids.loc[pop_pyramids.index.get_level_values(2) != "Total",    # Exclude "Total" rows
                        :].mean()    # Get mean population count
xbar

In [None]:
stdev = pop_pyramids.loc[pop_pyramids.index.get_level_values(2) != "Total",    # Exclude "Total" rows
                         :].std()    # Get mean population count
stdev

In [None]:
# Centering at 0/scaling to 1
pop_pyramids["ScaledCenteredTotal"] = (pop_pyramids["Total"] - xbar["Total"]) / stdev["Total"]
pop_pyramids.loc[(slice(None), slice(None), "Total"), "ScaledCenteredTotal"] = np.nan    # Missing because nonsense
pop_pyramids.loc[("Afghanistan", 2016), :]

In [None]:
# log populations
pop_pyramids["LogMale"] = np.log10(pop_pyramids.Male) + 1    # The log function with base 10
pop_pyramids["LogFemale"] = np.log10(pop_pyramids.Female) + 1
pop_pyramids["LogTotal"] = np.log10(pop_pyramids.Total) + 1
pop_pyramids.head()