# BLU02 - Learning Notebook - Data wrangling workflows - Part 1 of 3

In [2]:
import numpy as np
import pandas as pd
import os

from sklearn.datasets import load_iris

# About the BLU 

## Data wrangling workflows

A typical data science workflow goes as follows: you get data from a source, you clean it, and then you continuously iterate on it.

![data_transformation_workflow](./media/data_processing_workflow.png)

On the previous learning unit in this specialization, we focused mostly on getting and cleaning data (the blue boxes above). 

At this point, we got a dataset and performed necessary data cleaning. Our data is, therefore, in an interim state:
* You have a *tidy dataset* (observations as rows and features as columns), comprised of one or more tables
* You know how to import such tables into Pandas, regardless of the format they are stored.

Now, to explore, visualize and model the data, we have to perform transformations on it with agility, balancing:
* Speed of iteration, testing different hypothesis fast and easy
* Consistency, ensuring our pipeline doesn't collapse along the way.

In this first part, we'll go into how to transform the dataset and explore it in Pandas.

Then, we'll zoom in on a how to combine dataframes.

Finally, we'll move to scikit-learn to build efficient pipelines for modeling.

# About the data

The New York Philharmonic played its first concert on December 7, 1842.

The data documents all known concerts, amounting to more than 20,000 performances. Some considerations:
* The Program is the top-most level element in the dataset
* A Program is defined as performances in which the repertoire, conductors, and soloists are the same
* A Program is associated with an Orchestra (e.g., New York Philharmonic) and a Season (e.g., 1842-43)
* A Program may have multiple Concerts with different dates, times and locations
* A Program's repertoire may contain various Works (e.g., two different symphonies by Beethoven)
* A Work can have multiple Soloists (e.g., Mahler on the harpsichord, Strauss or Bernstein on the piano).

**For more information about the dataset, including the data dictionary, please head to the README.**

In this unit, we will be using Works and Concerts, imported as follows.

In [3]:
works = pd.read_csv('./data/works.csv')
concerts = pd.read_csv('./data/concerts.csv')

In [None]:
works.head()

In [None]:
concerts.head()

# 1 Data transformation

## 1.1 Transformations as functions 

Most data transformations operate on dataframes: they receive a local dataframe, transform it and return a new one.

In its simplest form, this is the signature of a generic data transformer.

In [None]:
def data_transformer(df):
    df = df.copy()
    # df = ...
    return df

Such transformations have no side effects and operate as functions on immutable data (i.e., they keep the original dataframe unchanged).

Since the output depends only on the arguments, calling them with the same arguments always produces the same result.

Confusing? Not really.

In [None]:
def rename_column(df, new_name, old_name):
    df = df.copy()
    df[new_name] = df[old_name]
    df = df.drop(columns=old_name)
    return df

def test_dataframe():
    data = np.random.randn(6, 4)
    columns = ['A', 'B', 'C', 'D']
    return pd.DataFrame(data=data, columns=columns)

df = test_dataframe()

rename_column(df, 'Z', 'A')

In [None]:
rename_column(df, 'Z', 'A')

The same result, see! The program (or Notebook) remembers nothing but the original data and the function itself: a white canvas!

What about the original dataframe?

In [None]:
df

After each call, the program *state* is the same as it was before (no new objects, no changes, no nothing!), as if nothing happened.

This property is valid for as long as we don't explicitly overwrite the original dataframe outside the function, using an assignment.

In [None]:
df = test_dataframe()
df = rename_column(df, 'Z', 'A')

try:
    rename_column(df, 'Z', 'A')
except:
    print("For some reason this doesn't work. Why is that?")

Mutable data is dangerous because it makes programs unpredictable. And this is why you should avoid modifying objects after creation.

Such pitfall is common in Notebooks, especially when you re-run cells, run them in a different order or restart the Kernel. (Am I right?)

**Data transformation is a *pipeline***

Another problem is that data transformation is about applying multiple, sequential changes to the data (i.e., a multistep process).

![data_transformation_pipeline](./media/data_transformation_pipeline.png)

*Fig 2. - A data transformation pipeline is a multistep process.*

And once we realize this, how do we go about it?

In [None]:
df = test_dataframe()

df_renamed = rename_column(df, 'Z', 'A')
# Code happens. Ideas are tested, hours go by.
df_renamed_without_b = df_renamed.drop(columns='B')
# More code happens. We keep on testing ideas, days go by.
df_renamed_without_b_positive = df_renamed_without_b[df_renamed_without_b > 0]
# There's a lot of code. Ideas come and go, we've been doing this for a week.
df_renamed_without_b_positive_no_nans = df_renamed_without_b_positive.dropna(how='all')
# Can we honestly trace back how to get from df to here? Probably not.
df_renamed_without_b_positive_no_nans

Using functions instead, you concisely encapsulate everything. 

(Also, you spend less time naming things, unless you want to.)

In [None]:
def data_transformer(df, how_to_dropna):
    df = df.copy()
    df = rename_column(df, 'Z', 'A')
    df = df.drop(columns='B')
    df = df[df > 0]
    df = df.dropna(how=how_to_dropna)
    return df

data_transformer(df, how_to_dropna='all')

This function is a bad one: names are not explicit, and there are no apparent blocks of logic.

Functions should organize and document our codebase (*what* you are doing and how).

Using functions, immutable data and avoiding side effects is a smart choice to manage complexity and keep things understandable.

Alternatively, we could structure our functions more like this.

In [None]:
def preprocess_data():
    df = df.copy()
    # df = rename_misspelled_columns(df)
    # df = drop_unnecessary_columns(df)
    # df = keep_only_positive_values(df)
    # df = removemissing_values(df)
    return df

## 1.2 Data transformation in Pandas

Pandas provides convenient methods for most data transformation tasks, with a unified, well-known syntax and consistent interfaces.

For example, we don't need to create a `rename_column()` function, since Pandas already provides a `df.rename()` method for us.

In [None]:
df = test_dataframe()

df.rename({'A': 'Z'}, axis=1)

As a recap: `df.rename()` follows our transformer signature:
* It takes a dataframe as input 
* And returns a new one as output.

This predictable input/output is what we mean by consistent interfaces! 

It seems very promising to build and multistep pipelines, no? What transformations can we perform this way?

### 1.2.1 Subsetting columns or the index

#### Take a subset of indexes or columns

Pandas implements this functionality, somewhat counterintuitively, as `df.filter()`.

Imagine that we want only the columns related to the work itself, excluding IDs.

In [None]:
work_related_columns = ['ComposerName', 'WorkTitle', 'Movement']
# Select columns by name.
works.filter(items=work_related_columns).head()

We can also use it to subset our dataframe based on the index.

In [None]:
# Select rows containing 'Glass' on the index.
works.set_index('ComposerName').filter(like='Glass', axis=0).reset_index().head()

### 1.2.2 Drop columns

Additionally, `GUID` and `ProgramID` are pretty redundant. We can get rid of `GUID`.

In [None]:
works.drop(columns='GUID').head()

### 1.2.3 Group By

This is a very extensive topic, and we'll just touch it's surface here, so that you know that exists and can explore it further later by your own.

In case you've worked with SQL before, you'll find this very familiar :)

So, in Pandas there is a process of three chained steps called split-apply-combine:

- split: splitting the DataFrame into groups (this is the groupby), we need to specify the column or the columns that will be used to form the groups.
- apply: apply a function to each group (aggregation, transformation and filtration)
- combine: create a DataFrame with the results

With this out of the way, it's time to focus on three types of functions you may want to apply:
* Aggregation (e.g., sum, count, mean)
* Transformation (e.g., filling missing values)
* Filtration (e.g., discard data from underrepresented groups).

We will drill-down into each one of them. For this next part, we will use be using the `concerts` data.



**Aggregation** (e.g., sum, count, mean)

We want to start by uncovering the most popular programs, by the number of performances.

So the first step is to group our data by `ProgramID`. This returns a DataFrameGroupBy object that by itself doesn't tell us much.

However, we can use the group property of the DataFrameGroupBy object to inspect the groups.


In [None]:
concerts_grouped_by_ProgramID = concerts.groupby('ProgramID')
concerts_grouped_by_ProgramID

In [None]:
concerts_grouped_by_ProgramID.groups

Now, if we want to know the number of performances per program we can simply call `DataFrameGroupBy.size()`

In [None]:
concerts_grouped_by_ProgramID.size()

The same result but using the all the operations together

In [None]:
concerts.groupby('ProgramID').size()

With the operation above we obtain a Dataframe with the number of performances per `ProgramID`. If we want to know the most popular ones we need to use the `pd.DataFrame.nlargest()`

In [None]:
concerts.groupby('ProgramID').size().nlargest()

We use `GroupBy.size()` to count the number of elements in each group. A list of available methods:
* `mean()`
* `sum()`
* `size()`
* `count()`
* `std()`
* `var()`
* `sem()`
* `describe()`
* `first()`
* `last()`
* `nth()`
* `min()`
* `max()`.

Alternatively, we can use `GroupBy.agg()` as a more general method. Check out the [docs](http://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.core.groupby.DataFrameGroupBy.agg.html).

You could even define your own aggregation function using the lambda nomenclature, we will see an example of it in the transformation topic.

Let's now imagine that we want to know when it was the first performance of each program.
We start by `group-by` the shows by `ProgramID` and then, for each group we take the `min` of the Date column. 

In [None]:
concerts.groupby('ProgramID').Date.min()

And if we want to know when it was the last performance?

In [None]:
concerts.groupby('ProgramID').Date.max()

**Transformation**

The inherent difference between aggregation and transformation is that the later returns an object the same size as the original input.

We don't have an excellent example in our dataset, so let's use the [Iris](https://archive.ics.uci.edu/ml/datasets/iris) dataset to exemplify a possible use case.

In [None]:
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df = df.assign(Target=iris.target)
df.head()

Now, imagine that we want to standardize the data, but using the statistics for each group:
* We take the mean and the standard deviation *inside each group*
* We want to standardize each value according to that.

We implement it as a lambda function, that will be applied to each group separately.

In [None]:
zscore = lambda x: (x - x.mean()) / x.std()

df.groupby('Target').transform(zscore).head()

The downside is that, since we are returning the entire dataframe, it can't return `Target` as the index, so we lose the column.

Let's make sure we got this right.

In [None]:
df_transformed = df.groupby('Target').transform(zscore)

df_transformed.groupby(df['Target']).agg(['mean', 'std'])

We did it! Zero mean and standard deviation equal to one *inside each group*.

Now, if this use case doesn't ring a bell, think about replacing missing values with the group mean, for example. Useful, right?

**Filtration**

The method `DataFrameGroupBy.filter()` provides a convenient way to filter out elements that belong to underrepresented groups.

In [None]:
concerts.groupby('ProgramID').filter(lambda x: x.shape[0] > 15).head()

It returns a subset of the original dataframe, depending on a function applied to the group as a whole.

### 1.2.4 Method chaining

Now we know about some of the most common individual transformations we can use. But how can we combine them?

Data transformation is a pipeline, i.e., some sequential transformations, *chained* together.

This chaining means that each transformation returns an object that will be consumed by the next one, and so on, in a pre-defined order.

As we've seen, a familiar syntax is to declare intermediate variables for each output, used as input to the next function.

As an example, let us define a simple function that make the subset of a dataframe based on a mask

In [None]:
def subset(df, mask):
    return df[mask]

In [None]:
mask = works['Interval'].isnull()
df_no_intervals = subset(works,mask)

df_exclude_minor_composers = df_no_intervals.groupby('ComposerName').filter(lambda x: x.shape[0] > 10)

df_work_related = df_exclude_minor_composers.filter(items=work_related_columns)
df_work_related_no_movement = df_work_related.drop(columns='Movement')
df_work_related_no_movement_unique = df_work_related_no_movement.drop_duplicates()

works_per_composer = df_work_related_no_movement_unique.groupby('ComposerName').size()
works_per_composer_sorted = works_per_composer.nlargest()
works_per_composer_sorted

These declarations are syntactic sugar: they make it easier to read and express confusing things such as data pipelines. Some downsides:
* We need to create an extra variable per intermediate step
* Cognitive burden of naming each variable and keeping them in mind
* They make the code less fluid
* They make it harder to visualize the whole picture of what your program (or Notebook) is doing
* They are error-prone and heavily reliant on the state, which is dangerous as we've seen.

What if there was an alternative?

Method chaining allows invoking multiple method calls chained together in a single statement, each receiving and returning an object.

This syntax has always been possible with Pandas, but more and more methods and being added that (try to guess it!):
* Receive a dataframe
* Return a transformed dataframe.

In [None]:
no_intervals = works['Interval'].isnull()
df_no_intervals = subset(works,mask)

df_work_related = df_no_intervals.filter(items=work_related_columns)

(df_work_related.groupby('ComposerName').filter(lambda x: x.shape[0] > 10)
                .drop(columns='Movement')
                .drop_duplicates()
                .groupby('ComposerName').size()
                .nlargest()
                .plot(kind='bar'));

Code flows from top to bottom, and the function parameters are always near the function. 

Also, you eliminate an extra variable for each intermediate steps.

Now, explicitly naming things is good. Ideally, you want to chain functions that make sense together and encapsulate them in logically.

In [None]:
def get_top_5_composers(df):
    df = df.copy()
    
    no_intervals = df['Interval'].isnull()
    df = subset(df, no_intervals)
    
    work_related_columns = ['ComposerName', 'WorkTitle', 'Movement']
    
    df = (df.filter(work_related_columns)
            .groupby('ComposerName').filter(lambda x: x.shape[0] > 10)
            .drop(columns='Movement')
            .drop_duplicates()
            .groupby('ComposerName').size()
            .nlargest())
    return df

top_5_composers = get_top_5_composers(works)
top_5_composers

Another drawback to excessively long chains is that debugging is harder, as there no intermediate values to inspect.

**Segregate your code as to avoid long chains and keep only together what belongs together.**

(In case of doubt, read [The Zen of Python](https://www.python.org/dev/peps/pep-0020/) out loud ten times.)

### 1.2.5 Custom methods and pipes

Now, for the final trick.

The function `subset()` has the exact signature we want, again:
* It receives a dataframe
* It return a transformed dataframe.

What if Pandas had a way to include such functions in pipelines? Meet `df.pipe()`!

The method `df.pipe()` allows us to include user-defined functions in method chains (aka pipelines).

It works like this.

In [None]:
(df_work_related.pipe(subset, no_intervals)
                .filter(items=work_related_columns)
                .groupby('ComposerName').filter(lambda x: x.shape[0] > 10)
                .drop(columns='Movement')
                .drop_duplicates()
                .groupby('ComposerName').size()
                .nlargest()
                .plot(kind='bar'));

So, this should work.

In [None]:
def get_top_5_composers(df):
    no_intervals = df['Interval'].isnull()
    work_related_columns = ['ComposerName', 'WorkTitle', 'Movement']
    
    df = df.copy()
    df = (df.pipe(subset, no_intervals)
            .filter(items=work_related_columns)
            .groupby('ComposerName').filter(lambda x: x.shape[0] > 10)
            .drop(columns='Movement')
            .drop_duplicates()
            .groupby('ComposerName').size()
            .nlargest())
    
    return df

top_5_composers = get_top_5_composers(works)
top_5_composers

And it does!

Now, we have all the tools we need to build robust data transformation pipelines in Pandas.

In the next Notebook, you will learn how to combine dataframes.