# BLU02 - Learning Notebook - Part 1 of 3 - Transformations

In [1]:
import numpy as np
import pandas as pd
import os

from sklearn.datasets import load_iris

A typical data science workflow goes as follows: you get data from a source, you clean it, and then you iterate your model on it.

![data_transformation_workflow](./media/data_processing_workflow.png)

In the previous BLU, we focused on getting and cleaning data (the blue boxes in the image above). 

At this point, we got a dataset and performed the necessary data cleaning. Our data is tidy - observations as rows and features as columns and it is comprised in one or more tables.

Now, to explore, visualize and model the data, we might need to transform the data - remove parts of the data or perform calculations on existing columns to create additional features. We might also want to organize the tables in a different way.

In this first part, we'll go into dataset transformations.

Next we'll learn how to combine the dataframes to organize the data more conveniently and use the relationships between the tables.

Finally, we'll move to scikit-learn to build efficient transformation pipelines.

## 1. About the data

The New York Philharmonic played its first concert on December 7, 1842.

The data documents all known concerts, amounting to more than 20,000 performances. Some considerations:
* The Program is the top-most level element in the dataset
* A Program is defined as performances in which the repertoire, conductors, and soloists are the same
* A Program is associated with an Orchestra (e.g., New York Philharmonic) and a Season (e.g., 1842-43)
* A Program may have multiple Concerts with different dates, times and locations
* A Program's repertoire may contain various Works (e.g., two different symphonies by Beethoven)
* A Work can have multiple Soloists (e.g., Mahler on the harpsichord, Strauss or Bernstein on the piano).

**For more information about the dataset, including the data dictionary, please head to the README.**

In this unit, we will be using the Works and Concerts tables, imported as follows.

In [2]:
works = pd.read_csv('./data/works.csv')
concerts = pd.read_csv('./data/concerts.csv')

In [3]:
works.head()

Unnamed: 0,GUID,ProgramID,WorkID,MovementID,ComposerName,WorkTitle,Movement,ConductorName,Interval,isInterval
0,38e072a7-8fc9-4f9a-8eac-3957905c0002,3853,52446,,"Beethoven, Ludwig van","SYMPHONY NO. 5 IN C MINOR, OP.67",,"Hill, Ureli Corelli",,False
1,c7b2b95c-5e0b-431c-a340-5b37fc860b34,5178,52437,,"Beethoven, Ludwig van","SYMPHONY NO. 3 IN E FLAT MAJOR, OP. 55 (EROICA)",,"Hill, Ureli Corelli",,False
2,894e1a52-1ae5-4fa7-aec0-b99997555a37,10785,52364,1.0,"Beethoven, Ludwig van","EGMONT, OP.84",Overture,"Hill, Ureli Corelli",,False
3,34ec2c2b-3297-4716-9831-b538310462b7,5887,52434,,"Beethoven, Ludwig van","SYMPHONY NO. 2 IN D MAJOR, OP.36",,"Boucher, Alfred",,False
4,610a4acc-94e4-4cd6-bdc1-8ad020edc7e9,305,52453,,"Beethoven, Ludwig van","SYMPHONY NO. 7 IN A MAJOR, OP.92",,"Hill, Ureli Corelli",,False


In [4]:
concerts.head()

Unnamed: 0,GUID,ProgramID,ConcertID,EventType,Location,Venue,Date,Time
0,38e072a7-8fc9-4f9a-8eac-3957905c0002,3853,0,Subscription Season,"Manhattan, NY",Apollo Rooms,1842-12-07T05:00:00+00:00,8:00PM
1,c7b2b95c-5e0b-431c-a340-5b37fc860b34,5178,0,Subscription Season,"Manhattan, NY",Apollo Rooms,1843-02-18T05:00:00+00:00,8:00PM
2,894e1a52-1ae5-4fa7-aec0-b99997555a37,10785,0,Special,"Manhattan, NY",Apollo Rooms,1843-04-07T05:00:00+00:00,8:00PM
3,34ec2c2b-3297-4716-9831-b538310462b7,5887,0,Subscription Season,"Manhattan, NY",Apollo Rooms,1843-04-22T05:00:00+00:00,8:00PM
4,610a4acc-94e4-4cd6-bdc1-8ad020edc7e9,305,0,Subscription Season,"Manhattan, NY",Apollo Rooms,1843-11-18T05:00:00+00:00,


## 2. Functions as data transformations

Most data transformations operate on dataframes: they receive a local dataframe, transform it and return a new one.

In its simplest form, this is the signature of a generic data transformer.

In [5]:
def data_transformer(df):
    df = df.copy()
    # df = ...
    return df

Such transformations have no side effects and operate as functions on immutable data (i.e., they keep the original dataframe unchanged).

Since the output depends only on the arguments, calling the transformations with the same arguments always produces the same result.

For example:

In [6]:
def rename_column(df, new_name, old_name):
    df = df.copy()
    df[new_name] = df[old_name]
    df = df.drop(columns=old_name)
    return df

def test_dataframe():
    data = np.random.randn(6, 4)
    columns = ['A', 'B', 'C', 'D']
    return pd.DataFrame(data=data, columns=columns)

df = test_dataframe()

rename_column(df, 'Z', 'A')

Unnamed: 0,B,C,D,Z
0,1.031843,-1.534005,-0.290498,-2.154591
1,-1.258981,-1.101263,0.255491,1.996724
2,-1.059374,-1.053498,0.249926,1.501269
3,1.310611,-1.05488,1.205724,1.76855
4,-1.443973,-0.18372,-0.748904,0.493956
5,-0.131269,0.370565,1.699878,0.766185


Now let's repeat the transformations:

In [7]:
rename_column(df, 'Z', 'A')

Unnamed: 0,B,C,D,Z
0,1.031843,-1.534005,-0.290498,-2.154591
1,-1.258981,-1.101263,0.255491,1.996724
2,-1.059374,-1.053498,0.249926,1.501269
3,1.310611,-1.05488,1.205724,1.76855
4,-1.443973,-0.18372,-0.748904,0.493956
5,-0.131269,0.370565,1.699878,0.766185


The same result, see! The program (or Notebook) remembers nothing but the original data and the function itself: a white canvas!

What about the original dataframe?

In [8]:
df

Unnamed: 0,A,B,C,D
0,-2.154591,1.031843,-1.534005,-0.290498
1,1.996724,-1.258981,-1.101263,0.255491
2,1.501269,-1.059374,-1.053498,0.249926
3,1.76855,1.310611,-1.05488,1.205724
4,0.493956,-1.443973,-0.18372,-0.748904
5,0.766185,-0.131269,0.370565,1.699878


After each call, the *state* of the program (or Notebook) is the same as it was before (no new objects, no changes, no nothing!), as if nothing had happened.

This property is valid for as long as we don't explicitly overwrite the original dataframe outside the function, using an assignment.

In [9]:
df = test_dataframe()
df = rename_column(df, 'Z', 'A')

try:
    rename_column(df, 'Z', 'A')
except:
    print("For some reason this doesn't work. Why is that?")

For some reason this doesn't work. Why is that?


Mutable data is dangerous because it makes programs unpredictable. And this is why you should avoid modifying objects after creation.

Such pitfall is common in Notebooks, especially when you re-run cells, run them in a different order or restart the Kernel. (Am I right?)

**Data transformation is a *pipeline***

Another problem is that data transformation is about applying multiple, sequential changes to the data (i.e., a multistep process).

![data_transformation_pipeline](./media/data_transformation_pipeline.png)

And once we realize this, how do we go about it?

In [10]:
df = test_dataframe()

df_renamed = rename_column(df, 'Z', 'A')
# Code happens. Ideas are tested, hours go by.
df_renamed_without_b = df_renamed.drop(columns='B')
# More code happens. We keep on testing ideas, days go by.
df_renamed_without_b_positive = df_renamed_without_b[df_renamed_without_b > 0]
# There's a lot of code. Ideas come and go, we've been doing this for a week.
df_renamed_without_b_positive_no_nans = df_renamed_without_b_positive.dropna(how='all')
# Can we honestly trace back how to get from df to here? Probably not.
df_renamed_without_b_positive_no_nans

Unnamed: 0,C,D,Z
0,0.823425,0.425429,0.802966
1,1.246824,,
2,0.842495,,1.64894
3,2.86559,1.165503,
5,0.165768,,


This is clearly not the best way to do it. Using functions instead, we can concisely encapsulate everything. 

(Also, we spend less time naming things, unless we want to.)

In [11]:
def data_transformer(df, how_to_dropna):
    df = df.copy()
    df = rename_column(df, 'Z', 'A')
    df = df.drop(columns='B')
    df = df[df > 0]
    df = df.dropna(how=how_to_dropna)
    return df

data_transformer(df, how_to_dropna='all')

Unnamed: 0,C,D,Z
0,0.823425,0.425429,0.802966
1,1.246824,,
2,0.842495,,1.64894
3,2.86559,1.165503,
5,0.165768,,


This is better, but the function is not well designed: the name is not explicit, it performs multiple processing steps, and there are no apparent blocks of logic.

Functions should organize and document our codebase (*what* you are doing and how). Ideally, each function should perform just one thing and its name should indicate what it does (one of the principles of clean code).

Using functions, immutable data, and avoiding side effects is a smart choice to manage complexity and keep things understandable.

It would be better to structure our functions like this.

In [12]:
def preprocess_data():
    df = df.copy()
    # df = rename_misspelled_columns(df)
    # df = drop_unnecessary_columns(df)
    # df = keep_only_positive_values(df)
    # df = remove_missing_values(df)
    return df

## 3. Data transformations in Pandas

Luckily, pandas provides convenient methods for most data transformation tasks, with a unified syntax and consistent interfaces.

For example, we don't need to create a `rename_column()` function, since Pandas already provides a `df.rename()` method for us.

In [13]:
df = test_dataframe()

df.rename({'A': 'Z'}, axis=1)

Unnamed: 0,Z,B,C,D
0,-0.768736,-1.208567,-0.537524,-0.132812
1,0.883555,-2.00562,-1.347699,0.189216
2,0.953356,0.343798,0.959446,0.481693
3,-0.193715,0.830766,-1.229731,-0.097751
4,0.415734,1.559572,-0.752685,1.60764
5,-0.538279,-0.500252,0.920608,0.210399


As a recap: `df.rename()` follows our transformer signature:
* It takes a dataframe as input 
* And returns a new one as output.

This predictable input/output is what we mean by consistent interfaces! 

It seems like a very promising way to build multistep pipelines. What transformations can we perform this way?

### 3.1 Subsetting columns or the index

Pandas implements this functionality, somewhat counterintuitively, as `df.filter()`.

Let's use it on the `works` dataframe. Imagine that we want only the columns related to the work itself, excluding IDs.

In [14]:
work_related_columns = ['ComposerName', 'WorkTitle', 'Movement']
# Select columns by name.
works.filter(items=work_related_columns).head()

Unnamed: 0,ComposerName,WorkTitle,Movement
0,"Beethoven, Ludwig van","SYMPHONY NO. 5 IN C MINOR, OP.67",
1,"Beethoven, Ludwig van","SYMPHONY NO. 3 IN E FLAT MAJOR, OP. 55 (EROICA)",
2,"Beethoven, Ludwig van","EGMONT, OP.84",Overture
3,"Beethoven, Ludwig van","SYMPHONY NO. 2 IN D MAJOR, OP.36",
4,"Beethoven, Ludwig van","SYMPHONY NO. 7 IN A MAJOR, OP.92",


We can also use it to subset our dataframe based on the index.

In [15]:
# Select rows containing 'Glass' in the index.
works.set_index('ComposerName').filter(like='Glass', axis=0).reset_index().head()

Unnamed: 0,ComposerName,GUID,ProgramID,WorkID,MovementID,WorkTitle,Movement,ConductorName,Interval,isInterval
0,"Glass, Philip",cf230066-2cd2-4093-8b78-b91b8dda3cbf,11639,5729,,KOYAANISQATSI,,"Riesman, Michael",,False
1,"Glass, Philip",adf68bf5-db9d-4b24-aac6-c9b7c398cf06,14047,12401,,"""FATHER DEATH BLUES"" FROM HYDROGEN JUKEBOX",,"Sainte-Agathe, Valérie",,False
2,"Glass, Philip",990a8f66-cd5f-466e-b3b6-800baa6c0b47-0.1,14189,12547,,"QUARTET, STRING, NO. 3 (MISHIMA)",,,,False
3,"Glass, Philip",3d50968c-2e8b-405f-969c-ba36f941f393,14189,12547,,"QUARTET, STRING, NO. 3 (MISHIMA)",,,,False
4,"Glass, Philip",fb8e7125-7ef9-492e-a820-914467475701,14025,12327,,SARABANDE IN COMMON TIME (SOLO VIOLIN),,,,False


### 3.2 Group By

This is a very extensive topic, and we'll just touch the surface here so that you know that it exists and can explore it further on your own.

In case you've worked with SQL before, you'll find this very familiar :)

So, in pandas there is a process of three chained steps called **split-apply-combine**:

- split: splitting the DataFrame into groups (this is the group by) based on column values
- apply: apply a function to each group (aggregation, transformation, or filtration)
- combine: create a DataFrame with the results

The function applied in the apply step may be:
* Aggregation like sum, count, mean
* Transformation like filling missing values
* Filtration like discarding data from underrepresented groups.

We will look into each one of them in the sections below. We will be working on the `concerts` data.

#### 3.2.1 Aggregation

We want to identify the most popular programs in the concerts, with the most number of performances.

The first step is to group our data by `ProgramID`. This returns a [DataFrameGroupBy](https://pandas.pydata.org/docs/reference/groupby.html) object that by itself doesn't tell us much.

In [16]:
concerts_grouped_by_ProgramID = concerts.groupby('ProgramID')
concerts_grouped_by_ProgramID

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fba95986c00>

However, we can use the `groups` property of the `DataFrameGroupBy` object to inspect the groups. It is a dictionary where the keys are the `ProgramID` values and the values are the indexes of these values.

In [17]:
concerts_grouped_by_ProgramID.groups

{1: [474, 14097], 2: [1578], 3: [1579], 4: [1580, 14382], 5: [10075], 6: [1581], 7: [10076, 16815, 19224, 20893], 8: [10066], 9: [9471], 10: [7565], 11: [1582], 12: [10077, 16816, 19225, 20894], 13: [10058], 14: [9473], 15: [7566], 16: [154], 17: [3797, 14812], 18: [1583], 19: [10065], 20: [10078, 16817, 19226], 21: [9475], 22: [7567, 15738], 24: [1584], 25: [10063], 26: [10433], 27: [7568], 28: [1585, 14383], 29: [3799, 14813], 30: [10434], 31: [8604, 16117, 18674], 32: [7570, 15739], 33: [3800, 14814], 34: [4466], 35: [1589], 36: [10244], 37: [10435, 16978, 19355, 20978], 38: [8605], 39: [7571, 15740], 40: [4467, 14985, 18285], 41: [3801, 14815], 42: [334], 43: [1164], 44: [10436, 16979, 19356, 20979], 45: [8606, 16118], 46: [7572, 15741], 47: [19], 48: [4470, 14986, 18286], 49: [3803, 14816], 50: [1165, 14308], 51: [9549], 52: [10437, 16980, 19357], 53: [7573], 54: [1168], 55: [4472, 14987], 56: [3805, 14817], 57: [5101, 15174], 58: [9331, 16496, 18968, 20712], 59: [10439, 16981, 19

Now, if we want to know the number of performances per program we can simply call `DataFrameGroupBy.size()`. It shows us the number of elements per group.

In [18]:
concerts_grouped_by_ProgramID.size()

ProgramID
1        2
2        1
3        1
4        2
5        1
        ..
14191    2
14192    1
14193    1
14194    1
14195    1
Length: 13932, dtype: int64

The same result but using all the operations together

In [19]:
concerts.groupby('ProgramID').size()

ProgramID
1        2
2        1
3        1
4        2
5        1
        ..
14191    2
14192    1
14193    1
14194    1
14195    1
Length: 13932, dtype: int64

With the operation above we obtain a Dataframe with the number of performances per `ProgramID`. If we want to know the most popular ones we need to use the `.nlargest()` method.  It returns the n largest elements, with n defaulting to 5.

In [20]:
concerts.groupby('ProgramID').size().nlargest()

ProgramID
3128     16
3139     16
10700    16
10702    16
3134     12
dtype: int64

There is a number of other aggregation methods that we can use after `groupby`:
* `mean()`
* `sum()`
* `size()`
* `count()`
* `std()`
* `var()`
* `sem()`
* `describe()`
* `first()`
* `last()`
* `nth()`
* `min()`
* `max()`.

There is also the general method `aggregate()` where we can specify our own function. Check out the [docs](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html).

Let's now imagine that we want to know when each program was performed for the first time.
We start by grouping the shows by `ProgramID` and then we take the `min` of the `Date` column for each group. 

In [21]:
concerts.groupby('ProgramID').Date.min()

ProgramID
1        1897-02-05T05:00:00+00:00
2        1916-12-03T05:00:00+00:00
3        1916-12-06T05:00:00+00:00
4        1916-12-07T05:00:00+00:00
5        1983-09-14T04:00:00+00:00
                   ...            
14191    2016-12-11T05:00:00+00:00
14192    2017-01-15T05:00:00+00:00
14193    2017-02-03T05:00:00+00:00
14194    2017-03-13T04:00:00+00:00
14195    2017-04-18T04:00:00+00:00
Name: Date, Length: 13932, dtype: object

And if we want to know when was the last performance?

In [22]:
concerts.groupby('ProgramID').Date.max()

ProgramID
1        1897-02-06T05:00:00+00:00
2        1916-12-03T05:00:00+00:00
3        1916-12-06T05:00:00+00:00
4        1916-12-08T05:00:00+00:00
5        1983-09-14T04:00:00+00:00
                   ...            
14191    2016-12-11T05:00:00+00:00
14192    2017-01-15T05:00:00+00:00
14193    2017-02-03T05:00:00+00:00
14194    2017-03-13T04:00:00+00:00
14195    2017-04-18T04:00:00+00:00
Name: Date, Length: 13932, dtype: object

#### 3.2.2 Transformation

The inherent difference between aggregation and transformation is that the latter returns an object the same size as the original input.

The tranformation is done with the [transform](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.transform.html) method.

We don't have a good example in our `works` dataset, so we'll use the [Iris](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html) dataset to exemplify a possible use case. The dataset has three classes.

In [23]:
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['Target'] = iris.target
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


Now imagine that we want to standardize the data, but we want to do so separately for each of the three classes:
* We take the mean and the standard deviation *inside each class*
* Then we standardize each class accordingly.

We implement it as a lambda function that will be applied to each class separately.

In [24]:
zscore = lambda x: (x - x.mean()) / x.std()

df_transformed = df.groupby('Target').transform(zscore)
df_transformed.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,0.266674,0.189941,-0.357011,-0.436492
1,-0.300718,-1.129096,-0.357011,-0.436492
2,-0.868111,-0.601481,-0.932836,-0.436492
3,-1.151807,-0.865288,0.218813,-0.436492
4,-0.017022,0.453749,-0.357011,-0.436492


The downside is that the returned dataframe no longer has the `Target` column, but we can easily paste it back because the order of the rows is not affected.

In [25]:
df_transformed['Target'] = df['Target']
df_transformed.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Target
0,0.266674,0.189941,-0.357011,-0.436492,0
1,-0.300718,-1.129096,-0.357011,-0.436492,0
2,-0.868111,-0.601481,-0.932836,-0.436492,0
3,-1.151807,-0.865288,0.218813,-0.436492,0
4,-0.017022,0.453749,-0.357011,-0.436492,0


Let's check that the standardization worked - we compute the mean and stddev for each class using the aggregate method (we can use the `agg` alias).

In [26]:
df_transformed.groupby(df['Target']).agg(['mean', 'std'])

Unnamed: 0_level_0,sepal length (cm),sepal length (cm),sepal width (cm),sepal width (cm),petal length (cm),petal length (cm),petal width (cm),petal width (cm),Target,Target
Unnamed: 0_level_1,mean,std,mean,std,mean,std,mean,std,mean,std
Target,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
0,-6.917383e-16,1.0,1.576517e-16,1.0,-1.159073e-15,1.0,3.796963e-16,1.0,0.0,0.0
1,5.4400930000000005e-17,1.0,-1.513234e-15,1.0,4.52971e-16,1.0,8.032464e-16,1.0,1.0,0.0
2,2.753353e-15,1.0,-6.178391e-16,1.0,-9.880985e-16,1.0,-9.547918e-16,1.0,2.0,0.0


We did it! Zero mean and standard deviation equal to one *inside each class*.

Now, if this use case doesn't seem useful, think about replacing missing values with the group mean for different populations. For example mean height for females and males. Not bad, right?

#### 3.2.3. Filtration

The `DataFrameGroupBy.filter()` method provides a convenient way to filter out elements that belong to underrepresented groups. Here we keep programs that were played more than 15 times.

In [27]:
concerts.groupby('ProgramID').filter(lambda x: x.shape[0] > 15).head()

Unnamed: 0,GUID,ProgramID,ConcertID,EventType,Location,Venue,Date,Time
6608,8ad0bfa4-09b9-4b18-889b-d0c426410cbb,3128,0,Special,"Manhattan, NY",Roxy Theatre,1950-09-01T04:00:00+00:00,12:00PM
6610,1e1114aa-7152-4305-a357-7aac149b8599,3139,0,Special,"Manhattan, NY",Roxy Theatre,1950-09-08T04:00:00+00:00,
6689,b37d1833-3252-41a6-9f3e-fbd596b215b0,10700,0,Special,"Manhattan, NY",Roxy Theatre,1951-05-09T04:00:00+00:00,12:40PM
6691,35fda061-f4c4-423c-a8ab-feb792caee34,10702,0,Special,"Manhattan, NY",Roxy Theatre,1951-05-16T04:00:00+00:00,
15517,8ad0bfa4-09b9-4b18-889b-d0c426410cbb,3128,1,Special,"Manhattan, NY",Roxy Theatre,1950-09-01T04:00:00+00:00,


### 3.3 Method chaining

Now we know about some of the most common individual transformations. But how can we combine them? Usually we want to perform a bunch of transformation in sequence, or *chained*.

This chaining means that each transformation returns an object that will be consumed by the next one, and so on, in a pre-defined order.

We could do it in the not-so-great way as above using variables for partial outputs.

First, we define a function that we want to use in the transformation sequence:

In [28]:
def subset(df, mask):
    return df.loc[mask]

Now comes the transformation sequence using partial outputs. The goal of the transformations is to get composers with the most works.

In [29]:
mask = works['Interval'].isnull()
df_no_intervals = subset(works,mask)

df_exclude_minor_composers = df_no_intervals.groupby('ComposerName').filter(lambda x: x.shape[0] > 10)

df_work_related = df_exclude_minor_composers.filter(items=work_related_columns)
df_work_related_no_movement = df_work_related.drop(columns='Movement')
df_work_related_no_movement_unique = df_work_related_no_movement.drop_duplicates()

works_per_composer = df_work_related_no_movement_unique.groupby('ComposerName').size()
works_per_composer_sorted = works_per_composer.nlargest()
works_per_composer_sorted

ComposerName
Traditional,                  640
Bach,  Johann  Sebastian      306
Mozart,  Wolfgang  Amadeus    242
Schubert,  Franz              158
Beethoven,  Ludwig  van       144
dtype: int64

These declarations are syntactic sugar: they make it easier to read and express confusing things such as data pipelines. Some downsides:
* We need to create an extra variable per intermediate step
* Cognitive burden of naming each variable and keeping them in mind
* They make the code less fluid
* They make it harder to visualize the whole picture of what your program (or Notebook) is doing
* They are error-prone and heavily reliant on the state, which is dangerous as we've seen.

What if there was an alternative?

Method chaining allows invoking multiple method calls chained together in a single statement, each receiving and returning an object.

This syntax is possible with Pandas thanks to methods that:
* Receive a dataframe
* Return a transformed dataframe.

Let's see our transformation sequence with method chaining (notice that we need parenthesis for it to work on multiple lines):

In [30]:
no_intervals = works['Interval'].isnull()
df_no_intervals = subset(works,mask)

df_work_related = df_no_intervals.filter(items=work_related_columns)

(df_work_related.groupby('ComposerName').filter(lambda x: x.shape[0] > 10)
                .drop(columns='Movement')
                .drop_duplicates()
                .groupby('ComposerName').size()
                .nlargest())

ComposerName
Traditional,                  640
Bach,  Johann  Sebastian      306
Mozart,  Wolfgang  Amadeus    242
Schubert,  Franz              158
Beethoven,  Ludwig  van       144
dtype: int64

Code flows from top to bottom, and the function parameters are always near the function. 

Also, you eliminate an extra variable for each intermediate step.

Now, explicitly naming things is good. Ideally, you want to chain functions that make sense together and encapsulate them into a function.

In [31]:
def get_top_5_composers(df):
    df = df.copy()
    
    no_intervals = df['Interval'].isnull()
    df = subset(df, no_intervals)
    
    work_related_columns = ['ComposerName', 'WorkTitle', 'Movement']
    
    df = (df.filter(work_related_columns)
            .groupby('ComposerName').filter(lambda x: x.shape[0] > 10)
            .drop(columns='Movement')
            .drop_duplicates()
            .groupby('ComposerName').size()
            .nlargest())
    return df

top_5_composers = get_top_5_composers(works)
top_5_composers

ComposerName
Traditional,                  640
Bach,  Johann  Sebastian      306
Mozart,  Wolfgang  Amadeus    242
Schubert,  Franz              158
Beethoven,  Ludwig  van       144
dtype: int64

A drawback to excessively long chains is that debugging is harder, as there are no intermediate values to inspect.

**Segregate your code to avoid long chains and keep together only what belongs together.**

(In case of doubt, read [The Zen of Python](https://www.python.org/dev/peps/pep-0020/) out loud ten times.)

### 3.4 Custom methods and pipes

Now, for the final trick.

The function `subset()` cannot be included in the chaining as is. However, it has the exact signature we want, again:
* It receives a dataframe
* It returns a transformed dataframe.

What if pandas had a way to include such functions in pipelines? Meet [df.pipe()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pipe.html)!

The method `df.pipe()` allows us to include user-defined functions in method chains.

It works like this.

In [32]:
(df_work_related.pipe(subset, no_intervals)
                .filter(items=work_related_columns)
                .groupby('ComposerName').filter(lambda x: x.shape[0] > 10)
                .drop(columns='Movement')
                .drop_duplicates()
                .groupby('ComposerName').size()
                .nlargest())

ComposerName
Traditional,                  640
Bach,  Johann  Sebastian      306
Mozart,  Wolfgang  Amadeus    242
Schubert,  Franz              158
Beethoven,  Ludwig  van       144
dtype: int64

We can use this new improved chain in our function.

In [33]:
def get_top_5_composers(df):
    no_intervals = df['Interval'].isnull()
    work_related_columns = ['ComposerName', 'WorkTitle', 'Movement']
    
    df = df.copy()
    df = (df.pipe(subset, no_intervals)
            .filter(items=work_related_columns)
            .groupby('ComposerName').filter(lambda x: x.shape[0] > 10)
            .drop(columns='Movement')
            .drop_duplicates()
            .groupby('ComposerName').size()
            .nlargest())
    
    return df

top_5_composers = get_top_5_composers(works)
top_5_composers

ComposerName
Traditional,                  640
Bach,  Johann  Sebastian      306
Mozart,  Wolfgang  Amadeus    242
Schubert,  Franz              158
Beethoven,  Ludwig  van       144
dtype: int64

And it works!

Now, we have all the tools we need to build robust data transformation pipelines in Pandas.

In the next Notebook, you will learn how to combine dataframes.