Skip to content
Reiner Krämer edited this page Aug 4, 2016 · 21 revisions

Introduction

pandas unto itself is a very powerful Python data wrangling library. This documentation focuses on how to use pandas in the best way possible for the VIS-framework.

Greater knowledge of pandas will set you on an expedient path to VIS-ardry. It is recommended that you work through the following paths:

There are a lot of other uses for pandas as well, so it would only enhance your knowledge to read more about pandas at the following sites (especially if the terseness of the pandas documentation is not your "thing"):

Top

What is pandas?

Pandas is a python package designed for easily structuring, displaying and manipulating data. It stores data in table-like structures called Series or DataFrames. The first thing you need to know about VIS is the difference between Series and DataFrames. Series are tables that only contain a single column, while DataFrames contain many Series (or columns). The distinction might not seem that big, but when we get into it more, you'll see that they often function quite differently.

Top

How does VIS use pandas?

We use pandas in VIS to create the output of the indexers. It not only provides a nice way of visualizing said output, but also makes the internal work of the indexers much easier. Another benefit is that we can directly output pandas DataFrames to csv files, which can be viewed in Excel.

Pandas also has a lot of its own features that can be useful for making your analyses more powerful without having to do a lot of your own scripting. Here, we will walk you through some of the most applicable features in this documentation and demonstrate how using pandas directly is easier and faster than many of the other work-arounds.

Top

Terminology

  • DataFrame If you've ever used VIS before, you have probably seen a DataFrame at some point in the process, whether that was printing something in your terminal, or looking at the csv files downloaded from the client. Either way, the indexer output will hopefully not look to foreign to you. This is what the noterest indexer DataFrame looks like:
Indexer noterest.NoteRestIndexer                 
Parts                          0     1     2    3
0                             F4    F3  Rest   F3
8                             F4    F3   NaN   D3
10                            E4    G3   NaN   C3
12                            C4    A3  Rest   F3
15                           NaN   NaN   NaN   E3
16                          Rest    G3   NaN   C3
20                            G4   NaN   NaN  NaN
24                            A4    F3    F3   C4
27                           NaN   NaN   NaN   B3
28                           NaN  Rest   NaN   A3
29                           NaN   NaN   NaN   G3
30                           NaN   NaN   NaN   A3
32                            A4    C4   NaN  NaN
33                           NaN   NaN   NaN   G3
34                           NaN   NaN   NaN   F3
35                           NaN   NaN   NaN   E3
36                            A4    D4    F3   D3
40                            G4    B3    G3   E3
  • Series In the above example, each part of the piece itself is a Series of data. Individually, the first part as a Series would look like this:
0        F4
8        F4
10       E4
12       C4
15      NaN
16     Rest
20       G4
24       A4
27      NaN
28      NaN
29      NaN
30      NaN
32       A4
33      NaN
34      NaN
35      NaN
36       A4
40       G4
  • NaN NaN stands for "Not a Number", or "Not a Note" for our purposes. In either case, it means there is no value for that offset. Instead of leaving cells in the DataFrame blank, pandas fills in data-less spots with this. In the case of VIS, NaN indicates that neither a note nor a rest is being attacked, which means that a note or rest is being held.

  • Index The index is the location of the data, in VIS also known as the offset. This is the column of numbers on the far left of the table. Here it is as a list:

[0.0, 8.0, 10.0, 12.0, 15.0, 16.0, 20.0, 24.0, 27.0, 28.0, 29.0, 30.0, 32.0, 33.0, 34.0, 35.0, 36.0, 40.0]
  • MultiIndex Normally, a DataFrame has an index and column labels. VIS uses something on top of that, called MultiIndexing, with which you can add labels to the column labels. For example, a single-indexed DataFrame would look something like this:
Parts                          0     1     2    3
0                             F4    F3  Rest   F3
8                             F4    F3   NaN   D3

And the MultiIndexed DataFrame looks like this:

Indexer noterest.NoteRestIndexer                 
Parts                          0     1     2    3
0                             F4    F3  Rest   F3
8                             F4    F3   NaN   D3

There is an extra layer of abstraction here, but also additional information that is important for us to see what data we're actually looking at.

These are just some of the basic terms that you will come across later on. As long as you understand the basics now, more will become clear as you learn to use them.

Top

Pandas Basics

Rotation

Since we're looking at music, you might prefer to rotate your pandas DataFrames from their original orientation in order for it to look more like a musical score:

dataframe.T

gives this output:

                                 0    8    10    12   15    16   20  24   27  
Indexer                  Parts                                                  
noterest.NoteRestIndexer 0        F4   F4   E4    C4  NaN  Rest   G4  A4  NaN   
                         1        F3   F3   G3    A3  NaN    G3  NaN  F3  NaN   
                         2      Rest  NaN  NaN  Rest  NaN   NaN  NaN  F3  NaN   
                         3        F3   D3   C3    F3   E3    C3  NaN  C4   B3   

Data as Lists

To get the list of offsets present in the piece, simply get the index of the DataFrame and (optionally) convert it to a list:

dataframe.index.tolist()

You can also do the same thing with the data itself. The values() function will get all the values of the DataFrame or Series, but in different ways. For DataFrames, calling values() on it:

dataframe.values()

will give you a list of lists of the data in each row:

[[u'F4' u'F3' 'Rest' u'F3']
 [u'F4' u'F3' nan u'D3']
 [u'E4' u'G3' nan u'C3']

Calling values() on a series will just give you a plain list of the data:

dataframe['noterest.NoteRestIndexer', '0'].values()

Equality

There are many ways of comparing DataFrames, a lot of them using the methods you saw above, but pandas also has a built in function that checks for equality between two DataFrames. Important: '==' does not work!

dataframe1.equals(dataframe2)

Cropping

If you want to look at only a subset of the full DataFrame, there are a number of ways to approach this. First, there are the head and tail functions. They show the first or last few events, respectively, of the DataFrame. To use this, call the function and pass it the number of events you want to see from the beginning or end:

dataframe.head(3)

gives:

Indexer noterest.NoteRestIndexer              
Parts                          0   1     2   3
0                             F4  F3  Rest  F3
8                             F4  F3   NaN  D3
10                            E4  G3   NaN  C3

or,

dataframe.tail(6)

gives:

Indexer noterest.NoteRestIndexer               
Parts                          0    1    2    3
672                           G4  NaN  NaN  NaN
673                          NaN   F4  NaN   D3
674                           F4   D4  NaN  B-2
676                          NaN   C4  NaN   C3
678                           E4  NaN  NaN  NaN
680                           F4  NaN   F3  NaN

You can also use the pandas location function to slice a range of the dataframe:

dataframe.loc[8:24]

gives us:

Indexer noterest.NoteRestIndexer                
Parts                          0    1     2    3
8                             F4   F3   NaN   D3
10                            E4   G3   NaN   C3
12                            C4   A3  Rest   F3
15                           NaN  NaN   NaN   E3
16                          Rest   G3   NaN   C3
20                            G4  NaN   NaN  NaN
24                            A4   F3    F3   C4

Similarly, you might also want to look at what is happening at a particular offset. For this, you would use the location function in pandas:

dataframe.loc[8]

and if you want to see that offset in a particular part:

dataframe[('noterest.NoteRestIndexer', '2')].loc[8]

Parts

Though splitting up your dataframe horizontally, it might be more useful to split it vertically - by part rather than by measure, for example. To get the Series of a single part, you have to index into both parts of the MultiIndex, meaning you have to include both the indexer name and the part name:

dataframe['noterest.NoterestIndexer', '0']

The above code will give you the entire length of the piece, but only looking at one part. Make sure you have the right indexer name ('noterest.NoteRestIndexer' in this case) and that the part name ('0' in this case) actually exists in the DataFrame.

You might also want to look at all the parts, but one at a time, in which case you can use a for loop:

for part in dataframe:
    dataframe[part]

Concatenation

Just as you can pull DataFrames apart, so can you push them together into one, or concatenate them. The simple way of calling this function:

pandas.concat([dataframe1, dataframe2])

returns a DataFrame that contains 2 DataFrames in a row. If you were to look at the index of this new DataFrame, it would be the indices of both DataFrames one after another. In other words, this way of concatenating DataFrames is fine for specific situations; it's especially good when you're looking at a lot of different pieces and don't want the data getting confused.

Indexer noterest.NoteRestIndexer                
Parts                          0    1     2    3
8                             F4   F3   NaN   D3
10                            E4   G3   NaN   C3
12                            C4   A3  Rest   F3
15                           NaN  NaN   NaN   E3
16                          Rest   G3   NaN   C3
Indexer interval.HorizontalIntervalIndexer                 
Parts                          0     1     2    3
8                             -2     2   NaN   -2
10                            -3     2   NaN    4
12                          Rest    -2  Rest   -2
15                           NaN   NaN   NaN   -3
16                          Rest    -2   NaN    8

However, if you want to look at the notes next to the intervals within a single piece, you probably want the index to be the same for both DataFrames, so you can see much more easily:

Indexer noterest.NoteRestIndexer                   interval.HorizontalIntervalIndexer
Parts                          0    1     2    3              0     1     2    3
8                             F4   F3   NaN   D3             -2     2   NaN   -2 
10                            E4   G3   NaN   C3             -3     2   NaN    4
12                            C4   A3  Rest   F3           Rest    -2  Rest   -2
15                           NaN  NaN   NaN   E3            NaN   NaN   NaN   -3
16                          Rest   G3   NaN   C3           Rest    -2   NaN    8

This can be done with as many DataFrames as necessary!

Top

One Step Further

Dealing with NaN's

Although NaN's are generally quite a useful part of VIS output, they can lead to problems every now and then. There will come a time when you want to have a DataFrame or Series that doesn't contain any NaN's. For example, you might want to check what note is occurring at a given offset, but don't want NaN to be returned. You could, of course, look at the last valid piece of data in the DataFrame, but why code that yourself when pandas can do it for you! The two main things you need to know to manipulate NaN's is dropna() and fillna(). dropna(), as the name suggests, is a function that drops the NaN's from your dataframe. You will probably want to be careful when using this one, because, by default, it removes all rows that contain a NaN. That means, even if there are notes in other parts in the row, they will be removed as well. This can have its uses, but it is best to beware of this. Consequently, I generally use dropna() on Series. First, I separate out each part from the DataFrame:

0        F4
8        F4
10       E4
12       C4
15      NaN
16     Rest
20       G4
24       A4
27      NaN
28      NaN
29      NaN
30      NaN
32       A4
33      NaN
34      NaN
35      NaN
36       A4
40       G4

and then call dropna() on it:

0        F4
8        F4
10       E4
12       C4
16     Rest
20       G4
24       A4
32       A4
36       A4
40       G4

One of the issues with dropna() is that you can easily remove way too much of the dataframe than you meant to. dropna() has a setting called 'how' that has two options: 'all' and 'any'. 'All' only removes rows that contain all NaN's. You should find any rows like this in the standard VIS output, but if you've done other post-processing at this point, this can be a very useful tool. The 'any' option will remove all rows that have at least one NaN in them. Since it does this by default, it can be dangerous to call dropna() on a DataFrame, which is why I tend to call it on Series. Either way, you might find your own use for this tool in your analyses.

fillna() is another very useful DataFrame manipulation. It simply fills in the NaN's with other values, that you can specify when you call the function. The most useful for our purposes (because music is linear) is forward fill. This means that instead of only showing the note name when it is attacked, we can show it at every offset that it is sounding. For example, take the normal indexer results:

Indexer noterest.NoteRestIndexer                 
Parts                          0     1     2    3
0                             F4    F3  Rest   F3
8                             F4    F3   NaN   D3
10                            E4    G3   NaN   C3
12                            C4    A3  Rest   F3
15                           NaN   NaN   NaN   E3
16                          Rest    G3   NaN   C3
20                            G4   NaN   NaN  NaN
24                            A4    F3    F3   C4
27                           NaN   NaN   NaN   B3
28                           NaN  Rest   NaN   A3
29                           NaN   NaN   NaN   G3
30                           NaN   NaN   NaN   A3
32                            A4    C4   NaN  NaN
33                           NaN   NaN   NaN   G3
34                           NaN   NaN   NaN   F3
35                           NaN   NaN   NaN   E3
36                            A4    D4    F3   D3
40                            G4    B3    G3   E3

This is what it would look like after calling fillna(method='ffill') on it:

Indexer noterest.NoteRestIndexer                 
Parts                          0     1     2    3
0                             F4    F3  Rest   F3
8                             F4    F3  Rest   D3
10                            E4    G3  Rest   C3
12                            C4    A3  Rest   F3
15                            C4    A3  Rest   E3
16                          Rest    G3  Rest   C3
20                            G4    G3  Rest   C3
24                            A4    F3    F3   C4
27                            A4    F3    F3   B3
28                            A4  Rest    F3   A3
29                            A4  Rest    F3   G3
30                            A4  Rest    F3   A3
32                            A4    C4    F3   A3
33                            A4    C4    F3   G3
34                            A4    C4    F3   F3
35                            A4    C4    F3   E3
36                            A4    D4    F3   D3
40                            G4    B3    G3   E3

Notice that though, like dropna() all the NaN's have been removed, this time there are no offsets missing, and there are a lot more repetitions in the notes being shown.

Duplicates

There are a few different ways of dealing with duplicate values in pandas. The pandas function drop_duplicates() takes one setting that has 3 different options. Depending on the option of this one setting, this function then does 3 different things:

dataframe.drop_duplicates(keep=False)

Deletes all duplicated values in the entire DataFrame

dataframe.drop_duplicates(keep='first')

Deletes all duplicated values except for the first occurence

dataframe.drop_duplicates(keep='last')

Deletes all duplicated values except for the last occurence

All these dropping of duplicates can, of course, be useful. However, the most common reason to drop duplicates that we, as VIS developers, have found is to drop only consecutive duplicates. Unfortunately, pandas doesn't have a built-in function to do this for us. Luckily, though, it only takes one line of code to get there yourself:

dataframe = dataframe.loc[dataframe.shift() != dataframe]

Mapping

A common function for many data structures is the map function. Pandas, too, has equivalent functions for DataFrames and Series. With this, you can apply a function to every row of data in the DataFrame with only a single line of code, rather than looping through everything yourself.

The basic version is for Series and is called 'map':

series.map(my_func())

For DataFrames, there are two options, however. The first applies the function to each row - the row as a whole. This version is called 'apply':

dataframe.apply(my_func())

The other option is 'applymap', which applies the function to every individual element in each row in the DataFrame:

dataframe.applymap(my_func())

Describe

Pandas has a cool function called describe() that spits out a new DataFrame of data about your DataFrame. That data includes a number of different things, that also depend on what is relevent for the given DataFrame. Generally for VIS output, the most common data is: *count: the number of occurences *unique: the number of distinct elements *top: the most frequently occuring element *freq: the frequency of the top element

Indexer noterest.NoteRestIndexer              
Parts                          0    1   2    3
count                        186  208  84  170
unique                        13   16  11   13
top                           A4   C4  D4   F3
freq                          41   30  13   31

Top

Filtering

This method is something we do all the time for music analysis, and pandas makes it that much easier. It is a way of cross-filtering results of various indexers to get specific and useful information about the music represented in the DataFrames.

The first step in this method is to think of a question you want to ask about the music. To make it simple, let's ask what intervals occur against the bass voice when the bass voice is singing an F3.

First, we have to get the relevant indexers (noterest and interval, in this case) and concatenate them:

notes = noterest.NoteRestIndexer(piece).run()
intervals = interval.IntervalIndexer(notes).run()
piece = pandas.concat([notes, intervals], axis=1)

Next, indicate the condition you want to filter your results by. In this case, we want to filter for when the bass voice sings an F3:

condition = piece.loc[:, ('noterest.NoteRestIndexer', '3')] == 'F3'

Then, specify in which indexer you want to find your results. In this case, we also have to take into consideration that we only want to look at columns that have intervals occuring against the bass. This means the column label with definitely have the number '3' in it:

location = piece['interval.IntervalIndexer'].loc[:, [part for part in piece['interval.IntervalIndexer'] if '3' in part]]

Now, we simple have to apply the condition to the location and we get our results!

location[condition]

The resulting DataFrame:

Parts   0,3   1,3   2,3
0         8     1  Rest
12        5     3  Rest
34       10     5     1
56       10     5     3
86       10     3  Rest
98       10    10     6
133       9     6     4

Top