![Erudio logo](img/erudio-logo-small.png)
---
![Pandas logo](img/pandas-logo-small.png)

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from src.training import *

# Reshaping DataFrames

In the basic form, Pandas DataFrames are 2-D "arrays" with labels for the rows (the index) and labels for the columns.  That abstraction is very powerful in itself, as we have seen.  But there are times when looking at the same data in a somewhat different way allows experessing calculations much more easily.

For example, let us recall our aggregated Olympic medals data.  Using descriptive row labels rather than the default sequential numbers, we get something where every intersection of row × column gives us some meaningful information about how those aspects of the dataset interact.

In [None]:
# How many Bronze medals did France win?
medals = pd.read_csv('data/olympic-medals.csv', index_col='Abbrev')
medals.loc['FRA', 'Bronze']

## Reversing Rows and Columns

First off, there is a common concept in data science—but generally in science—of [*tidy data*](https://vita.had.co.nz/papers/tidy-data.html) (the R programming langauge community emphasizes this especially strongly).  Each rows should represent an observation or entity and each column should represent a features or variable.

Generally we work with no more than tens of features, but we may have thousands or millions of observations.  Pandas works best with that arrangement, and it is clearer both conceptually and visually.  Moreover, if we can allow each measurement of a common features to have the same datatype, vectorized operations will be much faster.

In [None]:
patients = pd.read_csv('data/patient-records.csv', index_col="name")
patients

Occasionally, it is possible to think of rows and columns under reversed conceptual frame.  Or even more often, we simply get the data in a format that does not follow the tidy model.  One of the simplest transforms we can make is simply transposing the rows and columns.

In [None]:
# The .T for transpose is borrowed from NumPy
patients.T

## Hierarchical Indexing

We already saw several hierarchical indices, but did not comment on them, when we did `.groupby()` operations.  The idea is that rows (and columns) can be indexed at multiple levels for better organization and more convenient access patterns.

Let us read in and play around with the Olympic medal data to play around with this.

We might organize our data better simply by structuring the index to reflect different elements that identify the underlying data.  Notice that we still have the same number of rows, and no aggregation has occurred.

In [None]:
# Index by both Continent and Abbrev
medals.reset_index(inplace=True)
medals.set_index(['Continent', 'Abbrev'])

In [None]:
# Index by Level, Continent, and Abbrev
df = medals.set_index(['Level', 'Continent', 'Abbrev'])
df

### Simulating higher dimensions

One of the things hierarchical indexing gets us is a kinda of higher dimensionality.  However, in the general case it is "ragged dimensionality."  That is, not everything at one level of the index will correspond to everything at another level, e.g. unlike in an L×M×N×O 4-D array.   In this conceptualization, the columns make up the final dimension.

In the particular example, the `Abbrev` dimension uniquely corresponds to only one `Continent` dimension.  But that degree of containment is not general to hierarchical indices.  For example, various countries are in the `High` / `Europe` combination.

In [None]:
# Pick two "dimensions"
df.loc['High', 'Europe']

In [None]:
# Pick four "dimensions" (have to group the index dimensions before column)
df.loc[('Medium', 'Asia', 'JPN'), 'Silver']

In [None]:
# Combine other indexers, i.e. lists
with show_all_rows():
    print(df.loc[(['Low', 'Medium'], 'Asia'), ['Bronze', 'Total']])

In [None]:
medals.Level.unique()

## Stacking and Unstacking

A particular transform is available which takes all (or some) of the columns that are not in the index, and turns them into a DataFrame or Series of labeled values.  Actually, what we get is a Series-of-Series.  Each hierarchical index entry in the overall Series contains a Series as its value.

It is more sensible in these "dimensional" transformations to use features that are independently of each other (unlike `Abbrev` that was used to illustrate more dimensions).  So, for example, `Continent` and `Level` might occur in any combination; albeit in the actual data, not all 8×5 (=40) combinations actually occur among the 138 countries.

In [None]:
series = medals.set_index(['Continent', 'Level']).stack()
series

We can do a similar pseudo-dimensional indexing on this stacked Series.

In [None]:
# Use a slice to select all Levels (i.e. second level of index)
series['South America', :]

The method `.unstack()` is the inverse of stacking, but it is a bit picky since it fails in the face of duplicate labels (which one will almost always have when there are (partially) independent "dimensions" represented.

## Aggregating New Rows and Columns

A kind of generalization of stacking is pivoting.  The methods `.pivot()` and `.pivot_table()` do this, with the latter being more general in not failing on duplicate values.

The idea of pivoting is to take categorical values within a column, and treat those as either row or column labels themselves.  Because multiple rows in the original DataFrame may share a categorical label, we need to perform an aggregation of the values.  

Counting the number of things that match multiple categoricals is an obvious task.  But in other cases, finding the maximum, or the mean, or simply the first, among the items can be relevant.

In [None]:
medals.pivot_table(index='Continent', columns='Level', aggfunc="count").loc[:, 'Total']

This next one is a little bit convoluted, but we create a DataFrame with hierarchical *columns* and just one level of index.  The index is the number of gold medals, and the values in the table are the number of total medals.  Moreover, we break down the countries by continent in the nested columns.

In [None]:
df = (medals
         .pivot_table(index='Gold', 
                      columns=['Continent', 'Abbrev'], 
                      aggfunc=sum)
         .loc[:, 'Total']
         .fillna(0)
         .astype(int)
)
df

This data with 136 columns would lend itself better to a hierarchical index, and `.stack()` conveniently gets us that. The initial version puts NaNs almost everywhere, listing every country abbreviation by every Gold medal count.  We can drop the rows that contain only zeros.

In [None]:
stack = df.stack().fillna(0).astype(int)
stack = stack.loc[(stack.T != 0).any()]
stack

This sparse DataFrame does let us answer some questions.  For example, among European countries that won 50-200 Gold medals, how many total medals did they win?

In [None]:
# Zeros are either European countries with no medals, or non-European
europe = stack.loc[(slice(50, 200), slice(None)), 'Europe']
# Drop the zeros
europe[europe != 0]

There is an extra trick above.  We need slices into both levels of the hierarchical index, hence have to make a tuple.  But the colon notation for a slice does not work inside a tuple, neither for an empty slice or one with ends.  We can use the slice constructor, as a convenience "accessor":

In [None]:
from pandas import IndexSlice as ndx
europe = stack.loc[ndx[50:200, :], 'Europe']
europe[europe != 0]

# Exercises

Many Pandas user use hierarchical indexing a great deal, and the more exotic tools with stack, unstack, and pivot much less often.  These exercises will focus on the more common need.

We utilize a small dataset called "Best Artworks of All Time"was published on Kaggle under CC-BY_NC-SA 4.0.  We are not endorsing or disputing the selection of these 50 artists, nor vouching for any of the details provided.  It simply contains a number of categorical fields.

Three small notes: 

1. the script used to provide some functionality to these lessons massages the raw dataset slightly, but the original is in the repository; 
2. the numeric count of paintings is not the total output of each artist, but simply the number of image files accompanying the dataset (we do not provide or utilize the actual images).  Nonetheless, it is data which we will pretend is meaningful.
3. The start and end dates are datetime datatypes, but only the year was provided.  All the dates are therefore simply that strike of midnight on January 1 of a year, not any calendar date relevant to the artist.

In [None]:
from src.pandas_exercises import *
artists.head()

---
Question: Which nationalities in the dataset are represented by more than 3, but fewer than 10 artists?

The answer happens to be Italian, Spanish, Russian, and Dutch; but obviously the point is to write code that will provide the answer generically for a new dataset structured similarly.

In [None]:
# Identify nationalities for 3 < N < 10
...

---
Question: What are the range of years during which Russian artists in the dataset were working?

The answer is 1884-1944.

In [None]:
# Identify working years for Russian artists
...

---
Report the maximum number of paintings from artists of each nationality, and which painter(s) had that maximum (from the collection in this dataset; probably not in their career).

In [None]:
ex8_1.result

In [None]:
# Most prolific artist by nationality
...

---

Create a DataFrame with a nested index of end of professional career and nationality, with rows sorted by end date and secondarily by start date.  The non-index columns of the DataFrame should include only name and start date.

*Hint*: Edvard Munch, Vasiliy Kandinsky, and Piet Mondrian, all ended work in 1944, but started in different years.

In [None]:
with show_all_rows():
    print(ex8_2.result)

In [None]:
# Create nested index and correct columns
...

---
If you have looked at examples closely, you will have noticed that there are two attributes of artists that are non-unique.  One artist may have multiple nationalities and/or may have worked in multiple genres.

These non-unique attributes are encoded differently.  For different nationalities, duplicate rows are present that differ only in that one feature.  For different genres, three different columns indicate one of them (in many cases some of these columns are filled with `None` to indicate missing value).

Create a Series containing each artist listed only once, and the number of paintings of theirs in the collection.

In [None]:
# Uniquify artists and give count of paintings
...
ex8_3.result[["El Greco", "Marc Chagall", "Alfred Sisley"]]

In contrast, this approach contains duplicates of the information we want.

In [None]:
artists.set_index("name").paintings[["El Greco", "Marc Chagall", "Alfred Sisley"]]

---
This last exercise is **a bit more challenging**, and there are many ways to get there.  We would like to create a Series with a non-unique index containing artist name, and values for each corresponding genre that the artist worked in.  Sort the artists by name.

In [None]:
with show_all_rows():
    print(ex8_4.result)

In [None]:
# Show all artist/genre combinations
...


---

Materials licensed under [CC BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/) by the authors