# Table of Contents 

- **[Series](#Series)**
- **[DataFrame](#DataFrame)**
    - [Accessing a DataFrame](#Accessing-a-DataFrame)
    - [Boolean Indexing](#Boolean-Indexing)
    - [Adding columns and rows](#Adding-columns-and-rows)
    - [Deleting columns and rows](#Deleting-columns-and-rows)
    - [Reading and Writing DataFrames](#Reading-and-Writing-DataFrames)
    - [Missing Data](#Missing-data)
- **[DataFrame Operations](#DataFrame-Operations)**
    - [Matrix operations](#Matrix-operations)
    - [Column operations](#Column-operations)

- **[Data Splitting](#Data-Splitting)**
    - [Grouping](#Grouping)    

- **[Matplotlib: plotting examples](#Matplotlib:-plotting-examples)**
    - [Plotting with pandas](#Plotting-with-pandas)


### The Data Mining Process 

Image from: 

    Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy. 
    Advances in Knowledge Discovery and Data Mining. MIT Press, Menlo Park, CA, 1996

![datamining](images/DataMiningProcess.PNG)


**Pandas** is desgined to make **data pre-processing and data analysis fast and easy in Python**. Pandas adopts many coding idioms from NumPy, such as avoiding the `for` loops, but it is designed for working with heterogenous data represented in tabular format.

To use Pandas, you need to import the `pandas` module, using for example:

In [1]:
import pandas as pd
import numpy as np # we will also need numpy

This import style is quite standard; all objects and functions the `pandas` package will now be invoked with the `pd.` prefix.


## Aside: Numpy
NumPy (**Num**erical **Py**thon) is the fundamental package for scientific computing with Python. It contains, among other things:

- a powerful N-dimensional array object
- sophisticated functions that support broadcasting (i.e. it allows to perform arithmetic operations between arrays with different shape)
- tools for integrating C/C++ and Fortran code
- useful linear algebra, Fourier transform, and random number capabilities

The core object of numpy is **ndarray**: N-dimensional Array. It represents a multidimensional, homogeneous array of fixed-size items.

In [2]:
# example of 1-dimensional array
np.arange(0, 1, 0.1)

array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])

In [3]:
# example of 2-dimensional array
my_ndarray = np.zeros((3,5))
my_ndarray

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

In [4]:
np.asarray([1,2,3])

array([1, 2, 3])

In [5]:
type(np.asarray([1,2,3]))

numpy.ndarray

In [6]:
print(my_ndarray)
print(my_ndarray.shape)
print(my_ndarray.ndim)
print(my_ndarray.size)
print(my_ndarray.dtype)

[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]
(3, 5)
2
15
float64


There are several NumPy functions for [creating arrays](https://docs.scipy.org/doc/numpy/user/quickstart.html#array-creation):

| Function | Description |
| ---: | :--- |
| `np.array(a)` | Create $n$-dimensional NumPy array from sequence `a` |
| `np.linspace(a,b,N)` | Create 1D NumPy array with `N` equally spaced values from `a` to `b` (inclusively)|
| `np.arange(a,b,step)` | Create 1D NumPy array with values from `a` to `b` (exclusively) incremented by `step`|
| `np.zeros(N)` | Create 1D NumPy array of zeros of length $N$ |
| `np.zeros((n,m))` | Create 2D NumPy array of zeros with $n$ rows and $m$ columns |
| `np.ones(N)` | Create 1D NumPy array of ones of length $N$ |
| `np.ones((n,m))` | Create 2D NumPy array of ones with $n$ rows and $m$ columns |
| `np.eye(N)` | Create 2D NumPy array with $N$ rows and $N$ columns with ones on the diagonal (ie. the identity matrix of size $N$) |

### Mathematical Functions

[Mathematical functions](http://docs.scipy.org/doc/numpy/reference/routines.math.html) in NumPy are called [**universal functions**](https://docs.scipy.org/doc/numpy/user/quickstart.html#universal-functions) and are *vectorized*. Vectorized functions operate *element-wise* on arrays producing arrays as output and are built to compute values across arrays *very* quickly. 

The following table contains a list of the most important unary ufuncs.

|Function| Description |
|:-------|:---------|
|`np.abs`|Compute the absolute value element-wise for integer, floating-point, or complex values|
|`np.sqrt`|Compute the square root of each element|
|`np.exp`|Compute the exponent $e^x$ of each element|
|`np.log`, `np.log10`, `np.log2`, `np.log1p`|Natural logarithm (base e), log base 10, log base 2, and log(1 + x), respectively|
|`np.sign`|Compute the sign of each element: 1 (positive), 0 (zero), or –1 (negative)|
|`np.ceil`|Compute the ceiling of each element|
|`np.floor`|Compute the floor of each element|
|`np.modf`|Return fractional and integral parts of array as a separate array|
|`np.isnan`|Return boolean array indicating whether each value is `NaN` (Not a Number)|
|`np.cos`, `np.cosh`, `np.sin`, `np.sinh`, `np.tan`, `np.tanh`|Regular and hyperbolic trigonometric functions|
|`np.arccos`, `np.arccosh`, `np.arcsin`, `np.arcsinh`, `np.arctan`, `np.arctanh`|Inverse trigonometric functions|

The following table contains a list of the most important binary ufuncs.

|Function| Description |
|:-------|:---------|
|`np.add`|Element-wise addition|
|`np.subtract`|Element-wise subtraction|
|`np.multiply`|Element-wise multiplication|
|`np.divide`|Element-wise division|
|`np.mod`|Element-wise modulus|
|`np.power`|Raise elements in first array to powers indicated in second array |
|`np.maximum`, `np.fmax`|Element-wise maximum; `np.fmax` ignores `NaN`|
|`np.minimum`, `np.fmin`|Element-wise minimum; `np.fmin` ignores `NaN`|


Pandas has two main data structures, **Series** and **DataFrame**.


# Series

Series are the Pandas version of 1-D Numpy arrays. 

An instance of Series is a single dimension array-like object containing:
- a *sequence of values*,
- an array of *data labels*, namely its **index**.

A Series can be created easily from a Python list:

In [7]:
ts = pd.Series([4, 8, 1, 3])
print(ts)

0    4
1    8
2    1
3    3
dtype: int64


The string representation of a Series display two columns: the first column represents the index array, the second column represents the values array. Since no index was specified, the default indexing consists of increasing integers starting from 0. 

The underlying structure can be recovered with the `values` attribute:

In [8]:
print(ts.values)
print(type(ts.values))

[4 8 1 3]
<class 'numpy.ndarray'>


To create a Series with its own index, you can write:

In [9]:
ts = pd.Series([4, 8, 1, 3], index=['first', 'second', 'third', 'fourth'])
print(ts)

first     4
second    8
third     1
fourth    3
dtype: int64


The labels in the index can be used to select values in the Series:

In [10]:
print(ts['first'])

4


In [11]:
print(ts[['second', 'fourth']])

second    8
fourth    3
dtype: int64


You can think about a Series as a kind of fixed-length, ordered Python's `dict`, mapping index values to data values. In fact, it is possible to create a Series directlty from a Python's `dict`:

In [12]:
my_dict = {'Pisa': 80, 'London': 300, 'Paris': 1}
ts = pd.Series(my_dict)
print(ts)

Pisa       80
London    300
Paris       1
dtype: int64


Sorting a series:

- sort by values

In [13]:
ts.sort_values()

Paris       1
Pisa       80
London    300
dtype: int64

- sort by index

In [14]:
ts.sort_index()

London    300
Paris       1
Pisa       80
dtype: int64

In [15]:
ts

Pisa       80
London    300
Paris       1
dtype: int64

In [16]:
ts.sort_values?

[0;31mSignature:[0m
[0mts[0m[0;34m.[0m[0msort_values[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0maxis[0m[0;34m:[0m [0;34m'Axis'[0m [0;34m=[0m [0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mascending[0m[0;34m:[0m [0;34m'bool | Sequence[bool]'[0m [0;34m=[0m [0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0minplace[0m[0;34m:[0m [0;34m'bool'[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mkind[0m[0;34m:[0m [0;34m'SortKind'[0m [0;34m=[0m [0;34m'quicksort'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mna_position[0m[0;34m:[0m [0;34m'NaPosition'[0m [0;34m=[0m [0;34m'last'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mignore_index[0m[0;34m:[0m [0;34m'bool'[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mkey[0m[0;34m:[0m [0;34m'ValueKeyFunc | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)

Note: **no pandas method has the side effect of modifying your data; almost every method returns a new object, leaving the original object untouched. If the data is modified, it is because you did so explicitly.**

**Arithmetic operations** on Series are automatically aligned on the index labels:

In [17]:
ts1 = pd.Series([4, 8, 1, 3], index=['first', 'second', 'third', 'fourth'])
ts2 = pd.Series([4, 8, 1], index=['first', 'second', 'pisa'])

In [18]:
ts1

first     4
second    8
third     1
fourth    3
dtype: int64

In [19]:
ts2

first     4
second    8
pisa      1
dtype: int64

In [20]:
ts_sum = ts1 + ts2
print(ts_sum)

first      8.0
fourth     NaN
pisa       NaN
second    16.0
third      NaN
dtype: float64


Here two index values are correctly computed (corresponding to the label `first` and `second`). The two other index labels `third` and `fourth` in `ts1` are missing in `ts2`, as well as the `pisa` index label in `ts2`. Hence, for each of these index label, a `NaN` value (*not a number*) appears, which Pandas considers as a **missing value**.

The `pd.isnull` (or `pd.isna`) and `pd.notnull` (or `pd.notna`) functions detects missing data. There are also corresponding **instance methods**.

In [21]:
pd.isnull(ts_sum)

first     False
fourth     True
pisa       True
second    False
third      True
dtype: bool

In [22]:
ts_sum.isnull()

first     False
fourth     True
pisa       True
second    False
third      True
dtype: bool

In [23]:
pd.notnull(ts_sum)

first      True
fourth    False
pisa      False
second     True
third     False
dtype: bool

In [24]:
ts_sum.notnull()

first      True
fourth    False
pisa      False
second     True
third     False
dtype: bool

# DataFrame

A DataFrame is a **rectangular table of data**. It contains an ordered list of columns. Every column can be of a different type. 

A DataFrame has both a *row index* and a *column index*. It can be thought as a *dictionary of Series* (one per column) all sharing the same index labels.

There are many ways to construct a DataFrame: one of the most common is using a dictionary of Python's lists (or NumPy's arrays):

In [25]:
cars = {'Brand': ['Honda Civic', 'Toyota Corolla', 'Ford Focus', 'Audi A4'],
        'Price': [22000, 25000, 27000, 35000],
        'Wheels': 4} # broadcast if possible

df = pd.DataFrame(cars)
print(df)

            Brand  Price  Wheels
0     Honda Civic  22000       4
1  Toyota Corolla  25000       4
2      Ford Focus  27000       4
3         Audi A4  35000       4


The resulting DataFrame will receive its index automatically as with Series.

To pretty-print a DataFrame in a Jupyter notebooks, it is enough to write its name (or using the `head()` instance method for very long DataFrames):

In [26]:
display(df)

Unnamed: 0,Brand,Price,Wheels
0,Honda Civic,22000,4
1,Toyota Corolla,25000,4
2,Ford Focus,27000,4
3,Audi A4,35000,4


In [27]:
df

Unnamed: 0,Brand,Price,Wheels
0,Honda Civic,22000,4
1,Toyota Corolla,25000,4
2,Ford Focus,27000,4
3,Audi A4,35000,4


Access the T attribute, to transpose a dataframe

In [28]:
df.T 

Unnamed: 0,0,1,2,3
Brand,Honda Civic,Toyota Corolla,Ford Focus,Audi A4
Price,22000,25000,27000,35000
Wheels,4,4,4,4


In [29]:
df.head(2)

Unnamed: 0,Brand,Price,Wheels
0,Honda Civic,22000,4
1,Toyota Corolla,25000,4


In [30]:
df.tail(2)

Unnamed: 0,Brand,Price,Wheels
2,Ford Focus,27000,4
3,Audi A4,35000,4


A summary of the *numerical* data is provided by `describe`:

In [31]:
df.describe()

Unnamed: 0,Price,Wheels
count,4.0,4.0
mean,27250.0,4.0
std,5560.275773,0.0
min,22000.0,4.0
25%,24250.0,4.0
50%,26000.0,4.0
75%,29000.0,4.0
max,35000.0,4.0


In [32]:
df.describe(include = 'object')

Unnamed: 0,Brand
count,4
unique,4
top,Honda Civic
freq,1


In [33]:
df.describe(include = 'all')

Unnamed: 0,Brand,Price,Wheels
count,4,4.0,4.0
unique,4,,
top,Honda Civic,,
freq,1,,
mean,,27250.0,4.0
std,,5560.275773,0.0
min,,22000.0,4.0
25%,,24250.0,4.0
50%,,26000.0,4.0
75%,,29000.0,4.0


If working with a large table, it might be useful to sometimes have a list of all the columns' names. This is given by the `keys()` methods:

In [34]:
print(df.keys())

Index(['Brand', 'Price', 'Wheels'], dtype='object')


In [35]:
print(df.columns)

Index(['Brand', 'Price', 'Wheels'], dtype='object')


Many feature from the NumPy package can be directly used with Pandas DataFrames

In [36]:
print(df.values)
print()
print(type(df.values))

[['Honda Civic' 22000 4]
 ['Toyota Corolla' 25000 4]
 ['Ford Focus' 27000 4]
 ['Audi A4' 35000 4]]

<class 'numpy.ndarray'>


In [37]:
print(df.shape)

(4, 3)


Another common way to create a DataFrame is to use a *nested dict of dicts*:

In [38]:
population = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

If this nested dict is passed to the DataFrame, the **outer dict keys are interpreted as column labels**, and the **inner keys are interpreted as row labels**:

In [39]:
df = pd.DataFrame(population)
df

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


Sorting a dataframe:

In [40]:
df.sort_values(by = 'Ohio',ascending = False) # by can be a string or list of strings

Unnamed: 0,Nevada,Ohio
2002,2.9,3.6
2001,2.4,1.7
2000,,1.5


## Accessing a DataFrame

Let's create a brand new DataFrame:

In [41]:
dict_of_list = {'birth': [1860, 1770, 1858, 1906], 
                'death':[1911, 1827, 1924, 1975], 
                'city':['Kaliste', 'Bonn', 'Lucques', 'Saint-Petersburg']}
composers_df = pd.DataFrame(dict_of_list, index=['Mahler', 'Beethoven', 'Puccini', 'Shostakovich'])
composers_df

Unnamed: 0,birth,death,city
Mahler,1860,1911,Kaliste
Beethoven,1770,1827,Bonn
Puccini,1858,1924,Lucques
Shostakovich,1906,1975,Saint-Petersburg


There are multiple ways of accessing values or series of values in a Dataframe. Unlike in Series, a simple bracket gives access to a column and not an index, for example:

In [42]:
composers_df['city']

Mahler                   Kaliste
Beethoven                   Bonn
Puccini                  Lucques
Shostakovich    Saint-Petersburg
Name: city, dtype: object

returns a Series. Alternatively one can also use the attributes syntax and access columns by using:

In [43]:
composers_df.city

Mahler                   Kaliste
Beethoven                   Bonn
Puccini                  Lucques
Shostakovich    Saint-Petersburg
Name: city, dtype: object

The attributes syntax has some limitations, so in case something does not work as expected, revert to the brackets notation.

When specifiying multiple columns, a DataFrame is returned:

In [44]:
composers_df[['city', 'birth']]

Unnamed: 0,city,birth
Mahler,Kaliste,1860
Beethoven,Bonn,1770
Puccini,Lucques,1858
Shostakovich,Saint-Petersburg,1906


Standard indexing operators (just slices the rows)

In [45]:
composers_df[0:2]

Unnamed: 0,birth,death,city
Mahler,1860,1911,Kaliste
Beethoven,1770,1827,Bonn


from the [docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html):
>The Python and NumPy indexing operators [$\cdot$] and attribute operator  `.` provide quick and easy access to pandas data structures across a wide range of use cases. This makes interactive work intuitive, as there’s little new to learn if you already know how to deal with Python dictionaries and NumPy arrays. However, since the type of the data to be accessed isn’t known in advance, directly using standard operators has some optimization limits. For production code, we recommended that you take advantage of the **optimized pandas data access methods**.



Pandas optimized data access methods:  `iloc` and `loc`.

**Remember that `loc` and `iloc` are attributes, not methods, hence they use brackets `[]` and not parenthesis `()`.**

The `loc` attribute allows to recover elements by using the index labels, while the `iloc` attribute can be used to recover the regular indexing:

In [46]:
composers_df.iloc[0:2,:]

Unnamed: 0,birth,death,city
Mahler,1860,1911,Kaliste
Beethoven,1770,1827,Bonn


In [47]:
composers_df.loc[['Mahler','Beethoven'], 'death']


Mahler       1911
Beethoven    1827
Name: death, dtype: int64

In [48]:
composers_df.loc['Beethoven', 'death']

1827

## Boolean Indexing

Just like with Numpy, it is possible to subselect parts of a Dataframe using boolean indexing.
A logical Series can be used as an index to select elements in the Dataframe.

In [49]:
composers_df

Unnamed: 0,birth,death,city
Mahler,1860,1911,Kaliste
Beethoven,1770,1827,Bonn
Puccini,1858,1924,Lucques
Shostakovich,1906,1975,Saint-Petersburg


In [50]:
mask = composers_df['death'] > 1859
print(mask)
composers_df[mask]

Mahler           True
Beethoven       False
Puccini          True
Shostakovich     True
Name: death, dtype: bool


Unnamed: 0,birth,death,city
Mahler,1860,1911,Kaliste
Puccini,1858,1924,Lucques
Shostakovich,1906,1975,Saint-Petersburg


More compact:

In [51]:
composers_df[composers_df['birth'] > 1900]

Unnamed: 0,birth,death,city
Shostakovich,1906,1975,Saint-Petersburg


To sum up: basics of indexing
| Operation | Syntax | Result |
| :---: | :---: | :---: |
| Select column | `df[col]` (or `df.col`, where possible) | Series |
| Select row by label | `df.loc[label]` | Series |
| Select row by integer location | `df.iloc[loc]` | Series |
| Slice rows | `df[5:10]` | DataFrame |
| Select rows by boolean vector | `df[bool_vect]` | DataFrame |


## Adding columns and rows

It is very simple to add a column to a Dataframe:

In [52]:
composers_df['country'] = '???'
composers_df

Unnamed: 0,birth,death,city,country
Mahler,1860,1911,Kaliste,???
Beethoven,1770,1827,Bonn,???
Puccini,1858,1924,Lucques,???
Shostakovich,1906,1975,Saint-Petersburg,???


Alternatively, an existing list can be used:

In [53]:
composers_df['country2'] = ['Austria','Germany','Italy','Russia']
composers_df

Unnamed: 0,birth,death,city,country,country2
Mahler,1860,1911,Kaliste,???,Austria
Beethoven,1770,1827,Bonn,???,Germany
Puccini,1858,1924,Lucques,???,Italy
Shostakovich,1906,1975,Saint-Petersburg,???,Russia


A DataFrame or a Series can be "appended" to another DataFrame through `pd.concat`

In [54]:
new_row = pd.DataFrame({'Sibelius':{'birth':None,'death':1900, 'city': None, 'country':None}}).T
new_row

Unnamed: 0,birth,city,country,death
Sibelius,,,,1900.0


In [55]:
pd.concat((composers_df,new_row))

Unnamed: 0,birth,death,city,country,country2
Mahler,1860.0,1911.0,Kaliste,???,Austria
Beethoven,1770.0,1827.0,Bonn,???,Germany
Puccini,1858.0,1924.0,Lucques,???,Italy
Shostakovich,1906.0,1975.0,Saint-Petersburg,???,Russia
Sibelius,,1900.0,,,


More on [Database-style DataFrame or named Series joining/merging](https://pandas.pydata.org/docs/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging)

## Deleting columns and rows

In [56]:
composers_df

Unnamed: 0,birth,death,city,country,country2
Mahler,1860,1911,Kaliste,???,Austria
Beethoven,1770,1827,Bonn,???,Germany
Puccini,1858,1924,Lucques,???,Italy
Shostakovich,1906,1975,Saint-Petersburg,???,Russia


In [57]:
composers_df.drop(columns = ['country2'])

Unnamed: 0,birth,death,city,country
Mahler,1860,1911,Kaliste,???
Beethoven,1770,1827,Bonn,???
Puccini,1858,1924,Lucques,???
Shostakovich,1906,1975,Saint-Petersburg,???


In [58]:
tmp_df = composers_df.drop('Puccini')
tmp_df

Unnamed: 0,birth,death,city,country,country2
Mahler,1860,1911,Kaliste,???,Austria
Beethoven,1770,1827,Bonn,???,Germany
Shostakovich,1906,1975,Saint-Petersburg,???,Russia


In [59]:
composers_df #note that, by default, drop does not operate in-place

Unnamed: 0,birth,death,city,country,country2
Mahler,1860,1911,Kaliste,???,Austria
Beethoven,1770,1827,Bonn,???,Germany
Puccini,1858,1924,Lucques,???,Italy
Shostakovich,1906,1975,Saint-Petersburg,???,Russia


In [60]:
composers_df.drop?

[0;31mSignature:[0m
[0mcomposers_df[0m[0;34m.[0m[0mdrop[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mlabels[0m[0;34m:[0m [0;34m'IndexLabel | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0maxis[0m[0;34m:[0m [0;34m'Axis'[0m [0;34m=[0m [0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mindex[0m[0;34m:[0m [0;34m'IndexLabel | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcolumns[0m[0;34m:[0m [0;34m'IndexLabel | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlevel[0m[0;34m:[0m [0;34m'Level | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0minplace[0m[0;34m:[0m [0;34m'bool'[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0merrors[0m[0;34m:[0m [0;34m'IgnoreRaise'[0m [0;34m=[0m [0;34m'raise'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0

## Reading and Writing DataFrames

A common way of "creating" a Pandas Dataframe is by importing a table from another format like CSV (comma separated values) or Excel. 

### CSV format

In [61]:
df

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [62]:
df.to_csv('out/foo.csv')

In [63]:
df_read = pd.read_csv('out/foo.csv')
df_read

Unnamed: 0.1,Unnamed: 0,Nevada,Ohio
0,2001,2.4,1.7
1,2002,2.9,3.6
2,2000,,1.5


In [64]:
df_read = pd.read_csv('out/foo.csv',index_col = 0)
df_read

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


### Importing Excel files

An Excel table is provided in the [composers.xlsx](data/composers.xlsx) file and can be read with the `pd.read_excel` function.

You may need to install `openpyxl` package through `pip install openpyxl`

In [65]:
#!pip install openpyxl

In [66]:
composers_df = pd.read_excel('dataset/composers.xlsx')
composers_df

Unnamed: 0,composer,birth,death,city
0,Mahler,1860,1911,Kaliste
1,Beethoven,1770,1827,Bonn
2,Puccini,1858,1924,Lucques
3,Shostakovich,1906,1975,Saint-Petersburg


The reader automatically recognized the heaers of the file. However it created a new index. If needed we can specify which column to use as header:

In [67]:
pd.read_excel?

[0;31mSignature:[0m
[0mpd[0m[0;34m.[0m[0mread_excel[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mio[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msheet_name[0m[0;34m:[0m [0;34m'str | int | list[IntStrT] | None'[0m [0;34m=[0m [0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mheader[0m[0;34m:[0m [0;34m'int | Sequence[int] | None'[0m [0;34m=[0m [0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnames[0m[0;34m:[0m [0;34m'list[str] | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mindex_col[0m[0;34m:[0m [0;34m'int | Sequence[int] | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0musecols[0m[0;34m:[0m [0;34m'int | str | Sequence[int] | Sequence[str] | Callable[[str], bool] | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdtype[0m[0;34m:[0m [0;34m'DtypeArg | None'[0m [0;34m=[0m [0;32mNone[0m[0

In [68]:
composers_df = pd.read_excel('dataset/composers.xlsx', index_col = 'composer')
composers_df

Unnamed: 0_level_0,birth,death,city
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mahler,1860,1911,Kaliste
Beethoven,1770,1827,Bonn
Puccini,1858,1924,Lucques
Shostakovich,1906,1975,Saint-Petersburg


If we open the file in Excel, we see that it is composed of more than one sheet. Clearly, when not specifying anything, the reader only reads the first sheet. However we can specify a sheet:

In [69]:
composers_df = pd.read_excel('dataset/composers.xlsx', index_col = 'composer', sheet_name='Sheet2')
composers_df

Unnamed: 0_level_0,birth,death,city
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mahler,1860,1911,Kaliste
Beethoven,1770,1827,Bonn
Puccini,1858,1924,Lucques
Shostakovich,1906,1975,Saint-Petersburg
Sibelius,unknown,unknown,unknown
Haydn,,,Rohrau


In [70]:
composers_df.describe()

Unnamed: 0,birth,death,city
count,5,5,6
unique,5,5,6
top,1860,1911,Kaliste
freq,1,1,1


As you can see above, some information is missing. Some missing values are marked as "`unknown`" while other are `NaN`. `NaN` is the standard symbol for unknown/missing values and is understood by Pandas while "`unknown`" is just seen as text. 
This is impractical as now we have columns with a mix of numbers and text which will make later computations difficult. What we would like to do is to replace all "irrelevant" values with the standard `NaN` symbol that says "*no information*".
For this we can use the `na_values` argument to specify what should be a `NaN`:

In [71]:
composers_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, Mahler to Haydn
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   birth   5 non-null      object
 1   death   5 non-null      object
 2   city    6 non-null      object
dtypes: object(3)
memory usage: 192.0+ bytes


In [72]:
composers_df = pd.read_excel('dataset/composers.xlsx', index_col = 'composer', sheet_name='Sheet2', 
                             na_values=['unknown'])
composers_df

Unnamed: 0_level_0,birth,death,city
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mahler,1860.0,1911.0,Kaliste
Beethoven,1770.0,1827.0,Bonn
Puccini,1858.0,1924.0,Lucques
Shostakovich,1906.0,1975.0,Saint-Petersburg
Sibelius,,,
Haydn,,,Rohrau


In [73]:
composers_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, Mahler to Haydn
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   birth   4 non-null      float64
 1   death   4 non-null      float64
 2   city    5 non-null      object 
dtypes: float64(2), object(1)
memory usage: 192.0+ bytes


### Read / Write SQL database

from the [docs](https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html)

Read data from SQL via either a SQL query or a SQL tablename. 
Note that when using a SQLite database only SQL queries are accepted, providing only the SQL tablename will result in an error.

In [74]:
from sqlite3 import connect
conn = connect(':memory:') # most common way to force an SQLite database to exist purely in memory 
df = pd.DataFrame(data=[[0, '10/11/12'], [1, '12/11/10']],
                  columns=['int_column', 'date_column'])
df.to_sql('test_data', conn) # Returns number of rows affected by to_sql, or None if the callable passed into method does not return an integer number of rows.

2

In [75]:
df

Unnamed: 0,int_column,date_column
0,0,10/11/12
1,1,12/11/10


In [76]:
pd.read_sql('SELECT int_column, date_column FROM test_data', conn)

Unnamed: 0,int_column,date_column
0,0,10/11/12
1,1,12/11/10


## Missing data
pandas primarily uses the value `np.nan` to represent missing data. It is by default not included in computations.

In [77]:
df_new = composers_df.copy()

In [78]:
df_new

Unnamed: 0_level_0,birth,death,city
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mahler,1860.0,1911.0,Kaliste
Beethoven,1770.0,1827.0,Bonn
Puccini,1858.0,1924.0,Lucques
Shostakovich,1906.0,1975.0,Saint-Petersburg
Sibelius,,,
Haydn,,,Rohrau


Get a boolean mask where values are `np.nan` (as for Series).

In [79]:
pd.isna(df_new)

Unnamed: 0_level_0,birth,death,city
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mahler,False,False,False
Beethoven,False,False,False
Puccini,False,False,False
Shostakovich,False,False,False
Sibelius,True,True,True
Haydn,True,True,False


In [80]:
df_new.isna().values.any()

True

In [81]:
df_new.isna().sum()

birth    2
death    2
city     1
dtype: int64

To drop missing data.

In [82]:
df_new.dropna(how = 'any') # delete any row and column with at least one np.nan

Unnamed: 0_level_0,birth,death,city
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mahler,1860.0,1911.0,Kaliste
Beethoven,1770.0,1827.0,Bonn
Puccini,1858.0,1924.0,Lucques
Shostakovich,1906.0,1975.0,Saint-Petersburg


In [83]:
df_new.dropna(how = 'all') # the whole row (or column) must be np.nan 

Unnamed: 0_level_0,birth,death,city
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mahler,1860.0,1911.0,Kaliste
Beethoven,1770.0,1827.0,Bonn
Puccini,1858.0,1924.0,Lucques
Shostakovich,1906.0,1975.0,Saint-Petersburg
Haydn,,,Rohrau


Filling missing data

In [84]:
df_new.fillna(value=5)

Unnamed: 0_level_0,birth,death,city
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mahler,1860.0,1911.0,Kaliste
Beethoven,1770.0,1827.0,Bonn
Puccini,1858.0,1924.0,Lucques
Shostakovich,1906.0,1975.0,Saint-Petersburg
Sibelius,5.0,5.0,5
Haydn,5.0,5.0,Rohrau


# DataFrame Operations

One of the great advantages of using Pandas to handle tabular data is how simple it is to extract valuable information from them. Here we are going to see various types of operations that are available for this.


## Matrix operations

The strength of Numpy is its natural way of handling matrix operations, and Pandas reuses a lot of these features. For example one can use simple mathematical operations to opereate at the cell level:

In [85]:
df = pd.read_excel('dataset/composers.xlsx')
df

Unnamed: 0,composer,birth,death,city
0,Mahler,1860,1911,Kaliste
1,Beethoven,1770,1827,Bonn
2,Puccini,1858,1924,Lucques
3,Shostakovich,1906,1975,Saint-Petersburg


In [86]:
df['birth'] * 2

0    3720
1    3540
2    3716
3    3812
Name: birth, dtype: int64

In [87]:
np.log(df['birth'])

0    7.528332
1    7.478735
2    7.527256
3    7.552762
Name: birth, dtype: float64

We can directly use an operation's output to create a new column:

In [88]:
df['age'] = df['death'] - df['birth']
df

Unnamed: 0,composer,birth,death,city,age
0,Mahler,1860,1911,Kaliste,51
1,Beethoven,1770,1827,Bonn,57
2,Puccini,1858,1924,Lucques,66
3,Shostakovich,1906,1975,Saint-Petersburg,69


Here we applied functions only to series. Indeed, since our Dataframe contains e.g. strings, no operation can be done on it. If however we have a homogenous Dataframe, this is possible:

In [89]:
df[['birth', 'death']] * 2

Unnamed: 0,birth,death
0,3720,3822
1,3540,3654
2,3716,3848
3,3812,3950


## Column operations

There are other types of functions whose purpose is to summarize the data. For example the mean or standard deviation. Pandas by default applies such functions column-wise and returns a series containing e.g. the mean of each column:

In [90]:
df[['birth','death','age']].mean()

birth    1848.50
death    1909.25
age        60.75
dtype: float64

Sometimes one needs to apply to a column a very specific function that is not provided by default. In that case we can use one of the different `apply` methods of Pandas.

The simplest case is to apply a function to a column, or Series of a DataFrame. Let's say for example that we want to define the age >60 as 'old' and <60 as 'young'. We can define the following general function:

In [91]:
define_age = lambda x: 'old' if x > 60 else 'young'

We can now apply this function on an entire Series:

In [92]:
df['categorical age'] = df.age.apply(define_age)

In [93]:
df

Unnamed: 0,composer,birth,death,city,age,categorical age
0,Mahler,1860,1911,Kaliste,51,young
1,Beethoven,1770,1827,Bonn,57,young
2,Puccini,1858,1924,Lucques,66,old
3,Shostakovich,1906,1975,Saint-Petersburg,69,old


In [94]:
df['compact categorical age'] = df.age.apply(lambda x: 'old' if x > 60 else 'young') # as before, but more compact
df

Unnamed: 0,composer,birth,death,city,age,categorical age,compact categorical age
0,Mahler,1860,1911,Kaliste,51,young,young
1,Beethoven,1770,1827,Bonn,57,young,young
2,Puccini,1858,1924,Lucques,66,old,old
3,Shostakovich,1906,1975,Saint-Petersburg,69,old,old


### Value Counting

In [95]:
df['categorical age'].value_counts()

categorical age
young    2
old      2
Name: count, dtype: int64

In [96]:
df['categorical age'].unique()

array(['young', 'old'], dtype=object)

# Data Splitting

Often Pandas tables mix regular variables (e.g. the size of cells in microscopy images) with categorical variables (e.g. the type of cell to which they belong). In that case, it is quite usual to split the data using the category to do computations. Pandas allows to do this very easily.

## Grouping

In [97]:
composers_df = pd.read_excel('dataset/composers.xlsx', index_col = 'composer', sheet_name='Sheet5')

In [98]:
composers_df

Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Mahler,1860,1911.0,post-romantic,Austria
Beethoven,1770,1827.0,romantic,Germany
Puccini,1858,1924.0,post-romantic,Italy
Shostakovich,1906,1975.0,modern,Russia
Verdi,1813,1901.0,romantic,Italy
Dvorak,1841,1904.0,romantic,Czechia
Schumann,1810,1856.0,romantic,Germany
Stravinsky,1882,1971.0,modern,Russia
Sibelius,1865,1957.0,post-romantic,Finland
Haydn,1732,1809.0,classic,Austria


In [99]:
composers_df.head()

Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Mahler,1860,1911.0,post-romantic,Austria
Beethoven,1770,1827.0,romantic,Germany
Puccini,1858,1924.0,post-romantic,Italy
Shostakovich,1906,1975.0,modern,Russia
Verdi,1813,1901.0,romantic,Italy


What if we want now to count how many composers we have in each category? 

Pandas simplifies this with the `groupby()` function, which actually groups elements by a certain criteria, e.g. a categorical variable like the period:

In [100]:
composer_grouped = composers_df.groupby('period')
composer_grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x10f72de10>

The output is a bit cryptic. What we actually have is a new object called *group* which has a lot of handy properties. First let's see what the groups actually are. As for the Dataframe, let's look at a summary of the object:

In [101]:
composer_grouped.describe()

Unnamed: 0_level_0,birth,birth,birth,birth,birth,birth,birth,birth,death,death,death,death,death,death,death,death
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
period,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
baroque,10.0,1663.3,36.009412,1587.0,1647.0,1676.5,1685.0,1710.0,10.0,1720.2,43.460838,1640.0,1697.25,1736.0,1755.25,1764.0
classic,5.0,1744.4,12.054045,1731.0,1732.0,1749.0,1754.0,1756.0,5.0,1801.2,6.942622,1791.0,1799.0,1801.0,1806.0,1809.0
modern,13.0,1905.692308,28.595992,1854.0,1891.0,1902.0,1918.0,1971.0,11.0,1974.090909,26.139834,1928.0,1962.0,1982.0,1990.0,2016.0
post-romantic,5.0,1854.2,17.123084,1824.0,1858.0,1860.0,1864.0,1865.0,5.0,1927.4,25.540164,1896.0,1911.0,1924.0,1949.0,1957.0
renaissance,7.0,1527.142857,59.881629,1397.0,1528.5,1540.0,1564.5,1567.0,7.0,1595.285714,56.295986,1474.0,1594.0,1613.0,1624.5,1643.0
romantic,17.0,1824.823529,25.468695,1770.0,1810.0,1824.0,1841.0,1867.0,17.0,1883.588235,28.026904,1827.0,1869.0,1887.0,1904.0,1919.0


So we have a dataframe with a statistical summary of the the contents. The "names" of the groups are here the indices of the Dataframe. These names are simply all the different categories that were present in the column we used for grouping. Now we can recover a single group:

In [102]:
composer_grouped.get_group('baroque')

Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Haendel,1685,1759.0,baroque,Germany
Purcell,1659,1695.0,baroque,England
Charpentier,1643,1704.0,baroque,France
Couperin,1626,1661.0,baroque,France
Rameau,1683,1764.0,baroque,France
Caldara,1670,1736.0,baroque,Italy
Pergolesi,1710,1736.0,baroque,Italy
Scarlatti,1685,1757.0,baroque,Italy
Caccini,1587,1640.0,baroque,Italy
Bach,1685,1750.0,baroque,Germany


If one has multiple categorical variables, one can also do a grouping on several levels. For example here we want to classify composers both by period and country. For this we just give two column names to the `groupby()` function:


In [103]:
composer_grouped = composers_df.groupby(['period','country'])
composer_grouped.get_group(('baroque','Germany'))

Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Haendel,1685,1759.0,baroque,Germany
Bach,1685,1750.0,baroque,Germany


In [104]:
for k,v in composer_grouped:
    print(k)
    display(v)

('baroque', 'England')


Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Purcell,1659,1695.0,baroque,England


('baroque', 'France')


Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Charpentier,1643,1704.0,baroque,France
Couperin,1626,1661.0,baroque,France
Rameau,1683,1764.0,baroque,France


('baroque', 'Germany')


Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Haendel,1685,1759.0,baroque,Germany
Bach,1685,1750.0,baroque,Germany


('baroque', 'Italy')


Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Caldara,1670,1736.0,baroque,Italy
Pergolesi,1710,1736.0,baroque,Italy
Scarlatti,1685,1757.0,baroque,Italy
Caccini,1587,1640.0,baroque,Italy


('classic', 'Austria')


Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Haydn,1732,1809.0,classic,Austria
Mozart,1756,1791.0,classic,Austria


('classic', 'Czechia')


Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Dusek,1731,1799.0,classic,Czechia


('classic', 'Italy')


Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Cimarosa,1749,1801.0,classic,Italy


('classic', 'Spain')


Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Soler,1754,1806.0,classic,Spain


('modern', 'Austria')


Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Berg,1885,1935.0,modern,Austria


('modern', 'Czechia')


Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Janacek,1854,1928.0,modern,Czechia


('modern', 'England')


Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Walton,1902,1983.0,modern,England
Adès,1971,,modern,England


('modern', 'France')


Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Messiaen,1908,1992.0,modern,France
Boulez,1925,2016.0,modern,France


('modern', 'Germany')


Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Orff,1895,1982.0,modern,Germany


('modern', 'RUssia')


Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Prokofiev,1891,1953.0,modern,RUssia


('modern', 'Russia')


Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Shostakovich,1906,1975.0,modern,Russia
Stravinsky,1882,1971.0,modern,Russia


('modern', 'USA')


Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Copland,1900,1990.0,modern,USA
Bernstein,1918,1990.0,modern,USA
Glass,1937,,modern,USA


('post-romantic', 'Austria')


Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Mahler,1860,1911.0,post-romantic,Austria
Bruckner,1824,1896.0,post-romantic,Austria


('post-romantic', 'Finland')


Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Sibelius,1865,1957.0,post-romantic,Finland


('post-romantic', 'Germany')


Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Strauss,1864,1949.0,post-romantic,Germany


('post-romantic', 'Italy')


Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Puccini,1858,1924.0,post-romantic,Italy


('renaissance', 'Belgium')


Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Dufay,1397,1474.0,renaissance,Belgium
Lassus,1532,1594.0,renaissance,Belgium


('renaissance', 'England')


Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Dowland,1563,1626.0,renaissance,England
Byrd,1540,1623.0,renaissance,England


('renaissance', 'Italy')


Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Monteverdi,1567,1643.0,renaissance,Italy
Palestrina,1525,1594.0,renaissance,Italy
Gesualdo,1566,1613.0,renaissance,Italy


('romantic', 'Czechia')


Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Dvorak,1841,1904.0,romantic,Czechia
Smetana,1824,1884.0,romantic,Czechia


('romantic', 'France')


Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Berlioz,1803,1869.0,romantic,France
Gounod,1818,1893.0,romantic,France
Massenet,1842,1912.0,romantic,France


('romantic', 'Germany')


Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Beethoven,1770,1827.0,romantic,Germany
Schumann,1810,1856.0,romantic,Germany
Brahms,1833,1897.0,romantic,Germany
Wagner,1813,1883.0,romantic,Germany


('romantic', 'Italy')


Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Verdi,1813,1901.0,romantic,Italy
Donizetti,1797,1848.0,romantic,Italy
Leoncavallo,1858,1919.0,romantic,Italy
Bellini,1801,1835.0,romantic,Italy


('romantic', 'Russia')


Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Borodin,1833,1887.0,romantic,Russia
Mussorsgsky,1839,1881.0,romantic,Russia


('romantic', 'Spain')


Unnamed: 0_level_0,birth,death,period,country
composer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Albeniz,1860,1909.0,romantic,Spain
Granados,1867,1916.0,romantic,Spain


The main advantage of this Group object is that it allows us to do very quickly both computations and plotting without having to loop through different categories. Indeed Pandas makes all the work for us: it applies functions on each group and then reassembles the results into a Dataframe (or Series depending on output).
For example we can apply most functions we used for Dataframes (mean, sum etc.) on groups as well and Pandas seamlessly does the work for us.

### Grouping on index

In [105]:
df1 = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6]})
df1

Unnamed: 0,a,b
0,1,4
1,2,5
2,3,6


In [106]:
df2 = pd.DataFrame({'a':[1,5], 'b':[8,0]})
df2

Unnamed: 0,a,b
0,1,8
1,5,0


In [107]:
df_concat = pd.concat([df1,df2])
df_concat

Unnamed: 0,a,b
0,1,4
1,2,5
2,3,6
0,1,8
1,5,0


In [108]:
dir(df_concat)

['T',
 '_AXIS_LEN',
 '_AXIS_ORDERS',
 '_AXIS_TO_AXIS_NUMBER',
 '_HANDLED_TYPES',
 '__abs__',
 '__add__',
 '__and__',
 '__annotations__',
 '__array__',
 '__array_priority__',
 '__array_ufunc__',
 '__bool__',
 '__class__',
 '__contains__',
 '__copy__',
 '__dataframe__',
 '__dataframe_consortium_standard__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pandas_priority__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '

In [109]:
by_row_index = df_concat.groupby(df_concat.index)
for index_value, group in by_row_index:
    display(group)

Unnamed: 0,a,b
0,1,4
0,1,8


Unnamed: 0,a,b
1,2,5
1,5,0


Unnamed: 0,a,b
2,3,6


In [110]:
df_avg = by_row_index.mean()
df_avg

Unnamed: 0,a,b
0,1.0,6.0
1,3.5,2.5
2,3.0,6.0
