# Introduction to the `pandas` data analysis library

We will use a Jupyter notebook for the final Python session of the Vorkurs, and introduce the `pandas` data analysis library.

Jupyter notebooks allow you to write Python code interactively in your browser, and display the code's intermediate outputs (text, tables or plots) in-place. Jupyter notebooks are structured into three kinds of cells:
1. Markdown text (like the one you're currently reading)
2. Code
3. Output

Code cells can be run one-by-one, as opposed to executing an entire Python program in one go. Code cells that you execute (with the `Run` button or `Ctrl+Enter`) have lasting effects: all the variables set by previously executed code cells will be available for later cells.

You can edit and re-run cells freely, but keep in mind that your program's internal state depends on what you had previously run, which isn't necessarily consistent with what you see on your screen. E.g. if you execute a cell containing `a = 1`, then change it to `b = 1` and re-run it: variable `a` won't disappear, it will still be available for later cells until you restart your session.

Upon the execution of a cell, the last line of the cell's code will be additionally printed and placed below the cell by default, which is a useful feature to inspect intermediate results. Try it by running this cell:

In [None]:
print("I can check stuff without explicit print statements like this")
a = 5
b = 3

a + b

You saw that running the code cell created an output cell below it. This is how we will use Jupyter notebooks and look at our intermediate outputs.

You can take a look at [Jupyter keyboard shortcuts](https://www.cheatography.com/weidadeyue/cheat-sheets/jupyter-notebook/) to help you format and interact with your notebooks efficiently. The important parts:

You can insert a new cell anywhere by clicking on the left margin of a cell, and pressing `b` (for below) or `a` (for above). If the left border of a cell is blue, you are in navigation mode. If it's green, you are in editing mode. You can switch between them with `Escape` and `Enter`. You can run a code cell with `Ctrl+Enter`. In nagivation mode, you can move between cells using the `Up` or `Down` keys.

In [None]:
# Try navigating, inserting, editing and executing code cells.

## Pandas

The purpose of this notebook is to give an introduction to the `pandas` data analysis library. `pandas` provides data structures and various kinds of operations for handling tabular data.

The basic object types are `Series` (1-dimensional data, like a list) and `DataFrame` (2-dimensional data, like a table). These objects have a lot of fancy features that plain old Python lists, or even `numpy` arrays don't. They offer label-based indexing, slicing, grouping, aggregation, merging, reshaping, pivoting, and a host of other manipulation methods, as well as some visualization capabilities using `matplotlib` or `seaborn`. DataFrames also support pretty HTML table outputs inside Jupyter notebooks by default.

Pandas has extensive online documentation, and you may find their [10 Minutes Introduction](http://pandas.pydata.org/pandas-docs/stable/10min.html) and their [tutorials](http://pandas.pydata.org/pandas-docs/stable/tutorials.html) especially useful.



## Import pandas

Conventionally `pandas` is imported under the name `pd`. There are a few other conventions such as `numpy as np` or the visualization library `seaborn as sns`.

In [None]:
import pandas as pd

## Series objects

Create a pandas `Series` object named `s` with the following data:

|     |
|-----|
| 3.0 |
| 1.5 |
| 4.3 |
| 8.2 |
| 0.9 |

Now
* Access its first value
* Access its last value
* Calculate its mean and standard deviation
* Sort it in ascending and descending order

Use a separate code block for each operation. 

Store the sorted `Series` in variable `s2`. Notice how the index labels moved together with the values. Positional (`.iloc[]`) and label-based (`.loc[]`) access behave very differently now, try them out!

## DataFrame objects, accessing values

Create a `DataFrame` named `df_species` with the following data:

| |species |weight  |legs |
|--------|--|---|----|
|<b>0</b>|human |62  |2 |
|<b>1</b>|cat   |4  |4 |
|<b>2</b>|mouse |0.02  |4 |
|<b>3</b>|dog   |20 |4 |
|<b>4</b>|mole  |0.09  |4 |
|<b>5</b>|train |200000 |0 |
|<b>6</b>|bee   |0.0001 |6  |
|<b>7</b>|elephant |3000 |4 |

You can pass the data to the `DataFrame(...)` constructor column-by-column as a dictionary (keys: column labels, values: list of column values), or row-by-row using nested lists. In the latter case, you have to additionally specify the column labels using the `columns=` keyword argument.

Now, in separate cells
* access the 3rd row of `df_species`. What type is the resulting object?
* access the row with index `3`
* access the column `weight`
* access a single cell (e.g. dog's number of legs)
* print the index and column labels
* extract rows 2-5 into a new DataFrame (no need to save it)
* obtain a summary table of the DataFrame with the means, standard deviations, min/max/percentiles of the numerical columns
* add a new column `weight_in_lb` to the DataFrame containing the weights in pounds instead of kilograms

## Data Import

Import the `df_characters.csv` file into a DataFrame named `df`, and display it:

Now import the file `df_countries.tsv` into a pandas DataFrame as `df_countries`. Do you need to change any parameters of the importing function?

## DataFrame filtering

The DataFrame access method `.loc[]` can take:
* individual index labels (as seen previously)
* lists of index labels
* slices in the `beginning:end` format (also seen previously)
* boolean valued `Series` objects, where only labels with `True` value are kept

For the upcoming task, the last method will be especially useful. Using the cartoon character DataFrame `df`, filter them for the following criteria (separately) and display the results:

* German characters
* Characters born after 1970
* Human characters weighing <60 kg

Tips: You can perform boolean comparisons for an entire `Series` with the usual `>`, `<`, `==`, `!=` etc operators. You can do elementwise logical operations between two conforming `Series` objects using the `&` or `|` operators.

Introduce a new column named `age` based on the characters' birthdates. Answer the following questions:

* What is the average age of the characters?
* Who is the youngest German character?

## DataFrame aggregation

Using the same DataFrame `df`
* Count the number of characters from each country
* Calculate the average weight and age of characters by species
* List the heaviest characters per country

## Merging DataFrames

By combining information from the DataFrames above we can try to answer more complex queries. For example, the number of legs for a given character isn't stored in the `df` DataFrame, but it is available in the `df_species` DataFrame if we look up the corresponding entry of the character's species. If we could combine the two into a single DataFrame, we could create queries using character-specific **and** species-specific data at the same time.

Combining tables based on shared keys is called merging. Pandas allows you to do it with either the function `pd.merge` or as a DataFrame method `df.merge`. All you have to do is specify the two DataFrames that you want to merge (the first one is already implied if you use it as a method) and the column label(s) that you want to merge them on.

Merge `df` and `df_species` into a new DataFrame named `dfm` and answer the following questions:

* What is the average age of four-legged characters?

* What is the average weight of four-legged characters?

Although the question sounds nearly identical to the previous one, you may notice that something happened to your original character `weight` column. Since the column name was contained in both DataFrames, they had to be renamed during the merge to avoid a collision. You can control how to rename them with the `suffixes` keyword argument. We suggest that you set them to `_indiv` and `_species` so you can distinguish `weight_indiv` (the character's weight) from `weight_species` (their species' typical weight).

* Has any character gone missing during the merge?

Why is that? How could you ensure that they aren't thrown out? Hint: the `merge` function/method has a keyword argument `how`. You can either go back to the cell where you created `dfm` and update it, or you can overwrite `dfm` here.

* Compute the characters' relative weight as a ratio of the typical weight of their species, and add it as a new column `rel_weight`.

Of course you won't be able to calculate that for characters whose species data wasn't available. But that's okay, `pandas` can handle operations with missing values wisely.

Now let's answer some queries based on the characters' countries of origin:
* How many characters are there from each continent?
* Which character comes from the smallest country?

For this you have to merge `df` and `df_countries` (let's call the result `dfm2`). There's one little issue: the shared key is labelled differently in the two DataFrames. Thankfully the `merge` function/method allows you to specify them independently for the left and right DataFrame if necessary.

And finally, a question that requires information from all three DataFrames:

* Who is the most overweight European character?

You can't directly merge three DataFrames with a single operation, but you can merge `dfm` with `df_countries` easily enough. Let's call this resulting "mega-merge" as `df_all`.

### Group filtering

You can either continue working with `df` or `df_all`, it won't matter for the next task:

* List characters that are the sole entries of their country

To answer this question, you can use the `filter` method of `GroupBy` objects. It is similar to the aggregation methods like `mean`, `size` or `first` that we had used before, but instead of returning a single row per group, it either returns all rows of the group or none of them.

It expects a simple function as an argument. The function's input will be a DataFrame containing all rows of a group, and it should return a boolean `True/False` value. If the returned value is `True`, the group will be kept, otherwise it will be discarded.

Of course you can experiment with alternative solutions too!

### Grouping using derived keys

* Count characters by decade of birth -- without inserting a new column!

So far, we have always used a column label as `groupby`'s argument. If we wanted to group rows based on data that isn't explicitly contained in the DataFrame, we would've had to insert a new column with the data, and group by that new column.

Thankfully, `groupby` also accepts a `Series` object as an argument, and group the DataFrame according to the values of that Series, without having to insert that Series into the DataFrame.

Hint: you can round a number down to the nearest ten with the full division operator by `x // 10 * 10`.

## Data export

Export `df_all` into `characters_merged.tsv` in tab-separated format with question marks in place of N/A values, and omit the index column.