<img src="../figures/HeaDS_logo_large_withTitle.png" width="300">

<img src="../figures/tsunami_logo.PNG" width="600">

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Center-for-Health-Data-Science/PythonTsunami/blob/fall2021/Pandas/Pandas.ipynb)

# Pandas

*Prepared by [Rita Colaço](https://www.cpr.ku.dk/staff/?id=621366&vis=medarbejder) and [Henry Webel](https://twitter.com/Henrywebel)

[Pandas cheat sheet](https://github.com/Center-for-Health-Data-Science/PythonTsunami/blob/fall2021/cheat_sheets/Pandas_Cheat_Sheet.pdf)

## Introduction

Pandas is built on top of `numpy`. It has an interface to directly plot using `maptlotlib`.

Pandas is well suited for tabular data with heterogeneously-typed columns, as in an Excel spreadsheet

Pandas is a library for data analysis and its powertool is the **`DataFrame`**.

## Two main classes (types/ objects)

1. `pandas.Series`
2. `pandas.DataFrame`

- a `Series` is a `numpy.array` with an `Index` series.
- a column in a `DataFrame` is a `Series`.
- columns in a `DataFrame` share an `Index`

In [None]:
import pandas as pd
import numpy as np

## Creating an instance of a `pandas.Series`

There are many ways. But a Series is an object holding some data.

Let's create a Series from a built-in `range` and a `numpy.arange` object:

In [None]:
series_range = pd.Series(range(3, 13))
series_range

In [None]:
series_np_arange = pd.Series(np.arange(3, 13))
series_np_arange

In [None]:
print("Series from range, underlying data:\t", series_range.values)
print("Series from np.arange underlying data:\t", series_np_arange.values)

In [None]:
series_range.values

In [None]:
series_np_arange.values

## What is a DataFrame?

A DataFrame is basically, a **Table** of data (or a tabular data structure) with labeled rows and columns. The rows are labeled by a special data structure called an Index, that permits fast look-up and powerful relational operations.
For example:

|  | Name | Age | Height | LikesIceCream |
| :---: | :--: | :--: | :--: | :--: |
| 0     | "Nick" | 22 | 3.4 | True |
| 1     | "Jenn" | 55 | 1.2 | True |
| 2     | "Joe"  | 25 | 2.2 | True |

## Create a DataFrame directly

### From a `list` of `list`s

In [None]:
data = [
    [2.23, 1, "test"],
    [3.45, 2, "train"],
    [4.5, 3, "test"],
    [6.0, 4, "train"]
]

df = pd.DataFrame(data, columns=['A', 'B', 'C'])
df

### From a `list` of `dict`s

In [None]:
data = [
    {'A': 2.23, 'B': 1, 'C': "test"},
    {'A': 3.45, 'B': 2, 'C': "train"},
    {'A': 4.5, 'B': 3, 'C': "test"},
    {'A': 6.0, 'B': 4, 'C': "train"}
]

df = pd.DataFrame(data)
df

### From a Dict of Lists

In [None]:
df = pd.DataFrame({
    'A': [2.23, 3.45, 4.5, 6.0],
    'B': [1, 2, 3, 4],
    'C': ["test", "train", "test", "train"]
})

df

### From a `dict` of `dict`s

In [None]:
df = pd.DataFrame.from_dict(
    {
        'row1': {'A': 2.23, 'B': 1, 'C': "test"},
        'row3': {'A': 3.45, 'B': 2, 'C': "train"},
        'row2': {'A': 4.5, 'B': 3, 'C': "test"},
        'row4': {'A': 6.0, 'B': 4, 'C': "train"}
    },
    orient='index'  # default is columns. pd.DataFrame also works, but you have to transpose the data
)
df

### From an empty `DataFrame`

In [None]:
df = pd.DataFrame()
df['A'] = [2.23, 3.45, 4.5, 6.0]
df['B'] = [1, 2, 3, 4]
df['C'] = ["test", "train", "test", "train"]

In [None]:
df

### Exercise 1
Please recreate the table below as a Dataframe using one of the approaches detailed above:

|  | Year | Product | Cost |
| ---| :--: | :----:  | :--: |
| 0  | 2015 | Apples  | 0.35 |
| 1  | 2016 | Apples  | 0.45 |
| 2  | 2015 | Bananas | 0.75 |
| 3  | 2016 | Bananas | 1.10 |

Which approach did you prefer? Why?

## Loading data into `DataFrame`s from a file

Pandas has functions that can make DataFrames from a wide variety of file types.  To do this, use one of the functions in Pandas that start with `read_`.  Here is a non-exclusive list of examples:

| File Type | Function Name |
| :----:    |  :---:  |
| Excel | `pd.read_excel` |
| CSV, TSV | `pd.read_csv` |
| H5, HDF, HDF5 | `pd.read_hdf` |
| JSON  | `pd.read_json` |
| SQL | `pd.read_sql_table` |

> These are all functions, which can be called, i.e. `pd.read_csv()`

### Loading the Data

The file can be local or **hosted**: The `read_*`-function have many options and are very high general (in the sense of broad or comprehensive) functions.

In [None]:
url_ecdc_daily_cases = "https://opendata.ecdc.europa.eu/covid19/nationalcasedeath_eueea_daily_ei/csv/data.csv"
df = pd.read_csv(url_ecdc_daily_cases)
df

## Examining the Dataset

Sometimes, we might just want to quickly inspect the DataFrame:

### Attributes
```python
df.shape    # Shape of the object (2D)
df.dtypes   # Data types in each column
df.index    # Index range
df.columns  # Column names
```

### Methods

```python
df.describe()   # Descriptive statistics of columns
df.info()       # DataFrame information

```




### Shape

The first dimension are the number of rows (the `len`gth of the `DataFrame`), the second dimension the number of features or columns. The direction going down the rows is `axis=0` or `axis='index'`, and going over the columns is `axis=1` or `axis='columns'`.

axis | descriptions
---  | ---
0    | index
1    | columns

In [None]:
df.shape  # axis=0

### Data types

In [None]:
df.dtypes

### Index and Columns

In [None]:
df.index

In [None]:
df.columns

You can set the index using `set_index`

In [None]:
df.set_index('dateRep') 

### Info and describe

In [None]:
_ = df.info()  # returns None, only prints

In [None]:
df.describe()  # returns a new DataFrame

'popData2020'## Selecting Data

Pandas has a lot of flexibility in the number of syntaxes it supports.  For example, to select columns in a DataFrame:

```python
df['Column1']
df.Column1  # no whitespaces possible!
```

Multiple Columns can also be selected by providing a list:

```python
df[['Column1', 'Column2']]
```

Rows are selected with the **iloc** and **loc** attributes:

```python
df.iloc[5]  # Used to get the "integer" index of the row.
df.loc['Row6']  # Used if rows are named.
```

However, with large DataFrames, we often just want to see the first or last rows, or even just a sample of the rows.

| Method | Description |
| ---  | --- |
| `df.head(5)` | the first 5 rows |
| `df.tail(5)` | the last 5 rows |
| `df.sample(5)` | a random 5 rows |


### Exercise 2

Back to the Covid19 data

Display the first 5 lines of the dataset.

Show the last 15 lines

Check 10 random lines of the dataset

Make a new dataframe containing just the date, population data and cases per capita

Make a new dataframe containing just the 10th, 15th and 16th lines of the dataset

## Query/Filtering Data

To get rows based on their value, Pandas supports both Numpy's logical indexing:

```python
select_rows = df[df['Column1'] > 0]
```

and an SQL-like query string:
    
```python
df.query('Colummn1 > 0')
```

One can also filter based on multiple conditions, using the element-wise ("bit-wise") logical operators **`&`** data intersection, or **`|`** for the data union.

```python
select_rows = df[(df['Column1'] > 0) & (df['Column2'] > 2)]
```

```python
select_rows = df[(df['Column1'] > 0) | (df['Column2'] > 2)]
```

consider  first creating the mask (`Series` of `True` and `False` values indicating if a row is selected)

```python
mask = (df['Column1'] > 0) | (df['Column2'] > 2)
select_rows = df[mask]
```

Checkout the methods [`pandas.Series.isin`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.isin.html) or  [`pandas.Series.betweeen`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.between.html?highlight=between#pandas.Series.between)

In [None]:
df[df['countriesAndTerritories'] == 'Denmark'] # This is not saved

### Exercise 3

What the average daily death cases for Norway?

## Summarizing/Statistics in DataFrames

Pandas' Series and DataFrames are iterables, and can be given to any function that expects a list or Numpy Array, which allows them to be useful to many different libraries' functions.  For example, to compute basic statistics for a colum (`Series`):

```python
df.describe() # describe numeric columns
df['Column1'].describe() # describe a particular column/series
df['Column1'].count()
df['Column1'].nunique()
df['Column1'].value_counts()

df['Column1'].max()
df['Column1'].mean()
df['Column2'][df['Column1'] == 'string'].sum()
```

or for row:

```python
df.loc['row_index_label`].sum() # count, std, mean, etc
```

or for all columns

```python
df.mean() # default by column (= over all index)
```

or for all rows

```python
df.mean(axis=1)
df.mean(axis='columns') # columns axis is axis 1
```

> What the default axis for a method (or operation) will vary.

Example documentation: [`value_counts`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html)

In [None]:
df.mean()  # uses numeric column only

You can also use the [`pipe`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pipe.html?highlight=pipe) method to call a function on the rows or columns of a DataFrame:

```python
df['Column1'].pipe(np.mean)
```

method | description
--- | ---
[`DataFrame.apply`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html#pandas.DataFrame.apply) | Apply a function along input axis of DataFrame.
[`DataFrame.applymap`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.applymap.html#pandas.DataFrame.applymap) | Apply a function elementwise on a whole DataFrame.
[`Series.map`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html#pandas.Series.map) | Apply a mapping correspondence on a Series.

### Exercise 4
Which country has the maxiumum number of deaths reported on one day?

How many countries does europe have?

How many unique days are in this data set?

## Transforming/Modifying Data

Any transformation function can be performed on each element of a column   or on the entire DataFrame. For example:


```python
df['Column1'] * 5

np.sqrt(df['Column1'])

df['Column1'].str.upper()

del df['B']

df['Column1'] = [3, 9. 27, 81]  # Replace the entire column with other values (length must match)
```

> Column: a `pandas.Series` with the index of the rows

In [None]:
col_population = 'popData2020'
df['cases_per_capita'] = df['cases'] / df[col_population] * 100_000
mask_denmark = df['countriesAndTerritories'] == 'Denmark'
df[mask_denmark]

For more complicated operations, where you want to combine `DataFrame`s with `Series`, you can have a look [how broadcasting works](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#flexible-binary-operations) in pandas.

### Exercise 5

Which country has most daily deaths per capita?

What is the median daily infection rate in Europe? 

Make one country code variable column lower-cased

Make a columns called "survived", the opposite as opposite to the deaths column

## GroupBy Operations

In most of our tasks, getting single metrics from a dataset is not enough, and we often actually want to compare metrics between groups or conditions.

The [**`groupby`**](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html)
method essentially splits the data into different groups depending on a variable of your choice, and allows you to apply summary functions on each group. For example, if you wanted to calculate the mean temperature by month from a given data frame:

```python
df.groupby('month').temperature.mean()
```
where "month" and "temperature" are column names from the [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html).
 
You can also group by multiple columns, by providing a list of column names:
 
```python
df.groupby(['year', 'month']).temperature.mean()
```

The `groupby` method returns a GroupBy object, where the `.groups` attribute is a dictionary whose keys are the computed unique groups.


[`GroupBy`](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html) objects are **lazy**, meaning they don't start calculating anything until they know the full pipeline.  This approach is called the **"Split-Apply-Combine"** workflow.  You can get more info on it [in the UserGuide](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html).

In [None]:
df.groupby(by='countriesAndTerritories').cases.sum().loc['Denmark']

In [None]:
df.loc[mask_denmark, 'cases'].sum()

### Exercise 6

What proportion of days where withouth deaths?

How many days where there withouth deaths in each country?

What was the median number of daily cases per country?

What was the infection rate for each country for the whole period?

How many infected daily on average for each country in each month?

> Multiclass Index ([guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html):
> You can access indices with more than one entry using tuples: `df.loc[(first_index, second_index)]`

### Multiple Statistics per Group

Another piece of syntax we are going to look at, is the `aggregate` method for a `GroupBy` pandas objects, also available as `agg`.

The aggregation functionality provided by this function allows multiple statistics to be calculated per group in one calculation.

The instructions to the method `agg` are provided in the form of a dictionary, where the keys specify the columns upon which to apply the operation(s), and the value specify the function to run:

```python
df.groupby(['year', 'month']).agg({'duration':'sum',
                                   'network_type':'count',
                                   'date':'first'})
```

You can also apply multiple functions to one column in groups:

```python
df.groupby(['year', 'month']).agg({'duration':[min, max, sum],
                                   'network_type':'count',
                                   'date':[min, 'first', 'nunique']})
```

> Aggregating by multiple columns will create an [`MultiIndex`](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html) row-Index.  
> You can access indices with more than one entry using tuples: `df.loc[(first_index, second_index)]`

In [None]:
df.groupby(by='countriesAndTerritories').agg(
    {'cases': 'sum', 'deaths': sum})

### Exercise 7

How many infected and survived daily on average all over Europe?

How many infected and survived daily on average for each country in each month?

## Handling Missing Values

Missing values are often a concern in data science, for example in proteomics, and can be indicated with a **`None`** or **`NaN`** (np.nan in Numpy). Pandas DataFrames have several methods for detecting, removing and replacing these values:

| method | description
| ---:  | :---- |
**`isna()`** | Returns True for each NaN |
**`notna()`** | Returns False for each NaN |
**`dropna()`** | Returns just the rows without any NaNs |

## Imputation

Imputation means replacing the missing values with real values. 

| method | description |
| ----: |  :---- |
| **`fillna()`** | Replaces the NaNs with values (provides lots of options) |
| **`ffill()`** | Replaces the Nans with the previous non-NaN value (equivalent to df.fillna(method='ffill') |
| **`bfill()`** | Replaces the Nans with the following non-NaN value (equivalent to df.fillna(method='bfill') |
| **`interpolate()`** | interpolates nans with previous and following values |


### Exercise 8

Here we will use the titanic data which contains some missing values

In [None]:
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv'
titanic = pd.read_csv(url)
titanic

What proportion of the "deck" column is missing data?

How many rows don't contain any missing data at all?

Make a dataframe with only the rows containing no missing data.

### Exercise 9

Using the following DataFrame, solve the exercises below.

> recreate it in every exercise or copy it to avoid confusion:  `data_type_filled = data.copy()`  
> Can you explain the problem?

In [None]:
data = pd.DataFrame({'time': [0.5, 1., 1.5, None, 2.5, 3., 3.5, None], 'value': [
                    6, 4, 5, 8, None, 10, 11, None]})
data

Replace all the missing "value" rows with zeros.

Replace the missing "time" rows with the previous value.

Replace all of the missing values with the data from the next row. What do you notice when you do this with this dataset?

Linearly interpolate the missing data. What is the result for this dataset?

---------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------

# Optional: Redo-exercises from numpy 

> What is different?

In all of these exercises, do not loop over `Series` you create. All exercises can be solved using only vectorized operations on a `Series` or `DataFrame`.

## Simple vectorized operations

You want to plot the mathematical function

$f(x) = log(-1.3x^2 + 1.4^x + 7x + 50)$

For the numbers in $[0, 20]$. To do this, you need to create a vector `xs` with lots of numbers between 0 and 20, and a vector `ys` with $f$ evaluated at every element of `xs`. A vector is a `1d-ndarray`.

To get a hang of vectorized operations, solve the problem *without using any loops*:

### Create a `pandas.Series` `xs` with 1000 evenly spaced points between 0 and 20

### Create a Python function $f$ as seen above

### Evaluate `ys` = $f(x)$, i.e. $f$ of every element of `xs`.

### What is the mean and standard deviation of `ys`?

### How many elements of. `ys` are below 0? Between 1 and 2, both exclusive?

> Hint: You can use a comparison operator to get an array of dtype `bool`. To get the number of elements that are `True`, you can exploit the fact that `True` behaves similar to the number 1, and `False` similar to the number 0.

### What is the minimum and maximum value of `ys`?

### Create a series `non_negatives`, which contain all the values of `ys` that are nonnegative

### *Extra*: Use `matplotlib` to plot `xs` vs `ys` directly from your `Series` object

## Species depth matrix

Load in the data [`depths.csv`](https://drive.google.com/file/d/1d5694Ggnc-wq-ta0njlA9cz0_AEVQLoN/view). As you can see in drive preview, there are 11 columns, with columns 2-11 representing a sample from a human git microbiome. Each row represents a genome of a micro-organism, a so-called "operational taxonomic unit at 97% sequence identity" (OTU_97). The first row gives the name of the genome. The values in the matrix represents the relative abundance (or depth) of that micro-organism in that sample, i.e. how much of the micro-organism there is.

### Load in the matrix in a `pandas.DataFrame`

In [None]:
url = 'https://raw.githubusercontent.com/pythontsunami/teaching/fall2021/data/depths.csv'
depths = pd.read_csv(url)
depths

### How many OTUs are there? Show how you figured it out.

### Find the OTU "OTU_97.41189.0". What is the mean and standard deviations of the depths across the 10 samples of this OTU?

### How many samples have 0 depth of that OTU? (or rather, below detection limit?)

### What is the mean and standard deviation if you exclude those samples?

### Extra: How would you get all the means and std. deviations in one go?

### We are not interested in OTUs present in fewer than 4 samples. Remove all those OTUs.

### How many OTUs did you remove?

### How many OTUs have a depth of > 5 in all 10 samples? (hint: `np.all`)

### Filtering and Normalization

After discarding all OTUs present in fewer than 4 samples, sort the OTUs, do the following:

- Calculate the mean depth across samples for each remaining OTU.
   
- Normalize the remaining OTUs such that each row sum to 0 and have a standard deviation of 1 (so-called z-score normalization)
- Print the remaining OTUs to a new file in descending order by their mean depth, with a 12th column giving the mean depth, and columns 1-11 being the normalized depth. Make sure that your file looks like the input file (except with the 12th column)

> Do the results match with what you computed before using `numpy`?