# Pandas `DataFrame`s

## 1 `DataFrame`s: Basics and Creation

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

### 1.1 Pandas `DataFrame` Objects

The `pd.DataFrame` class provides a data structure to handle 2-dimensional tabular data. `DataFrame`  objects are *size-mutable* and can contain mixed datatypes (e.g. `float`, `int` or `str`). All data columns inside a `DataFrame` share the same `index`.

#### 1.1.1 Creating `DataFrame`s

In [None]:
name = ["person 1", "person 2", "person 3"]
age = [23, 27, 34]

Create nested list with zip and pass column names:

In [None]:
df = pd.DataFrame(data=zip(name, age), columns=["Name", "Age"])
df

Create with `dict`. Keys will be column names:

In [None]:
df = pd.DataFrame(data={"Name": name, "Age": age})
df

Create from ndarray (then all columns of same type):

In [None]:
df = pd.DataFrame(data=np.random.random((4, 3)), columns=list("abc"))
df

Let's create two dicts:

In [None]:
math_grades_dict = {
    'student1': 15,
    'student2': 11,
    'student3': 9,
    'student4': 13,
    'student5': 12,
    'student6': 7,
    'student7': 14,
}
chemistry_grades_dict = {
    'student1': 10,
    'student2': 14,
    'student3': 12,
    'student4': 8,
    'student5': 11,
    'student6': 10,
    'student7': 12,
    "student8": 5,  # <-- note the additional entry here
}

Convert them to Series Objects:

In [None]:
series_math = pd.Series(math_grades_dict)
series_chemistry = pd.Series(chemistry_grades_dict)
print(series_chemistry.index)

Now use the two Series Objects to create a Dataframe:

In [None]:
df = pd.DataFrame(
    data={'math grades': series_math, 'chemistry grades': series_chemistry}
)
df

#### 1.1.2 What characterizes a `DataFrame` object?

In [None]:
# Attribute giving us the `shape` of the `DataFrame`. Similar to `np.array`.
df.shape

In [None]:
# A method providing infos on the `DataFrame` and the data contained inside.
df.info()

In [None]:
# Get some statistics.
df.describe()

`DataFrame`s are essentially composed of 3 components. Theses components can be accessed with specific data attributes.

- Index (`df.index`)
- Columns (`df.columns`)
- Body (`df.values`)

Index and Body like `Series`

In [None]:
df.index

In [None]:
df.columns

In [None]:
df.values.nbytes, df.index.nbytes

In [None]:
df.dtypes

One can call `head()` (or `tail()`) first, when having loaded data into a `DataFrame`.
It is useful for checking if all data columns were loaded successfully.
It will print the first (last) 5 columns of the `DataFrame`.

In [None]:
df.head()

# Compare the `tail()` method
# df.tail()

#### 1.1.3 Data Indexing and Selection

Download IRIS dataset:

In [None]:
#df = pd.read_csv(
#    'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
#    names=["sepal length", "sepal width", "petal length", "petal width", "Name"],
#)

df = pd.read_csv('iris.data', names=["sepal length", "sepal width", "petal length", "petal width", "Name"])

Quick check if data looks alright: (petal - Bluetenblatt, sepal - Kelchblatt)

In [None]:
df.head()
#df.info()

Each column is a `Series` object and can be accessed like with a Python dictionary

In [None]:
df.columns

In [None]:
df['Name']

In [None]:
type(df["Name"])

Accessing multiple columns at once:

In [None]:
df_sepal = df[["sepal length", 'sepal width']]
df_sepal.head()

Select rows with boolean mask. In this case all columns will be returned:

In [None]:
df[ df["Name"] == "Iris-setosa" ].head()

In [None]:
df["Name"] == "Iris-setosa"  # the boolean expression inside the [] is a `Series` object

Let's try to visualize the data. First specify color encoding of the different flower species:

In [None]:
name_to_color = {
    'Iris-setosa': "lightblue",
    'Iris-versicolor': "darkred",
    'Iris-virginica': "orange",
}

And make a plot:

In [None]:
def structured_multigrid_plot(df, axes):
    from itertools import permutations

    features = df.columns[:-1].values.tolist()
    feature_perms = tuple(permutations(features, r=2))
    feature_perms_index = tuple(permutations(range(len(features)), r=2))

    # Along the diagonal we plot the histogram
    for idx, f in enumerate(features):
        for name in df["Name"].unique():
            df[df["Name"] == name][f].plot.hist(
                ax=axes[idx, idx],
                label=name,
                color=name_to_color.get(name),
                alpha=0.5,
                bins=10,
            )
        axes[idx, idx].set_xlabel("")
        axes[idx, idx].set_ylabel("")
        axes[idx, idx].legend()

    # Scatter plot showing correlations between feature pairs.
    for perm, (row, col) in zip(feature_perms, feature_perms_index):
        colx, coly = perm
        for name in df["Name"].unique():
            df[df["Name"] == name].plot.scatter(
                x=colx,
                y=coly,
                ax=axes[col, row],  # use transpose to have same x-scale in each column
                xlabel="",
                ylabel="",
                c=name_to_color.get(name),
            )

    for idx, f in enumerate(features):
        label = f + " / cm"
        axes[idx, 0].set_ylabel(label)
        axes[-1, idx].set_xlabel(label)

In [None]:
fig, axes = plt.subplots(
    4, 4, figsize=(20, 15)
)  # petal - Bluetenblatt; sepal - Kelchblatt
structured_multigrid_plot(df, axes)

Get row with specific *index value*. Remember that index values must be contained in `df.index`.

In [None]:
df.loc[0]

Get rows with slicing:

In [None]:
df.loc[0::4].head()

Get rows with fancy indexing:

In [None]:
df.loc[[0, 4, 8, 12, 16]]

We can also access whole columns with the `loc` method:

In [None]:
df.loc[ :, ['petal length', 'petal width']].head()
# Does not work with `iloc` method (since it wants `int` values).

In [None]:
# This is the same as
df[["petal length", "petal width"]].head()

Boolean masks can also be used with `.loc`:

In [None]:
df.loc[df["Name"] == "Iris-setosa"].head()
# This is the same as
# df[df["Name"] == "Iris-setosa"].head()

There are situations where using `[]` and `loc[]` actually are semantically different.

Suppose we want to alter some values inside a `DataFrame`. We want to limit the changes to certain rows and columns.

In [None]:
# This is a working copy.
df_tmp = df.copy()
df_tmp.head()

The followiong is an example for "chained indexing".
This calls (df_tmp.__getitem__(slice(1, 3, 1))).__setitem__("sepal length",  -1000.0)

This will yield a warning. We have *no guarantees* that this will return a view such that the assignment will succeed. It cannot be easily predicted if we obtain a copy or a view. It depends on the mem-layout of the data inside the `DataFrame`. Pandas does not make any guarantees which mem-layout we actually have.

In [None]:
df_tmp[1:3]["sepal length"] = -1000.0
df_tmp.head()

This is a better way of handling the assignment. This calls `df_tmp.loc.__setitem__(slice(1, 3, 1), "sepal length")` and allows to deal with the assignment in single step (just *one* function call instead of two function calls).

In [None]:
df_tmp.loc[1:3, "sepal length"] = -1000.0
df_tmp.head()

For more details on the difference between using `[]` and the `loc` method see [this link](https://stackoverflow.com/questions/48409128/what-is-the-difference-between-using-loc-and-using-just-square-brackets-to-filte#52919794).

We can also use combined boolean masks:

In [None]:
boolean_mask = (df["sepal length"] > 6.0) & (df["petal length"] > 1.0)
df.loc[boolean_mask].head()

#### 1.1.4  A word on views

Let's generate a DataFrame for some experiments:

In [None]:
df = pd.DataFrame(np.random.randint(0, 20, (4, 2)), columns=list("AB"))
df

Get a view on a slice of data:

In [None]:
df_slice = df.loc[1:3, :]
df_slice

Change a value:

In [None]:
df.loc[1, "A"] = -1000
df

Is this change visible from the slice?

In [None]:
df_slice

Make another change. But this time we change the dtype of the value

In [None]:
df.loc[1, "A"] = -999.99
df  # Note how *all* value in column "A" are now `float`s.

In [None]:
df.info()

What about the slice? Can we see this change as well?

In [None]:
df_slice

The column `"A"` of the original `DataFrame` is a Pandas `Series` object with `dtype = int64`. When we replace a value from this `Series` with a value which is also of type `int64` (or another integer type) the change with also be visible from the view.

If we try to place a `float` value inside this `Series` the value is not converted into a `int64` but rather a new `float` array is generated. The `float` array contains all of the original values as `flaot`s and the new value. The `float` array replaces the array inside the `Series` with column index `"A"`.

### 1.2 Reading data into a `DataFrame`

Pandas can import several common file formats:

- `pd.read_csv`: Read in CSV spreadsheets (`.csv` suffix)
- `pd.read_excel`: Read in MS Office spreadsheets (`.xls` and `.xlsx` suffix) 
- `pd.read_stata`: Read stata datasets (`.dta` suffix)
- `pd.read_hdf`: Read HDF datasets (`.hdf` suffix)
- `pd.read_sql`: Read from SQL database

Other file formats are [supported](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) as well.

#### 1.2.1 Reading from a CSV file

A very common way to generate a `DataFrame` is to read data from an external file. CSV files can be parsed with Pandas convenience function [`pd.read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).

Read the file with Pandas and specify the delimiter symbol as well as the a symbol for the comment:

In [None]:
pd.read_csv("iris-data.csv", delimiter=";", comment='#').head()

# original CSV-File from https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
# reformated for our use

We can limit the number of imported columns by specifying those that we explicitly want to have.

In [None]:
df_iris = pd.read_csv(
    "iris-data.csv",
    delimiter=";",
    comment="#",
    usecols=["Name", "sepal length", "sepal width"],
)
df_iris.head()

#### 1.2.2 Playing with the index

`DataFrames` offer multiple methods for altering the Index. Some of them are:

- [`df.reset_index()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html): Reset the index and use default index.
- [`df.set_index()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html): Set the index  using an existing column.
- [`df.reindex()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html): Change current index with additional filling logic.

In [None]:
# Download IRIS dataset
df = pd.read_csv('iris.data', names=["sepal length", "sepal width", "petal length", "petal width", "Name"])

# Quick check if data looks alright
df.head()
#df.index

Discard the current index and use default indexing scheme. The index will be made a regular column:

In [None]:
df.reset_index().head()  # By default this returns a new object (inplace=False)

We can select another column as our index:

In [None]:
df_new = df.set_index("Name")  # By default this returns a new object (inplace=False).
df_new.head()

In [None]:
 df_new.loc["Iris-versicolor"].head()

In [None]:
# the same inplace
df.set_index("Name", inplace=True)
df.head()

Example with student grades:

In [None]:
math_grades = {
    'stud1': 15,
    'stud2': 11,
    'stud3': 9,
    'stud4': 13,
    'stud5': 12,
    'stud6': 7,
    'stud7': 14,
}
chemistry_grades = {
    'stud1': 10,
    'stud2': 14,
    'stud3': 12,
    'stud4': 8,
    'stud5': 11,
    'stud6': 10,
    'stud7': 12,
}

df_grades = pd.DataFrame(
    {"math": pd.Series(math_grades), "chemistry": pd.Series(chemistry_grades)}
)

df_grades

We can change the index of the `DataFrame` to add additional rows:

In [None]:
new_index = list(math_grades.keys()) + ["stud8", "stud9"]

df_grades.reindex(new_index, copy=True)
# As long as copy=False (default: True) a new object is returned.

In [None]:
# We can also choose a specific value to fill into places that orginate from introducing a new index.
df_grades.reindex(new_index, fill_value="missing", copy=True)

#### 1.2.3 Performance implications of the `inplace` argument

Create a huge field of data for testing:

In [None]:
import string

column_names = list(string.ascii_lowercase)
N_rows, N_columns = 500_000, len(column_names)
data = np.ones((N_rows, N_columns))
index = range(N_rows)

pd.DataFrame(data=data, index=index, columns=column_names)


Test the performance difference between inplace and not-inplace :

In [None]:
def reset_index_of_DataFrame(index, data, colnames, inplace=False):
    df = pd.DataFrame(data=data, index=index, columns=colnames)
    df.reset_index(inplace=inplace)
    del df

In [None]:
%timeit reset_index_of_DataFrame(index, data, column_names, inplace=False)
%timeit reset_index_of_DataFrame(index, data, column_names, inplace=True)

The `inplace` argument is available for many methods that operate on `DataFrames`. For performance and memory efficiency reasons, it may be a good idea to pass `inplace=True`  to these methods.

Please be aware that this change will persist and will possibly influence future calls to other functions and methods.

Please always refer to the documentation of the method of interest and check the availability and the relevance of the `inplace` argument.

### 1.3 Task

##### **1.**  Erstellen Sie basierend auf den beiden folgenden Listen auf verschiedene Arten einen `pd.DataFrame`. Die Namen der Spalten sollen dabei `"Zufallszahlen"` (für `values1`) und `"Countdown"` (für `values2`) sein.

In [None]:
values1 = np.random.randint(-10, 10, size=5)
values2 = range(5, 0, -1)

##### **2.**  Gegeben sind die beiden folgenden `pd.Series`. Konstruieren Sie aus diesen einen `pd.DataFrame` mit den Spaltennamen `"alles"` (für `s1`) und `"gerade Zahlen"` (für `s2`). Ersetzen Sie die dabei auftretenden `NaN` Werte mit `0`.

In [None]:
s1 = pd.Series(data=range(5), index=list('abcde'))
s2 = pd.Series(data=range(0, 10, 2), index=list('acegi'))

##### **3.**  Wir laden den Iris-Datensatz mit Maßen verschiedener Pflanzen herunter:

In [None]:
#df_iris = pd.read_csv(
#    'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
#    names=["sepal length", "sepal width", "petal length", "petal width", "Name"],
#)

df_iris = pd.read_csv('iris.data', names=["sepal length", "sepal width", "petal length", "petal width", "Name"])
df_iris.head()

##### **4.**  Greifen Sie auf die Spalten `"sepal length`", `"petal width"` und `"Name"` gleichzeitig auf zwei verschiedene Arten zu.

##### **5.** Die Messungen zu `"Iris-setosa"` sind leider unbrauchbar. Ändern Sie alle Einträge in `df_tmp1` und `df_tmp2` mit Messdaten zur Untergattung `"Iris-setosa"` auf `nan` (nicht aber die Spalte mit dem Eintrag `"Iris-setosa"`). Nutzen Sie einmal sequenzielles Indizieren (`[cols][rows]`) und einmal die `.loc`-Methode. Prüfen Sie anschließend, ob die Änderungen jeweils wirksam geworden sind.

In [None]:
df_tmp1 = df_iris.copy(deep=True)
df_tmp2 = df_iris.copy(deep=True)

##### **6.** Stellen Sie die Verteilung der Messwerte separat in Histogrammen dar. Achten Sie auf Achsenbeschriftung und inbesondere auf Angabe der Einheiten. Wie verändern sich die Abbildungen, wenn Sie die Anzahl der `bins` verändern?

## 2  `DataFrame`s: Operations

### 2.1  Arithmetic operations on `DataFrame`s

#### Mapping between Python operators and Pandas methods

| Python operator | Pandas methods                   |
|:---------------:|----------------------------------|
|       `+`       | `add()`                          |
|       `-`       | `sub()`, `subtract()`            |
|       `*`       | `mul()`, `multiply()`            |
|       `/`       | `truediv()`, `div()`, `divide()` |
|       `//`      | `floordiv()`                     |
|       `%`       | `mod()`                          |
|       `**`       | `pow()`                          |

In [None]:
A = pd.DataFrame(np.random.randint(0, 20, (3, 2)), columns=list("AB"))
B = pd.DataFrame(np.random.randint(0, 20, (3, 3)), columns=list("BAC"))

In [None]:
A

In [None]:
B

Indices are aligned, no matter what the order is in both `DataFrame`s.

In [None]:
A + B

The number of columns do not match. We use `fill_value` to to be used inplace for the missing values:

In [None]:
A.add(B, fill_value=0)

NumPy broadcasting rules apply for `DataFrame`s as well.

In [None]:
df = pd.DataFrame(np.random.randint(10, size=(3, 4)), columns=list("wxyz"))
df

Row-wise operations are the default.

In [None]:
df - df.loc[0]

We can use the `axis` argument if we want to operate on the columns.

In [None]:
df.sub(df["x"], axis=0)

Indices will be aligned for these kind of operations. This means that data context is maintained which helps avoiding uncessary errors.

In [None]:
df_slice = df.loc[0, ::2]
df_slice

In [None]:
df - df_slice

We can apply NumPy Ufuncs to a `DataFrame` object as well:

In [None]:
np.exp(df)

Adding columns based on arithmetic with existing columns

In [None]:
df["asdf"] = np.sin(df["x"] * df["y"])
df

### 2.2 `agg()`, `apply()` , `applymap()` and `transform()`

Pandas `DataFrame` and `Series` objects have several built-in method to operate on the data.

- `agg()`: available for *both* `Series` and `DataFrame` objects
- `apply()`: available for *both* `Series` and `DataFrame` objects
- `transform()`: available for *both* `Series` and `DataFrame` objects
- `applymap()` *only* available for `DataFrame` objects
- `map()`: *only* available for `Series` objects

*Note*: In what follows we will only deal with  `agg()`, `apply()`, `applymap()` and `transform()`.

In [None]:
#df_iris = pd.read_csv(
#    'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
#    names=["sepal length", "sepal width", "petal length", "petal width", "Name"],
#)

df_iris = pd.read_csv('iris.data', names=["sepal length", "sepal width", "petal length", "petal width", "Name"])
df_iris.head()

In [None]:
# Get a subset of the data columns
data_columns = df_iris.columns[:-1]
data_columns

#### [`agg()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html)

```python
DataFrame.agg(func=None, axis=0, *args, **kwds)
```
- *applies* a function (callable) along an `axis` of the `DataFrame`
    - `axis=0`: `func` is applied to each column (a `Series` object). This is the default!
    - `axis=1`: `func` is applied to each row
- return type is inferred from `func`

This method performs aggregation operations along a specified axis of a `DataFrame`. It can be passed multiple functions, e.g. in `list`.

The return can be:
 - scalar : when Series.agg is called with single function
 - Series : when DataFrame.agg is called with a single function
 - DataFrame : when DataFrame.agg is called with several functions

In [None]:
df_iris[data_columns].agg(['sum', 'max', 'min'])

In [None]:
df_iris[data_columns].agg(['sum', 'max', 'min'], axis=1)

In [None]:
df_iris[data_columns].agg([np.mean, np.std], axis=0)

#### [`apply()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html)

```python
DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwds)
```
- *applies* a function (callable) along an `axis` of the `DataFrame`
    - `axis=0`: `func` is applied to each column (a `Series` object). This is the default!
    - `axis=1`: `func` is applied to each row
- return type is inferred from `func`

The return type of `func` determines the form of the result.

`func` can operate on `Series` objects an perform operations that are supported by these types of objects (e.g. by means of the methods `.min()`, `.max()` or `.mean()`). 
- result can be a scalar value (e.g. `.sum()` which is an aggregation operation)
- result can be another `Series` object

`func` must not be a agg

In [None]:
# Operate on columns (axis=0): `x` inside the `lamdba` function are `Series` objects!
result = df_iris[data_columns].apply(lambda x: x.mean(), axis=0)
print(f"The type of the output is {type(result)}")
result

In [None]:
# Operate elementwise along all values in a row (axis=1).  The return type is another `DataFrame`.

# This converts the units of all measured values from cm to mm.
df_iris[data_columns].apply(lambda x: x * 10, axis=1).head()

As an example for using the `agg()` and `apply()` functions we normalise the features (data_columns) of the IRIS dataset and plot the distributions.

In [None]:
# Specify color encoding of the different flower species.
name_to_color = {
    'Iris-setosa': "lightblue",
    'Iris-versicolor': "darkred",
    'Iris-virginica': "orange",
}

In [None]:
def plot_raw(axes, df, data_columns):
    for idx, (ax, f) in enumerate(zip(axes, data_columns)):
        for name in df["Name"].unique():
            df[df["Name"] == name][f].plot.kde(ax=ax, color=name_to_color.get(name))
        ax.set_xlabel(f + " / cm")


def plot_normalised(axes, df, data_columns):
    df_agg = df[data_columns].agg(["mean", "std"])
    df_normalised = df[data_columns].apply(
        lambda x: (x - df_agg.loc["mean"]) / df_agg.loc["std"], axis=1
    )
    #     print(df_normalised.describe())
    for idx, (ax, f) in enumerate(zip(axes, data_columns)):
        for name in df["Name"].unique():
            df_normalised[df["Name"] == name][f].plot.kde(
                ax=ax, color=name_to_color.get(name)
            )
        ax.set_xlabel(f + " (normalised)")


def adjust_xscale(axes):
    from functools import reduce
    from math import ceil, floor

    xmin, xmax = reduce(
        lambda a, b: (min(a[0], b[0]), max(a[1], b[1])),
        (ax.get_xlim() for ax in axes),
        (1000, -1000),
    )
    xmin, xmax = floor(xmin), ceil(xmax)
    for ax in axes:
        ax.set_xticks(range(xmin, xmax + 1, 2))
        ax.set_xticklabels(range(xmin, xmax + 1, 2))
        ax.set_xlim((0.9 * xmin, 1.05 * xmax))

In [None]:
def plot_distributions(df):
    fig, axes = plt.subplots(2, len(data_columns), figsize=(20, 10))

    # plot data as-is
    plot_raw(axes[0, :], df_iris, data_columns)
    # plot normalised data
    plot_normalised(axes[1, :], df_iris, data_columns)
    # adjust the x scale for the raw data
    adjust_xscale(axes[0, :])

In [None]:
plot_distributions(df_iris)

#### Experiment: What happens when operating with `apply`?

In [None]:
N_rows, N_cols = 10_000, 500
df = pd.DataFrame(
    np.random.random((N_rows, N_cols)), columns=[f"col{idx}" for idx in range(N_cols)]
)

In [None]:
# Benchmark: Operate along the columns (axis=0) vs operating along the rows (axis=1)
%timeit df.apply(lambda x: x ** 2, axis=0)
%timeit df.apply(lambda x: x ** 2, axis=1)

In [None]:
def dummy(x):
    """Dummy function to showcase how `apply` operates."""
    # Some code is needed here.
    print(type(x), x.shape)
    return x - x.mean()


df = pd.DataFrame(np.random.random((5, 3)))
df

In [None]:
df.apply(dummy, axis=0)  # apply along the columns

In [None]:
df.apply(dummy, axis=1)  # apply along the rows

#### `applymap()`

```python
DataFrame.applymap(func, na_action=None)
```

- `func` is applied to each element in the `DataFrame`
- `func` is supposed to return a scalar values as well
- return type of `applymap()` is another (modified) `DataFrame`

In [None]:
# We repeat the simple example of changing the units of the measured data from cm to mm.
df_iris[data_columns].applymap(lambda x: x * 10).head()

#### `transform()`

```python
DataFrame.transform(func, axis=0, *args, **kwargs)
```

`func` can either be
- callable, e.g. `np.exp`
- list-like, e.g. `[np.sin, np.cos]`
- dict-like, e.g. `{"sepal length": np.sin,  "petal length": np.cos}`. Application is limited to columns names passed as keys to `dict`.
- string, e.g. `"sqrt"`

*Note*: This function *transforms*, i.e, when the input value is `Series` another (transformed) `Series` is returned. Returning a scalar value is not valid (resulting error message will be: `ValueError: Function did not transform
`)

In [None]:
df_iris[data_columns].transform({"sepal length": np.cos, "petal length": np.sin}).head()

### 2.3 Performance considerations

When operating on columns of a `DataFrame` or a `DataFrame` as a whole it is oftentimes faster to use a vectorised operations instead of column-/row-wise operations.

In [None]:
df = pd.DataFrame(np.random.randn(100000, 3), columns=list("abc"))

In [None]:
%timeit df.apply(lambda x: x ** 2)
%timeit df.applymap(lambda x: x ** 2)
%timeit df ** 2
%timeit (df.values ** 2)

### 2.4 Grouping data

Oftentimes items in a dataset can be grouped in a certain manner (e.g., if a column contains a value multiple times). The IRIS dataset, for instance, can  be grouped according the species of each flower.

```python
my_dataframe.groupby(by=["<column label>"])
```
The `DataFrame` is split and entries are grouped according to the values in the column with `"<column-label>"`. Once the data  has been grouped operations can be conducted on the items of each group.

*Note*: `DataFrame`s cannot only be [grouped](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) according to the entries of a column.

The return type of `groupby()` is *not* another `DataFrame` but rather a `DataFrameGroupBy` object. We can imagine this object to be a grouping of multiple `DataFrame`s.

It is important to understand that such an object essentially is a special *view* on the original `DataFrame`. No computations have been carried out when generating it (lazy evaluation).

#### 2.4.1 `GroupBy` objects

In [None]:
#df_iris = pd.read_csv(
#    'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
#    names=["sepal length", "sepal width", "petal length", "petal width", "Name"],
#)

df_iris = pd.read_csv('iris.data', names=["sepal length", "sepal width", "petal length", "petal width", "Name"])
df_iris.head()

We group the data according to the species of the flowers:

In [None]:
grouped_by_species = df_iris.groupby(by=["Name"])

The output of the `DataFrame.groupby()`  method is *not* another `DataFrame`

In [None]:
print(type(grouped_by_species))
print(grouped_by_species)

This data structure still knows about the `columns` that were present in the original `DataFrame`. We can use the `[<column-name>]` operation to access the columns with the correspoding label in each of the group members (subframes).

In [None]:
# This does *not* return a `DataFrame`
grouped_by_species["sepal length"]

We can perform several types of aggregations on this data structure. Pandas will access the corresponding column of all subframes and apply the functions passed to the `agg()` method.

In [None]:
grouped_by_species["sepal length"].agg([np.min, np.mean, np.max])

#### Access the groups contained inside `DataFrameGroupBy`

We can iterate over the `DataFrameGroupBy` object where each subframe is returned as a `Series` of a `DataFrame`.

In [None]:
for (species, subframe) in grouped_by_species:
    print(f"{species} subframe has shape = {subframe.shape}")

With `get_group` we can choose the subframe to obtain a `DataFrame`.

In [None]:
grouped_by_species.get_group("Iris-setosa").head()

#### Dispatch

Methods that are not directly implemented for the `DataFrameGroupBy` object are passed to the subframes and executed on these.

In [None]:
grouped_by_species["sepal length"].describe()  # The return type is a `DataFrame`

The `describe()` method can also be called on the full object but the output would be rather hard to view.

In [None]:
grouped_by_species.describe()

Single methods are available as well. E.g. `mean()`, `std()` or `sum()`

In [None]:
grouped_by_species.mean()  # The return type is a `DataFrame`

#### Plotting

It also provides a convenient way to plot data for comparison.

In [None]:
_, ax = plt.subplots()
ax.set_xlabel("sepal length / cm")
grouped_by_species["sepal length"].plot.hist(alpha=0.5, ax=ax, legend=True)

### 2.5 Aggregate, filter, transform, apply

`DataFrameGroupBy` object support `aggregate()`, `filter()`, `transform()` and `apply()` operations.

These methods can be efficiently used to implement a great variety of operations on grouped data.

#### [`aggregate()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html) (or simply `agg()`)

```python
DataFrameGroupBy.aggregate(func=None, *args, engine=None, 
                           engine_kwargs=None, **kwargs)
```

`func` can for example be ...
- ... function (Python callable),
- ... a string specifiying a function name (e.g. `"mean"`)
- ...  list of functions or strings, e.g. `["std", np.mean]`
- ... `dict` of column labels and function to apply (e.g. `{'data1': np.mean}`)

Perform some common aggegrations within each subframe. The output of this method is another `DataFrame`.

In [None]:
group_agg = grouped_by_species.agg([np.min, np.max, np.mean, np.std])
group_agg

To understand this a bit better consider the following. Note that we limit the output to only one species.

In [None]:
df_iris[df_iris.columns[:-1]].loc[df_iris["Name"] == "Iris-setosa"].agg(
    [np.min, np.max, np.mean, np.std]
)

The resulting output looks somewhat complicated than what we are used to from `DataFrame`s so far. The column labels now are hierarchical due to the grouping.

In [None]:
group_agg.columns

We can also select which operations to apply on specific columns.

In [None]:
grouped_by_species.agg({"sepal length": np.mean, "petal length": np.median})

#### [`filter()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.filter.html)

A filtering operation allows to select/drop data based on certain criteria.

```python
DataFrameGroupBy.filter(func, dropna=True, *args, **kwargs)
```

- `func` must be applicable to a `DataFrame`.
- `func` should have a `bool`ean return type and hence should either return `True` or `False`

The argument of the callable passed to `filter` can be treated like a regular `DataFrame` object.

From all subframes we select only those with mean value of 'sepal length' > threshold. The return type of the `filter` function is a `DataFrame` object. The grouping is dropped.

In [None]:
grouped_by_species.filter(lambda x: x["sepal length"].mean() > 6).head()

#### [`transform()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.transform.html)

```python
DataFrameGroupBy.transform(func, *args, engine=None, engine_kwargs=None, **kwargs)
```

Transformations return a modified version of  the original `DataFrame` with transformed values.

`func` is applied to each subframe (operating at one `Series` at a time).

As an example we center each of the data on the group-wise mean value.

In [None]:
def center_on_mean(x):
    return x - x.mean()


fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4), sharey=True)
ax1.set_xlabel("measured value / cm")
df_iris.plot.kde(ax=ax1)
(grouped_by_species.transform(center_on_mean)).plot.kde(ax=ax2)

#### [`apply()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.apply.html)

```python
GroupBy.apply(func, *args, **kwargs)
```

`func` must take a `DataFrame` as argument and return a `DataFrame`, a `Series` or a scalar. The final result will be combined into a `DataFrame` or a `Series` object.

In [None]:
def compute_df_mean(x):
    print(type(x))  # The input datatype is a `DataFrame`
    x = x.mean()
    print(type(x))  # Returns a `Series` object
    x = x.mean()
    print(type(x))  # Returns a scalar
    return x


species_all_mean = grouped_by_species.apply(compute_df_mean)
print(f"The output type of the `apply()` operation is: {type(species_all_mean)}")
species_all_mean

### 2.6 Task

##### **1.** Sehen Sie sich den "Titanic"-Datensatz an. Importieren diesen in einen `pd.DataFrame` mit dem Namen `df_titanic`. Importieren Sie nur die  Spalten `"class"`, `"age"`, `"sex"` und `"survived"`.

##### **2.** Welche Einträge gibt es in den Spalten `"class"` und `"age"`?

##### **3.** Wie hoch war die Überlebensrate erwachsener Passagiere der ersten, zweiten bzw. dritten Klasse? Nutzen Sie dazu unterschiedliche Ansätze, und zwar:

- Selektion der passenden Werte mit `bool`eschen Masken

- Gruppierung von `df_titanic` nach zwei dafür relevanten Spalten

* Anwenden der `apply`-Methode mit einer selbst geschriebenen Funktion `survival_rate` auf die nach `"class"` gruppierten Werte