# Brief Introduction to pandas

pandas is an open-source library providing intuitive data structures for data analysis, data transformation, cleaning, and, with the help of matplotlib, simple data visualizations.

Most pandas' functionality is based on numpy library, which is a optimized library providing efficient data structures and mathematical functions.

Note that this JupyterNotebook merely gives an overview and is by no means complete. Yet, this notebooks introduces pandas' central data structures (Series and DataFrame), how to index and select data from a DataFrame, as well as operations for data cleaning, sorting, aggregating, and grouping data.
For detailed explanations and further topics like time series and date functionality refer to the extensive [user guide](https://pandas.pydata.org/docs/user_guide/index.html) and [API](https://pandas.pydata.org/docs/reference/index.html).

To use pandas simply import it like in the following code cell. Typically, this library is aliased with `pd`, thus it became the de-facto standard to import it with this alias.

In [None]:
import pandas as pd

## Content

1. [Central Data Structures: Series and DataFrame](#data_structures)
2. [Indexing and Selection](#index_selection)
3. [Content Modification](#modification)
4. [Sort DataFrame by Index](#sort)
5. [Statistics, Aggregation, and Grouping](#stats_aggregation_grouping)
6. [Data Cleaning Operations](#cleaning)
7. [Read and Write Data](#read_write)
8. [Database Access](#database_access)

## 1. Central Data Structures: Series and DataFrame <a id='data_structures'></a>


pandas represents data in a table format called `DataFrame` in which every column represents a `Series`. A `Series` holds data of a specific type such as String or integer. Under the hood, a `Series` is nothing more than a one-dimensional numpy array with a heading (column name). In this way, a `DataFrame` is a table with several columns (`Series`) that has column names. 

![DataFrame](img/01_table_dataframe.svg)

<div class="alert alert-danger" role="alert">

### Difference: pandas vs. numpy

- pandas may have trouble with extremely large DataFrames. It becomes sluggish and slow compared to numpy arrays.
- numpy arrays are different to Python lists. A numpy array can only hold values of the same type whereas a Python list can hold mixed data types.
- numpy arrays is optimized and thus consumes less (memory) storage.
- numpy library is written in C and provides an API for Python. Thus, code is already combiled which makes access and computations faster. Python code is merely interpreted and thus slower.
- Mathematical operations on numpy arrays behave just like mathematical operations on matrices and vectors (e. g. multiplication, addition).

</div>

### Object Creation

Differnt ways exist to create a pandas `DataFrame`. These include for instance creation from a list, numpy array, dictionary and list of dictionaries.

1. `DataFrame` Creation From a List

In [None]:
df = pd.DataFrame(
    data=[
        [1, "Course Introduction", True],
        [2, "KDD Introduction", True],
        [3, "Getting To Know Your Data", False],
        [4, "Data Preprocessing", False],
        [5, "OLAP", False],
        [6, "Frequent Pattern", False],
        [7, "Classification", False],
        [8, "Cluster", False],
        [9, "Outlier", False],
    ]
)
df

2. `DataFrame` Creation From a Dictionary

In [None]:
df = pd.DataFrame(
    data={
        "Number": [1, 2, 3, 4, 5, 6, 7, 8, 9],
        "Lecture Name": [
            "Course Introduction",
            "KDD Introduction",
            "Getting To Know Your Data",
            "Data Preprocessing",
            "OLAP",
            "Frequent Pattern",
            "Classification",
            "Cluster",
            "Outlier",
        ],
        "Done": [True, True, False, False, False, False, False, False, False],
    }
)
df

Another way of creating a `DataFrame` from a dictionary:

In [None]:
df = pd.DataFrame(
    data=[
        {"Number": 1, "Lecture Name": "Course Introduction", "Done": True},
        {
            "Number": 2,
            "Lecture Name": "KDD Introduction",
            "Done": True,
        },
        {
            "Number": 3,
            "Lecture Name": "Getting To Know Your Data",
            "Done": False,
        },
        {
            "Number": 4,
            "Lecture Name": "Data Preprocessing",
            "Done": False,
        },
        {
            "Number": 5,
            "Lecture Name": "OLAP",
            "Done": False,
        },
        {
            "Number": 6,
            "Lecture Name": "Frequent Pattern",
            "Done": False,
        },
        {
            "Number": 7,
            "Lecture Name": "Classification",
            "Done": False,
        },
        {
            "Number": 8,
            "Lecture Name": "Cluster",
            "Done": False,
        },
        {
            "Number": 9,
            "Lecture Name": "Outlier",
            "Done": False,
        },
    ]
)
df

In [None]:
# we copy this DataFrame for later use
df_original = df.copy()

pandas automatically derives data types. You can view them with the function `dtypes`:

In [None]:
df.dtypes

Note that the data type of column "Lecture Name" is of type `object`. [Documentation of `dtypes`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html) reveals that "[c]columns with mixed types are stored with the `object` dtype". Generally, pandas has two ways to store strings: in an `object` dtype capable of holding any Python object, or `StringDtype`. It is, however, recommendet to use `StringDtype` for strings. Conversion of dtype can be achieved with the function `astype`, which is available for both `DataFrame` and `Series`. 

Changing dtype of a specific column of a `DataFrame`:

In [None]:
df = df.astype({"Lecture Name": "string"})
df.dtypes

Similarly, a `Series` can be created. 

<div class="alert alert-success" role="alert">

### TODO Exercise: Create a Series.

</div>

In [None]:
lectures = [
    "Course Introduction",
    "KDD Introduction",
    "Getting To Know Your Data",
    "Data Preprocessing",
    "OLAP",
    "Frequent Pattern",
    "Classification",
    "Cluster",
    "Outlier",
]

In [None]:
# TODO create a pandas Series with column name "lecture"

<div class="alert alert-success" role="alert">

### TODO Exercise: Create a pandas DataFrame From This Series. 

</div>

In [None]:
# TODO create a pandas DataFrame from your Series

### Accessing First or Last Rows of a DataFrame

A `DataFrame` may hold many rows. Viewing the whole `DataFrame` at once might not always be the best idea. However, it is possible to only view the first or the last couple of rows. A default of five rows will be displayed each.

In [None]:
df = df_original.copy()
df.head()

In [None]:
df.head(2)

In [None]:
df.tail()

### Get Number of Rows and Columns

In [None]:
df.shape

In [None]:
print(f"Number of rows: {df.shape[0]}", f"Number of columns: {df.shape[1]}", sep="\n")

### Get Number of Cells in a DataFrame or Series

In [None]:
df.size

### Get Memory Consumption in Bytes

In [None]:
df.memory_usage()

### Get Useful DataFrame Information

In [None]:
df.info()

## 2. Indexing and Selection <a id='indexing_selection'></a>

A `DataFrame` consists of one or more columns with one or more rows. pandas provides the data type `Index` and automatically creates an row index (called `index`) and column index (called `columns`) for you when creating a `DataFrame`. Many pandas functions operate on these indices and are often refered to as `axis` where `axis=0` corresponds to columns and `axis=1`corresponds to rows.

The first example of `DataFrame` creation used a simple Python list without any column names, whereas later dictionary examples set column names based on dictionary keys. It is also possible to explicitly define column names at `DataFrame` creation with the parameter `columns`:

In [None]:
df

Of course, indices `columns` and `index` of a `DataFrame` is accessible via their respective variable:

In [None]:
df.columns

In [None]:
df.index

Note that the row index (`index`) is not stored as a list but as a `RangeIndex`.

Both indices are accessible at once via `axes`:

In [None]:
df.axes

### Access Specific Columns

It is possible to select one column or specific columns.

In [None]:
df["Lecture Name"]

When a column has no whitespaces it is also possible to access a column in the following way:

In [None]:
df.Number

Select specific columns by a list of column names:

In [None]:
df[["Number", "Lecture Name"]]

Note that selecting a single column always returns a `Series` wheras selecting multiple columns returns a `DataFrame`.

In [None]:
type(df.Number)

In [None]:
type(df[["Number", "Lecture Name"]])

### Label-based vs. Integer-based Indexing

pandas provides two ways of indexing and selection: label-based and integer-based.

**Label-based Indexing** 

Label-based indexing operates using names of columns or rows. `DataFrame` provides the function `loc` that, according to documentation, primarily operates on names but also works with a boolean array.

Following inputs are allowed:
1. Single label for row/column.
2. List of labels for row/column.
3. A slice object with labels for row/column. A pandas slice object includes both start and end lables, whereas typical Python slice objects exclude the end position/label!.
4. Boolean array for row/column. Here, value `False` does not include a row/column.
5. A `callable` function.

Syntax for `loc` differs slightly to the usual function calls as instead of rounded brackets it expects corner brackets: `loc[row_name, col_name]`.

In [None]:
df

Let's display the name of lecture 3:

In [None]:
df.loc[2, "Lecture Name"]

Select multiple rows with a slice object:

In [None]:
df.loc[2:, "Lecture Name"]

Select all lecture names that have already been held with a callable:

In [None]:
df.loc[lambda df: df.Done, "Lecture Name"]

**Index-based Indexing** 

Index-based indexing is similar to its label-based counterpart, yet operates on integer numbers for both row and columns. Thus all allowed 


1. Single integer for row/column.
2. List of integers for row/column.
3. A slice object with integers for row/column. A pandas slice object includes both start and end lables, whereas typical Python slice objects exclude the end position/label!.
4. Boolean array for row/column. Here, value `False` does not include a row/column.
5. A `callable` function.


<div class="alert alert-success" role="alert">

### TODO Exercise: Play with Index-based Indexing.

</div>

In [None]:
# TODO Select a single cell

In [None]:
# TODO Select multiple cells at once

In [None]:
# TODO Select cells with a callable

## 3. Sort a `DataFrame` by Index or Column <a id='sort'></a>

A `DataFrame`can be sorted by its index as well as column values.

### 3.1. Sort By Index

Function `sort_index` sorts a `DataFrame` by its index alphanumerically. Following parameters are available (among others, refer to [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_index.html)):

- `axis`: Default `0`, meaning it sorts rows.
- `ascending`: Default `True`, set to `False` for descending sort.
- `inplace`: Default `False`.
- `na_position`: Default `last`. To place NaNs (Not a Number, `NULL` in SQL) first, set to `first`.
- `kind`: Sorting algorithm, default `quicksort`. Choose between: `quicksort`, `mergesort`, `heapsort`, `stable`.

In [None]:
# let's get our untouched DataFrame
df = df_original.copy()
df

Sort descending:

In [None]:
df.sort_index(ascending=False)

### 3.2 Sort by Column

To sort a `DataFrame` by column, use function `sort_values`. It has the following parameters (refer to [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html)):
- `by`: Column name or list of column names to sort by.
- `axis`: Default `0`, meaning it sorts rows.
- `inplace`: Default `False`.
- `na_position`: Default `last`. To place NaNs (Not a Number, `NULL` in SQL) first, set to `first`.
- `kind`: Sorting algorithm, default `quicksort`. Choose between: `quicksort`, `mergesort`, `heapsort`, `stable`.
- `ignore_index`: Default `False` to retain current index. Set to `True` to generate a new index.

In [None]:
df.sort_values(by="Lecture Name")

In [None]:
df.sort_values(by="Lecture Name", ignore_index=True)

## 4. Statistics, Aggregation, and Groups <a id='stats_aggregation_groups'></a>

pandas is based on numpy and thus, provides an extensive amount of mathematical functions out of the box. 

In [None]:
articles = pd.DataFrame(
    data={
        "AID": range(1, 16),
        "NAME": [
            "Apple",
            "Banana",
            "Kiwi",
            "Clementine",
            "Strawberry",
            "Cherry",
            "Carrot",
            "Bell Pepper",
            "Onion",
            "Salad",
            "Tomato",
            "Cucumber",
            "Spinach",
            "Water Melon",
            "Garlic",
        ],
        "PRICE": [
            0.3,
            0.75,
            0.6,
            0.5,
            0.45,
            0.4,
            0.5,
            0.7,
            0.3,
            0.4,
            0.45,
            0.3,
            0.6,
            1.5,
            0.3,
        ],
        "TYPE": [
            "Fruit",
            "Fruit",
            "Fruit",
            "Fruit",
            "Fruit",
            "Fruit",
            "Vegetable",
            "Vegetable",
            "Vegetable",
            "Vegetable",
            "Vegetable",
            "Vegetable",
            "Vegetable",
            "Fruit",
            "Vegetable",
        ],
    }
)
articles_original = articles.copy()  # store it for later
articles

pandas provides functions to describe a distribution:

In [None]:
# Mean value of column "PRICE"
articles.PRICE.mean()

In [None]:
# Median value of column "PRICE"
articles.PRICE.median()

In [None]:
# Max value of column "PRICE"
articles.PRICE.max()

In [None]:
# Likewise, min value of column "PRICE"
articles.PRICE.min()

In [None]:
# Variance of column "PRICE"
articles.PRICE.var()

In [None]:
# Standard Deviation of column "PRICE"
articles.PRICE.std()

These statistics are automatically calculated at once by `describe`:

In [None]:
articles.PRICE.describe()

In [None]:
# We can also apply this function on the whole DataFrame
articles.describe()

In [None]:
# Or transpose the output
articles.describe().transpose()

In [None]:
# Sum of all article prices
articles.PRICE.sum()

In [None]:
# Cummulative sum of article prices
articles.PRICE.cumsum()

In [None]:
# Get unique values
articles.PRICE.unique()

In [None]:
# get count of unique values
articles.PRICE.nunique()

It is also possible to calculate several functions in one statement:

In [None]:
articles.agg({"PRICE": [min, max, "nunique", "mean"]})

In [None]:
# We can also group by some column and apply a function based on this group
articles.groupby("TYPE").count().loc[:, "NAME"]

In [None]:
# Get list of all article names in a group
articles.groupby("TYPE").agg({"NAME": list})

In [None]:
# Get article names in a group as a comma separated string
articles.groupby("TYPE").agg({"NAME": ", ".join})

In [None]:
articles.groupby("TYPE").mean().loc[:, "PRICE"]

## 5. Content Modification <a id='modification'></a>

Often than not it is desirable to change a `DataFrame`'s content. This could include either adding a new row/column or modify a single or multiple cells at once. Additionally, when having two `DataFrame`s is may be desireable to merge, join, or concatenate these `DataFrame`s.

Thus, we take a look at the following:

1. Add a new row/column.
2. Modify a specific cell or multiple cells.
3. Delete row or column.
4. Merge two `DataFrame`s.
5. Join two `DataFrame`s.
6. Concatenate two or more `DataFrame`s.


In [None]:
df = df_original.copy()

### 1. Add a New Row or Column

Inserting new rows or columns is possible in different ways:
1. Select a row or column that does not yet exist and simply assign a value to it.
2. Insert a column at a specific position with `insert(loc, column, value)`.
3. Append new rows or columns with `concat`. Later more on this method.

Insert a new row or column by selecting an index that does not yet exist:

In [None]:
df.columns

In [None]:
df["Exam Relevant"] = [False] + [True] * (df.shape[0] - 1)
df

In [None]:
df.loc[10, "Lecture Name"] = "Unnamed Lecture"
df

Add a column with `insert`:

In [None]:
new_values = [i for i in range(42, 42 + df.shape[0])]
df.insert(0, "Better Number", new_values)
df

### 2. Modify a Specific Cell or Multiple Cells

Modifying cells or multiple cells at once can be achieved with using indexing (label-based or index-based), boolean masking or using a callable. For instance, set all lectures "Done" column to `True` with boolean masking:

In [None]:
df.loc[df["Done"] == False, "Done"] = True
df

TODO `apply`, `applymap`

### 3. Delete Row or Column

Deleting rows or columns can be done in different ways:
1. Selecting everything you want to keep and assign it to the same variable.
2. Using function `pop`. This function removes a column form a `DataFrame` and returns the removed row/column as a `Series`.
3. Using funciton `drop` to remove a row or column. 

In [None]:
df = df.loc[df["Exam Relevant"] == True, :]
df

In [None]:
better_numbers = df.pop("Better Number")
df

In [None]:
better_numbers

In [None]:
df_wo_exam_relevant = df.drop(
    labels=["Exam Relevant"],
    axis=1,  # to drop a column. To drop a row set axis=0
    inplace=False,  # No inplace drop, this returns the new DataFrame
)
df_wo_exam_relevant

### 4. Merge Two `DataFrame`s

`merge` is a function to join two `DataFrame`s based on some keys (columns). It is similar to `JOIN` in SQL. In pandas another function exists, called `join`. The difference is the following:
- `merge` merges or "joins" two `DataFrame`s based on columns or indexes.
- `merge` can sort join keys lexicographically, add suffixes to columns in result `DataFrame`
- Per default, `join` performs a join based on two `DataFrame`s indices.
- `join` uses `merge` internally when joining/merging index-on-index or column(s)-on-index.
Thus, `join`saves typing time when you want to join or merge two `DataFrame`s by their index.

Both `merge` and `join` supports all SQL `JOIN`-operations:

| **Method** | **SQL**            | **Description**                                       |
|------------|--------------------|-------------------------------------------------------|
| `left`     | `LEFT OUTER JOIN`  | Use keys from left `DataFrame` only                   |
| `right`    | `RIGHT OUTER JOIN` | Use keys from right `DataFrame` only                  |
| `outer`    | `FULL OUTER JOIN`  | Use union of keys on both `DataFrame`s                |
| `inner`    | `INNER JOIN`       | Use intersection of keys from both `DataFrame`s       |
| `cross`    | `CROSS JOIN`       | Create cartesian product of rows of both `DataFrame`s |


In [None]:
orders = pd.DataFrame(
    data={
        "OID": list(range(1, 11)),
        "CID": [2, 2, 2, 4, 1, 9, 7, 7, 7, 7],
        "DATE": pd.date_range(start="2022-03-25", periods=10),
    }
)
orders.OID.to_list()

In [None]:
order_positions = pd.DataFrame(
    data={
        "OID": [
            1,
            1,
            1,
            2,
            2,
            2,
            2,
            2,
            3,
            3,
            3,
            3,
            4,
            5,
            5,
            6,
            6,
            6,
            6,
            6,
            7,
            7,
            7,
            7,
            7,
            8,
            8,
            9,
            10,
            10,
        ],
        "AID": [
            15,
            4,
            3,
            9,
            13,
            10,
            6,
            15,
            2,
            2,
            1,
            6,
            3,
            12,
            5,
            2,
            9,
            14,
            3,
            5,
            4,
            1,
            5,
            14,
            2,
            12,
            14,
            1,
            5,
            14,
        ],
        "UNIT": [
            4,
            3,
            11,
            13,
            11,
            1,
            8,
            8,
            6,
            2,
            9,
            17,
            15,
            2,
            18,
            6,
            19,
            14,
            19,
            16,
            6,
            11,
            6,
            9,
            11,
            1,
            7,
            15,
            2,
            9,
        ],
    }
)
order_positions

In [None]:
customers = pd.DataFrame(
    data={
        "CID": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        "NAME": [
            "Alice",
            "Bob",
            "Carol",
            "Dan",
            "Erin",
            "Frank",
            "Grace",
            "Heidi",
            "Ivan",
            "Judy",
        ],
    }
)
customers

In [None]:
articles = pd.DataFrame(
    data={
        "AID": range(1, 16),
        "NAME": [
            "Apple",
            "Banana",
            "Kiwi",
            "Clementine",
            "Strawberry",
            "Cherry",
            "Carrot",
            "Bell Pepper",
            "Onion",
            "Salad",
            "Tomato",
            "Cucumber",
            "Spinach",
            "Water Melon",
            "Garlic",
        ],
        "PRICE": [
            0.3,
            0.75,
            0.6,
            0.5,
            0.45,
            0.4,
            0.5,
            0.7,
            0.3,
            0.4,
            0.45,
            0.3,
            0.6,
            1.5,
            0.3,
        ],
    }
)
articles

Recap:

| **Method** | **SQL**            | **Description**                                       |
|------------|--------------------|-------------------------------------------------------|
| `left`     | `LEFT OUTER JOIN`  | Use keys from left `DataFrame` only                   |
| `right`    | `RIGHT OUTER JOIN` | Use keys from right `DataFrame` only                  |
| `outer`    | `FULL OUTER JOIN`  | Use union of keys on both `DataFrame`s                |
| `inner`    | `INNER JOIN`       | Use intersection of keys from both `DataFrame`s       |
| `cross`    | `CROSS JOIN`       | Create cartesian product of rows of both `DataFrame`s |


We can easily merge/join the `DataFrame`s `orders` and `customers` to add the names to each order:

In [None]:
pd.merge(orders, customers, on="CID")

Do we have customers that did not yet buy something?

In [None]:
outer_join = pd.merge(orders, customers, on="CID", how="outer")
customers_no_buy = outer_join[outer_join["OID"].isnull()]
customers_no_buy[["CID", "NAME"]]

<div class="alert alert-success" role="alert">

### TODO Exercise: Join `DataFrame`s `orders` and `order_position`.

</div>

In [None]:
# TODO

<div class="alert alert-success" role="alert">

### TODO Exercise: Reuse Previous `DataFrame` to Calculate Sum of each Order.

</div>

In [None]:
# TODO

<div class="alert alert-success" role="alert">

### TODO Exercise: Reuse Previous `DataFrame` and Join Customers.

</div>

In [None]:
# TODO

<div class="alert alert-success" role="alert">

### TODO Exercise: What Articles Have Not Been Sold Yet?

</div>

In [None]:
# TODO

<div class="alert alert-success" role="alert">

### TODO Exercise: What Articles Have Not Been Sold the Most (Units)?

</div>

In [None]:
# TODO

<div class="alert alert-success" role="alert">

### TODO Exercise: What Articles Have the Most Revenue?

</div>

In [None]:
# TODO

### 6. Concatenate Two or More `DataFrame`s

## 6. Data Cleaning Operations <a id='cleaning'></a>

1. fillna (Fehlende Werte durch Interpolation auffüllen)
    1. dropna (siehe unten)
    1. drop_duplicates (siehe unten)

## 7. Read and Write Data <a id='read_write'></a>

### Other Data Adapter 

1. `to_dict` (Python Dict)
2. `to_records` (Numpy Array mit Python Dicts)
3. `to_pickle` (Python Pickle Datei)
4. `to_parquet` (binäres Parquet Format)
5. `to_feather` (binäres Feather Format)
6. `to_hdf` (HDF5 Format)
7. `to_stata` (Stata dta Format)
8. `to_gbq` (Google BigQuery Tabelle)
9. `to_clipboard` (kopiert Objekt in Zwischenablage)
10. `to_html` (rendern als HTML Tabelle)
11. `to_markdown` (rendern als Markdown Tabelle)
12. `to_latex` (rendern als LaTeX Tabelle)
13. `to_string` (rendern als Tabelle für den Terminal)

## 8. Database Access <a id='database_access'></a>

## What's More?
Check out the extensive [user guide](https://pandas.pydata.org/docs/user_guide/index.html#user-guide).