# Pandas and Matplotlib (Part2)

In [None]:
# Required packages
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
# # from matplotlib import pyplot as plt # also possible

# only in jupyter notebooks: Embed plots inside Jupyter notebooks
%matplotlib inline

f'{pd.__version__=}, {np.__version__=}'

## 4 Statistics on Series
* Series have a number of methods for performing basic statistics
* in this lecture:
    * `[.size, .count(), .sum()]`
    * `[.mean(), .median(), .std()]`
    * `[.max(), .min(), .quantile(), .describe()]`
    * `[.head(), .tail(), .sample()]`
    * `[.value_counts(), .unique(), .duplicated()]`

### 4.1 `.size`, `.count()`, and `.sum()`
* `.size` is an attribute of the series
* `.count()` and `.sum()` depend on the *contents*

In [None]:
integers = pd.Series(np.random.randint(0, 101, 5))
integers

In [None]:
integers.size

In [None]:
integers.count()

In [None]:
integers.loc[0] = np.NaN
integers

In [None]:
integers.count(), integers.size

In [None]:
integers.loc[0] = np.random.randint(0, 101, 1)
integers.sum()

In [None]:
sum(integers)

In [None]:
help(pd.Series.sum)

### 4.2 `.mean()`, `.median()`, `.std()`
* Series provide methods for simple statistics
    * arithmetic `mean` $\mu$ (average value of numeric Series)
    * `median` value (half of the values are above, half are below)
    * standard deviation $\sigma$ (measure of the "spread")
        * with *degrees of freedom* $\Delta_\text{dof}$: default 1, **differing from [numpy.std](https://numpy.org/doc/stable/reference/generated/numpy.std.html)** (where `ddof=0` by default)
        * pass `ddof=0` for the "uncorrected" standard deviation
    $$\mu = \frac{1}{N} \sum_{i=1}^N s_i~~~~~~~~;~~~~~~~~\sigma = \sqrt{\frac{1}{N-\Delta_{\text{dof}}} \sum_{i=1}^N (s_i - \mu)^2}$$

In [None]:
int_range = pd.Series(np.arange(0, 101, 1))
int_range

In [None]:
f'{int_range.mean()=}, {int_range.median()=}, {int_range.std()=}'

In [None]:
# The mean is the sum divided by the size
f'mean = {int_range.sum() / int_range.count()};   {int_range.mean() = }'

In [None]:
# The median is the "halfway point" of the sorted values
int_range.loc[100] = 1000000
int_range.mean(), int_range.median()

In [None]:
# The Standard Deviation is tricky!
# By default, Pandas uses the "sample standard deviation"
pd.Series((0)).std(), pd.Series((0, 0)).std(), pd.Series((0, 2)).std()

In [None]:
# If we want the uncorrected ("numpy") standard deviation, we have to pass ddof=0
pd.Series((0)).std(ddof=0), pd.Series((0, 0)).std(ddof=0), pd.Series((0, 2)).std(ddof=0)

In [None]:
help(pd.Series.std)

### 4.3 `.max()`, `.min()`, `.quantile()`, `.describe()`
* `.max()`, `.min()` retrieve the maximum and minimum numeric elements
* `.quantile(q=0.5: float)` returns the value at a given quantile (in $[0, 1]$)
    * a proportion of $q$ (and $1-q$, respectively) of series values is below (above) the result
    * minimum, maximum values at q = 0, 1
* `.describe()` gives summarizing statistics
    * count and type of values
    * mean and std
    * min, max
    * median and 1st, 3rd quartiles


In [None]:
int_range

In [None]:
int_range.min(), int_range.max()

In [None]:
int_range.describe()

In [None]:
points = pd.Series((0, 10, 20, 30, 40, 50))
points.quantile(0.6)

In [None]:
points.quantile(0.5)

In [None]:
help(pd.Series.quantile)

### 4.4 `.head()`, `.tail()`, `.sample(n)`
* To preview a series, it's often helpful to look at the first and last few values
    * do I have the correct data?
    * was the series read in or processed correctly?
    * have headers and footers been removed?
* don't use this to draw a sample from your data!

In [None]:
int_range.head()

In [None]:
int_range.tail(2)

In [None]:
int_range.sample(4)

### 4.5 `.value_counts()`, `.unique()`, `.duplicated()`
* These methods are useful with *non-numeric* data
    * e.g. a corpus of words
* `.value_counts` returns a series with the frequency of values in a Series

In [None]:
words = pd.Series('In Ulm und um Ulm und um Ulm herum'.split())
words

In [None]:
words.describe()

In [None]:
words.value_counts(normalize=True, ascending=True)

In [None]:
words.value_counts()[words.value_counts() == 2]

In [None]:
words.value_counts()[words.value_counts() > 1]

In [None]:
words.duplicated()

In [None]:
words[~words.duplicated()]

In [None]:
words[~words.duplicated()].values

In [None]:
words.unique()  # watch out: numpy array!

### 4.6 Plots for statistical insight
* *histograms* visualize the distribution of points, separated into bins
    * displays distribution of one-dimensional data points
    * quick diagnostic tool: range of values, most common value
* *box plots* display the most important statistical parameters of a series
    * median, quartiles, range of values, and outliers
    * useful e.g. to compare across measurements

* may be applied even in scatterplots: show *marginal* distribution of data

In [None]:
number_of_samples = 100
normal_sample = pd.Series(np.random.randn(number_of_samples))
normal_sample.describe()

In [None]:
normal_sample.plot.hist(
    title='random sample of the normal distribution',
    bins=20,
    density=False,
    label='y',
    legend=True,
)

In [None]:
help(plt.hist)

In [None]:
normal_sample.plot.box(
    legend=True,
    title='random sample of the normal distribution',
    ylabel='(y - μ) / σ',
    label=f'{number_of_samples} values',
)

In [None]:
help(plt.boxplot)

### 4.7 Tasks

##### 1. Standardabweichung
Berechnen Sie die Standardabweichung der `Series((0, 2))` einmal von Hand, einmal mit numpy, und einmal mit Pandas. Was stellen Sie fest? Was müssen Sie tun, um konsistente Ergebnisse zu erhalten?

The standard deviation of a set of values $\{x_i\}_{i = 1,\dots, N}$ is computed as follows:
$$
\sigma = \sqrt{\frac{1}{N - \Delta_\mathrm{ddof}}\sum_{i = 1}^N (x_i - \mu)^2},
$$
where $\mu$ is the mean value:
$$
\mu = \frac{1}{N}\sum_{i=1}^N x_i
$$

##### 2. Eckpunkte einer Series
Erzeugen Sie sich eine Series mit 400 zufälligen Integern zwischen 0 und 100.
* Verschaffen Sie sich zunächst einen Überblick mit der `.describe`-Methode.
* Welches sind die häufigsten Werte? Welche Werte tauchen am seltensten auf?
* Wie oft tauchen die "vollen 10'er" auf, also 0, 10, 20,...?
* Gibt es Werte zwischen 0 und 100, die gar nicht auftauchen? Wenn ja, welche?

## 5 Data types and Missing Values

### 5.1 Series data types
* internally, Series (and indices) use numpy datatypes
* important implications for "big data":
    * storage requirement differs widely
    * overflow, precision
* at creation, pandas determines a "fitting" dtype
    * only numeric types or "object"
* Series are "flexible"
    * assignment can change the Series data type
    * easy typecasting with `.astype`

In [None]:
integers = pd.Series([20, 30, 40])
integers

In [None]:
integers.dtype

In [None]:
integers = pd.Series([20, 30, 40], dtype=np.int32)
integers

In [None]:
integers = integers.astype(np.int8)
integers

In [None]:
integers * 4

In [None]:
floats = pd.Series([4, 5, 6], dtype=np.float64)
floats, floats.values

In [None]:
objects = pd.Series('using a string yields dtype "object"'.split())
objects

In [None]:
integers

In [None]:
integers.loc[0] = 1.23  # Careful: Assignment with `.loc` changes Series type!
integers

In [None]:
integers.loc[3] = '56'
integers

In [None]:
integers.sum()  # Python can not add "str" and "float"

In [None]:
integers = integers.astype(np.float32)
integers

In [None]:
integers.sum()

In [None]:
integers.astype(np.int8)

In [None]:
strings = pd.Series('12 34 56 78'.split())
strings.sum()
strings

In [None]:
strings.mean()

In [None]:
strings = strings.astype(np.int64)  # need to assign back. Not in-place!
strings.mean()

In [None]:
boolean_mask = integers < 50
boolean_mask

In [None]:
pd.Series(
    ['big', 'small', 'big', 'big', 'big', 'small', 'big', 'small', 'small'],
    dtype='category',
)

In [None]:
from datetime import datetime, timedelta

dates = pd.Series([datetime.fromisoformat(f'2021-04-{day:02}') for day in range(1, 31)])
times = pd.Series([timedelta(hours=day) for day in range(0, 30)])
dates.head(), times.head()

In [None]:
(dates + times).head()

### 5.2 Missing Values
* `np.nan` is a special `np.float`
* designates missing or undefined data
    * "no response" in survey data
    * result of certain invalid operations (e.g. `0 / 0`)
* pandas provides functions and parameters for skipping `NaN` values


* Experimental: pandas provides its own "missing value": `pd.NA`
    * can be used as integer

In [None]:
np.nan, pd.NA

In [None]:
miss = pd.Series([1, 2, np.nan, 4, 5])  # , dtype='Int64')
miss

In [None]:
miss.size, miss.count()

In [None]:
type(np.nan)

In [None]:
np.nan + 10

In [None]:
np.nan == np.nan

In [None]:
np.nan is np.nan

#### 5.2.1 Dealing with NaN
* numpy typically does *not* deal with nan itself
* pandas often tried to accommodate missing data as best it can
    * treated as 0, and does not add to count
* different methods are applicable in different situations:
    * `dropna` removes missing values
    * `fillna`, `interpolate` fill in missing values
    * `fill_value` prevents creation of missing values

In [None]:
miss.values.sum()  # this sums the numpy array

In [None]:
miss.sum()  # this skips na by default!

In [None]:
help(pd.Series.sum)

In [None]:
many_missing_values = pd.Series(
    [-15, np.nan, 4, 9, np.nan, 1, -10, -12, np.nan, np.nan, 3, 13, 25, np.nan]
)
many_missing_values

In [None]:
many_missing_values.dropna()  # This gets rid of `nan` values

In [None]:
many_missing_values.fillna(many_missing_values.mean())

####  5.2.2 fill_value
* numeric operations have optional parameter `fill_value`
* prevents creation of `nan` values when index is not available
* which value is appropriate depends on the operation and context

In [None]:
revenue = pd.Series([1, 2, 3, 4, 5], index=[2017, 2018, 2019, 2020, 2021])
expenses = pd.Series([2, 3, 3, 1], index=[2017, 2018, 2019, 2020])
revenue - expenses  # value for 2021 is missing!

In [None]:
revenue.sub(expenses, fill_value=0)

#### 5.2.3 interpolate
* pandas can interpolate missing values
* various methods available
    * linear, splines, for smooth plotting, time-sensitive
* warning: this "invents" data

In [None]:
many_missing_values.interpolate(method='akima').plot.line(label='akima', legend=True)
many_missing_values.interpolate(method='linear').plot.line(label='linear', legend=True)
many_missing_values.interpolate(method='spline', order=3).plot.line(
    label='spline', legend=True
)

many_missing_values.plot.line(style='o', legend=True, label='measured')

### 5.3 Tasks

##### 1. Auftreten bei Operationen
Erzeugen Sie drei Series mit normalverteilten Zufallszahlen:
* Eine Series mit 4 Werten und den Indizes 'abcd',
* eine Series mit 5 Werten und den Indizes 'abcde', und
* eine Series mit 6 Werten und den Indizes 'abcabc'.

Multiplizieren und addieren Sie die Series jeweils paarweise.
* Wie können Sie jeweils sinnvollerweise mit `nan`-Werten umgehen oder diese vermeiden?
* Welchen Einfluss haben diese Vorgehensweisen auf Mittelwert und Standardabweichung?

##### 2. `size` und Imputieren fehlender Werte
Erzeugen Sie eine `Series` mit 1000 zufälligen `float`s zwischen 0 und 1000.
* Ersetzen Sie alle Werte < 100 und alle Werte > 900 mit `np.nan`.
    * Was ist die Summe, die Standardabweichung, und der Mittelwert der Series?
    * Was ist ihre `size`, und wie viele Werte verbleiben?
* Ersetzen Sie alle nun fehlenden Werte mit dem Mittelwert der verbleibenden Werte.
    * Wie ändern sich dadurch Summe, Mittelwert, und Standardabweichung?

## 6 Transformations
* Series values and indices are mutable
    * can easily be re-assigned
    * typical operations still create new instances
    * `inplace=True` is deprecated

* more comprehensive transformations need dedicated methods
    * replace
        * `Series.replace` *ignores* values not found
        * `Series.map` *drops* values not found
        
    * condense
        * `Series.cumsum` adds progressively
        * `Series.aggregate` (or `Series.agg`) returns a scalar value
        
    * sort
        * `Series.sort_values` sorts by series *values*
        * `Series.sort_index` sort by series *index*
    
    * manipulate
        * `Series.apply` uses a single function
        * `Series.transform` uses one or more functions, "string functions", or dicts

### 6.1 Replace and map
* Replace values with different values according to a replacement rule
* for the difference, see also https://stackoverflow.com/a/62947436

#### `Series.replace`
- can utilize strings or regular expressions
- may give two positional arguments: replace first with second
- may also give a mapping (dict or Series)
- all values not explicitly given are ignored

In [None]:
help(pd.Series.replace)

In [None]:
strings = pd.Series('Er sah das Wasser as'.split())
strings

In [None]:
strings.replace('as', 'an')

In [None]:
strings.replace('^.s$', 'an', regex=True)  # with a regular expression

In [None]:
integers = pd.Series((0, 10, 20, 30))
integers.replace(0, 1000)  # with two values

In [None]:
integers.replace({10: 100, 20: 200, 50: 10})  # with a dict

In [None]:
replacement = pd.Series(data=(5, 6, 7), index=(0, 10, 20))
print(integers,replacement)
integers.replace(replacement)  # with a Series

#### `Series.map`
- accepts a Series, dict, or function
    - `Series` with old values in the index
    - `dict` with old values: new values as key-value pairs
    - function with a single argument: similar to `apply` (see below)
- if a value is not found, replace with `na`

In [None]:
help(pd.Series.map)

In [None]:
integers

In [None]:
replacement

In [None]:
integers.map(replacement)

In [None]:
integers.map({10: 100, 20: 200, 50: 10})

### 6.2 Condense
- `Series.cumsum` cumulates values
- `Series.mean`, `Series.std` for statistics
- `Series.all`, `Series.any` for truthiness
- `Series.agg` with arbitrary functions

#### `Series.cumsum`
- adds up all values
- sometimes useful in statistics
- returns a Series of the sum up to each index

In [None]:
help(pd.Series.cumsum)

In [None]:
errors = pd.Series(
    (1, 1, 0, 0, 2, 2, 1), index=pd.date_range(start='2021-04-01', periods=7)
)
errors

In [None]:
errors.cumsum()

#### `Series.aggregate`
* applies a function to a Series
    * returns a single value
* applies a *list of* functions
    * returns a *Series of* values

In [None]:
integers.agg('max')

In [None]:
integers.agg('std', ddof=1)

In [None]:
random_normal = pd.Series(1.87 * np.random.randn(1000))
random_normal.agg(['product', 'sum'])

In [None]:
random_normal.agg(
    [
        pd.Series.count,
        pd.Series.mean,
        pd.Series.std,
        pd.Series.min,
        pd.Series.max,
        pd.Series.quantile,
        pd.Series.quantile,
    ],
    q=0.25,
)

* note: relatively trivial when applied to Series
* more interesting with DataFrames
    * precise control over axis of aggregation
    * different methods for different columns, at once

In [None]:
help(pd.Series.agg)

### 6.3 Sort
* a very basic way to manipulate data
* small values first, big values later
    * may pass custom key to sort data by
* may introduce bias and destroy information
    * order of items may convey "hidden" information
    * instead, we suggest a new order which was never there
* sorting the underlying numpy arrays is faster

In [None]:
help(pd.Series.sort_values)

#### `Series.sort_values`
- returns a new Series, sorted by value
    - sorting order depends on datatype
- may sort in `ascending` order (or not)
- may `ignore_index` to create a new numeric index
- may sort missing values first or last

In [None]:
pd.Series((100, 11, 1)).sort_values()

In [None]:
pd.Series(('100', '11', '1')).sort_values()

In [None]:
pd.Series(('100', '11', '1')).sort_values(ignore_index=True)

In [None]:
random_normal.sort_values(na_position='first')

In [None]:
random_normal.sort_values(ascending=False)

In [None]:
%%timeit
np.random.seed(1)
values = pd.Series(np.random.randn(10_000_000)).sort_values()

In [None]:
%%timeit
np.random.seed(1)
values = np.random.randn(10_000_000)
values.sort()

#### `Series.sort_index`
- we may also sort by index
- when Series have more than one index, sort sequentially
    - may give indices to sort first by
    - may ignore other indices

In [None]:
random_normal.sort_index()

In [None]:
help(pd.Series.sort_index)

#### sort functions
- so far only sorted "from small to big" and "from big to small"
- arbitrary functions can be given as "key" to sort by
- series are then sorted by function output

In [None]:
def sort_function(s: int) -> int:
    return np.sin(abs(s))

In [None]:
random_normal.sort_values(key=sort_function)

### 6.4 Apply and Transform
- invoke a function on the values
    - operates on *one row at a time*
    - may provide additional keyword args
- for the difference, see https://towardsdatascience.com/difference-between-apply-and-transform-in-pandas-242e5cf32705

In [None]:
help(pd.Series.apply)

In [None]:
help(pd.Series.transform)

#### `Series.transform`
(single Series $\rightarrow$ multiple results)
- may use a (numpy or python) function, a 'string function', a list of functions, or a dict
- cannot use to aggregate Series (result has same length as input)
- may only use a single Series at a time

In [None]:
integers = pd.Series((10, 20, 30))

In [None]:
integers.transform(np.exp)  # numpy ufunc

In [None]:
def loglin(x, base=np.e):
    return x * np.log(x) / np.log(base)


integers.transform(loglin, base=10)  # Python function

In [None]:
integers.transform('sqrt')  # "string function": Pandas looks these up

In [None]:
integers.transform(['sqrt', np.square, loglin])  # output is a `pd.DataFrame`!

#### `Series.apply`
(multiple Series $\rightarrow$ single result)
- may *only* use a numpy ufunc, string function, or a Python function (no list or dict!)
- may use multiple Series (of a DataFrame) at a time
- may produce aggregated results
- may automatically convert the data type

In [None]:
integers.apply(np.sqrt)  # `Series` result

In [None]:
integers.apply('prod')  # scalar result, does not work with Series.transform

In [None]:
integers.apply(loglin, base=10, convert_dtype=False)

In [None]:
transformed_ints = integers.transform(['sqrt', np.square, loglin])
transformed_ints

In [None]:
transformed_ints.apply('sum')  # returns a Series

In [None]:
transformed_ints.apply('sum', axis=1)

In [None]:
transformed_ints

In [None]:
def weigh_by_index(x: pd.DataFrame) -> pd.Series:
    return x.values * x.index


transformed_ints.apply(weigh_by_index, axis=0)

In [None]:
def my_condensation(x: pd.DataFrame) -> pd.Series:
    return x['sqrt'] + (x['square'] * x['loglin'])


transformed_ints.apply(my_condensation, axis=1)

In [None]:
transformed_ints.transform(my_condensation, axis=1)

### 6.5 Tasks

Betrachten Sie die `Series` namens `ints` mit einem manuell gesetzten Index und zufällig gewählten Werten.

In [None]:
np.random.seed(1)
ints = pd.Series(np.random.randint(0, 10, 10), index=range(0, 20, 2))
ints

##### 1. `replace`, `map`, und `aggregate`
- `replace`, `map`
    - Was erhalten Sie mit `ints.replace(ints)`? Warum erhalten Sie viele ungerade Werte?
    - Was erhalten Sie mit `ints.map(ints)`? Warum hat das Ergebnis fehlende Werte?
- aggregate
    - Schreiben Sie Funktionen `sum_odd` und `sum_even`, welche die ungeraden bzw. geraden Werte einer Series addiert. Nutzen Sie `Series.aggregate`, um sich eine neue Series mit der Summe der ungeraden Werte, der geraden Werte, und aller Werte zu erstellen.

##### 2. `sort`, `apply`, und `transform`
- `sort`
    - Sortieren Sie die Werte. Wie unterscheiden sich `kind='heapsort'` und `kind='quicksort'`?
- `apply`, `transform`
  - Wenden Sie (mit `Series.apply`) die Funktionen `np.log`, `np.exp`, `'sqrt'`, und `'square'` auf die Werte an, sodass Sie einen `DataFrame` erhalten.
  - Wenden Sie auf diesen `DataFrame` die Funktion `'sum'` so an, dass Sie eine Series mit dem gleichen Index, aber als Werte die Summe aller Spalten erhalten.
  - wie können Sie mit `Series.transform` das gleiche Endergebnis erhalten?

## 7 Conclusion: Pandas Series
- `Series` are a powerful class for one-dimensional data
    - numpy arrays with generalized indexing capabilities
    - a plenitude of methods are available
        * easily construct, plot, and transform data
        * do simple statistics and deal with missing values
        * much more than shown here — read the [docs](https://pandas.pydata.org/pandas-docs/stable/reference/series.html)


- `Series` cannot deal well with multi-dimensional data
    - scatterplots are a pain
    - we need to (ab-)use the index
    - we cannot hold all our data (e.g. measurements) in one object


- up one level: `DataFrame`s!