# Summarising and Plotting Statistics

After a successful run of `run_topostats` you will have a `all_statistics.csv` file that contains a summary of various
statistics about the detected molecules across all image files that were processed. There is class
`topostats.plotting.TopoSum` that uses this file to generate plots automatically and a convenience command
`toposum` which provides an entry point to re-run the plotting at the command line.

Inevitably though there will be a point where you want to tweak plots for publication or otherwise in some manner that
is not conducive to scripting in this manner because making every single option from
[Seaborn](https://seaborn.pydata.org/) and [Matplotlib](https://matplotlib.org/) accessible via this class is a
considerable amount of work writing [boilerplate code](https://en.wikipedia.org/wiki/Boilerplate_code). Instead the
plots should be generated and tweaked interactively a notebook. This Notebook serves as a sample showing how to use the
`TopoSum` class and some examples of creating plots directly using [Pandas](https://pandas.pydata.org/).

If you are unfamiliar with these packages it is recommended that you read the documentation. It is worth bearing in mind
that both Pandas and Seaborn build on the basic functionality that Matplotlib provides, providing easier methods for
generating plots. If you are stuck doing something with either of these refer to Matplotlib for how to achieve what you
are trying to do.

* [Pandas](https://pandas.pydata.org/docs/)
* [10 minutes to pandas](https://pandas.pydata.org/docs/user_guide/10min.html)
* [Chart visualization — pandas](https://pandas.pydata.org/docs/user_guide/visualization.html?highlight=plotting)
* [seaborn: statistical data visualization](https://seaborn.pydata.org/index.html)
* [An introduction to seaborn](https://seaborn.pydata.org/tutorial/introduction.html)
* [Matplotlib — Visualization with Python](https://matplotlib.org/)
* [Tutorials — Matplotlib](https://matplotlib.org/stable/tutorials/index)



## Load Libraries

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd

from topostats.plotting import TopoSum

## Load  `all_statistics.csv`

You need to load your data to be able to work with it, this is best achieved by importing it using
[Pandas](https://pandas.pydata.org/). Here we use the `tests/resources/minicircle_default_all_statistics.csv` that is
part of the TopoStats repository and load it into the object called `df` (short for "Data Frame"). You will need to
change this path to reflect your output. 

Because `molecule_number` is unique to the `image` and `threshold` we set a multi-level index of these three

In [None]:
df = pd.read_csv("../tests/resources/minicircle_default_all_statistics.csv")
df.set_index(["image", "threshold", "molecule_number"], inplace=True)


## Plotting with Pandas

### Plotting Contour Lengths

In [None]:
df["contour_lengths"].plot.hist(figsize=(16,9),
                                bins=20,
                                title="Contour Lengths",
                                alpha=0.5)

### Plotting End to End Distance of non-Circular grains

In [None]:
df[df["circular"] == False]["end_to_end_distance"].plot.hist(figsize=(16,9),
                                                             bins=20,
                                                             title="End to End Distance",
                                                             alpha=0.5)

### Multiple Images

Often you will have processed multiple images and you will want to plot the distributions of metrics for each image
separately.

For this example we duplicate the data and append it, adjusting the values slightly

In [None]:
def scale_df(df: pd.DataFrame, scale:float, image:str) -> pd.DataFrame:
    """Scale the numerical values of a data frame. Retains string variables and the index.

    Parameters
    ----------
    df: pd.DataFrame
        Pandas Dataframe to scale.
    scale: float
        Factor by which to scale the data.
    image: str
        Name for new (dummy) image.

    Returns
    -------
    pd.DataFrame
        Scaled data frame
    """
    _df = df[df.select_dtypes(include=['number']).columns] * scale
    _df.reset_index(inplace=True)
    _df["image"] = image
    _df = pd.concat([_df, df[["circular", "basename"]]], axis=1)
    _df.set_index(["image", "threshold", "molecule_number"], inplace=True)
    return _df
smaller = scale_df(df, scale=0.9, image="smaller")
larger = scale_df(df, scale=1.25, image="larger")
df_three_images = pd.concat([smaller, df, larger])

### Contour Length from Three Processed Images

In [None]:
df_three_images["contour_lengths"].groupby("image").plot.hist(figsize=(16,9),
                                bins=20,
                                title="Contour Lengths",
                                alpha=0.5)

### Violin Plot of `max_feret` using Seaborn

Pandas does not have built-in support for Violin Plots so we switch to using Seaborn.

In [None]:
# Reset dataframe index to make `image` readily available
df_three_images.reset_index(inplace=True)
fig, ax = plt.subplots(1, 1, figsize=(16,9))
sns.violinplot(data=df_three_images, x="image", y="max_feret", hue="image", alpha=0.5)
plt.title("Minimum Feret")
plt.ylabel("Minimum Feret / nm")
# Return the index
df_three_images.set_index(["image", "threshold", "molecule_number"], inplace=True)

In [None]:
df_three_images.index
