# Statistics

During the last few days, we learned fundamental programming.
While this knowledge is useful, it usually needs to be paired with some basic knowledge about mathematics.
Therefore, we will now investigate our signals a little bit deeper.

## Inter spike interval

Between two spikes a neuro needs time to recover.
This [refractory period](https://en.wikipedia.org/wiki/Refractory_period_(physiology)) has a minimal length.
The units we showed you were filtered with multiple methods.
One of them was the minimal refractory period.
We will now investigate it.

## Simulation

Considering that we have now "incorrect unit" we have to create one, but how?
The answer to this is simulation.
Especially in Physics simulations are often used tool to answer questions that cannot be answered with simple experiments.
If we want to know how galaxies form we can neither make one in our own backyard nor can we observe it in our lifetimes,
so we build a mathematical model in a computer and investigate it.

In biology, computer simulations are more difficult to perform, because we lack a sufficiently advanced mathematical understanding of the problems we investigate.
Expressed in a simpler way “it is easier to calculate how two galaxies collide, than how two cell interacts with each other.
You can see this in the way we "simulated" the our two neurons.

We take a starting point and then roll a random number afterwards.
Adding the random number to the last time gives us the new spike time.
Thereby we have a randomly firing neuron.

We cover simulations, because I believe that during your career you will may encounter questions that can be answered by writing a short program and running it instead of using a plant or animal and that the use of simulation will slowly proliferate within biology.
For the latter case always remember that a simulation is a simplified mathematical model and therefore flawed, so if you use it always ask which corners were cut and how this will influence your research.

So if we now simulate a neurin our code could look like this:

```python
import random

def create_random_neuron(start_time, minimal_delay, maximal_delay, end_recording):
    """!
    @brief Creates random neuron data
    @details This creates a uniformly disributed neuron signal

    @param start_time the beginning of the firing
    @param minimal_delay the minimal distance between spikes (refractory period)
    @param maximal_delay the maximal distance between spikes (no biologial meaning)
    @param end_recording the largest permitted spike time
    @return a list with spike times
    """
    spike_times = list()
    # Note that the start time is not a spike otherwise the results would not be random enough
    last_time = start_time
    while True: # This is Pythons version of a do while loop: https://en.wikipedia.org/wiki/Do_while_loop
        time_difference = random.uniform(minimal_delay, maximal_delay)
        last_time += time_difference
        if last_time < end_recording:
            spike_times.append(last_time)
        else:
            break
    return spike_times

# To ensure that the random module produces repeatable results we have to seed it
# This makes sure if we run the same algortihm on two machines the results are equal
# If you do not specify it a seed is chosen autoamtically, usually the system time
random.seed(42)
print(create_random_neuron(0, 0.2, 5.0, 100))
```

Please use the code above to create a neuron.
Then calculate the time difference between the spikes (inter-spike-interval) and visualize the results.

In [None]:
# Your code should be added here

<details>
<summary> Show suggested solution </summary>

```Python
import random
import matplotlib.pyplot

def create_random_neuron(start_time, minimal_delay, maximal_delay, end_recording):
    """!
    @brief Creates random neuron data
    @details This creates a uniformly disributed neuron signal

    @param start_time the beginning of the firing
    @param minimal_delay the minimal distance between spikes (refractory period)
    @param maximal_delay the maximal distance between spikes (no biologial meaning)
    @param end_recording the largest permitted spike time
    @return a list with spike times
    """
    spike_times = list()
    # Note that the start time is not a spike otherwise the results would not be random enough
    last_time = start_time
    while True: # This is Pythons version of a do while loop: https://en.wikipedia.org/wiki/Do_while_loop
        time_difference = random.uniform(minimal_delay, maximal_delay)
        last_time += time_difference
        if last_time < end_recording:
            spike_times.append(last_time)
        else:
            break
    return spike_times

def get_inter_spike_intervals(spike_times):
    differences = list()
    for index in range(0, len(spike_times) - 1):
        difference = spike_times[index + 1] - spike_times[index]
        differences.append(difference)
    return differences

# To ensure that the random module produces repeatable results we have to seed it
# This makes sure if we run the same algortihm on two machines the results are equal
# If you do not specify it a seed is chosen autoamtically, usually the system time
random.seed(42)
random_spikes = create_random_neuron(0, 0.2, 5.0, 100)
interval_random_spikes = get_inter_spike_intervals(random_spikes)

matplotlib.pyplot.hist(interval_random_spikes)
matplotlib.pyplot.show()
```

</details>

## Compare model to reality

You have seen a rough distribution, but to correctly fine tune your model you should compare it to reality.
Please read in the contents of ```./data_neuron/session_2023111501010_units.csv``` as we did last time and plot the inter-spike-interval of it.

In [None]:
# Your code should be added here

<details>
<summary> Show suggested solution </summary>

```Python
import random
import csv
import pathlib
import matplotlib.pyplot

def create_random_neuron(start_time, minimal_delay, maximal_delay, end_recording):
    """!
    @brief Creates random neuron data
    @details This creates a uniformly disributed neuron signal

    @param start_time the beginning of the firing
    @param minimal_delay the minimal distance between spikes (refractory period)
    @param maximal_delay the maximal distance between spikes (no biologial meaning)
    @param end_recording the largest permitted spike time
    @return a list with spike times
    """
    spike_times = list()
    # Note that the start time is not a spike otherwise the results would not be random enough
    last_time = start_time
    while True: # This is Pythons version of a do while loop: https://en.wikipedia.org/wiki/Do_while_loop
        time_difference = random.uniform(minimal_delay, maximal_delay)
        last_time += time_difference
        if last_time < end_recording:
            spike_times.append(last_time)
        else:
            break
    return spike_times

def get_spike_times_for_all_units(units_file_path):
    unit_spike_times = dict()
    with open(units_file_path, "r") as units_file:
        reader = csv.DictReader(units_file)
        for row in reader:
            unit_id = int(row["unitID"])
            if unit_id not in unit_spike_times.keys():
                unit_spike_times[unit_id] = list()
            spike_time = float(row["spikeTimes"])
            unit_spike_times[unit_id].append(spike_time)
    return unit_spike_times

def get_inter_spike_intervals(spike_times):
    differences = list()
    for index in range(0, len(spike_times) - 1):
        difference = spike_times[index + 1] - spike_times[index]
        differences.append(difference)
    return differences

data_path = pathlib.Path("./data_neuron/")
units_file_path = data_path / "session_2023111501010_units.csv"

spikes_times_units = get_spike_times_for_all_units(units_file_path)
max_duration = 0
for unit_id in spikes_times_units.keys():
    max_time_unit = max(spikes_times_units[unit_id])
    max_duration = max(max_time_unit, max_duration)

for unit in spikes_times_units.keys():
    matplotlib.pyplot.title(f"Unit {unit}")   
    matplotlib.pyplot
    matplotlib.pyplot.hist(get_inter_spike_intervals(spikes_times_units[unit]))
    matplotlib.pyplot.xlabel("Interval in seconds")
    matplotlib.pyplot.ylabel("Number of spikes")
    matplotlib.pyplot.ylim((0, 130))
    matplotlib.pyplot.show()

random.seed(42)
random_spikes = create_random_neuron(0, 0.2, 5.0, max_duration)
interval_random_spikes = get_inter_spike_intervals(random_spikes)

matplotlib.pyplot.title(f"Simulation")
matplotlib.pyplot.hist(interval_random_spikes)
matplotlib.pyplot.xlabel("Interval in seconds")
matplotlib.pyplot.ylabel("Number of spikes")
matplotlib.pyplot.show()
```

</details>

## Mean, median and standard deviation

As you have observed our simulated neuron does not look like any of the units we measured at all.
To answer the question why we have to do a little bit more statisic.

# TODO

Let us now return to our growth rate. 
We wish to obtain the bigger picture or figure out how the growth performs in general.
So far, we used our brains to do this visually, but we do not have numbers.
How do we get a growth number from our five dishes?

The first step would probably be to combine all the dishes into one dish. 
In other words we need an abstract measure for the growth rate of all dishes.
We have a few mathematical tools to obtain such a measure.

The simplest one is the average or the mean.
The sum of all elements divided by the number of elements.
Nicely represented by ```numpy.mean```.

The disadvantage is its sensitivity to outliers. 
If the average PhD takes 51 months computer scientists take 60 months veterinarians may take less time[DFG](https://www.dfg.de/de/service/presse/pressemitteilungen/2021/pressemitteilung-nr-09).
It becomes more fun if you are doing your PhD in philosophy where the average is around 55 months.
You may already plan your life accordingly just to realize that the person sitting on the desk across you is already working on their PhD for seven years and far from completion[DFG](https://www.dfg.de/de/service/presse/pressemitteilungen/2021/pressemitteilung-nr-09).

This issues means we need better ways to describe our data.
The next one we can use is the middle value or the median,
which we can obtain via ```numpy.median```.

The last measure is the [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation), measuring how strongly values deviate from a mean.
It only works if we use normally distributed values, similar in shape to the following graph from [Wikipedia](https://commons.wikimedia.org/wiki/File:Standard_deviation_diagram.svg):

![A bell shaped normal distribution](https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/Standard_deviation_diagram.svg/640px-Standard_deviation_diagram.svg.png)

This is often the case because our observed variables are independent random variables,
Which form a normal distribution if sampled often enough according to the [central limit theorem](https://en.wikipedia.org/wiki/Central_limit_theorem).
If your data in the future are distributed can be tested with a few statistic tests.
Those tests are not simple to apply or interpret so I advise you to either attend a statistics lecture or cooperate with someone that did.

Now back to our problem.
Since the standard deviation, which we can calculate using ```numpy.std``` can serve as an error-measurement we wish to include it.

Please plot the growth rates median and mean. The  latter one should be plotted with errorbars using the standard deviation. Please enter your code below.

In [None]:
# Write your solution here

<details>
<summary> Show suggested solution </summary>

```Python
import csv
import numpy
import pathlib
import matplotlib.pyplot

def process_csv(csv_file, dishes):
    with open(csv_file, "r") as csv_file_handle:
        _, day, _ , dish_number = str(csv_file.stem).split("_")
        day = int(day)
        dish_number = int(dish_number)
        cell_counter = 0
        cell_area_counter = 0
        reader = csv.DictReader(csv_file_handle)
        for row in reader:
            cell_counter += 1
            cell_area_counter += int(row[" Cell Area"])
        if dish_number not in dishes.keys():
            dishes[dish_number] = {}
        dishes[dish_number][day] = {
            "cell_count": cell_counter,
            "area": cell_area_counter
        } 
    return

csv_files = list()
data_folder = pathlib.Path("./data")
for csv_file in data_folder.iterdir():
    if "dish_" in csv_file.stem:
        csv_files.append(csv_file)


dishes = {}
for csv_file in csv_files:
    process_csv(csv_file, dishes)

area = []
count = []
cells = {"area": area, "count": count}
# We know that the dishes are numbered so we iterate over them<
for dish_number in range(1, len(dishes) + 1, 1):
    dish = dishes[dish_number]
    dish_area = []
    dish_count = []
    # We know that the days in the dishes are numbered
    for day_number in range(1, len(dish) + 1, 1):
        value_pair = dish[day_number]
        day_area = value_pair["area"]
        day_count = value_pair["cell_count"]
        dish_area.append(day_area)
        dish_count.append(day_count)
    area.append(dish_area)
    count.append(dish_count)

def get_forward_derivative(series_x, series_y):
    """ Gets the forward derivative

    We chose the forward derivative, because it needs fewer values and still gives us some insight.
    We can also calculate it for the first element in the series, which lacks a predecessor
    """
    if len(series_x) != len(series_y):
        raise ValueError(f"{series_x} did not have the same number of elements as {series_y}")
    return_x = list()
    return_y = list()
    # Remeber that the last element has no sucessor and we can therefore not get a derivative for it
    for index in range(0, len(series_x) - 1, 1):
        delta_x = series_x[index + 1] - series_x[index]
        delta_y = series_y[index + 1] - series_y[index]
        slope = delta_y / delta_x
        return_x.append(series_x[index])
        return_y.append(slope)
    return return_x, return_y
    
days = [day for day in range(0, len(cells["count"][0]))]
days_derived = days[0:-1]
growth_number_of_cells = [list() for day in days]
growth_area = [list() for day in days]

for dish in range(0, len(cells["count"])):
    current_growth_number_of_cells = get_forward_derivative(days, cells["count"][dish])[1]
    current_growth_area = get_forward_derivative(days, cells["area"][dish])[1]
    for day in range(0, len(days_derived)):
        growth_number_of_cells[day].append(current_growth_number_of_cells[day])
        growth_area[day].append(current_growth_area[day])
        
growth_number_of_cells_median = list()
growth_number_of_cells_mean = list()
growth_number_of_cells_std = list()
growth_area_median = list()
growth_area_mean = list()
growth_area_std = list()
for day in range(0, len(days_derived)):
    current_values_count = growth_number_of_cells[day]
    growth_number_of_cells_median.append(numpy.median(current_values_count))
    growth_number_of_cells_mean.append(numpy.mean(current_values_count))
    growth_number_of_cells_std.append(numpy.std(current_values_count))
    current_values_area = growth_area[day]
    growth_area_median.append(numpy.median(current_values_area))
    growth_area_mean.append(numpy.mean(current_values_area))
    growth_area_std.append(numpy.std(current_values_area))

figure, axes = matplotlib.pyplot.subplots(2,1)
figure.suptitle("Cell growth")
# Formatting the plots to use circles "o" and crosses "x".
# For further information consult: https://matplotlib.org/stable/api/markers_api.html#module-matplotlib.markers
axes[0].errorbar(days_derived, growth_number_of_cells_mean, yerr=growth_number_of_cells_std, fmt="o", label = "mean")
axes[0].plot(days_derived, growth_number_of_cells_median, "x", label = "median")
axes[0].set_title("Cell count")
axes[0].set_xlabel("Days")
axes[0].set_ylabel("Change in the number of cells")
axes[1].errorbar(days_derived, growth_area_mean, yerr=growth_area_std, fmt="o", label = "mean")
axes[1].plot(days_derived, growth_area_median, "x", label = "median")
axes[1].set_title("Cell area")
axes[1].set_xlabel("Days")
axes[1].set_ylabel("Change in the area covered by cells")
axes[0].legend()
axes[1].legend()
matplotlib.pyplot.show()
```

</details>

## Numerical differentiation

After we have seen our two plots, it seems prudent to look at the growth rates now.
As you may recall from your math studies the growth of the cells is the change in the number of cells or the first derivative of the numbers you see in front of you.
Since we lack the underlying function, we have to solve this problem numerically,
which means letting a computer handle it by a number-by-number basis.
So how do we do this?

During my high-school time the derivative was introduced via the [slope](https://en.wikipedia.org/wiki/Slope) of a function.
We counted how much we went up or down on one axis comparted to the other,
as shown in the following [illustration from Wikipedia](https://commons.wikimedia.org/wiki/File:Wiki_slope_in_2d.svg):

![A slope illustrated with the interval steps on the x- and y-axis](https://upload.wikimedia.org/wikipedia/commons/thumb/c/c1/Wiki_slope_in_2d.svg/445px-Wiki_slope_in_2d.svg.png)

Luckily, for us [numerical differentiation](https://en.wikipedia.org/wiki/Numerical_differentiation) did not progress much further, so we get the first derivative with the the original formula:

$$
m = \frac{y_{2} - y_{1}}{x_{2} - x_{1}}
$$

So we can get the forward slope, an approximation for the first order derivative by looking .

We can get the forward slope, an approximation for the derivative by looking a certain distance ahead.
In our case, the sample times indicate the length of this distance often called h.
Our formula now takes the following form:

$$
f^{'}(x) = \frac{f(x + h) - f(x)}{h}
$$

If you need a symmetric derivative, you may choose to look around the point of interest and use:

$$
f^{'}(x) = \frac{f(x + h) - f(x - h)}{2 h}
$$

Should you need to calculate derivatives of mathematical functions you should look up [Taylor expansion](https://en.wikipedia.org/wiki/Taylor_series) and [finite difference coefficients](https://en.wikipedia.org/wiki/Finite_difference_coefficient).
In this case you should also pay attention to the size of h.
You can also reverse the process and numerically integrate, but there are special lectures for this topic that will warn you about the pitfalls in the different techniques.

To round up this topic I would ask you to calculate the derivatives of the cell growth and plot them. Please use the cell below.

In [None]:
# Write your code here
# Feel free to copy from above what you need

<details>
<summary> Show suggested solution </summary>

```Python
import csv
import pathlib
import matplotlib.pyplot

def process_csv(csv_file, dishes):
    with open(csv_file, "r") as csv_file_handle:
        _, day, _ , dish_number = str(csv_file.stem).split("_")
        day = int(day)
        dish_number = int(dish_number)
        cell_counter = 0
        cell_area_counter = 0
        reader = csv.DictReader(csv_file_handle)
        for row in reader:
            cell_counter += 1
            cell_area_counter += int(row[" Cell Area"])
        if dish_number not in dishes.keys():
            dishes[dish_number] = {}
        dishes[dish_number][day] = {
            "cell_count": cell_counter,
            "area": cell_area_counter
        } 
    return

csv_files = list()
data_folder = pathlib.Path("./data")
for csv_file in data_folder.iterdir():
    if "dish_" in csv_file.stem:
        csv_files.append(csv_file)


dishes = {}
for csv_file in csv_files:
    process_csv(csv_file, dishes)

area = []
count = []
cells = {"area": area, "count": count}
# We know that the dishes are numbered so we iterate over them<
for dish_number in range(1, len(dishes) + 1, 1):
    dish = dishes[dish_number]
    dish_area = []
    dish_count = []
    # We know that the days in the dishes are numbered
    for day_number in range(1, len(dish) + 1, 1):
        value_pair = dish[day_number]
        day_area = value_pair["area"]
        day_count = value_pair["cell_count"]
        dish_area.append(day_area)
        dish_count.append(day_count)
    area.append(dish_area)
    count.append(dish_count)

def get_forward_derivative(series_x, series_y):
    """ Gets the forward derivative

    We chose the forward derivative, because it needs fewer values and still gives us some insight.
    We can also calculate it for the first element in the series, which lacks a predecessor
    """
    if len(series_x) != len(series_y):
        raise ValueError(f"{series_x} did not have the same number of elements as {series_y}")
    return_x = list()
    return_y = list()
    # Remeber that the last element has no sucessor and we can therefore not get a derivative for it
    for index in range(0, len(series_x) - 1, 1):
        delta_x = series_x[index + 1] - series_x[index]
        delta_y = series_y[index + 1] - series_y[index]
        slope = delta_y / delta_x
        return_x.append(series_x[index])
        return_y.append(slope)
    return return_x, return_y
    

figure, axes = matplotlib.pyplot.subplots(2,1)
days = [day for day in range(0, len(cells["count"][0]), 1)]
for dish in range(0, len(cells["count"])):
    # * unpacks the returned tuple into two values
    axes[0].plot(*get_forward_derivative(days, cells["count"][dish]), label=f"Dish {dish}")
    axes[1].plot(*get_forward_derivative(days, cells["area"][dish]), label=f"Dish {dish}")
figure.suptitle("Cell growth")
axes[0].set_title("Cell count")
axes[0].set_xlabel("Days")
axes[0].set_ylabel("Change in the number of cells")
axes[1].set_title("Cell area")
axes[1].set_xlabel("Days")
axes[1].set_ylabel("Change in the area covered by cells")
axes[0].legend()
axes[1].legend()
matplotlib.pyplot.show()
```

</details>

## Rubber duck debugging

During this course, you were able to work with your partners.
Working on the course together, you were able to engage naturally in [pair programming](https://en.wikipedia.org/wiki/Pair_programming) and [code review](https://en.wikipedia.org/wiki/Code_review). 
These methods greatly improve your understanding of the problem and thereby your code.
It is unfortunately difficult to communicate the advantages of efficient work to superior and colleagues.
The preferred standard is to let two people on two desks solve in eight days, what two people on one desk could have solved in four.
Therefore, you will find yourself most probably working on your problems alone.
Luckily, not some of the benefits of working with a partner can be gained without one.

A good example of this is [rubber duck debugging](https://en.wikipedia.org/wiki/Rubber_duck_debugging).
It is especially useful for novice programmers as yourself or problems with many steps.
The general idea is that explaining something requires you to think more deeply about it,
therefore you can improve your understanding and by explaining.
So, you need to explain your code to figure out why it does not do what it is supposed to.
While a live human can contribute his or her own insights, a rubber duck is sufficient to gain your own.
So if you struggle with a problem take a rubber duck and explain the problem to the duck.
If you are lucky, you will realize that a small misunderstanding or mistake was the problem and fix it.

![Picture of a rubber duck in front of a monitor](https://upload.wikimedia.org/wikipedia/commons/d/d5/Rubber_duck_assisting_with_debugging.jpg)

## Possibilities 
After you are done with the exercise above take a look at the [matplotlib gallery](https://matplotlib.org/stable/gallery/index.html), [biopython](https://biopython.org/), [seaborn](https://seaborn.pydata.org/tutorial/introduction.html) and plan your future adventures with your rubber ducky.

