# MuonDataLib Tutorial 5: One versus many filters

In this tutorial we will investigate the filters in more detail. This includes:

- The performance of one and numerous filters.
- The reason for a small amount of filtering causing the histogram calculation to become slower.

The first bit of code just sets up the problem. The frame start times will be used to create a series of filters that include different proportions of the full data set (e.g. 10%). This works best with large data sets, the current example is not really big enough to demonstrate the differences in the time taken for calculating the histograms.

In [None]:
from MuonDataLib.data.loader.load_events import load_events
import time
import numpy as np
import os

file_name = 'HIFI00193325.nxs'
input_file = os.path.join('..', 'Data_files', file_name)

output_name = 'HIFI193325.nxs'
output_file = os.path.join('..', 'Output_files', output_name)

data = load_events(input_file, 64)
frame_start_times = data.get_frame_start_times()

To get good statistics for the timing we will repeat the calculations `N_loops` times, which is currently set to 10. We will be filtering our data in increments of $10\%$ (`percentages`). Finally the frame interval, `dt`, should be approximatly constant between two adjacent frames. A small amount is subtracted from `dt` to make sure that the filter that uses it will remain in a single frame. 

In [None]:
N_loops = 10
percentages = np.asarray([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])
dt = frame_start_times[1] - frame_start_times[0] - 0.01

To keep the code clean a function will be used to calculate the timings for saving the histograms. The `clear_time_filters` is included to make sure that the `save_histogram` command does not use the cached values. Hence, the average will be for the same calculation every time. At the end of the function it returns the mean and standard deviation for the time taken to save the histogram nexus files. 

In [None]:
def calc_stats(data):
    times = np.zeros(N_loops)
    for j in range(N_loops):
        start = time.time()
        data.save_histograms(output_file)
        data.clear_time_filters()
        times[j] = time.time() - start
        os.remove(output_file)
    return np.mean(times), np.std(times)
    

We will now create some arrays to store the average time taken for the filtering. We will consider three cases:

1. There is one large filter applied across all of the filtered data
2. There are numerous filters of a single frame
3. There are numerous overlapping filters, all of them will start at 0

Then we will add the mean time for saving the whole data set.


In [None]:
times_many = np.zeros(len(percentages)+1)
times_one = np.zeros(len(percentages)+1)
times_overlap = np.zeros(len(percentages) + 1)

mean, std = calc_stats(data)

times_many[9] = mean
times_one[9] = mean
times_overlap[9] = mean
print('full data', mean, std)

This code will loop over the percentages of the data to be filtered and calculates the mean time. The first block is a single filter, the second is lots of single frame filters and the last block is for the overlapping filters. Each time `clear_time_filters` is used to make sure that only the filters that we want to consider are included. The final bit of code prints and stores the results.

In [None]:
for j, percent in enumerate(percentages):
    data.clear_time_filters()
    data.add_time_filter('one', frame_start_times[0], frame_start_times[-1]*percent)
    one_mean, one_std = calc_stats(data)
    
    data.clear_time_filters()
    for k in range(int(len(frame_start_times)*percent)):
        data.add_time_filter(f'filter {k}', frame_start_times[k], frame_start_times[k] + dt)
    many_mean, many_std = calc_stats(data)
    
    data.clear_time_filters()
    for k in range(int(len(frame_start_times)*percent)):
        data.add_time_filter(f'filter {k}', frame_start_times[0], frame_start_times[k] + dt)
    overlap_mean, overlap_std = calc_stats(data)
    
    pos = len(percentages) - j - 1
    print(percent, many_mean, many_std, one_mean, one_std, overlap_mean, overlap_std)
    times_many[pos] = many_mean
    times_one[pos] = one_mean
    times_overlap[pos] = overlap_mean



The plot tools expect a histogram, so we write the percentages as bin edges.

In [None]:
bins = np.asarray([0.05, 0.15, 0.25, 0.25, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95, 1.05])

Finally we plot the results.

In [None]:
from MuonDataLib.plot.basic import Figure

fig = Figure(y_label='time to save (s)', x_label='fraction of data retained')
fig.plot(bins, times_one, label='one filter')
fig.plot(bins, times_many, label='many filters')
fig.plot(bins, times_overlap, label='overlapping filters')
fig.show()

There are a few interesting features in this data. The first is that filtering a small amount of data is slower than using all of it. This seems unintuitive, as there is less data so it should be quicker. However, the code has to calculate which data to filter out (remove) before calculating the histogram. Hence, there is initially a time cost for doing the filtering if the only a small amount of data is removed. 

The second interesting thing is that using multiple filters makes the code slower. This is becuase each new filter causes a search of the frame times to identify which data to remove. Hence, as the number of searches increases the calculation gets slower.  

The final interesting point, is that the `overlapping filters` and `many filters` methods take about the same amount of time for each fraction of data. This is because, th time taken scales with the number of filters and as a consequence the number of searches.  