In [None]:
from pathlib import Path

import pandas as pd
import geopandas as gpd

import matsim

# Usage example notebook

In this notebook we'll take a look at some typical workflows that can be crafted using this library. We'll start by taking a look at and using some of the more basic components, then learn how to leverage the provided convenience methods and classes in order to automate a part of the process before finally diving into crafting our own custom classes to extend the library's functionality.

## Basic usage

### Data prep

Since our goal is to compare the simulation against observed data, we must load both of those datasets, as well as a way to know exactly which element of one set should be considered to an element of the other set. In our case, this means providing something like a lookup table relating the links in the simulation's network to the city's detectors.

In [None]:
data_folder = Path.cwd() / './data/zurich'

In [None]:
detectors_filename = data_folder / 'counts.csv'
events_filename = data_folder / 'events.csv'
lookup_table_filename = data_folder / 'lookup_table.csv'

First, we'll import the data from the loop detectors

In [None]:
detector_data = pd.read_csv(detectors_filename, engine='pyarrow')[["MSID", "MessungDatZeit", "AnzFahrzeuge"]].rename(columns={"MessungDatZeit": "time", "AnzFahrzeuge": "count"})
detector_data

In case the MATSim events file is unprocessed, we can run the following cells to extract the relevant events and save the result to a csv (which we'll subsequently read).

In [None]:
# unprocessed_events_filename = data_folder / 'output_events.xml.gz'

In [None]:
# import utils
# import subprocess
# subprocess.Popen([utils.get_entered_link_filename, unprocessed_events_filename, data_folder / 'processed_events.csv']).wait()

We'll now import the processed MATSim events

In [None]:
events_df = pd.read_csv(events_filename, engine='c', nrows=10_000_000)

In [None]:
events_df['count'] = 1
events_df.rename(columns={'link': 'link_id'}, inplace=True)

Now that we've loaded our base datasets, we'll load the lookup table which will allows us to compare them.

In [None]:
lookup_table = pd.read_csv(lookup_table_filename, index_col=0)
lookup_table


### Analysis

The core of this library is made up of Analysis objects. These objects' sole purpose is to generate their respective analysis (duh) for a given input - a pandas dataframe with specific column names or something that inherits from that (like a geodataframe). In order to use one of these objects, you can simply instantiate them and then call their generate_analysis method while providing the specified dataframe.


Let's do this for the CountComparison Analysis object. As we'll see, what this Analysis does is calculate some link-based metrics comparing the simulated and observed counts. We'll start by reading in an appropriate dataframe. In this case, the dataframe needs to have at least the following columns:

* link_id: integer that identifies the individual link
* count_sim: the simulated vehicle counts for that link
* count_obs: the observed vehicle counts for that link

In [None]:
detector_data = detector_data.merge(lookup_table, on="MSID", how='right')
detector_data = detector_data.astype({'link_id': 'int64'})
detector_data

In [None]:
from diagnostic.report import CreateComparisonDF

comparison = CreateComparisonDF.link_comp(events_df, detector_data)

#### Instantiating the object

In order to create/instantiate a basic Analysis object, all you need to do is import the desired class and call it as though it were a regular python function

In [None]:
from diagnostic.analyses import CountComparison

In [None]:
cc = CountComparison()

#### Generating the analysis

To actually generate the analysis, simply call the _generate\_analysis_ method from the Analysis object while providing an appropriate dataframe as the argument.

In [None]:
cc.generate_analysis(comparison)

Now, if we want to inspect the result of the analysis, we must simply get the 'result' attribute from our Analysis object, as follows. Keep in mind that, depending on the Analysis, this object can be of different types (a pandas DataFrame for CountComparison but a list of matplotlib Figures for CountVisualization).

In [None]:
cc.result

#### Getting the result in a specific format

You can also output the generated result in a specific format (csv, latex, png, shp) depending on the analysis being used. To do that, call the object's _to\_\<format\>_ method. For example, to get the result from the CountComparison Analysis as a latex table, we execute

In [None]:
cc.to_latex()

#### Summing up

There are many default analyses already implemented, and the process of creating one, generating its analysis, and getting back the result is the same for each one of them. The only thing you must keep in mind when calling the _generate\_analysis_ method on these objects, however, is that the dataframe column requirements can vary between them. For example, besides the columns already mentioned for the CountComparison analysis, the CountVisualization analysis also requires that the input be a geodataframe from the geopandas library (meaning it should have an active geometry column) in order for the generated plots to make any sense. The required columns are all listed in the respective object's docstring, so if your're unsure all you need to is read it.

#### Going beyond

Besides this most elementary use, Analysis objects also support being passed two types of objects when being instantiated: Filters and Options. Each of these objects have a specific purpose:

* Options: determine what is computed (such as what statistics)
* Filters: determine what is saved to the results attribute (such as what values)

By default, each different type of Analysis is instantiated with a specific Options object and the 'identity' Filter (meaning nothing is filtered), however that is easily changed.

Suppose we want to only keep the 10 largest entries/links in terms of their calculated SQV and GEH when generating the CountComparison analysis. We can then use one of the already implemented Filter classes in the following way:

In [None]:
from diagnostic.analyses import FilterByLargest

# We first instantiate the filter
sqv_10 = FilterByLargest((10, ['SQV', 'GEH']))

# Then apply to the Analysis object upon creation
cc = CountComparison(sqv_10)

Thus, when we generate the analysis and access the result attribute, we will find there are only 10 entries in the DataFrame, those with the highest SQV and GEH values

In [None]:
cc.generate_analysis(comparison)
cc.result

## Automated report generation

Manually instantiating each desired Analysis, then calling their respective _generate\_analysis_ and _to\_\<format\>_, and then joining all those outputs together every time can be a bit of a chore. For this reason we also provide a Report class which automates most of this stuff for you.

Suppose, for example, we want to generate a latex document with the CountComparison, CountSummaryStats, and CountVisualization analyses, all neatly formatted and divided into their own sections. Doing that is very straightforward.

### Defining the desired analyses

First, we'll instantiate the analyses we want and add them to a list.

In [None]:
from diagnostic.analyses import CountComparison, CountSummaryStats, CountVisualization

In [None]:
cc = CountComparison()
cs = CountSummaryStats()
cv = CountVisualization()

analyses = [cc, cs, cv]

#### Adding geometry to events

Since we want to generate the CountVisualization Analysis, which relies on the passed object being a `GeoDataFrame`, we must add the link geometries to our events.

In [None]:
network = gpd.read_file(data_folder / 'network/network.shp')

In [None]:
events_df = gpd.GeoDataFrame(events_df.merge(network, on='link_id'))
events_df

### Instantiating the Report object

Second, we'll create an instance of the Report class and feed it our list of analyses as well as two dataframes: the simulated and observed dataframes. These dataframes should have the same columns as one produced by this module's parsers.

In [None]:
from diagnostic.report import Report

In [None]:
link_count_report = Report('Link count report', analyses)

### Generating the analyses

Just like we did with the individual Analysis objects, in order to get the Report to generate all of our analyses is to call the `generate_analysis()` method on it

In [None]:
link_count_report.generate_analyses(events_df, detector_data)

### Accessing individual results

Accessing the analyses' individual results when they have been generated through the Report is done exactly the same way as before. Just call the _result_ attribute on the desired Analysis object

In [None]:
print(f"Access directly through the object: {cc.result}")
print(f"Access directly through the object: {cs.result}")
print(f"Access directly through the object: {cv.result}")

In [None]:
for analysis in analyses:
    print(f"Access through the list defined earlier: {analysis.result}")

In [None]:
for analysis in link_count_report.analyses:
    print(f"Access through the report's analyses attribute: {analysis.result}")

### Creating the output file

To create our aforementioned latex report, we call the Report object's _to\_latex_ method and pass in where to save it

In [None]:
# link_count_report.to_latex('/path')

### Going Beyond

#### Specifying analysis dependence

Suppose we want the result from one of our analyses, say the CountComparison one, to be fed as the input to some other analysis, CountSummaryStats, automatically as we're generating a report. In order to do that, we can pass in an aditional argument to our Report object when instantiating it called 'analysis_dependence_dict'. This is a dictionary in which the key is the dependent analysis and the value is the analysis on which the previous one depends. So for this given example, we would have:

In [None]:
cc = CountComparison()
cs = CountSummaryStats()

analyses = [cc, cs]

# This is the analysis dependence dictionary
add = {cs: cc}

report = Report('title', analyses, add)
report.generate_analyses(events_df, detector_data)

So now, if we take a look at _cs_'s result attribute, we should see that it has a lot more columns than the one we previously created

In [None]:
cs.result

# Extending the library

It should be fairly easy to add new capabilities to the library, mainly in the form of new analyses, options, and filters, while keeping everything compatible with the higher level convenience objects and methods (such as the Report class).

Implementing a new Analysis is as simple as creating a class that extends/inherits from Analysis and defines at least the following method:

* _generate\_analysis_(self, comp: pd.DataFrame)

For example:

In [None]:
from diagnostic.analyses import Analysis

In [None]:
class MyAnalysis(Analysis):
    def generate_analysis(self, comp: pd.DataFrame):
        # Our analysis is halving the given input
        comp = comp/2
        self._save_result(comp)

In [None]:
cc.result