Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Create a Data Model for Documentation, Auditing, and Consensus Building #617

Open
DavidOry opened this issue Oct 10, 2022 · 6 comments

Comments

@DavidOry
Copy link

1. User Stories

User Story One

As an owner of a travel demand model, I would like to transition to ActivitySim. As a first step, I would like to understand:
(a) what variables need to be input into each of the available prototype model sets;
(b) what variables are derived from the input variables, e.g., what variables are used in density calculations or person type rules?
(c) what variables are created by each of the prototype models; and,
(d) what are the relationships between these variables, e.g., does an automobile have a primary driver? Does each individual have a value of time?

To do this now, a model owner needs to be an expert in ActivitySim. It requires inspecting the input socioeconomic data,
the input synthetic population files, and skim matrices. It requires examining the output trip lists, person files, and household files. It requires examining the annotate files to understand the derived variable calculations. And it may require looking at the code itself to understand other details.

User Story Two

As a model developer, I am transferring utility expressions written in Java CT-RAMP syntax to ActivitySim. To do this, I need to understand the variable names used in ActivitySim and where they are created (or if they need to be created). I also need to understand the syntax of Python's eval and pandas.DataFrame.eval. I then need to iteratively craft expressions and run them through ActivitySim to determine if they are valid. This is tedious and inefficient.

2. Resolution Ideas

Create a complete (i.e., defines input, derived, intermediate, and output variables) data model for ActivitySim in something like a Protocol Buffer. Creating a data model would:
(a) Document the variables used in an ActivitySim model, including inputs, derived variables, and outputs;
(b) Specify the data type for each variable;
(c) Specify and document the relationships between each variable;
(d) Facilitate the specification of methods used to compute derived variables, such as density and person type, in a single location.
(e) Be an avenue towards reaching consensus on variable names and definitions, which can lead to greater standardization and avoid arbitrary differences (e.g., hh_density versus household_density).
(f) Set the stage for the next generation ActivitySim, which would presumably be agent-based and start with a forward-thinking data model.

(There are a large number of resources describing data models on-line, e.g., here and here.)

The existing write_data_dictionary component is helpful, but making it more complete (identified in #528) falls short of satisfying these use cases.

With a data model in place, I see two pathways for integrating with ActivitySim (other ideas?), as follows:

Resolution Pathway A

A data model represented in something like a Protocol Buffer could be used to audit input files, annotation files,
utility expressions, and output files. This would allow model users to use a data model as a means of documenting
model inputs and outputs, which addresses User Story One. The auditing could also assist with User Story Two, in that draft utility expressions could be run through the auditing software rather than ActivitySim itself. (An auditing tool could also address #616).

Resolution Pathway B

Ideally, the data model would be used to replace the existing annotation and utility expression formulation. This would be a significant effort that would only make sense as part of a broader re-factoring of ActivitySim or part of ActivitySim 2.0. The benefit of this approach is it would allow for interactive validation of utility expressions, which addresses User Story Two. It would also allow for utility expressions to be more verbose and readable (e.g., person.age rather than df.age).

3. Priority

TBD by ActivitySim Consortium

4. Level of Effort

Medium. Here's a guess for Resolution Pathway A:
-- 4 to 6 months of consensus building on a standard data model;
-- 4 to 6 months of developing the data model code and associated auditing code;
-- 2 to 4 months of testing and review.

5. Project

Is there a funder or project associated with this feature?
No

6. Risk

Will this potentially break anything?
Not for Resolution Pathway A, which calls for the data model to exist independently from ActivitySim and
be used as an optional auditing mechanism -- for data inputs, annotation expressions, and utility expressions.

Resolution Pathway B is sufficiently risky to be ill advised outside a broader refactoring.

7. Tests

What are relevant tests or what tests need to be created in order to determine that this issue is complete?
For Resolution Pathway A, tests can be conducted on existing input, annotation, and utility expressions and compared to human-derived definitions of variable names and relationships.

@DavidOry
Copy link
Author

@lmz: per your comment last week, I roughed out a data model template that could be used as an alternative pathway for the input checker. It's very basic and rough at the moment, but should get the broader idea across.

@dhensle, @aletzdy: per our chat , please have a look. Functional contributions to the template are very welcome. The documentation build is failing at the moment -- I'll look into that next week -- but you can build locally.

@joecastiglione: see the input.py file for an example of what a Pydantic data model could look like.

fyi @jpn--, @jfdman, @i-am-sijia

@DavidOry
Copy link
Author

@lmz: per your comment last week, I roughed out a data model template that could be used as an alternative pathway for the input checker. It's very basic and rough at the moment, but should get the broader idea across.

@dhensle, @aletzdy: per our chat , please have a look. Functional contributions to the template are very welcome. The documentation build is failing at the moment -- I'll look into that next week -- but you can build locally.

@joecastiglione: see the input.py file for an example of what a Pydantic data model could look like.

fyi @jpn--, @jfdman, @i-am-sijia

Docs are fixed.

@bettinardi
Copy link

You don't have to write paragraphs to this question - but hoping you can bring me up to speed.
I can understand the high value of a data model, but I don't quite understand how it replaces an input checker. It seems like an input checker still needs to be built into a data model... correct.

In my mind the point of an input checker is not to build internal model function consistency and improved understanding. The point of an input checker is to have the model bomb immediately with an easy to understand message, so the user knows they messed up and knows how to fix it.

I don't see how a data model (in itself, without additional checks), is going to cover the non-fatal warnings that we have in our current input checker. Things like -

"You're not wrong, you're just an A-hole" [Big Lebowksi] - attempt at humor
But things like... uhh... your employment in this sector totals to 500,000 and you only have 300,000 persons with an occupation code that matches...

Or

The model doesn't really care if you have 10,000 veh/lane capacity - just that it's a positive number - but you might want to look at that 10,000 value - we're thinking you meant 1,000.

The point is - I don't think a data model replaces a system to check against user error. I think a data model can have a built in warning system for fatal errors, but there is still another wrapper to protect against user error, that isn't wrong, the coder was just an A-hole.

@DavidOry
Copy link
Author

You don't have to write paragraphs to this question - but hoping you can bring me up to speed. I can understand the high value of a data model, but I don't quite understand how it replaces an input checker. It seems like an input checker still needs to be built into a data model... correct.

In my mind the point of an input checker is not to build internal model function consistency and improved understanding. The point of an input checker is to have the model bomb immediately with an easy to understand message, so the user knows they messed up and knows how to fix it.

I don't see how a data model (in itself, without additional checks), is going to cover the non-fatal warnings that we have in our current input checker. Things like -

"You're not wrong, you're just an A-hole" [Big Lebowksi] - attempt at humor But things like... uhh... your employment in this sector totals to 500,000 and you only have 300,000 persons with an occupation code that matches...

Or

The model doesn't really care if you have 10,000 veh/lane capacity - just that it's a positive number - but you might want to look at that 10,000 value - we're thinking you meant 1,000.

The point is - I don't think a data model replaces a system to check against user error. I think a data model can have a built in warning system for fatal errors, but there is still another wrapper to protect against user error, that isn't wrong, the coder was just an A-hole.

For an input checker, it would be good to know:

  1. What the inputs are, e.g, roadway capacity is an input.
  2. What the variable names are, e.g, is it capacity or capacity_per_hour_per_lane.
  3. What the context of each variable is, e.g., capacity belongs to something called a network.
  4. What the data type is, e.g., float or int or something else.
  5. As you suggest, reasonable values for each variable, e.g., hourly capacity should be less than 2500 vehicles per lane.

Pydantic has built in tools for each of these, and does 1-4 more or less automatically. For 5, you still have to write Python code, but I would argue that it's easier to write that code in the data model (here's an example) than in an ActivitySim expression file.

@bettinardi
Copy link

@i-am-sijia showcased your parking costs example in the meeting today. It feels like we are entering a semantics argument, but the python code to check if parking costs is reasonable doesn't feel like a data model any more... it feels like an input checker.

But I won't fight semantics - I'll discuss the issue.

The issue for Oregon DOT... Our input checker is not a "one and done" file. Everytime we stub our toe on an input mistake - we add to what is currently a csv list of input checks - with the hope that we will never stub our toe on that issue again.

If the input checks are embedded in python - I'm concerned they might become obscured from the average user. If the group were to move forward with the "data model" (part of which is actually an input checker hiding in python code) - we would need to ensure that the python code was an input file that the user had easy access too, and not code in the repo in some activitySim package to install...

@DavidOry
Copy link
Author

Right @bettinardi. One key difference is that if you used the data model, it would be more logical to write validation methods in Python directly, rather than via an ActivitySim CSV file. You would keep the data model separate from ActivitySim, importing it to use as part of the input checker, and, as you say, adding to it when you find an error.

I personally think the ActivitySim CSV files are a necessary evil for computing utilities, that often have long and complex expressions. But using them for simple calculations seems an unnecessary evil. Python methods are much easier to write and check, as you have the entirety of the internet to help you (e.g., ChatGPT can write Pydantic validator methods for you).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@DavidOry @bettinardi and others