New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Centralize schedule metadata and make it editable by Data Administrators #2018

Merged
merged 12 commits into from Jul 3, 2018

Conversation

Projects
None yet
3 participants
@toolness
Contributor

toolness commented Jun 20, 2018

Future improvements to CALC point to a need for Data Administrators to be able to add new schedule data to CALC without requiring changes to the codebase.

For example, Kelly would like to augment his Region 10 bulk upload with rows containing data about Region 6 schedules. For his part, this would just require setting the schedule column on his bulk upload spreadsheet for the relevant price rate rows. But we also want to make sure that CALC starts listing the new schedule(s) as being available in all the places where it lists schedules.

Currently, the list of schedules available in CALC is surfaced in the following places:

  • A list of the schedules and their descriptions is shown in CALC's about page.
  • The "SIN / Schedule" drop-down filter in the Data Explorer lists the schedules and their SIN numbers.
  • The developer API documentation lists the possible values that the schedule field may have.

Currently these locations have hard-coded lists, and changing the schedule list requires visiting each list and making the relevant changes.

This PR changes the list to be stored in the database and be editable in the admin UI by Data Administrators, with an initial data migration that pre-populates the list to its current settings.

Examples

Data Administrators will see this in the admin UI:

screen shot 2018-06-20 at 11 53 03 am

Once they click on it, they will see a list of schedules:

screen shot 2018-06-20 at 11 53 21 am

If they click on one of them, they can edit it:

screen shot 2018-06-20 at 11 53 36 am

The "Schedule" field above may be a bit confusing: put more concretely, it's the value the schedule column in Kelly's bulk upload has that corresponds to the name of the schedule. So for our Region 6 example, this might be "Region 6", if that's the schedule column value Kelly gave to all the rows in his bulk upload that were for Region 6. But the "name" might be "Schedule 03FAC" since that might be a more famliar name for end-users to see in CALC.

Notes

  • The API docs now link to /api/schedules/ instead of hard-coding their list of schedules/SIN numbers. This is because dynamically generating the API docs is hard; I documented my efforts to do this at https://gist.github.com/toolness/9673aee0b3c7a5fc37d7a00b2c157fa8#gistcomment-2626291, but eventually gave up because it was just too much work and my solution might be too clever/hard to maintain.

  • I added a function to the Google Analytics helpers to track exceptions and used it to track failures when pulling schedule metadata via the API.

To do

  • Ensure that only Data Administrators and superusers can modify the schedule metadata.
  • Make the schedule/SIN list on the Data Explorer front-end be dynamically generated.
  • Make the schedule list in the API documentation be dynamically generated (or alternatively, if this is hard, link to a separate page or API endpoint that lists them).
  • Move the schedule metadata stuff out of constants.js since it's not really a constant anymore.
  • Consider adding a test that scrapes all relative URLs from /api/docs/ and makes sure that they exist.
  • Consider using bleach or a similar HTML sanitizer to ensure that the description field can't be abused to trigger XSS.
  • Consider replacing <p class="schedule"> in about.html with a header tag (also remove .schedule CSS rules if they are unused).
  • File an issue that proposes connecting this functionality with the display of schedule statistics on the about page (see #1993).
  • File an issue that proposes adding a data quality report showing the number of labor rates that have a schedule with no associated ScheduleMetadata record.

@toolness toolness changed the title from [WIP] Make the schedule list editable by Data Administrators to [WIP] Centralize schedule metadata and make it editable by Data Administrators Jun 20, 2018

innovative solutions to their information technology needs.
*Note: IT Schedule 70 is newly added so available results
in CALC is currently limited.*

This comment has been minimized.

@hbillings

hbillings Jun 20, 2018

Member

We should revisit whether we even still need this note.

This comment has been minimized.

@toolness

toolness Jun 20, 2018

Contributor

Yeah I agree. I think that #1993 / #1836 can help with this too, as it's basically about making freshness/quantity information more dynamic.

@hbillings

This comment has been minimized.

Member

hbillings commented Jun 20, 2018

This approach makes sense to me. I like the flexibility it builds in, and the logic is straightforward. I think the only part that feels a bit odd is having Yet Another Django App (YADA) just for contracts, but I'm not sure it makes sense to put this anywhere else, either.

@toolness

This comment has been minimized.

Contributor

toolness commented Jun 20, 2018

Cool!

I think the only part that feels a bit odd is having Yet Another Django App (YADA) just for contracts, but I'm not sure it makes sense to put this anywhere else, either.

Oh, contracts is a pre-existing Django app from CALC 1 days (it's what has the main Contract model), so I just added to it instead of making YADA for schedule metadata... but because contracts doesn't already expose any models to django admin, we have to have it under its own separate section in the admin UI, which I guess makes it look like YADA, even though it's not?

@hbillings

This comment has been minimized.

Member

hbillings commented Jun 20, 2018

Derp, you're totally right -- for some reason I was remembering contracts being inside of the calc app! (Then again, I can't even remember where my glasses are when they're on my face.)

'Text that appears in the "schedule" column of Contract data '
'to identify a labor rate as being in this schedule (e.g. "MOBIS").'
),
unique=True,

This comment has been minimized.

@tbaxter-18f

tbaxter-18f Jun 20, 2018

Member

I think unique=True on an open CharField could lead to some annoying user interactions when they get errors about the text they put in not being unique. Is there another way we can resolve this? Does this text have to be unique, or could we maybe "unique together"?

This comment has been minimized.

@toolness

toolness Jun 20, 2018

Contributor

I suppose it doesn't have to be unique, but it doesn't make much sense for it to be non-unique. Maybe we can leave it as unique and then change things if there's evidence that Data Administrators are running into confusing errors?

Another thing to note is that very few people will even have the ability to edit these records--really just Kelly and a small handful of other people who will have a relatively high level of technical expertise.

max_length=128
)
name = models.CharField(

This comment has been minimized.

@tbaxter-18f

tbaxter-18f Jun 20, 2018

Member

I'm sure I'm missing context, but why charFields instead of Foreignkeys?

This comment has been minimized.

@toolness

toolness Jun 20, 2018

Contributor

Sorry, a ForeignKey from which field to which field? I was thinking that the schedule field on the Contract model might actually be a good foreign key to the same field on the ScheduleMetadata model, but the problem there is that we can't actually guarantee integrity, as the Contract model is essentially just a spreadsheet that's been dumped into a database, and unfortunately contains all kinds of messiness (this was the way the table was structured when we started working on the project).

This comment has been minimized.

@tbaxter-18f

tbaxter-18f Jun 20, 2018

Member

I was thinking most of the fields here felt like FKs, but I see your point about not knowing the other objects (or even if they exist).

@property
def description_html(self):
return mark_safe(markdown.markdown(self.description)) # nosec

This comment has been minimized.

@tbaxter-18f

tbaxter-18f Jun 20, 2018

Member

I think this may be slightly unsafe. I would instead suggest sanitizing/bleaching the user input and then storing in a separate non-editable description_html field. That will also save re-translating it on every use for a nominal performance gain.

This comment has been minimized.

@toolness

toolness Jun 20, 2018

Contributor

Yeah, I was on the fence about sanitization since trusted admins are editing the code and might even get frustrated if certain markup isn't allowed, but it's a good idea.

That said, I'm not sure about having a separate db field for it, as the two could get out-of-sync, especially if the sanitization algorithm (e.g. bleach's tag whitelist) changed. For now I'd prefer to just dynamically generate the HTML and move to a db field later if it ends up slowing things down (though even before doing that I might try just using Django's caching layer).

This comment has been minimized.

@tbaxter-18f

tbaxter-18f Jun 20, 2018

Member

It's pretty straightforward. You just save the formatted version on save(). In my own code, I've actually abstracted it out to reusable code for all user inputs. (I have to do a LOT of sanitation)

This comment has been minimized.

@hbillings

hbillings Jun 20, 2018

Member

@tbaxter-18f Please tell me you refer to the abstracted version as SanitizationWorker.

Hello yes welcome to CALC you will be dealing with my terrible jokes from here on

This comment has been minimized.

@toolness

toolness Jun 20, 2018

Contributor

Yes, but the problem with saving the formatted version on save() is that then if we later realize that we need to add or remove a tag to be sanitized (for example, perhaps someone wants to use a <summary> element and we have no problem with it), all the pre-rendered fields will still have it filtered out, which means that we have to allow for some way to manually re-render records. (I've done a fair amount of work with sanitization as well and found that being able to quickly iterate on the sanitization strategy without having to worry about stale caches was beneficial.)

This comment has been minimized.

@tbaxter-18f

tbaxter-18f Jun 20, 2018

Member

In that unlikely occurence, you hop on the command line, iterate through the objects and resave the single field. Not an issue. It's worth it for the safety at least. Even with trusted users, mistakes happen.

This comment has been minimized.

@toolness

toolness Jun 20, 2018

Contributor

I’m not arguing against the use of sanitization here, I’m arguing against the premature optimization of caching the sanitized value.

This comment has been minimized.

@tbaxter-18f

tbaxter-18f Jun 20, 2018

Member

@hbillings I did not, but now I deeply regret not thinking of that. If anyone's interested, here's the full battery of cleaning and transforming I put all user inputs through: https://github.com/tBaxter/tango-shared-core/blob/master/tango_shared/utils/sanetize.py

<p>IT Schedule 70 offers federal, state and local governments innovative solutions to their information technology needs.<br />
<i>Note: IT Schedule 70 is newly added so available results in CALC is currently limited.</i></p>
{% for schedule in schedules %}
<p class="schedule">{{ schedule.full_name }}</p>

This comment has been minimized.

@tbaxter-18f

tbaxter-18f Jun 20, 2018

Member

would a header (or maybe a DL) be more appropriate?

This comment has been minimized.

@toolness

toolness Jun 20, 2018

Contributor

Yeah, I think it would be actually, I'm not sure why this wasn't a header to begin with... but also in the spirit of just making this a refactor, I didn't change it. But it seems like a simple enough change that I might as well!

@@ -3,10 +3,13 @@
from django.utils.safestring import mark_safe
from django.utils.crypto import get_random_string
from contracts.models import ScheduleMetadata
def about(request):

This comment has been minimized.

@tbaxter-18f

tbaxter-18f Jun 20, 2018

Member

I'm not sure how much value there is in swapping it, but we may want to consider moving to a generic ListView, since that's the pattern we're following. But again... there's not a whole lotta extra value in doing it, so if we don't want to, I'm not gonna argue.

This comment has been minimized.

@toolness

toolness Jun 20, 2018

Contributor

Yeah, I think the structure of this particular page will potentially change in the near future as the data explorer gets redesigned, so it's probably best to leave things as-is for now...

@tbaxter-18f

This comment has been minimized.

Member

tbaxter-18f commented Jun 20, 2018

I had the same notion re YADA but eventually figured out @toolness was leveraging the existing app and then I 👏 .

# This is SUPER weird. Previous TransactionTestCase runs (usually
# the result of a LiveServerTestCase) could have removed the
# schedule metadata populated by our migration, so we'll forcibly
# wipe our schedule metadata and re-run the migration just in case.

This comment has been minimized.

@toolness

toolness Jun 20, 2018

Contributor

For the record, figuring out how to work around this was super frustrating. I ended up learning a bit about rollback emulation and TEST_NON_SERIALIZED_APPS but still couldn't get all of CALC's tests to pass as a whole. I find it deeply disturbing that it's so easy to accidentally create Django tests that aren't properly isolated from one another, and I always seem to run into weird problems when attempting to create pre-populated tables--either through data migrations or other means--which is unfortunate.

I guess we might want to file an issue about this to attempt to deal with it later in a cleaner way, though I have no idea what that cleaner way might be. Update: filed #2023 to potentially address this later.

col_header = find_column_header(driver, 'schedule')
self.assertFalse(has_class(col_header, 'collapsed'))

This comment has been minimized.

@toolness

toolness Jun 22, 2018

Contributor

This was failing, and I think it was actually due to the problem I mentioned in #2018 (comment). Running this test on its own works fine, but running it in the context of the whole CALC test suite does not, and I think it has to do with the lack of isolation between tests with respect to data migrations. Update: if we ever fix #2023 we can probably bring back this test.

schedules.forEach((schedule) => {
scheduleLabels[schedule.schedule] = schedule.full_name;
});
}

This comment has been minimized.

@toolness

toolness Jun 23, 2018

Contributor

I'm not thrilled by the fact that this module stores state (albeit constant-ish state) in an exported object, but this PR was getting big and I didn't want to change too much front-end code. I think a future PR could potentially make a cleaner solution out of this, maybe.

sin: '',
schedule: 'Consolidated',
full_name: 'Consolidated'
}]);

This comment has been minimized.

@toolness

toolness Jun 23, 2018

Contributor

I wanted to make sure the front-end code still worked the same way it did before, so during test setup I moved the old hard-coded SCHEDULE_LABELS over here and made them look like they were coming from the API.

return self.driver.find_element_by_id('search')
wait = WebDriverWait(self.driver, 10)
element = wait.until(EC.element_to_be_clickable((By.ID, 'search')))
return element

This comment has been minimized.

@toolness

toolness Jun 23, 2018

Contributor

We could previously assume that the React UI was visible on page load because we were able to initialize it on DOMContentLoaded. However, now that we make an API request before initializing the React app, we need to wait for the app to be visible.

@@ -56,7 +56,7 @@
"redux": "^3.6.0",
"redux-logger": "^2.7.4",
"sinon": "2.1.0",
"typescript": "^2.7.2",
"typescript": "^2.9.2",

This comment has been minimized.

@toolness

toolness Jun 23, 2018

Contributor

I had to upgrade typescript so that importing JSDoc-based type definitions from other modules would work.

* @property {string} schedule - The schedule identifier
* @property {string} sin - The SIN number (can be an empty string)
* @property {string} full_name - The user-facing name for the schedule
*/

This comment has been minimized.

@toolness

toolness Jun 23, 2018

Contributor

(Note that typedefs are basically a way for TypeScript to ensure that we're using a data structure the right way, e.g. not accidentally making typos in property names and such. It also gives us intellisense (auto code completion) on editors with the proper plugins installed, which is nice.)

I wasn't really sure where to put this typedef... I could have put it in the new schedule-metadata.js file, but because the type is really coming from the API, I thought it was more appropriate to put it here, so all the API-specific code and constants and type definitions are in the same place.

store.dispatch(invalidateRates());
store.dispatch(invalidateRates());

This comment has been minimized.

@toolness

toolness Jun 23, 2018

Contributor

Moving some of the initialization after the populating of schedule metadata is crucial, because some of the initialization validates the querystring, which means ensuring that the schedule querystring argument is valid. If this initialization was run before schedule metadata was populated, then the schedule querystring argument would always be perceived as invalid, since the app would know nothing about schedules at that time.

}
populateScheduleLabels(schedules || []);
startApp();
});

This comment has been minimized.

@toolness

toolness Jun 23, 2018

Contributor

I would kind of rather this be promise-based, but we don't seem to be using promises anywhere else in this code right now (I guess due to the fact that it's been evolving from legacy code circa 2014), and I wasn't sure if promises would be available via polyfills on older browsers, so I stuck to good ol' callbacks for now.

if (err) {
trackException(err, true);
}
populateScheduleLabels(schedules || []);

This comment has been minimized.

@toolness

toolness Jun 23, 2018

Contributor

Note that an error doesn't fatally crash the application; I figured we could "recover" by just assuming the schedule list is empty and moving on. This would effectively mean that CALC would be usable, but the "Schedule" dropdown would be empty, and any schedule querystring argument in the URL would be ignored.

My one worry about this approach has to do with the latter limitation: if someone were to share a link to a CALC search that filtered by schedule, and then another person loaded that link in their browser but this api.getSchedules() call failed, then the user would see what appeared to be the shared search, only without the schedule filter, which could be misleading.

@toolness toolness changed the title from [WIP] Centralize schedule metadata and make it editable by Data Administrators to Centralize schedule metadata and make it editable by Data Administrators Jun 23, 2018

@toolness

This comment has been minimized.

Contributor

toolness commented Jun 23, 2018

Okay, I think this PR is ready to go, as an initial implementation at least. There are still items on the TODO list in the description of this PR, but I think they should be pushed out to separate issues because this PR is already quite large.

@hbillings

This comment has been minimized.

Member

hbillings commented Jul 3, 2018

This seems good to go to me if @tbaxter-18f feels the same!

@tbaxter-18f

This comment has been minimized.

Member

tbaxter-18f commented Jul 3, 2018

👍

@hbillings hbillings merged commit 22af161 into develop Jul 3, 2018

3 checks passed

ci/circleci: build Your tests passed on CircleCI!
Details
security/snyk - package.json (CALC) No new issues
Details
security/snyk - requirements.txt (CALC) No new issues
Details

@hbillings hbillings deleted the dynamic-schedule-list-2 branch Jul 3, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment