Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Add experimental price list analysis #997
This adds some experimental price list analysis functionality so we can at least figure out what our outliers are. A lot of the early work was done in a Jupyter notebook.
As we improve the algorithm, we can potentially use this to provide a report that COs can use to guide their negotiations (this would involve having them upload proposed price lists to CALC, not awarded ones).
The hard part is "broadening" the search from the labor category of a price list row to be more generic, while still being useful and not requiring administrators to constantly manage some kind of hand-crafted classification system. Here's how it currently works:
Just modified the analysis text so it's a bit less jargony:
In particular, we're only saying whether it's above/below +1/-1 standard devation from the mean, rather than specifying how many standard deviations from the mean it is. However, the number of a's in way is equal to the number of standard deviations, so saying "waaay below" actually means it's three standard deviations below.
Here's a quick summary review....
There are a handful of issues that should be tackled before release, specifically towards preventing the analysis from failing w/ out of memory and making clear the results of the matching for the price analysis.
Beyond that there remains the potential of a lengthy or otherwise complex incoming price list taking too long to analyze. This is mitigated a bit by the limited set of users w/ access to the tool so long as they're made aware of the experimental nature of this feature and its limitations. We can work towards bringing the runtime down w/ additional caching of intermediary work and not writing to the database.
Additional logging throughout the analysis/matching, specifically the intermediate broadening steps, would also be helpful going forward to assist in the evolution of the algorithm by future developers.
Additionally, there's a good amount of house keeping that can be done in the form of refactoring, removal of dead code, and outstanding TODOs throughout the PR. I'll submit new issues for the later.
Definitely address before release:
Nice to have:
These earlier concerns from the world of stats are worth resurfacing and addressing as a separate issue once this is merged.
This was referenced
Sep 25, 2018
Awesome, looking great @tadhg-ohiggins!
For 7cf045b, I think it would be better to tackle this by cutting the
ContractsQuerySet methods. Likely confusing down the line if we add
get_queryset to a
models.QuerySet as its a
models.Manager method in Django.
That would also limit the tweaks here to code within this PR and avoid the associated test changes.
Specifically, I think it can be done like this...
Change this call in
for i, phrase in enumerate(broaden_query(cursor, vocab, labor_category, cache, min_count)): - phrase_qs = Contract.objects.all().multi_phrase_search(phrase) + phrase_qs = Contract.objects.multi_phrase_search(phrase)