New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add experimental price list analysis #997

Merged
merged 211 commits into from Sep 26, 2018

Conversation

@toolness
Contributor

toolness commented Nov 2, 2016

This adds some experimental price list analysis functionality so we can at least figure out what our outliers are. A lot of the early work was done in a Jupyter notebook.

As we improve the algorithm, we can potentially use this to provide a report that COs can use to guide their negotiations (this would involve having them upload proposed price lists to CALC, not awarded ones).

screen shot 2016-11-02 at 5 26 53 pm

The hard part is "broadening" the search from the labor category of a price list row to be more generic, while still being useful and not requiring administrators to constantly manage some kind of hand-crafted classification system. Here's how it currently works:

  1. First, we ask the database for a list of words in the corpus of labor categories that occur at least a certain number of times (currently 100).
  2. We filter against this list to find the most salient parts of a given row's labor category. For instance, it's how we toss out words like "Ridiculously", "Awesome", and "Doom" in the above screenshot.
  3. Finally, we permute that filtered list of words until we find a combination that a decent number of existing contracts match against. If no contracts match, then we don't have any advice to give, which is the case with "Hamburger Chef" in the above screenshot.

Notes:

  • The links in the screenshot point to CALC searches for the given labor category and education level. The min/max years of experience is set to be the two years around the minimum experience level of the price list row: so if the price list row requires a minimum of 7 years experience, the linked-to CALC search has a minimum experience of 5 years and a maximum experience of 9. First a search for labor categories matching the exact same education and experience level is done; if that doesn't result in a minimum number of comparables, then a search is done to find contracts that are greater than or equal to the education and experience level of the source row.
  • Currently a labor category like junior administrative engineer would result in a CALC search for engineer rather than junior engineer due to the way the algorithm works. Fixed in c673afd.
  • The "vocabulary" of frequently-occurring words described above is actually quite small and can fit in memory pretty easily. Update: Because the minimum number of comparables is now really low (4 versus the old value of 100), we need to have a much more granular vocabulary too. I'm not sure how big it is, though it does still seem to fit into memory.
  • Trying out different permutations of words can take a little while, as each permutation requires a separate database query to determine if it yields a significant number of results. This could get slow under load, but I suppose one advantage is that it's somewhat scalable, if we have multiple database replicas. We can also use caching to speed things up, some of which is already being done. I've also added a word co-occurrence matrix to speed things up.
  • Because we're not explicitly designing a classification system, some of the classifications the system comes up with can be a bit weird. For instance, it might compare Junior hamburger chef to Junior, which isn't really a labor category (using part-of-speech tagging wouldn't be very useful here, since Junior is both an adjective and a noun). (That said, who knows, actually doing a search on Junior constrained by experience and education could actually be useful!) Many "weird" classifications are eliminated if nltk is installed, in which case we use part-of-speech tagging to make sure that the broadened query we've come up with actually has nouns in it.

To do:

  • If a labor category is particularly long and contains lots of frequently-used words, the permutation phase could take a really long time. We need to limit the maximum number of permutations. (I added something to cap this in 09b746b.)
if (row['price'].errors or row['years_experience'].errors or
row['education'].errors or row['service'].errors):
return None

This comment has been minimized.

@toolness

toolness Nov 2, 2016

Contributor

The indentation here is super weird to me, but the style that is natural to me currently violates E129 visually indented line with same indent as next logical line. From the conversation on that issue, though, it looks like it's advised we just globally disable that feature, since PEP 8 specifically gives no guidance on it.

This comment has been minimized.

@jseppi

jseppi Nov 2, 2016

Contributor

That looks totally normal to me o_O But you can turn off E129 if you want, I don't really mind either way.

Or maybe

has_errors = (row['price'].errors or row['years_experience'].errors or
              row['education'].errors or row['service'].errors)
if has_errors:
    return None

This comment has been minimized.

@toolness

toolness Nov 2, 2016

Contributor

Haha, if it looks totally normal to you, I'm fine with it then :)

@toolness

This comment has been minimized.

Contributor

toolness commented Nov 16, 2016

Just modified the analysis text so it's a bit less jargony:

screen shot 2016-11-16 at 1 37 00 pm

In particular, we're only saying whether it's above/below +1/-1 standard devation from the mean, rather than specifying how many standard deviations from the mean it is. However, the number of a's in way is equal to the number of standard deviations, so saying "waaay below" actually means it's three standard deviations below.

@@ -290,6 +290,9 @@ def describe(cursor, vocab, labor_category, min_years_experience,
result['avg'] = avg
result['stddev'] = stddev
result['stddevs'] = math.ceil(price_delta / stddev)
result['preposition'] = \
'w' + ('a' * result['stddevs']) + 'y ' + \

This comment has been minimized.

@jseppi

jseppi Nov 16, 2016

Contributor

😆

@toolness

This comment has been minimized.

Contributor

toolness commented Jan 12, 2017

Here's a screenshot of what the analysis looks like now:

screen shot 2017-01-12 at 5 01 27 pm

James Seppi and others added some commits Jan 25, 2017

@Jkrzy

This comment has been minimized.

Jkrzy commented Sep 18, 2018

@tadhg-ohiggins, @hbillings

Here's a quick summary review....

There are a handful of issues that should be tackled before release, specifically towards preventing the analysis from failing w/ out of memory and making clear the results of the matching for the price analysis.

Beyond that there remains the potential of a lengthy or otherwise complex incoming price list taking too long to analyze. This is mitigated a bit by the limited set of users w/ access to the tool so long as they're made aware of the experimental nature of this feature and its limitations. We can work towards bringing the runtime down w/ additional caching of intermediary work and not writing to the database.

Additional logging throughout the analysis/matching, specifically the intermediate broadening steps, would also be helpful going forward to assist in the evolution of the algorithm by future developers.

Additionally, there's a good amount of house keeping that can be done in the form of refactoring, removal of dead code, and outstanding TODOs throughout the PR. I'll submit new issues for the later.

Definitely address before release:

Nice to have:

@Jkrzy

This comment has been minimized.

Jkrzy commented Sep 18, 2018

These earlier concerns from the world of stats are worth resurfacing and addressing as a separate issue once this is merged.
#997 (comment):

Aside from that, though, I know there is also some statistical things that @EricSchles had concerns about: if I recall correctly, he thinks that terms like "median" and "standard deviation" only really make sense in a normal distribution, and when we're making up our own search criteria from scratch, we're quite likely to get something that isn't a normal distribution. But since the current algorithm is a reflection of how people are already using CALC manually, it might be a bigger problem than just this algorithm....

@ericronne ericronne referenced this pull request Sep 20, 2018

Closed

[WIP] Add Schedule 736 price list analysis #2123

0 of 2 tasks complete

tadhg-ohiggins added some commits Sep 25, 2018

@tadhg-ohiggins

This comment has been minimized.

Collaborator

tadhg-ohiggins commented Sep 25, 2018

I addressed a number of @Jkrzy 's comments in my last several commits, and those that we ran out of time for I created new issues around (#2179 and #2180). This should be ready for final review and (hopefully!) approval for merging into develop.

# Cap on number of lexemes that get_best_permutations will process.
# This is a crude attempt to stop the CPU/RAM consumption of the function from
# getting too large.
MAX_LEXEMES_FOR_PERMUTATIONS = 4

This comment has been minimized.

@Jkrzy

Jkrzy Sep 26, 2018

Recommend bumping this up to 8 so as to avoid handicapping the broadening algorithm's attempts to identify the "best" permutation while still preventing the process from exploding on weird input.

At 8, we'd avoid trimming lexemes for ~99.4% of the contracts we've seen thus far. At 4, that drops to ~85%. Explored by binning the number of lexemes matched to the vocab on a test dataset of ~55k labor categories.

@Jkrzy

Awesome, looking great @tadhg-ohiggins!

For 7cf045b, I think it would be better to tackle this by cutting the ContractsQuerySet methods. Likely confusing down the line if we addget_queryset to a models.QuerySet as its a models.Manager method in Django.

That would also limit the tweaks here to code within this PR and avoid the associated test changes.

Specifically, I think it can be done like this...
Cut ContractsQuerySet.multi_phrase_search and ContractsQuerySet.get_queryset

Change this call in data_capture.analysis.core.find_comparable_contracts

    for i, phrase in enumerate(broaden_query(cursor, vocab, labor_category, cache,
                                min_count)):
-        phrase_qs = Contract.objects.all().multi_phrase_search(phrase)
+        phrase_qs = Contract.objects.multi_phrase_search(phrase)

@tadhg-ohiggins tadhg-ohiggins changed the title from [WIP] [Experiment] Add experimental price list analysis to [Experiment] Add experimental price list analysis Sep 26, 2018

@hbillings hbillings changed the title from [Experiment] Add experimental price list analysis to Add experimental price list analysis Sep 26, 2018

@hbillings

MERGE THIS SUCKER

@tadhg-ohiggins tadhg-ohiggins merged commit 0affe38 into develop Sep 26, 2018

3 checks passed

ci/circleci: build Your tests passed on CircleCI!
Details
security/snyk - package.json (CALC) No new issues
Details
security/snyk - requirements.txt (CALC) No new issues
Details

@hbillings hbillings deleted the price-list-analysis branch Sep 29, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment