Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Removing Bad Data #1432
Currently CALC appears to have some bad data (hourly rates well over 1000 dollars an hour). So we want a strategy for removing such data.
f("secretary") -> "low paying"
We'll be making use of an SVM (probably to do the text classification). We may also want to include some data from the prices we trust to reinforce this learning algorithm. Bringing in mixed data can be tricky. So we might need to do some preprocess encoding to the text in question.
As for the other data, I've made a general measure of spread and center. We can say if a price falls two spread lengths outside the center, that data is likely an outlier. Another anomaly detection technique we might consider is isolation forrests.
Anomaly detection seems like a great way to also detect if an uploaded price list (through Data Capture) has some weird pricing in it, and present it as a warning to the user!
"Whoa, your price list has a 'Project Manager' with an hourly rate of $1200. Is that correct?"
Whoa, I just noticed this issue existed. FWIW, I did some exploration of anomaly detection recently on the calc-analysis repo.