-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Agency Identification V2 #16
Comments
Suggestion: also check urls of data sources for that agency. Some agencies have data about them on multiple municipal websites. We might have to be more careful—we're going to get similar URL patterns for different agencies using the same data portal and stuff. So, might not be worth it. |
So to make sure I understand, we'd be adding this information to the same JSON file the HTML Tag Collector is appending to, right? I figure the best bet is to go to the home page of the website and grab information there. My reasoning is:
The implementation would likely be something of the following:
|
Currently working on this in linked branch. |
I've created the relevant code and linked it in the above pull request. This code, as designed, will follow the implementation I mentioned above and includes relevant unit and integration tests. It makes no other modifications. Before converting this PR to a draft, I'll need to figure out how to run the entire workflow and compare the results I get with the current suite of results, seeing if there are any distinctions. Because I'm not fully familiar with the workflow as yet, it's possible other components will need to be changed to ensure my changes are accounted for in URL identification. |
If I want to be able to see if I can make changes that improve the error rate, I first need a representative sample of the new urls that are regularly applied to it (and knowledge of the source of that sample), where I can get a sense of how often it's succeeding and failing, and then see how I can tweak that. @mbodeantor @josh-chamberlain , is such a representative sample readily available, or would I need to find a way to create it? |
@maxachis in the past we used common crawl to get a list of URLs. One approach would be to limit the URLs to those containing "police", though it's your choice. Since we'll need to fetch URLs for both annotation and identification, please consider saving a script in a Another option: using the sitemap scraper on our existing agencies database. |
@josh-chamberlain Are there any old scripts from when common crawl was previously used that I could reference as a starting point? I've no problem with starting from scratch, but precedent is always helpful. |
@maxachis no, I think someone just curl'd their API. We don't need automation or anything complicated at this moment. |
@josh-chamberlain Got it. This, consequently, is overlapping with #40 . Because #40 is upstream of this problem, I'm going to pause work on this issue until I can finish #40, which shouldn't take too long. I don't want to tie too much code to a singular branch. I also can't validate performance with new data until I have a well-defined source of that new data. So while I am playing fast-and-loose in terms of how many issues I'm tackling at a time, I do think this issue is dependent on #40, and hence must be subordinate to that. |
@josh-chamberlain I've created a PR to add a component to add additional information which should in theory assist in improving match rates, but I would likely need @EvilDrPurple or others to verify if it actually suits its intended purposes. We may need to try several different forms of training that include or don't include particular information. I additionally don't know if this PR should necessarily be closed because of this PR -- rather, it can be an additional possible way to improve the match rate. |
Fixes
Description
The text was updated successfully, but these errors were encountered: