Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agency Identification V2 #16

Closed
mbodeantor opened this issue Sep 20, 2023 · 10 comments · Fixed by #36
Closed

Agency Identification V2 #16

mbodeantor opened this issue Sep 20, 2023 · 10 comments · Fixed by #36
Assignees
Labels
enhancement New feature or request

Comments

@mbodeantor
Copy link
Contributor

Fixes

  • Improving match rate of new URLs to those in the Agency table

Description

  • Try to find a reliable source of geographic data from header, footer or other tags.
  • Use it to narrow down agencies to match against
@josh-chamberlain
Copy link
Contributor

Suggestion: also check urls of data sources for that agency. Some agencies have data about them on multiple municipal websites. We might have to be more careful—we're going to get similar URL patterns for different agencies using the same data portal and stuff. So, might not be worth it.

@josh-chamberlain josh-chamberlain added the enhancement New feature or request label Oct 26, 2023
@maxachis maxachis self-assigned this Feb 21, 2024
@maxachis
Copy link
Collaborator

maxachis commented Feb 22, 2024

So to make sure I understand, we'd be adding this information to the same JSON file the HTML Tag Collector is appending to, right?

I figure the best bet is to go to the home page of the website and grab information there. My reasoning is:

  1. The home page will most likely contain geographic information fairly well-described in a number of different tags. Take for example the Baltimore County home page, which contains "Baltimore" 43 times in the HTML, most prominently in the title.
  2. It's highly unlikely that a sub-page will refer to a different geographic location than the home page of the website, so getting data more reliably from the home page should work just as well as getting it from a subordinate page.

The implementation would likely be something of the following:

  1. Get the root url of the web page (see here for a probable regex pattern)
  2. Look up that root url
  3. If that root url exists (and no timeout function has occurred where we want to re-scrape after a certain amount of time has elapsed), retrieve the associated title info, as “home_title_tag”
  4. Otherwise, look up that information, populate the lookup table, and then return the “home_title_tag”.

@maxachis
Copy link
Collaborator

Currently working on this in linked branch.

@maxachis
Copy link
Collaborator

maxachis commented Feb 24, 2024

I've created the relevant code and linked it in the above pull request. This code, as designed, will follow the implementation I mentioned above and includes relevant unit and integration tests. It makes no other modifications.

Before converting this PR to a draft, I'll need to figure out how to run the entire workflow and compare the results I get with the current suite of results, seeing if there are any distinctions. Because I'm not fully familiar with the workflow as yet, it's possible other components will need to be changed to ensure my changes are accounted for in URL identification.

@maxachis
Copy link
Collaborator

maxachis commented Feb 24, 2024

If I want to be able to see if I can make changes that improve the error rate, I first need a representative sample of the new urls that are regularly applied to it (and knowledge of the source of that sample), where I can get a sense of how often it's succeeding and failing, and then see how I can tweak that.

@mbodeantor @josh-chamberlain , is such a representative sample readily available, or would I need to find a way to create it?

@maxachis maxachis mentioned this issue Feb 25, 2024
7 tasks
@josh-chamberlain
Copy link
Contributor

@maxachis in the past we used common crawl to get a list of URLs. One approach would be to limit the URLs to those containing "police", though it's your choice. Since we'll need to fetch URLs for both annotation and identification, please consider saving a script in a common_crawl directory in this repo! Let me know if that doesn't work.

Another option: using the sitemap scraper on our existing agencies database.

@maxachis
Copy link
Collaborator

@maxachis in the past we used common crawl to get a list of URLs. One approach would be to limit the URLs to those containing "police", though it's your choice. Since we'll need to fetch URLs for both annotation and identification, please consider saving a script in a common_crawl directory in this repo! Let me know if that doesn't work.

Another option: using the sitemap scraper on our existing agencies database.

@josh-chamberlain Are there any old scripts from when common crawl was previously used that I could reference as a starting point? I've no problem with starting from scratch, but precedent is always helpful.

@josh-chamberlain
Copy link
Contributor

@maxachis no, I think someone just curl'd their API. We don't need automation or anything complicated at this moment.

@maxachis
Copy link
Collaborator

@josh-chamberlain Got it. This, consequently, is overlapping with #40 . Because #40 is upstream of this problem, I'm going to pause work on this issue until I can finish #40, which shouldn't take too long.

I don't want to tie too much code to a singular branch. I also can't validate performance with new data until I have a well-defined source of that new data. So while I am playing fast-and-loose in terms of how many issues I'm tackling at a time, I do think this issue is dependent on #40, and hence must be subordinate to that.

@maxachis
Copy link
Collaborator

@josh-chamberlain I've created a PR to add a component to add additional information which should in theory assist in improving match rates, but I would likely need @EvilDrPurple or others to verify if it actually suits its intended purposes. We may need to try several different forms of training that include or don't include particular information.

I additionally don't know if this PR should necessarily be closed because of this PR -- rather, it can be an additional possible way to improve the match rate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
3 participants