Agency Identification V2 #16

mbodeantor · 2023-09-20T15:57:17Z

Fixes

Improving match rate of new URLs to those in the Agency table

Description

Try to find a reliable source of geographic data from header, footer or other tags.
Use it to narrow down agencies to match against

josh-chamberlain · 2023-10-26T16:53:36Z

Suggestion: also check urls of data sources for that agency. Some agencies have data about them on multiple municipal websites. We might have to be more careful—we're going to get similar URL patterns for different agencies using the same data portal and stuff. So, might not be worth it.

maxachis · 2024-02-22T00:08:30Z

So to make sure I understand, we'd be adding this information to the same JSON file the HTML Tag Collector is appending to, right?

I figure the best bet is to go to the home page of the website and grab information there. My reasoning is:

The home page will most likely contain geographic information fairly well-described in a number of different tags. Take for example the Baltimore County home page, which contains "Baltimore" 43 times in the HTML, most prominently in the title.
It's highly unlikely that a sub-page will refer to a different geographic location than the home page of the website, so getting data more reliably from the home page should work just as well as getting it from a subordinate page.

The implementation would likely be something of the following:

Get the root url of the web page (see here for a probable regex pattern)
Look up that root url
If that root url exists (and no timeout function has occurred where we want to re-scrape after a certain amount of time has elapsed), retrieve the associated title info, as “home_title_tag”
Otherwise, look up that information, populate the lookup table, and then return the “home_title_tag”.

maxachis · 2024-02-23T02:13:56Z

Currently working on this in linked branch.

maxachis · 2024-02-24T03:30:05Z

I've created the relevant code and linked it in the above pull request. This code, as designed, will follow the implementation I mentioned above and includes relevant unit and integration tests. It makes no other modifications.

Before converting this PR to a draft, I'll need to figure out how to run the entire workflow and compare the results I get with the current suite of results, seeing if there are any distinctions. Because I'm not fully familiar with the workflow as yet, it's possible other components will need to be changed to ensure my changes are accounted for in URL identification.

maxachis · 2024-02-24T23:52:41Z

If I want to be able to see if I can make changes that improve the error rate, I first need a representative sample of the new urls that are regularly applied to it (and knowledge of the source of that sample), where I can get a sense of how often it's succeeding and failing, and then see how I can tweak that.

@mbodeantor @josh-chamberlain , is such a representative sample readily available, or would I need to find a way to create it?

josh-chamberlain · 2024-02-26T16:59:52Z

@maxachis in the past we used common crawl to get a list of URLs. One approach would be to limit the URLs to those containing "police", though it's your choice. Since we'll need to fetch URLs for both annotation and identification, please consider saving a script in a common_crawl directory in this repo! Let me know if that doesn't work.

Another option: using the sitemap scraper on our existing agencies database.

maxachis · 2024-02-29T12:47:44Z

@maxachis in the past we used common crawl to get a list of URLs. One approach would be to limit the URLs to those containing "police", though it's your choice. Since we'll need to fetch URLs for both annotation and identification, please consider saving a script in a common_crawl directory in this repo! Let me know if that doesn't work.

Another option: using the sitemap scraper on our existing agencies database.

@josh-chamberlain Are there any old scripts from when common crawl was previously used that I could reference as a starting point? I've no problem with starting from scratch, but precedent is always helpful.

josh-chamberlain · 2024-02-29T14:11:34Z

@maxachis no, I think someone just curl'd their API. We don't need automation or anything complicated at this moment.

maxachis · 2024-02-29T16:15:11Z

@josh-chamberlain Got it. This, consequently, is overlapping with #40 . Because #40 is upstream of this problem, I'm going to pause work on this issue until I can finish #40, which shouldn't take too long.

I don't want to tie too much code to a singular branch. I also can't validate performance with new data until I have a well-defined source of that new data. So while I am playing fast-and-loose in terms of how many issues I'm tackling at a time, I do think this issue is dependent on #40, and hence must be subordinate to that.

maxachis · 2024-03-12T22:52:28Z

@josh-chamberlain I've created a PR to add a component to add additional information which should in theory assist in improving match rates, but I would likely need @EvilDrPurple or others to verify if it actually suits its intended purposes. We may need to try several different forms of training that include or don't include particular information.

I additionally don't know if this PR should necessarily be closed because of this PR -- rather, it can be an additional possible way to improve the match rate.

josh-chamberlain added the enhancement New feature or request label Oct 26, 2023

maxachis self-assigned this Feb 21, 2024

maxachis linked a pull request Feb 24, 2024 that will close this issue

Add functionality for obtaining title of website home page to collector.py #36

Merged

maxachis mentioned this issue Feb 25, 2024

Annotation workflow v2 #19

Closed

7 tasks

josh-chamberlain mentioned this issue Feb 26, 2024

Expand the dataset to include a wider variety of urls and labels #30

Open

maxachis mentioned this issue Feb 29, 2024

feature: add common crawl to the pipeline #40

Closed

2 tasks

maxachis mentioned this issue Mar 12, 2024

Add functionality for obtaining title of website home page to collector.py #36

Merged

maxachis mentioned this issue Mar 13, 2024

use NLP model to generate name and description for data sources #43

Open

mbodeantor closed this as completed in #36 Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agency Identification V2 #16

Agency Identification V2 #16

mbodeantor commented Sep 20, 2023

josh-chamberlain commented Oct 26, 2023

maxachis commented Feb 22, 2024 •

edited

Loading

maxachis commented Feb 23, 2024

maxachis commented Feb 24, 2024 •

edited

Loading

maxachis commented Feb 24, 2024 •

edited

Loading

josh-chamberlain commented Feb 26, 2024

maxachis commented Feb 29, 2024

josh-chamberlain commented Feb 29, 2024

maxachis commented Feb 29, 2024

maxachis commented Mar 12, 2024

Agency Identification V2 #16

Agency Identification V2 #16

Comments

mbodeantor commented Sep 20, 2023

Fixes

Description

josh-chamberlain commented Oct 26, 2023

maxachis commented Feb 22, 2024 • edited Loading

maxachis commented Feb 23, 2024

maxachis commented Feb 24, 2024 • edited Loading

maxachis commented Feb 24, 2024 • edited Loading

josh-chamberlain commented Feb 26, 2024

maxachis commented Feb 29, 2024

josh-chamberlain commented Feb 29, 2024

maxachis commented Feb 29, 2024

maxachis commented Mar 12, 2024

maxachis commented Feb 22, 2024 •

edited

Loading

maxachis commented Feb 24, 2024 •

edited

Loading

maxachis commented Feb 24, 2024 •

edited

Loading