Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test URL Filtering from the command line #1081

Merged
merged 3 commits into from
Jun 29, 2023
Merged

Test URL Filtering from the command line #1081

merged 3 commits into from
Jun 29, 2023

Conversation

jnioche
Copy link
Contributor

@jnioche jnioche commented Jun 27, 2023

Implements #1079

For instance

storm local target/xxx.jar com.digitalpebble.stormcrawler.filtering.URLFilters https://pubmed.ncbi.nlm.nih.gov/18926286/?format=pubmed | grep -v main

to test the filtering pipeline present in the jar.

… a bit of a problem

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
…fy source (although doesn't seem to be working)

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
@jnioche jnioche added this to the 2.9 milestone Jun 27, 2023
README.md Outdated Show resolved Hide resolved
@jnioche
Copy link
Contributor Author

jnioche commented Jun 28, 2023

@rzo1 I couldn't get the -s option to work for some reason. Any chance you could have a look at it?

@rzo1
Copy link
Contributor

rzo1 commented Jun 28, 2023

I am running this in Intelli:

"https://pubmed.ncbi.nlm.nih.gov/18926286/?format=pubmed" -s "https://pubmed.ncbi.nlm.nih.gov/18926286/"

and the -s option is parsed. For my tests with

{
	"com.digitalpebble.stormcrawler.filtering.URLFilters": [
		{
			"class": "com.digitalpebble.stormcrawler.filtering.basic.BasicURLFilter",
			"name": "BasicURLFilter",
			"params": {
				"maxPathRepetition": 3,
				"maxLength": 1024
			}
		},
		{
			"class": "com.digitalpebble.stormcrawler.filtering.depth.MaxDepthFilter",
			"name": "MaxDepthFilter",
			"params": {
				"maxDepth": -1
			}
		},
		{
			"class": "com.digitalpebble.stormcrawler.filtering.basic.BasicURLNormalizer",
			"name": "BasicURLNormalizer",
			"params": {
				"removeAnchorPart": true,
				"unmangleQueryString": true,
				"checkValidURI": true,
				"removeHashes": true,
				"hostIDNtoASCII": true
			}
		},
		{
			"class": "com.digitalpebble.stormcrawler.filtering.host.HostURLFilter",
			"name": "HostURLFilter",
			"params": {
				"ignoreOutsideHost": false,
				"ignoreOutsideDomain": true
			}
		},
		{
			"class": "com.digitalpebble.stormcrawler.filtering.basic.SelfURLFilter",
			"name": "SelfURLFilter"
		},
		{
			"class": "com.digitalpebble.stormcrawler.filtering.sitemap.SitemapFilter",
			"name": "SitemapFilter"
		}
	]
}

it outputs:

	[com.digitalpebble.stormcrawler.filtering.basic.BasicURLFilter] 1786msec => https://pubmed.ncbi.nlm.nih.gov/18926286/?format=pubmed
	[com.digitalpebble.stormcrawler.filtering.depth.MaxDepthFilter] 0msec => https://pubmed.ncbi.nlm.nih.gov/18926286/?format=pubmed
	[com.digitalpebble.stormcrawler.filtering.basic.BasicURLNormalizer] 118msec => https://pubmed.ncbi.nlm.nih.gov/18926286/?format=pubmed
	[com.digitalpebble.stormcrawler.filtering.host.HostURLFilter] 1889msec => https://pubmed.ncbi.nlm.nih.gov/18926286/?format=pubmed
	[com.digitalpebble.stormcrawler.filtering.basic.SelfURLFilter] 0msec => https://pubmed.ncbi.nlm.nih.gov/18926286/?format=pubmed
	[com.digitalpebble.stormcrawler.filtering.sitemap.SitemapFilter] 0msec => https://pubmed.ncbi.nlm.nih.gov/18926286/?format=pubmed

@rzo1
Copy link
Contributor

rzo1 commented Jun 28, 2023

Next test will be via storm cli :-) (most likely in the next days)

@rzo1
Copy link
Contributor

rzo1 commented Jun 28, 2023

Looks like Storm cli messes around with CLI options (as it allows them as well). Maybe we better switch to a custom format like key=val

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
@jnioche
Copy link
Contributor Author

jnioche commented Jun 29, 2023

Looks like Storm cli messes around with CLI options (as it allows them as well). Maybe we better switch to a custom format like key=val

I went for the simpler option of just passing the source as an optional 2nd argument. works for now

@jnioche
Copy link
Contributor Author

jnioche commented Jun 29, 2023

thanks for the review @rzo1

@jnioche jnioche merged commit d818874 into master Jun 29, 2023
6 of 7 checks passed
@jnioche jnioche deleted the 1079 branch June 29, 2023 07:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants