-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test URL Filtering from the command line #1081
Conversation
… a bit of a problem Signed-off-by: Julien Nioche <julien@digitalpebble.com>
…fy source (although doesn't seem to be working) Signed-off-by: Julien Nioche <julien@digitalpebble.com>
core/src/main/java/com/digitalpebble/stormcrawler/filtering/URLFilters.java
Show resolved
Hide resolved
@rzo1 I couldn't get the -s option to work for some reason. Any chance you could have a look at it? |
I am running this in Intelli: "https://pubmed.ncbi.nlm.nih.gov/18926286/?format=pubmed" -s "https://pubmed.ncbi.nlm.nih.gov/18926286/" and the {
"com.digitalpebble.stormcrawler.filtering.URLFilters": [
{
"class": "com.digitalpebble.stormcrawler.filtering.basic.BasicURLFilter",
"name": "BasicURLFilter",
"params": {
"maxPathRepetition": 3,
"maxLength": 1024
}
},
{
"class": "com.digitalpebble.stormcrawler.filtering.depth.MaxDepthFilter",
"name": "MaxDepthFilter",
"params": {
"maxDepth": -1
}
},
{
"class": "com.digitalpebble.stormcrawler.filtering.basic.BasicURLNormalizer",
"name": "BasicURLNormalizer",
"params": {
"removeAnchorPart": true,
"unmangleQueryString": true,
"checkValidURI": true,
"removeHashes": true,
"hostIDNtoASCII": true
}
},
{
"class": "com.digitalpebble.stormcrawler.filtering.host.HostURLFilter",
"name": "HostURLFilter",
"params": {
"ignoreOutsideHost": false,
"ignoreOutsideDomain": true
}
},
{
"class": "com.digitalpebble.stormcrawler.filtering.basic.SelfURLFilter",
"name": "SelfURLFilter"
},
{
"class": "com.digitalpebble.stormcrawler.filtering.sitemap.SitemapFilter",
"name": "SitemapFilter"
}
]
}
it outputs: [com.digitalpebble.stormcrawler.filtering.basic.BasicURLFilter] 1786msec => https://pubmed.ncbi.nlm.nih.gov/18926286/?format=pubmed
[com.digitalpebble.stormcrawler.filtering.depth.MaxDepthFilter] 0msec => https://pubmed.ncbi.nlm.nih.gov/18926286/?format=pubmed
[com.digitalpebble.stormcrawler.filtering.basic.BasicURLNormalizer] 118msec => https://pubmed.ncbi.nlm.nih.gov/18926286/?format=pubmed
[com.digitalpebble.stormcrawler.filtering.host.HostURLFilter] 1889msec => https://pubmed.ncbi.nlm.nih.gov/18926286/?format=pubmed
[com.digitalpebble.stormcrawler.filtering.basic.SelfURLFilter] 0msec => https://pubmed.ncbi.nlm.nih.gov/18926286/?format=pubmed
[com.digitalpebble.stormcrawler.filtering.sitemap.SitemapFilter] 0msec => https://pubmed.ncbi.nlm.nih.gov/18926286/?format=pubmed |
Next test will be via storm cli :-) (most likely in the next days) |
Looks like Storm cli messes around with CLI options (as it allows them as well). Maybe we better switch to a custom format like key=val |
Signed-off-by: Julien Nioche <julien@digitalpebble.com>
I went for the simpler option of just passing the source as an optional 2nd argument. works for now |
thanks for the review @rzo1 |
Implements #1079
For instance
storm local target/xxx.jar com.digitalpebble.stormcrawler.filtering.URLFilters https://pubmed.ncbi.nlm.nih.gov/18926286/?format=pubmed | grep -v main
to test the filtering pipeline present in the jar.