New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to store Start URLs #514
Comments
By start URL, do you mean the domain? If it is not available as an extracted field already, you can derive it from the URL using regular expression. Have a look at ReplaceTagger. To only have the fields you want, you may want to use the KeepOnlyTagger. If you first want to rename them to names of your choice, have a look at the RenameTagger. Does that answer? |
startURLs meam the URLs in the startURL tag. It can be domain or any other start URL used to extract web pages. Example
As we are going to load hundreds of start URLs using text file ,we need to link the start URLs to the indexed URLs. Tried checking the ReplaceTagger, I don't get the startURL in the fromfield. |
That feature is not present. It could prove challenging in some cases if a few different start URLs point to the same page for example. One workaround would be to somehow automate the launching multiple crawlers instead (admittedly maybe less practical) and use the ConstantTagger to define which one is which. Also, do you know if you will always have different domains? If so the domain approach suggested before can be used to help you identify each. How many level deep do you crawl? Since you have hundreds of start URLs, if you go only 1 level deep, you will get the start URL in a If you crawl deeper and you don't mind getting the start URLs in a post-index SQL query, you can use that field to reconstruct the full crawl path of a document (including the start URL). If none of the above can work for you, we can make this a feature request. |
All the URLs will be of different Domain. 2 Level deep is sufficient for crawling`. It would be a great feature if we can store the startURL in MYSQL against each indexed page. Thank you |
I am marking this as a feature request to store the start URL or the full crawl path to the document (as opposed to just the direct parent). That said if none of your URLs share the same domain, and you have "stayOnDomain=true", I would look at extracting the domain from the URL using the |
Can you Please provide me the configuration using Replace Tagger ?? |
Here is an example (untested): ...
<importer>
...
<preParseHandlers>
...
<tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
<replace fromField="document.reference" toField="MyCustomDomainField"
regex="true" wholeMatch="true">
<fromValue>https?://(.*?)(/.*|:.*|$)</fromValue>
<toValue>$1</toValue>
</replace>
</tagger>
...
</preParseHandlers>
...
</importer>
... |
Thank you and I will try and get back to you |
I tried it is working if the start URL is http://www.xyz.com/ however if the start URL is http://www.xyz.com/city/, it is extracting as http://www.xyz.com/ Can this be taken care also in the above config. Thanks in Advance |
I tried the exact config snippet and it worked for me. I am getting this value:
You can try adding a DebugTagger just after to print all fields (or just "MyCustomDomainField"). That may help you troubleshoot. If you can't resolve it, please share your config. |
Sorry for not communicating clearly earlier. the Data in the table must be like this |
Ha... it is just a matter of adjusting your regular expression to match exactly what you want then. Regular expressions are a very popular way to match text and you can find plenty of good documentation online if you are not too familiar. You can also find different regular expression testers online, such as these two: You can try your different text matching use cases there before trying them in the Collector. To help you with this one, you can try expanding the first match group by changing the regular expression to:
|
Hi ,
I am using HTTP collector and MYSQL committer along with document fetcher to crawl and index web pages. everything is working fine , however I have one requirement where I need to store startURLs along with other fields for each page as below
Thank you
The text was updated successfully, but these errors were encountered: