Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to store Start URLs #514

Closed
HappyCustomers opened this issue Aug 24, 2018 · 12 comments
Closed

How to store Start URLs #514

HappyCustomers opened this issue Aug 24, 2018 · 12 comments

Comments

@HappyCustomers
Copy link

HappyCustomers commented Aug 24, 2018

Hi ,

I am using HTTP collector and MYSQL committer along with document fetcher to crawl and index web pages. everything is working fine , however I have one requirement where I need to store startURLs along with other fields for each page as below

id, Content, title, meta Description, Keywords, imagepath, StartURls 
www.xyz.com/aboutus, aboutus content, xyz company, xyz description, xyz keywords, //imagepath, www.xyz.com
www.xyz.com/careers, careers content, xyz title, xyz description, xyz keywords, //imagepath, www.xyz.com

Thank you

@essiembre
Copy link
Contributor

essiembre commented Aug 26, 2018

By start URL, do you mean the domain? If it is not available as an extracted field already, you can derive it from the URL using regular expression. Have a look at ReplaceTagger.

To only have the fields you want, you may want to use the KeepOnlyTagger.

If you first want to rename them to names of your choice, have a look at the RenameTagger.

Does that answer?

@HappyCustomers
Copy link
Author

HappyCustomers commented Aug 26, 2018

startURLs meam the URLs in the startURL tag. It can be domain or any other start URL used to extract web pages. Example

 <url>www.xyz.com</url>
 <url>www.abc.com/city</url>
</startURLs>

As we are going to load hundreds of start URLs using text file ,we need to link the start URLs to the indexed URLs.

Tried checking the ReplaceTagger, I don't get the startURL in the fromfield.

@essiembre
Copy link
Contributor

That feature is not present. It could prove challenging in some cases if a few different start URLs point to the same page for example.

One workaround would be to somehow automate the launching multiple crawlers instead (admittedly maybe less practical) and use the ConstantTagger to define which one is which.

Also, do you know if you will always have different domains? If so the domain approach suggested before can be used to help you identify each.

How many level deep do you crawl? Since you have hundreds of start URLs, if you go only 1 level deep, you will get the start URL in a collector.referrer-reference field.

If you crawl deeper and you don't mind getting the start URLs in a post-index SQL query, you can use that field to reconstruct the full crawl path of a document (including the start URL).

If none of the above can work for you, we can make this a feature request.

@HappyCustomers
Copy link
Author

All the URLs will be of different Domain. 2 Level deep is sufficient for crawling`. It would be a great feature if we can store the startURL in MYSQL against each indexed page. Thank you

@essiembre
Copy link
Contributor

I am marking this as a feature request to store the start URL or the full crawl path to the document (as opposed to just the direct parent).

That said if none of your URLs share the same domain, and you have "stayOnDomain=true", I would look at extracting the domain from the URL using the ReplaceTagger and storing it in a new field as suggested before. You will be able to filter by domain quickly.

@HappyCustomers
Copy link
Author

Can you Please provide me the configuration using Replace Tagger ??

@essiembre
Copy link
Contributor

Here is an example (untested):

...
<importer>
    ...
    <preParseHandlers>
        ...
        <tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
            <replace fromField="document.reference" toField="MyCustomDomainField"
                        regex="true" wholeMatch="true">
                <fromValue>https?://(.*?)(/.*|:.*|$)</fromValue>
                <toValue>$1</toValue>
            </replace>
        </tagger>
        ...
    </preParseHandlers>
    ...
</importer>
...

@HappyCustomers
Copy link
Author

Thank you and I will try and get back to you

@HappyCustomers
Copy link
Author

I tried it is working if the start URL is http://www.xyz.com/

however if the start URL is http://www.xyz.com/city/, it is extracting as http://www.xyz.com/

Can this be taken care also in the above config.

Thanks in Advance

@essiembre
Copy link
Contributor

essiembre commented Sep 6, 2018

I tried the exact config snippet and it worked for me. I am getting this value:

MyCustomDomainField = www.xyz.com

You can try adding a DebugTagger just after to print all fields (or just "MyCustomDomainField"). That may help you troubleshoot.

If you can't resolve it, please share your config.

@HappyCustomers
Copy link
Author

Sorry for not communicating clearly earlier.
if the start URL is http://www.xyz.com/city/ then the ReplaceTagger extract only www.xyz.com and not the full start URL - http://www.xyz.com/city/

the Data in the table must be like this
Page URLs ------------------------------------------ Start URL
http://www.xyz.com/city/aboutus.html -------- http://www.xyz.com/city/
http://www.xyz.com/city/product.html -------- http://www.xyz.com/city/
http://www.xyz.com/city/contactus.html -------- http://www.xyz.com/city/

@essiembre
Copy link
Contributor

Ha... it is just a matter of adjusting your regular expression to match exactly what you want then.

Regular expressions are a very popular way to match text and you can find plenty of good documentation online if you are not too familiar. You can also find different regular expression testers online, such as these two:

You can try your different text matching use cases there before trying them in the Collector.

To help you with this one, you can try expanding the first match group by changing the regular expression to:

(https?://.*?)(/.*|:.*|$)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants