How to store Start URLs #514

HappyCustomers · 2018-08-24T06:37:42Z

Hi ,

I am using HTTP collector and MYSQL committer along with document fetcher to crawl and index web pages. everything is working fine , however I have one requirement where I need to store startURLs along with other fields for each page as below

id, Content, title, meta Description, Keywords, imagepath, StartURls 
www.xyz.com/aboutus, aboutus content, xyz company, xyz description, xyz keywords, //imagepath, www.xyz.com
www.xyz.com/careers, careers content, xyz title, xyz description, xyz keywords, //imagepath, www.xyz.com

Thank you

The text was updated successfully, but these errors were encountered:

essiembre · 2018-08-26T03:50:56Z

By start URL, do you mean the domain? If it is not available as an extracted field already, you can derive it from the URL using regular expression. Have a look at ReplaceTagger.

To only have the fields you want, you may want to use the KeepOnlyTagger.

If you first want to rename them to names of your choice, have a look at the RenameTagger.

Does that answer?

HappyCustomers · 2018-08-26T05:51:16Z

startURLs meam the URLs in the startURL tag. It can be domain or any other start URL used to extract web pages. Example

 <url>www.xyz.com</url>
 <url>www.abc.com/city</url>
</startURLs>

As we are going to load hundreds of start URLs using text file ,we need to link the start URLs to the indexed URLs.

Tried checking the ReplaceTagger, I don't get the startURL in the fromfield.

essiembre · 2018-08-26T21:01:48Z

That feature is not present. It could prove challenging in some cases if a few different start URLs point to the same page for example.

One workaround would be to somehow automate the launching multiple crawlers instead (admittedly maybe less practical) and use the ConstantTagger to define which one is which.

Also, do you know if you will always have different domains? If so the domain approach suggested before can be used to help you identify each.

How many level deep do you crawl? Since you have hundreds of start URLs, if you go only 1 level deep, you will get the start URL in a collector.referrer-reference field.

If you crawl deeper and you don't mind getting the start URLs in a post-index SQL query, you can use that field to reconstruct the full crawl path of a document (including the start URL).

If none of the above can work for you, we can make this a feature request.

HappyCustomers · 2018-08-28T12:12:18Z

All the URLs will be of different Domain. 2 Level deep is sufficient for crawling`. It would be a great feature if we can store the startURL in MYSQL against each indexed page. Thank you

essiembre · 2018-08-29T02:39:11Z

I am marking this as a feature request to store the start URL or the full crawl path to the document (as opposed to just the direct parent).

That said if none of your URLs share the same domain, and you have "stayOnDomain=true", I would look at extracting the domain from the URL using the ReplaceTagger and storing it in a new field as suggested before. You will be able to filter by domain quickly.

HappyCustomers · 2018-08-30T14:05:05Z

Can you Please provide me the configuration using Replace Tagger ??

essiembre · 2018-08-31T03:31:59Z

Here is an example (untested):

...
<importer>
    ...
    <preParseHandlers>
        ...
        <tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
            <replace fromField="document.reference" toField="MyCustomDomainField"
                        regex="true" wholeMatch="true">
                <fromValue>https?://(.*?)(/.*|:.*|$)</fromValue>
                <toValue>$1</toValue>
            </replace>
        </tagger>
        ...
    </preParseHandlers>
    ...
</importer>
...

HappyCustomers · 2018-09-05T12:45:01Z

Thank you and I will try and get back to you

HappyCustomers · 2018-09-05T13:46:46Z

I tried it is working if the start URL is http://www.xyz.com/

however if the start URL is http://www.xyz.com/city/, it is extracting as http://www.xyz.com/

Can this be taken care also in the above config.

Thanks in Advance

essiembre · 2018-09-06T02:54:14Z

I tried the exact config snippet and it worked for me. I am getting this value:

MyCustomDomainField = www.xyz.com

You can try adding a DebugTagger just after to print all fields (or just "MyCustomDomainField"). That may help you troubleshoot.

If you can't resolve it, please share your config.

HappyCustomers · 2018-09-06T13:20:52Z

Sorry for not communicating clearly earlier.
if the start URL is http://www.xyz.com/city/ then the ReplaceTagger extract only www.xyz.com and not the full start URL - http://www.xyz.com/city/

the Data in the table must be like this
Page URLs ------------------------------------------ Start URL
http://www.xyz.com/city/aboutus.html -------- http://www.xyz.com/city/
http://www.xyz.com/city/product.html -------- http://www.xyz.com/city/
http://www.xyz.com/city/contactus.html -------- http://www.xyz.com/city/

essiembre · 2018-09-07T02:57:23Z

Ha... it is just a matter of adjusting your regular expression to match exactly what you want then.

Regular expressions are a very popular way to match text and you can find plenty of good documentation online if you are not too familiar. You can also find different regular expression testers online, such as these two:

You can try your different text matching use cases there before trying them in the Collector.

To help you with this one, you can try expanding the first match group by changing the regular expression to:

(https?://.*?)(/.*|:.*|$)

essiembre added the feature-request label Aug 29, 2018

HappyCustomers closed this as completed Feb 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to store Start URLs #514

How to store Start URLs #514

HappyCustomers commented Aug 24, 2018 •

edited by essiembre

essiembre commented Aug 26, 2018 •

edited

HappyCustomers commented Aug 26, 2018 •

edited

essiembre commented Aug 26, 2018

HappyCustomers commented Aug 28, 2018

essiembre commented Aug 29, 2018

HappyCustomers commented Aug 30, 2018

essiembre commented Aug 31, 2018

HappyCustomers commented Sep 5, 2018

HappyCustomers commented Sep 5, 2018

essiembre commented Sep 6, 2018 •

edited

HappyCustomers commented Sep 6, 2018

essiembre commented Sep 7, 2018

How to store Start URLs #514

How to store Start URLs #514

Comments

HappyCustomers commented Aug 24, 2018 • edited by essiembre

essiembre commented Aug 26, 2018 • edited

HappyCustomers commented Aug 26, 2018 • edited

essiembre commented Aug 26, 2018

HappyCustomers commented Aug 28, 2018

essiembre commented Aug 29, 2018

HappyCustomers commented Aug 30, 2018

essiembre commented Aug 31, 2018

HappyCustomers commented Sep 5, 2018

HappyCustomers commented Sep 5, 2018

essiembre commented Sep 6, 2018 • edited

HappyCustomers commented Sep 6, 2018

essiembre commented Sep 7, 2018

HappyCustomers commented Aug 24, 2018 •

edited by essiembre

essiembre commented Aug 26, 2018 •

edited

HappyCustomers commented Aug 26, 2018 •

edited

essiembre commented Sep 6, 2018 •

edited