Implement initial basic crawler for extracting structred relations from PO.DAAC #1

lewismc · 2017-01-28T03:24:21Z

As discussed at our first meeting, we should use the Any23 basic-crawler to implement a lightweight crawler which will run Any23 extractions over Webpages.
I will link a WIP implementation which we can build on.

Yongyao · 2017-01-29T04:44:38Z

@lewismc , it just came to my mind that the results that Any23 will get is like "DOI:XXX, shortName:xxx,......", which is pretty much the same as metadata, if I understand correctly. People would argue that why don't we just use PO.DAAC web service to do it.

lewismc · 2017-01-29T07:06:22Z

The reason being is that initial, first order extractions are just the beginning for ESKG :)
There is plenty more work to do. This is only s start @Yongyao there is loads more to do.

lewismc · 2017-01-29T07:06:29Z

Lets walk before we can run!

lewismc · 2017-02-01T05:29:13Z

Another commit made and the new crawler command line looks as follows

usage: ESKGCrawler
    --maxDepth <mDepth>          Max allowed crawler depth.
    --maxPages <mPages>          Max number of pages before interrupting
                                 crawl.
    --numCrawlers <nCrawlers>    Sets the number of crawlers.
    --pageFilter <filter>        Regex used to filter out page URLs during
                                 crawling.
    --politenessDelay <pDelay>   Politeness delay in milliseconds.
    --seedUrl <seed>             An individual seed URL used to bootstrap
                                 the crawl.
    --storageFolder <storage>    Folder used to store crawler temporary
                                 data.

The only parameter which is mandatory is --seedUrl

Yongyao · 2017-02-01T16:31:24Z

Great. Just realized the basic crawler is based on crawler4j.

lewismc · 2017-02-01T17:07:54Z

ack

lewismc added the enhancement label Jan 28, 2017

lewismc self-assigned this Jan 28, 2017

lewismc added this to the 0.1 milestone Jan 28, 2017

lewismc closed this as completed Feb 1, 2017

lewismc mentioned this issue Feb 1, 2017

Feed Crawler triples into ESIP SemanticPortal #2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement initial basic crawler for extracting structred relations from PO.DAAC #1

Implement initial basic crawler for extracting structred relations from PO.DAAC #1

lewismc commented Jan 28, 2017

Yongyao commented Jan 29, 2017

lewismc commented Jan 29, 2017

lewismc commented Jan 29, 2017

lewismc commented Feb 1, 2017

Yongyao commented Feb 1, 2017

lewismc commented Feb 1, 2017

Implement initial basic crawler for extracting structred relations from PO.DAAC #1

Implement initial basic crawler for extracting structred relations from PO.DAAC #1

Comments

lewismc commented Jan 28, 2017

Yongyao commented Jan 29, 2017

lewismc commented Jan 29, 2017

lewismc commented Jan 29, 2017

lewismc commented Feb 1, 2017

Yongyao commented Feb 1, 2017

lewismc commented Feb 1, 2017