Skip to content

RimmaSkorn/struct-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

StructScraper

Real-time structured data scraper

StructScraper is a tool for extracting structured data from web resources in real-time and placing it on web pages. It includes REST API on .NET Framework and jQuery plugins. The REST API extracts data from web resources. The plugins make requests to the REST API, and stamp received data on the page.

Currently, StructScraper supports:

  • Microdata, JSON-LD and meta-tag-based markup in HTML documents
  • Embedded and custom properties in Word and PDF documents

The microdata parser grew out of the preliminary implementation of the Chapleau.MicrodataParser library, with additions and improvements made to resolve bugs and improve code maintainability.

The project uses

Usage

To incorporate semantic data from external resources into your HTML page while the page is loading, add special markup and include a piece of javascript code, as in the following example.

<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.3.1/jquery.min.js"> </script>
<script src="https://struct-scraper.keldysh.ru/Scripts/fill-struct.js"> </script>
<script type="text/javascript">
    $(document).ready(function () {
        $(document).fillRefStruct({
            apiMultiUri: 'https://struct-scraper.keldysh.ru/api/struct/multi-uri',
            import_schema: "LocalBusiness, Store, Organization"
        });
    });
</script><li class="import-struct">
    <div>
        <a class="struct-url" href="https://www.thesparrowsgr.com/" itemprop="name">The Sparrows Coffee & Tea & Newsstand</a>
        <div itemprop="telephone"></div>
        <div itemprop="email"></div>
        <div itemprop="address">
            <span itemprop="postalCode"></span>
            <span itemprop="addressLocality"></span>
            <span itemprop="streetAddress"></span>
        </div>
    </div>
</li>

Use .import-struct class for external data container and .struct-url class for hyperlink to external resource. Use [itemprop] attribute for Schema.org property to be included.

jQuery plugin fillRefStruct embedded in fill-struct.js file performs the job of filling external data. Parameter apiMultiUri contains the address of REST API request, and parameter import_schema includes the list of Schema.org types data should extract from.