Skip to content

Desvelao/lua-webscraper

Repository files navigation

webscraper

Description

This library provides a way to scrap data of webpages.

Features

  • Define configurations for sites with URL matchers and field selectors.
  • Use CSS selectors to extract content from HTML elements.
  • Apply a string-based pipeline to filter and validate the data.
  • Supports extensible filters and validators for custom transformations.

Installation

To install the library, use LuaRocks:

luarocks install webscraper

Usage

Basic example

local webscraper = require("webscraper")

-- Create a new WebScraper instance
local scraper = webscraper.WebScraper:new()
local builtin_scrapper = webscraper -- This is the built-in scrapper that has all the built-in filters and valitors pre-loaded

-- Define a site configuration
scraper.sites:register("example", {
    urls_match = { "https://example.com/.*" },
    fields = {
        title = {
            selector = { "h1" },
            transform = "trim | uppercase",
            validate = "is_string",
        },
        subtitle = {
            selector = { "h2", "h2.subtitle" }, -- Support multiple selectors to retrieve data for the same field
            transform = "trim | uppercase",
            validate = "is_string",
        },
        date = {
            selector = { ".date" },
            transform = "trim | parse_date('%d/%m/%Y')",
            validate = "is_string",
        },
    },
})

-- Run the scraper
local result = scraper:run("https://example.com", {}, { logger = print })
print(result)

Filters

Filters are used to transform data. Some built-in filters include:

  • lowercase: Converts a string to lowercase.
  • uppercase: Converts a string to uppercase.
  • trim: Removes leading and trailing whitespace.
  • to_number: Converts a string to a number.
  • parse_date: Parses a date string into a specific format.

Validators

Validators ensure the data meets specific criteria. Some built-in validators include:

  • is_boolean: Checks if the value is a boolean.
  • is_number: Checks if the value is a number.
  • is_string: Checks if the value is a string.

CUstom filters and validators

-- Register a custom filter
webscraper.filters:register("reverse", function(v)
    return v:reverse()
end)

-- Register a custom validator
webscraper.validators:register("is_positive", function(v)
    if tonumber(v) > 0 then
        return nil
    else
        return tostring(v) .. " is not positive"
    end
end)

Development

Run environment

docker compose -f docker-compose.dev.yml up -d
docker compose -f docker-compose.dev.yml exec sh /app/setup/setup.sh
docker compose -f docker-compose.dev.yml exec dev sh

Run tests

Run the test suite using Busted:

/usr/local/bin/busted

Format code

Run the test suite using StyLua:

/usr/local/bin/stylua webscraper

License

This library is licensed under the MIT License. See the LICENSE file for details.

About

Create custom scrappers for websites

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published