This library provides a way to scrap data of webpages.
- Define configurations for sites with URL matchers and field selectors.
- Use CSS selectors to extract content from HTML elements.
- Apply a string-based pipeline to filter and validate the data.
- Supports extensible filters and validators for custom transformations.
To install the library, use LuaRocks:
luarocks install webscraper
local webscraper = require("webscraper")
-- Create a new WebScraper instance
local scraper = webscraper.WebScraper:new()
local builtin_scrapper = webscraper -- This is the built-in scrapper that has all the built-in filters and valitors pre-loaded
-- Define a site configuration
scraper.sites:register("example", {
urls_match = { "https://example.com/.*" },
fields = {
title = {
selector = { "h1" },
transform = "trim | uppercase",
validate = "is_string",
},
subtitle = {
selector = { "h2", "h2.subtitle" }, -- Support multiple selectors to retrieve data for the same field
transform = "trim | uppercase",
validate = "is_string",
},
date = {
selector = { ".date" },
transform = "trim | parse_date('%d/%m/%Y')",
validate = "is_string",
},
},
})
-- Run the scraper
local result = scraper:run("https://example.com", {}, { logger = print })
print(result)
Filters are used to transform data. Some built-in filters include:
lowercase
: Converts a string to lowercase.uppercase
: Converts a string to uppercase.trim
: Removes leading and trailing whitespace.to_number
: Converts a string to a number.parse_date
: Parses a date string into a specific format.
Validators ensure the data meets specific criteria. Some built-in validators include:
is_boolean
: Checks if the value is a boolean.is_number
: Checks if the value is a number.is_string
: Checks if the value is a string.
-- Register a custom filter
webscraper.filters:register("reverse", function(v)
return v:reverse()
end)
-- Register a custom validator
webscraper.validators:register("is_positive", function(v)
if tonumber(v) > 0 then
return nil
else
return tostring(v) .. " is not positive"
end
end)
docker compose -f docker-compose.dev.yml up -d
docker compose -f docker-compose.dev.yml exec sh /app/setup/setup.sh
docker compose -f docker-compose.dev.yml exec dev sh
Run the test suite using Busted:
/usr/local/bin/busted
Run the test suite using StyLua:
/usr/local/bin/stylua webscraper
This library is licensed under the MIT License. See the LICENSE file for details.