goScraper is a small web-scraping library for Go.
Package can be installed manually using
go get github.com/keinberger/goScraper
But may also be normally imported when using go modules
import "github.com/keinberger/goScraper"
The package provides several exported functions to provide high functionality.
However, the main scrape functions
func (w Website) Scrape(funcs map[string]interface{}, vars ...interface{}) (string, error)
func (el lookUpElement) ScrapeTreeForElement(node *html.Node) (string, error)
func (e *Element) GetElementNodes(doc *html.Node) ([]*html.Node, error)
should be the preffered way to use the scraper library.
As these functions use the other exported functions, as well, it provides all the features of the library packed together
and guided by only having to provide a minimal amount of input. For the main Scrape()
function, the user input is scoped to only having to provide a custom Website variable.
This example provides a tutorial on how to scrape a website for specific html elements. The html elements will be returned chained-together, separated by a custom separator.
The example will use a custom website variable, where the Scrape()
function will be called upon. The arguments of the Scrape()
function are optional and will not be needed in this example.
package main
import (
"fmt"
"github.com/keinberger/goScraper"
)
func main() {
website := scraper.Website{
URL: "https://wikipedia.org/wiki/wikipedia",
Elements: []scraper.Element{
{
HtmlElement: scraper.HtmlElement{
Typ: "h1",
Tags: []scraper.Tag{
{
Typ: "id",
Value: "firstHeading",
},
},
},
},
{
HtmlElement: scraper.HtmlElement{
Typ: "td",
Tags: []scraper.Tag{
{
Typ: "class",
Value: "infobox-data",
},
},
},
Index: 0,
},
},
Separator: ", ",
}
scraped, err := website.Scrape(nil)
if err != nil {
panic(err)
}
fmt.Println(scraped)
}
This example will use ScrapeTreeForElement, which will return the content of an html element (*html.Node) inside of a bigger node tree. This function is especially useful, if one only wants one html element from a website, but still wants to retain control over formatting settings.
package main
import (
"fmt"
"github.com/keinberger/scraper"
)
func main() {
htmlNode, err := scraper.GetHTMLNode("https://wikipedia.org/wiki/wikipedia")
if err != nil {
panic(err)
}
element := scraper.Element{
HtmlElement: scraper.HtmlElement{
Typ: "li",
Tags: []scraper.Tag{
{
Typ: "id",
Value: "ca-viewsource",
},
},
},
}
content, err := element.ScrapeTreeForElement(htmlNode)
if err != nil {
panic(err)
}
fmt.Println(content)
}
GetElementNodes returns all html elements []*html.Node
found in an html code htmlNode *html.Node
with the same properties as e *Element
func (e *Element) GetElementNodes(htmlNode *html.Node) ([]*html.Node, error)
GetTextOfNodes returns the content of an html element node *html.Node
func GetTextOfNode(node *html.Node, notRecursive bool) (text string)
RenderNode returns the string representation of a node *html.Node
func RenderNode(node *html.Node) string
GetHTMLNode returns the node tree *html.Node
of the html string data
func GetHTMLNode(data string) (*html.Node, error)
GetHTML returns the HTML data of URL
func GetHTML(URL string) (string, error)
I created this project as a side-project from my normal work. Any contributions are very welcome. Just open up new issues or create a pull request if you want to contribute.