Skip to content

Golang package to extract useful text from a HTML document

Notifications You must be signed in to change notification settings

Gelembjuk/articletext

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ArticleText

Golang package with a function to extract useful text from a HTML document.

A function analyses a html code and drops everything related to navigation, advertising etc. Extracts only useful contents of a document, text of a central element.

Installation

go get github.com/gelembjuk/articletext

Manual

There are 3 types of exported functions.

  1. Functions to get a text from a HTML document. From 3 different types of sources

GetArticleText(input io.Reader)

GetArticleTextFromFile(filepath string)

GetArticleTextFromUrl(url string)

  1. Functions to return a path (signature) for a text location block. The path is a JQuery style selector - tags with classes.

Also 3 functions for input form different sources

GetArticleSignature(input io.Reader)

GetArticleSignatureFromFile(filepath string)

GetArticleSignatureFromUrl(url string)

Result of these functions is somethign like "body div div div.content div.article div.text" . And then this path can be used to get a text with one of following functions

  1. Functions to get a text from a HTML document using a path (signature) in a JQuery style. A path can be get by using one of functions from blcok 2, or prepared manually

GetArticleTextByPath(input io.Reader, path string)

GetArticleTextFromFileByPath(filepath string, path string)

GetArticleTextFromUrlByPath(url string, path string)

Example

package main

import (
	"fmt"
	"os"
	"github.com/gelembjuk/articletext"
)

func main() {

	url := os.Args[1]
	text, err := articletext.GetArticleTextFromUrl(url)
	
	fmt.Println(text)
}

Author

Roman Gelembjuk (@gelembjuk)

About

Golang package to extract useful text from a HTML document

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages