Webpage_Text_Extraction

Extract main textual information from HTML.

Requirement

Python 3

Installation

pip install pextract

Example

import requests
from bs4 import BeautifulSoup
import pextract as pe
from urllib.parse import urljoin

url = 'https://allaboutstevejobs.com/bio/short_bio'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
for img in soup.findAll('img'):
	img['src'] = urljoin(url, img['src'])
html, pval = pe.extract(soup, text_only = False, remove_img = False)
text, pval = pe.extract(soup)
print(pval) # This is a strong feature for web page classification
with open('out.html', 'w', encoding = 'utf-8') as f:
	f.write(html)
with open('out.txt', 'w', encoding = 'utf-8') as f:
	f.write(text)

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
build/lib/pextract		build/lib/pextract
dist		dist
pextract.egg-info		pextract.egg-info
pextract		pextract
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Webpage_Text_Extraction

Requirement

Installation

Example

About

Releases

Packages

Languages

License

1049451037/Webpage_Text_Extraction

Folders and files

Latest commit

History

Repository files navigation

Webpage_Text_Extraction

Requirement

Installation

Example

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages