What?

The goal of this Python module is to extract the main content of a webpage (like Readability for instance).

Ideally it would work on news websites & blogs, and it should work for a vast majority of languages.

Reference

The main algorithm is from the paper CoreEx: Content Extraction from Online News Articles by Jyotika Prasad & Andreas Paepcke.

You should note that this is an experiment so I may try different heuristics to improve the results (and so do not fully respect the algorithms described in the paper).

Try it

Just drop some articles from your favorite news websites in a directory called tests, (your articles should have the .in.html extension) and run the script, you'll get the content of the articles in *.out.html files.

You can also simply call summary from a Python REPL with an URL if you wish...

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commits
.gitignore		.gitignore
README.md		README.md
TODO		TODO
coreex.py		coreex.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

TODO

TODO

coreex.py

coreex.py

Repository files navigation

What?

Reference

Try it

About

Releases

Packages

Languages

Alexis-D/Coreex

Folders and files

Latest commit

History

Repository files navigation

What?

Reference

Try it

About

Resources

Stars

Watchers

Forks

Languages