Skip to content

Alexis-D/Coreex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commits
 
 
 
 
 
 
 
 

Repository files navigation

What?

The goal of this Python module is to extract the main content of a webpage (like Readability for instance).

Ideally it would work on news websites & blogs, and it should work for a vast majority of languages.

Reference

The main algorithm is from the paper CoreEx: Content Extraction from Online News Articles by Jyotika Prasad & Andreas Paepcke.

You should note that this is an experiment so I may try different heuristics to improve the results (and so do not fully respect the algorithms described in the paper).

Try it

Just drop some articles from your favorite news websites in a directory called tests, (your articles should have the .in.html extension) and run the script, you'll get the content of the articles in *.out.html files.

You can also simply call summary from a Python REPL with an URL if you wish...

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages