Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Web page content extractor
Branch: master
Pull request Compare This branch is 6 commits behind javasoze:master.

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
.settings
bin
lib
meaningfulweb-app
meaningfulweb-core
meaningfulweb-opengraph
meaningfulweb-parent
.classpath
.gitignore
.project
LICENSE
README.md
pom.xml

README.md

What is Meaningful Web?

We aim to extract structured information from a web resource:

url --> meaningfulweb engine --> structured information

Homepage:

http://www.meaningfulweb.org

Artifacts:

  1. meaningfulweb-opengraph.jar <- open graph parser
  2. meaningfulweb-core.jar <-- core engine
  3. meaningfulweb-app.war <-- web application

Build:

Build and release are managed via Maven: http://maven.apache.org/

  1. run the script: bin/mvn-install.sh to install .jar files in jars/ to local maven repo
  2. build opengraph: under meaningfulweb-opengraph/, do: mvn install
  3. build core: under meaningfulweb-core/, do: mvn install
  4. start webapp: under meaningfulweb-app/, do: mvn jetty:run

application should be running at: http://localhost:8080/

the rest service should be running at: http://localhost:8080/get-meaning?url=xxx

Example:

http://localhost:8080/get-meaning?url=http://www.google.com

Sample Code:

// extract the best image representing an url

String url = "http://www.google.com"

MetaContentExtractor extractor = new MetaContentExtractor();
MeaningfulWebObject obj = extractor.extractFromUrl(url);

String bestImageURL = obj.getImage();
String title = obj.getTitle();
String description = obj.getDescription();
String domain = obj.getDomain();

...

Bugs:

File bugs on our jira system at: http://snaprojects.jira.com/browse/MWEB

Wikis:

Wiki Home: http://snaprojects.jira.com/wiki/display/MWEB/Home

Something went wrong with that request. Please try again.