Skip to content
This repository has been archived by the owner on Feb 17, 2020. It is now read-only.
/ sofia Public archive

Code example to use Elastic Search as vector provider for a Mahout classifier (and for data exploration). On a related note will also contain sample code to move from the sequential setup to a distributed Hadoop implementation. In terms of data the example is based on the StackOverflow dump published among others on kaggle.

Notifications You must be signed in to change notification settings

MaineC/sofia

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sofia - Mahout/Elastic Search test bed

Single machine training

Text - Lucene analyzers - Mahout classifiers: The goal of the project is to get familiar with both Elastic Search and the Mahout logistic regression implementation and figure out to what extend ES can help getting an idea about a set of unknown text documents (e.g. through facetting).

The project needs (a sample of) the StackOverflow (SOF) dump as textual basis. It reads the dump, posts it to a running ES instance using pyes. The documents can then be retrieved, parsed and used as training input to the Mahout logistic regression classifier.

Ultimately training and classification most likely should happen as part of the indexing process. This is work for a future project.

Scaling out

As a second stage the project should show how classifier training would look like in a pure Hadoop setup w/o ES.

Components

  • sof_es_poster.py ... Python script to read the SOF dump and post docs to ES
  • config ... tracks the ES config used for the project
  • sof_trainer ... Maven project for querying ES and training the classifier

System Requirements

  • JDK 1.6 for building and running the code
  • Maven2 for building the sof_trainer module
  • Elastic Search (>0.20)
  • Python for posting documents to ES
  • Project Lombok installed into your IDE to build the code (integrated automatically for the maven based build)

Data:

Get from Kaggle: https://www.kaggle.com/c/predict-closed-questions-on-stack-overflow/data

About

Code example to use Elastic Search as vector provider for a Mahout classifier (and for data exploration). On a related note will also contain sample code to move from the sequential setup to a distributed Hadoop implementation. In terms of data the example is based on the StackOverflow dump published among others on kaggle.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published