Skip to content

TU-Berlin-DIMA/IMPRO-3.SS14

Repository files navigation

IMPRO-3 (SS14)

This project contains a set of machine learning algorithms implemented on top of Scala, Stratosphere, and Spark by master students attending the "Big Data Analytics Project" at FG DIMA, TU Berlin in the 2014 spring term.

Instructions for Contributors

Master students conrtibuting to the project should follow the instructions below:

1. Clone the code with Git

Use your group repository as «origin» and the main repository as «upstream»:

export GROUP=GXX # configure your group number, e.g. G07
git clone git@github.com:TU-Berlin-DIMA/IMPRO-3.SS14.${GROUP}.git
cd IMPRO-3.SS14.${GROUP}
git remote add upstream git@github.com:TU-Berlin-DIMA/IMPRO-3.SS14.git
git fetch upstream

Setup and push an appropriate branch structure (this should be done only once per group):

git checkout -b dev_scala
git checkout -b dev_stratosphere
git checkout -b dev_spark
git push origin dev_scala
git push origin dev_stratosphere
git push origin dev_spark

Each of the other group members can then merely checkout the existing branches:

git checkout -b dev_scala origin/dev_scala
git checkout -b dev_stratosphere origin/dev_stratosphere
git checkout -b dev_spark origin/dev_spark

2. Import the project into your IDE

We recommend using either IntelliJ or Eclipse. To enable auto-completion and syntax highlighting for the Scala code in your project, make sure you have the appropriate Scala plugin installed.

Project dependencies and build lifecycle are configured via Maven, so the easiest way to setup the project in your IDE is to point the Maven importer to the local Git clone location.

3. Contribute code

In the course of the spring term, each group should provide unit-tested implementations of one machine learning algorithm for Scala, Stratosphere, and Spark.

When you develop your code, please follow the workflow below:

  1. Collaborate within the group. Create small commits into the dev_{system} branches and exchange them push/pull to «origin»/dev_{system}.
  2. Make sure you frequently pull and rebase «upstream»/master onto the dev_{system} branch.
  3. Once the algorithm is unit-tested and works, squash all small commits from the dev_{system} branch into one or two commits (e.g. one for the algorithm and one for the uni-test) and push them into «origin»/dev_{system}.
  4. Create a pull request from «origin»/dev_{system} to «upstream»/master.
  5. If everything is fine, we will merge your code into «upstream»/master. You can then pull from «upstream»/master and push the merged version into «origin»/master.

4. Contribute your project presentations

Each group should also prepare and update an algorithm presentation for their particular algorithm. Please contribute updates to your slide sets using pull requests.

The current set of presentations can be found below:

  1. Introduction
  2. Scrum Introduction
  3. Scala Introduction
  4. Algorithm Presentations
    1. Logistic Regression
    2. Random Forest
    3. Canopy Clustering
    4. Hierarchical Agglomerative Clustering
    5. K-Means++