Source code for PACKT Book 'Programming MapReduce With Scalding'
Find more information at http://scalding.io/
The book consists of 9 chapters
-
Introduction to Map-Reduce - Introduction to Hadoop, Map Reduce, Pipelining, Cascading, Pig and Hive. Chapter presents benefits of higher level abstractions of Map Reduce (concepts and capabilities).
-
Get ready for Scalding - Theory about Scalding - the Scala Domain Specific Language utilising Cascading. Development environment setup including local hadoop cluster for development. Execute the first
Hello World
Scalding example. -
Scalding by example - The core capabilities of scalding: i) Map-like functions, ii) Grouping/reducing functions iii) Join operations
-
Intermediate examples - A Scalding log processing flow for a News company, aggregating multiple sources will be presented. Through an example with multiple pipe-lines some more advanced concepts are presented.
-
Scalding Design Patterns - Interesting design patterns applicable to Scalding data processing applications. Using the 'External Operations' patters will enable us performing unit testing and structuring our applications in a modular way.
-
Testing & TDD - Best practices of first defining behaviour (Behaviour Driven Development) then tests (Test Driven Development) and then completing the implementation. How to write unit, integration tests and also apply Black-box testing methodologies in the context of Big Data.
-
Running Scalding in Production - Tips and tricks on how to execute and schedule jobs. Also how to co-ordinate the execution of Scalding/Scala/Java and even external system processes. Finally how to configure Scalding jobs using property files or Hadoop parameters, how to monitor and optimize jobs and other usefull tips.
-
Using external data stores - Interaction with external external SQL, NOSQL and in-memory applications like HBase, SQL, ElasticSearch etc.
-
Matrix Calculations and Machine Learning - Matrix calculations using the Matrix API and algebird to calculate text similarity (TF-IDF) and set similarity (Jaccard). Then another example on Mahout K-Means clustering and outlier detection.