Skip to content

scalding-io/ProgrammingWithScalding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Source code for PACKT Book 'Programming MapReduce With Scalding'

Find more information at http://scalding.io/

The book consists of 9 chapters

  • Introduction to Map-Reduce - Introduction to Hadoop, Map Reduce, Pipelining, Cascading, Pig and Hive. Chapter presents benefits of higher level abstractions of Map Reduce (concepts and capabilities).

  • Get ready for Scalding - Theory about Scalding - the Scala Domain Specific Language utilising Cascading. Development environment setup including local hadoop cluster for development. Execute the first Hello World Scalding example.

  • Scalding by example - The core capabilities of scalding: i) Map-like functions, ii) Grouping/reducing functions iii) Join operations

  • Intermediate examples - A Scalding log processing flow for a News company, aggregating multiple sources will be presented. Through an example with multiple pipe-lines some more advanced concepts are presented.

  • Scalding Design Patterns - Interesting design patterns applicable to Scalding data processing applications. Using the 'External Operations' patters will enable us performing unit testing and structuring our applications in a modular way.

  • Testing & TDD - Best practices of first defining behaviour (Behaviour Driven Development) then tests (Test Driven Development) and then completing the implementation. How to write unit, integration tests and also apply Black-box testing methodologies in the context of Big Data.

  • Running Scalding in Production - Tips and tricks on how to execute and schedule jobs. Also how to co-ordinate the execution of Scalding/Scala/Java and even external system processes. Finally how to configure Scalding jobs using property files or Hadoop parameters, how to monitor and optimize jobs and other usefull tips.

  • Using external data stores - Interaction with external external SQL, NOSQL and in-memory applications like HBase, SQL, ElasticSearch etc.

  • Matrix Calculations and Machine Learning - Matrix calculations using the Matrix API and algebird to calculate text similarity (TF-IDF) and set similarity (Jaccard). Then another example on Mahout K-Means clustering and outlier detection.