added driven examples to the Impatient series

Cascading · Jun 10, 2014 · d7384fe · d7384fe
1 parent 5dca4dd
commit d7384fe
Show file tree

Hide file tree

Showing 22 changed files with 297 additions and 76 deletions.
diff --git a/impatient-docs/src/asciidoc/cascading_state.adoc b/impatient-docs/src/asciidoc/cascading_state.adoc
@@ -0,0 +1,31 @@
+Let's understand better how to read the timeline view. To do that, let's go deeper
+into understanding the different states for the Cascading application.
+
+image:cascading-state.png[]
+
+As the Cascading framework composes the Flows, they enter the "Pending" state. Since
+Flows within a Casacade can have dependencies, the Flows stay in the "Pending" state
+until all the dependencies predicating its execution have successfully completed.
+
+Similary, as the query planner maps the assembly of steps into Mappers and Reducers (or other
+execution primitives associated with the underlying fabric), they now enter the "Submitted" 
+state waiting the work unit to be executed, at which time they enter the "Running" state
+
+Measuring how much time the application, flow, or a Cascading step spends in a particular state 
+can give valuable insights about how to improve the performance of your application. 
+
+For example, a large time interval between the submitted and running time can allude to
+concurrency issues in running MapReduce jobs, and suggesting that increasing the number
+of mapper or reducer slots can significantly improve the application performance. 
+
+Similarly, if the application spends a large proportion of its total run time in 
+pending state, it may be cause to review your application architecture and see how 
+some dependencies between the Flows be eliminated. 
+
+You can visualize this information related to latencies for each state from two places.
+Hover your mouse over the timelime bars, and you can see the state associated with 
+each color band. For a more detailed and an analytical view, select "Add Columns", and select 
+the following counters: "Pending Time", "Start Time", "Submit Time", "Run Time", "Finished Time"
+
+Now with these detailed insights, you are ready to take your application tuning to the 
+next level!
diff --git a/impatient-docs/src/asciidoc/impatient1.adoc b/impatient-docs/src/asciidoc/impatient1.adoc
@@ -138,23 +138,69 @@ https://groups.google.com/forum/#!forum/cascading-user[cascading-user] email
 forum. Plenty of experienced Cascading users are discussing taps and pipes and
 flows there, and eager to help.
 
-For those who are familiar with http://pig.apache.org[Apache Pig], we have
-included a comparable script:
-
-[source]
-----
-copyPipe = LOAD '$inPath' USING PigStorage('\t', 'tagsource');
-STORE copyPipe INTO '$outPath' using PigStorage('\t', 'tagsource');
-----
-
-To run that, use:
-
-    rm -rf output
-    pig -p inPath=./data/rain.txt -p outPath=./output/rain ./src/scripts/copy.pig
-
 That's it in a nutshell, our simplest app possible in Cascading. Not quite a
 ``Hello World'', but more like a `Hi there, bus stop''. Or something.
 
+Driven
+~~~~~~
+
+Let us get started accessing your *Driven* environment. Driven is an 
+application-performance management application that helps you visualize 
+you application development, debugging, performance tuning, monitoring and 
+other operational aspects for data-driven applications built on Cascading.
+
+_Driven is free for developer use on cascading.io._
+
+To get started, http://cascading.io/try/[download the Driven plugin]
+
+NOTE: *How does Driven work?*
+When you run your Cascading application, the framework underneath instruments
+your code and sends these telemetry signals to your Driven server, which can 
+reside in the Cloud (cascading.io)  or be self-hosted. The Driven plugin, which resides close 
+your application, only sends the operational meta-data to the server for 
+analysis and visualization -- _no application data is ever transmitted_. The operational
+meta-data includes the graph it constructs to run the application on the native
+computational fabric (MapReduce, local..) as well as the counter information it
+collects from the run. 
+
+Since this is a very elementary application, we will not have much to analyze — we 
+will wait for later parts in the tutorial for that. However, it is important at 
+this stage to learn how to setup and access your Driven environment.
+
+When you execute your application, you will see the URL in your console prompt. 
+Your URL, will of course be different than the one in this screenshot. 
+
+image:driven-part1-b.png[]
+
+This URL will give you insights about your application, some of 
+which we will cover as part of the tutorial. If you have not installed the Driven plugin,
+you can still see how Driven will help you visualize such an application by following the 
+ https://driven.cascading.io/driven/4750100B4D434B70BFAD0BA7543FB99A[link]. 
+
+NOTE: To take advantage of additional features such as comparing your application statistics 
+with historical runs, collaborating with teams, and managing multiple applications, we suggest 
+that you get a free API key at http://cascading.io/register. If you have registered and 
+finished the instructions to install the key, log in at https://driven.cascading.io 
+to visualize your application flow. We will cover cover various features of the Driven 
+application in the following tutorials.
+
+image:driven-part1-a.png[] 
+
+We will get additional insights in later parts as we create more complex applications. 
+From the screenshot, you will see two key components as part of the application developer
+view. The top half will help you visualize the graph associated with your application, showing
+you all the dependencies between different Cascading steps and flows. Clicking on the two
+taps (green circles) will give you additional attribute information, including reference to
+the source code where the Tap was defined. 
+
+The bottom half of the screen contains the 'Timeline View', which will give details associated
+with each flow run. You can click on the 'Add Columns' to explore other counters too. As your
+applications get more complex, these counters will help you gain insights if a particular
+run-time behavior is caused by code, the infrastructure, or the network. 
+
+To understand how best to understand the timing counters, read 
+link:cascading_state.html[the following note on timing durations]
+
 Next
 ----
 Learn how to implement the classical word count with Cascading in

diff --git a/impatient-docs/src/asciidoc/impatient2.adoc b/impatient-docs/src/asciidoc/impatient2.adoc
@@ -10,14 +10,14 @@ yet, it’s probably best to start there.
 Today’s lesson takes the same app and stretches it a bit further. Undoubtedly
 you’ve seen Word Count before. We’d feel remiss if we did not provide a Word
 Count example. It’s the ``Hello World'' of MapReduce apps. Fortunately, this
-code is one of the basic steps toward developing a TF-IDF implementation. How
-convenient. We’ll also show how to use Cascading to generate a visualization of
-your MapReduce app. 
+code is one of the basic steps toward developing a implementation of a word
+scoring algorithm such as TF-IDF. How convenient. We’ll also show how to use 
+Cascading to generate a visualization of your MapReduce app. 
 
 Theory
 ~~~~~~
 
-Our example code in link:impatient1.html[Part 1] of this series showed how to
+Our example code in link:impatient1.html[Part 1] of this series, we showed how to
 move data from point A to point B. It was simply a distributed file copy -
 loading data via distributed tasks, an instance of the ``L'' in
 link:http://en.wikipedia.org/wiki/Extract,\_transform,_load[ETL].
@@ -43,13 +43,13 @@ of spare bicycle parts which is robust enough that strangers will let their
 kids ride. Let’s hope those welds were made using best practices and good
 materials, to avoid catastrophes.
 
-That’s a key reason why Cascading was created. When you need to move a few Gb
+That’s a key reason why Cascading was created. When you need to move a few GB
 from point A to point B, it’s probably simple enough to write a Bash script, or
 just use a single command line copy. When your work requires some reshaping of
 the data, then a few lines of Python will probably work fine. Run that Python
 code from your Bash script and you’re done. I’ve used that approach many times,
 when it fit the use case requirements. However, suppose you’re not moving just
-Gb around? Suppose you’re moving Tb, or Pb? Bash scripts won’t get you very
+GB around? Suppose you’re moving TB, or PB? Bash scripts won’t get you very
 far. Also think about this: suppose your app not only needs to move data from
 point A to point B, but it must run within the constraints of an enterprise IT
 shop. Millions of dollars and potentially even some jobs ride on the fact that
@@ -150,11 +150,6 @@ wcFlow.complete();
 Place those source lines all into a `Main` method, then build a JAR file. You
 should be good to go.
 
-image:wc.png[]
-
-The diagram for the Cascading flow will be in the `dot/` subdirectory. Here we
-have annotated it to show where the mapper and reducer phases are running:
-
 If you want to read in more detail about the classes in the Cascading API which
 were used, see the Cascading 2.5
 http://docs.cascading.org/cascading/2.5/userguide/html/[User Guide] and
@@ -170,7 +165,7 @@ To build the sample app from the command line use:
 Run
 ~~~
 
-lear the output directory (Apache Hadoop insists, if you’re running in
+Clear the output directory (Apache Hadoop insists, if you’re running in
 standalone mode) and run the app:
 
     rm -rf output
@@ -188,30 +183,58 @@ probably not set up correctly. Drop us a line on the
 https://groups.google.com/forum/#!forum/cascading-user[cascading-user] email
 forum.
 
-For those who are familiar with Apache Pig, we have included a comparable script:
-
-[source]
-----
-docPipe = LOAD '$docPath' USING PigStorage('\t', 'tagsource') AS (doc_id, text);
-docPipe = FILTER docPipe BY doc_id != 'doc_id';
--- specify a regex operation to split the "document" text lines into a token stream
-tokenPipe = FOREACH docPipe GENERATE doc_id, FLATTEN(TOKENIZE(text, ' [](),.')) AS token;
-tokenPipe = FILTER tokenPipe BY token MATCHES '\\w.*';
--- determine the word counts
-tokenGroups = GROUP tokenPipe BY token;
-wcPipe = FOREACH tokenGroups GENERATE group AS token, COUNT(tokenPipe) AS count;
--- output
-STORE wcPipe INTO '$wcPath' using PigStorage('\t', 'tagsource');
-EXPLAIN -out dot/wc_pig.dot -dot wcPipe;
-----
-
-To run that, use:
+So that is our Word Count example. Twenty lines of yummy goodness.
 
-    rm -rf output
-    mkdir -p dot
-    pig -p docPath=./data/rain.txt -p wcPath=./output/wc ./src/scripts/wc.pig
+Driven
+~~~~~~
 
-So that is our Word Count example. Twenty lines of yummy goodness.
+Now, we will start viewing some interesting details and getting some application insights 
+through Driven. 
+
+*If you have not installed the Driven plugin, you can still explore how Driven
+will visualize Part 2 by visiting this 
+https://driven.cascading.io/driven/56AB59A8C83E4ABAB50A617B2512600F[link]*
+
+You can inspect the application run either by following the URL provided in system
+command, or visiting http://driven.cascading.io if your registered your key.
+
+image:driven-part2.png[]
+
+1. The first thing you will see is a graph -- Directed Acyclic Graph (DAG) in 
+formal parlance -- that shows all the steps in your code, and the dependencies.
+The circles represent the Tap, and you can now inspect the function, Group by, 
+and the count function used by your code by clicking on each step.
+2. Click on each step of the DAG. You will see additional details about the specific
+operator, and the reference to  the line of the code where the that step was 
+invoked.
+3. The timeline chart visualizes how the application executed in your environment. You 
+can see details about the time taken by the flow to execute, and get additional 
+insights by clicking on "Add Columns" button.
+4. If you executed your application on the Hadoop cluster in a distributed mode,
+you will get additional insights regarding how your Cascading flows mapped into mappers
+and reducers. Note, that the 'Performance View' is only available if you ran your 
+application on Hadoop (distributed mode)
+5. In the timeline view, click on the your flow name link ("wc"). You will see how
+ your application logic got decomposed into the mapper (regex function and Group By),
+and what part of your application logic became part of the Reducer (of course, the count). 
+You can also see how many slices were created for the shard. 
+
+image:driven-part2-c.png[]
+
+As your applications become more complex, the 'Performance View' becomes seminal in 
+understanding the behavior of your application. 
+
+*If you registered and configured the Driven API key*, you will also have an 
+“All Application” view, where we can see all the applications that are 
+running, or have run in the Hadoop cluster for a given user. You can customize 
+the view to display additional attributes such as application name, ID, 
+owner. In addition, you can customize the view to filter the information 
+based on status and dates.
+
+image:driven-part2-b.png[]
+
+To understand how best to understand the timing counters, read 
+link:cascading_state.html[Understanding Timing Counters]
 
 Next
 ----

diff --git a/impatient-docs/src/asciidoc/impatient3.adoc b/impatient-docs/src/asciidoc/impatient3.adoc
@@ -130,7 +130,6 @@ are running, and also the section which was added since Part 2:
 
 image:wc1.png[]
 
-
 Build
 ~~~~~
 
@@ -146,6 +145,39 @@ Run the app like this:
     rm -rf output
     hadoop jar ./build/libs/impatient.jar data/rain.txt output/wc
 
+Driven
+~~~~~~
+Explore with *Driven* how your application got composed into Cascading. 
+
+*If you did not install the Driven plugin, you can still explore Part 3 
+through Driven by following this https://driven.cascading.io/driven/A6C97AEB171449F4945FF90651C08E74[link]*
+
+image:driven-part3.png[]
+
+*If you ran your application on a distributed Hadoop cluster*, your flow name ("wc") 
+in the timeline view will have a link into how the steps got composed into Mappers
+and Reducers.
+
+image:driven-part3-b.png[]
+
+Click on the step link ("(1/1) output/wc") in the Performance view if you want to 
+explore how your application executed from the Hadoop job dashboard.
+
+[NOTE]
+===============================
+Driven exposes three views of the DAG -- Logical, Physical and Contracted. The Logical View lets 
+you inspect and visualize the Tap, Pipe and other Cascading constructs that you 
+specified in your code. With Physical View, you can also inspect intermediate Tap 
+and Pipe Assemblies embedded in your code. In our case, we can see that the Retain 
+function was used in the subassembly.  
+===============================
+
+The bottom half of the screen contains the 'Timeline View', which will give details associated
+with each flow run. You can click on the 'Add Columns' to visualize other signals too. 
+
+To understand how best to understand the timing counters, read 
+link:cascading_state.html[Understanding Timing Counters]
+
 Output text gets stored in the partition file `output/wc1` which you can then
 verify:
 
@@ -157,12 +189,6 @@ up correctly. Drop us a line on the
 https://groups.google.com/forum/#!forum/cascading-user[cascading-user] email
 forum.
 
-For those familiar with Apache Pig, we have included a comparable script, and to run that:
-
-    rm -rf output
-    mkdir -p dot
-    pig -p docPath=./data/rain.txt -p wcPath=./output/wc -p stopPath=./data/en.stop ./src/scripts/wc.pig
-
 Next
 ----
 In link:impatient4.html[Part 4] of Cascading for the Impatient you will

diff --git a/impatient-docs/src/asciidoc/impatient4.adoc b/impatient-docs/src/asciidoc/impatient4.adoc
@@ -5,7 +5,7 @@ Part 4 - Implementing a Stop Word Filter
 
 In the previous installement we showed how to write a custom
 http://docs.cascading.org/cascading/2.5/javadoc/cascading/operation/package-summary.html[Operation]
-for a Cascading 2.0 application. If you haven’t read that yet, it’s probably
+for a Cascading application. If you haven’t read that yet, it’s probably
 best to start there.
 
 Today’s lesson takes that same Word Count app and expands on it to implement a
@@ -197,6 +197,32 @@ To run the app, do the following:
     rm -rf output
     hadoop jar ./build/libs/impatient.jar data/rain.txt output/wc data/en.stop
 
+Driven
+~~~~~~
+Open your *Driven-enabled* app to track the progress of your application in real-time. Make 
+sure that you have set the Refresh feature to ON. By default, the Driven updates the visualization
+every 30 seconds. 
+
+*If you have not installed the Driven plugin, you can still explore Part 4 through
+Driven by following this https://driven.cascading.io/driven/1AE69BD0641146EFB926DE7AC83B94B0[link]*
+
+[NOTE]
+===============================
+If you registered at http://cascading.io and installed the Driven API key, you will 
+have accces to the “All Applications” view that tracks all your historical application
+runs. This view starts becoming interesting over a period of time when you want to 
+track trending, identify outlier behavior, or monitor applications based on their 
+termination status
+===============================
+
+image:driven-part4.png[]
+
+You can get detailed insights 
+into how your Cascading steps translated into Map Reduce by clicking on your Flow name 
+in the timeline view ("wc"). 
+
+image:driven-part4-b.png[]
+
 Output text gets stored in the partition file `output/wc` which you can then verify:
 
     more output/wc/part-00000
@@ -206,15 +232,6 @@ sample app, part 4. If your run looks terribly different, something is probably
 not set up correctly. Drop us a line on the
 https://groups.google.com/forum/#!forum/cascading-user[cascading-user] email forum.
 
-For those familiar with Apache Pig, we have included a comparable script, and to run that:
-
-[source,java]
-----
-rm -rf output
-mkdir -p dot
-pig -p docPath=./data/rain.txt -p wcPath=./output/wc -p stopPath=./data/en.stop ./src/scripts/wc.pig
-----
-
 Next
 ----
 link:impatient5.html[Part 5] of Cascading for the Impatient implements the