Skip to content

Commit

Permalink
added driven examples to the Impatient series
Browse files Browse the repository at this point in the history
  • Loading branch information
supreetoberoi authored and fs111 committed Jun 10, 2014
1 parent 5dca4dd commit d7384fe
Show file tree
Hide file tree
Showing 22 changed files with 297 additions and 76 deletions.
31 changes: 31 additions & 0 deletions impatient-docs/src/asciidoc/cascading_state.adoc
@@ -0,0 +1,31 @@
Let's understand better how to read the timeline view. To do that, let's go deeper
into understanding the different states for the Cascading application.

image:cascading-state.png[]

As the Cascading framework composes the Flows, they enter the "Pending" state. Since
Flows within a Casacade can have dependencies, the Flows stay in the "Pending" state
until all the dependencies predicating its execution have successfully completed.

Similary, as the query planner maps the assembly of steps into Mappers and Reducers (or other
execution primitives associated with the underlying fabric), they now enter the "Submitted"
state waiting the work unit to be executed, at which time they enter the "Running" state

Measuring how much time the application, flow, or a Cascading step spends in a particular state
can give valuable insights about how to improve the performance of your application.

For example, a large time interval between the submitted and running time can allude to
concurrency issues in running MapReduce jobs, and suggesting that increasing the number
of mapper or reducer slots can significantly improve the application performance.

Similarly, if the application spends a large proportion of its total run time in
pending state, it may be cause to review your application architecture and see how
some dependencies between the Flows be eliminated.

You can visualize this information related to latencies for each state from two places.
Hover your mouse over the timelime bars, and you can see the state associated with
each color band. For a more detailed and an analytical view, select "Add Columns", and select
the following counters: "Pending Time", "Start Time", "Submit Time", "Run Time", "Finished Time"

Now with these detailed insights, you are ready to take your application tuning to the
next level!
74 changes: 60 additions & 14 deletions impatient-docs/src/asciidoc/impatient1.adoc
Expand Up @@ -138,23 +138,69 @@ https://groups.google.com/forum/#!forum/cascading-user[cascading-user] email
forum. Plenty of experienced Cascading users are discussing taps and pipes and
flows there, and eager to help.

For those who are familiar with http://pig.apache.org[Apache Pig], we have
included a comparable script:

[source]
----
copyPipe = LOAD '$inPath' USING PigStorage('\t', 'tagsource');
STORE copyPipe INTO '$outPath' using PigStorage('\t', 'tagsource');
----

To run that, use:

rm -rf output
pig -p inPath=./data/rain.txt -p outPath=./output/rain ./src/scripts/copy.pig

That's it in a nutshell, our simplest app possible in Cascading. Not quite a
``Hello World'', but more like a `Hi there, bus stop''. Or something.

Driven
~~~~~~

Let us get started accessing your *Driven* environment. Driven is an
application-performance management application that helps you visualize
you application development, debugging, performance tuning, monitoring and
other operational aspects for data-driven applications built on Cascading.

_Driven is free for developer use on cascading.io._

To get started, http://cascading.io/try/[download the Driven plugin]

NOTE: *How does Driven work?*
When you run your Cascading application, the framework underneath instruments
your code and sends these telemetry signals to your Driven server, which can
reside in the Cloud (cascading.io) or be self-hosted. The Driven plugin, which resides close
your application, only sends the operational meta-data to the server for
analysis and visualization -- _no application data is ever transmitted_. The operational
meta-data includes the graph it constructs to run the application on the native
computational fabric (MapReduce, local..) as well as the counter information it
collects from the run.

Since this is a very elementary application, we will not have much to analyze — we
will wait for later parts in the tutorial for that. However, it is important at
this stage to learn how to setup and access your Driven environment.

When you execute your application, you will see the URL in your console prompt.
Your URL, will of course be different than the one in this screenshot.

image:driven-part1-b.png[]

This URL will give you insights about your application, some of
which we will cover as part of the tutorial. If you have not installed the Driven plugin,
you can still see how Driven will help you visualize such an application by following the
https://driven.cascading.io/driven/4750100B4D434B70BFAD0BA7543FB99A[link].

NOTE: To take advantage of additional features such as comparing your application statistics
with historical runs, collaborating with teams, and managing multiple applications, we suggest
that you get a free API key at http://cascading.io/register. If you have registered and
finished the instructions to install the key, log in at https://driven.cascading.io
to visualize your application flow. We will cover cover various features of the Driven
application in the following tutorials.

image:driven-part1-a.png[]

We will get additional insights in later parts as we create more complex applications.
From the screenshot, you will see two key components as part of the application developer
view. The top half will help you visualize the graph associated with your application, showing
you all the dependencies between different Cascading steps and flows. Clicking on the two
taps (green circles) will give you additional attribute information, including reference to
the source code where the Tap was defined.

The bottom half of the screen contains the 'Timeline View', which will give details associated
with each flow run. You can click on the 'Add Columns' to explore other counters too. As your
applications get more complex, these counters will help you gain insights if a particular
run-time behavior is caused by code, the infrastructure, or the network.

To understand how best to understand the timing counters, read
link:cascading_state.html[the following note on timing durations]

Next
----
Learn how to implement the classical word count with Cascading in
Expand Down
91 changes: 57 additions & 34 deletions impatient-docs/src/asciidoc/impatient2.adoc
Expand Up @@ -10,14 +10,14 @@ yet, it’s probably best to start there.
Today’s lesson takes the same app and stretches it a bit further. Undoubtedly
you’ve seen Word Count before. We’d feel remiss if we did not provide a Word
Count example. It’s the ``Hello World'' of MapReduce apps. Fortunately, this
code is one of the basic steps toward developing a TF-IDF implementation. How
convenient. We’ll also show how to use Cascading to generate a visualization of
your MapReduce app.
code is one of the basic steps toward developing a implementation of a word
scoring algorithm such as TF-IDF. How convenient. We’ll also show how to use
Cascading to generate a visualization of your MapReduce app.

Theory
~~~~~~

Our example code in link:impatient1.html[Part 1] of this series showed how to
Our example code in link:impatient1.html[Part 1] of this series, we showed how to
move data from point A to point B. It was simply a distributed file copy -
loading data via distributed tasks, an instance of the ``L'' in
link:http://en.wikipedia.org/wiki/Extract,\_transform,_load[ETL].
Expand All @@ -43,13 +43,13 @@ of spare bicycle parts which is robust enough that strangers will let their
kids ride. Let’s hope those welds were made using best practices and good
materials, to avoid catastrophes.

That’s a key reason why Cascading was created. When you need to move a few Gb
That’s a key reason why Cascading was created. When you need to move a few GB
from point A to point B, it’s probably simple enough to write a Bash script, or
just use a single command line copy. When your work requires some reshaping of
the data, then a few lines of Python will probably work fine. Run that Python
code from your Bash script and you’re done. I’ve used that approach many times,
when it fit the use case requirements. However, suppose you’re not moving just
Gb around? Suppose you’re moving Tb, or Pb? Bash scripts won’t get you very
GB around? Suppose you’re moving TB, or PB? Bash scripts won’t get you very
far. Also think about this: suppose your app not only needs to move data from
point A to point B, but it must run within the constraints of an enterprise IT
shop. Millions of dollars and potentially even some jobs ride on the fact that
Expand Down Expand Up @@ -150,11 +150,6 @@ wcFlow.complete();
Place those source lines all into a `Main` method, then build a JAR file. You
should be good to go.

image:wc.png[]

The diagram for the Cascading flow will be in the `dot/` subdirectory. Here we
have annotated it to show where the mapper and reducer phases are running:

If you want to read in more detail about the classes in the Cascading API which
were used, see the Cascading 2.5
http://docs.cascading.org/cascading/2.5/userguide/html/[User Guide] and
Expand All @@ -170,7 +165,7 @@ To build the sample app from the command line use:
Run
~~~

lear the output directory (Apache Hadoop insists, if you’re running in
Clear the output directory (Apache Hadoop insists, if you’re running in
standalone mode) and run the app:

rm -rf output
Expand All @@ -188,30 +183,58 @@ probably not set up correctly. Drop us a line on the
https://groups.google.com/forum/#!forum/cascading-user[cascading-user] email
forum.

For those who are familiar with Apache Pig, we have included a comparable script:

[source]
----
docPipe = LOAD '$docPath' USING PigStorage('\t', 'tagsource') AS (doc_id, text);
docPipe = FILTER docPipe BY doc_id != 'doc_id';
-- specify a regex operation to split the "document" text lines into a token stream
tokenPipe = FOREACH docPipe GENERATE doc_id, FLATTEN(TOKENIZE(text, ' [](),.')) AS token;
tokenPipe = FILTER tokenPipe BY token MATCHES '\\w.*';
-- determine the word counts
tokenGroups = GROUP tokenPipe BY token;
wcPipe = FOREACH tokenGroups GENERATE group AS token, COUNT(tokenPipe) AS count;
-- output
STORE wcPipe INTO '$wcPath' using PigStorage('\t', 'tagsource');
EXPLAIN -out dot/wc_pig.dot -dot wcPipe;
----

To run that, use:
So that is our Word Count example. Twenty lines of yummy goodness.

rm -rf output
mkdir -p dot
pig -p docPath=./data/rain.txt -p wcPath=./output/wc ./src/scripts/wc.pig
Driven
~~~~~~

So that is our Word Count example. Twenty lines of yummy goodness.
Now, we will start viewing some interesting details and getting some application insights
through Driven.

*If you have not installed the Driven plugin, you can still explore how Driven
will visualize Part 2 by visiting this
https://driven.cascading.io/driven/56AB59A8C83E4ABAB50A617B2512600F[link]*

You can inspect the application run either by following the URL provided in system
command, or visiting http://driven.cascading.io if your registered your key.

image:driven-part2.png[]

1. The first thing you will see is a graph -- Directed Acyclic Graph (DAG) in
formal parlance -- that shows all the steps in your code, and the dependencies.
The circles represent the Tap, and you can now inspect the function, Group by,
and the count function used by your code by clicking on each step.
2. Click on each step of the DAG. You will see additional details about the specific
operator, and the reference to the line of the code where the that step was
invoked.
3. The timeline chart visualizes how the application executed in your environment. You
can see details about the time taken by the flow to execute, and get additional
insights by clicking on "Add Columns" button.
4. If you executed your application on the Hadoop cluster in a distributed mode,
you will get additional insights regarding how your Cascading flows mapped into mappers
and reducers. Note, that the 'Performance View' is only available if you ran your
application on Hadoop (distributed mode)
5. In the timeline view, click on the your flow name link ("wc"). You will see how
your application logic got decomposed into the mapper (regex function and Group By),
and what part of your application logic became part of the Reducer (of course, the count).
You can also see how many slices were created for the shard.

image:driven-part2-c.png[]

As your applications become more complex, the 'Performance View' becomes seminal in
understanding the behavior of your application.

*If you registered and configured the Driven API key*, you will also have an
“All Application” view, where we can see all the applications that are
running, or have run in the Hadoop cluster for a given user. You can customize
the view to display additional attributes such as application name, ID,
owner. In addition, you can customize the view to filter the information
based on status and dates.

image:driven-part2-b.png[]

To understand how best to understand the timing counters, read
link:cascading_state.html[Understanding Timing Counters]

Next
----
Expand Down
40 changes: 33 additions & 7 deletions impatient-docs/src/asciidoc/impatient3.adoc
Expand Up @@ -130,7 +130,6 @@ are running, and also the section which was added since Part 2:

image:wc1.png[]


Build
~~~~~

Expand All @@ -146,6 +145,39 @@ Run the app like this:
rm -rf output
hadoop jar ./build/libs/impatient.jar data/rain.txt output/wc

Driven
~~~~~~
Explore with *Driven* how your application got composed into Cascading.

*If you did not install the Driven plugin, you can still explore Part 3
through Driven by following this https://driven.cascading.io/driven/A6C97AEB171449F4945FF90651C08E74[link]*

image:driven-part3.png[]

*If you ran your application on a distributed Hadoop cluster*, your flow name ("wc")
in the timeline view will have a link into how the steps got composed into Mappers
and Reducers.

image:driven-part3-b.png[]

Click on the step link ("(1/1) output/wc") in the Performance view if you want to
explore how your application executed from the Hadoop job dashboard.

[NOTE]
===============================
Driven exposes three views of the DAG -- Logical, Physical and Contracted. The Logical View lets
you inspect and visualize the Tap, Pipe and other Cascading constructs that you
specified in your code. With Physical View, you can also inspect intermediate Tap
and Pipe Assemblies embedded in your code. In our case, we can see that the Retain
function was used in the subassembly.
===============================

The bottom half of the screen contains the 'Timeline View', which will give details associated
with each flow run. You can click on the 'Add Columns' to visualize other signals too.

To understand how best to understand the timing counters, read
link:cascading_state.html[Understanding Timing Counters]

Output text gets stored in the partition file `output/wc1` which you can then
verify:

Expand All @@ -157,12 +189,6 @@ up correctly. Drop us a line on the
https://groups.google.com/forum/#!forum/cascading-user[cascading-user] email
forum.

For those familiar with Apache Pig, we have included a comparable script, and to run that:

rm -rf output
mkdir -p dot
pig -p docPath=./data/rain.txt -p wcPath=./output/wc -p stopPath=./data/en.stop ./src/scripts/wc.pig

Next
----
In link:impatient4.html[Part 4] of Cascading for the Impatient you will
Expand Down
37 changes: 27 additions & 10 deletions impatient-docs/src/asciidoc/impatient4.adoc
Expand Up @@ -5,7 +5,7 @@ Part 4 - Implementing a Stop Word Filter

In the previous installement we showed how to write a custom
http://docs.cascading.org/cascading/2.5/javadoc/cascading/operation/package-summary.html[Operation]
for a Cascading 2.0 application. If you haven’t read that yet, it’s probably
for a Cascading application. If you haven’t read that yet, it’s probably
best to start there.

Today’s lesson takes that same Word Count app and expands on it to implement a
Expand Down Expand Up @@ -197,6 +197,32 @@ To run the app, do the following:
rm -rf output
hadoop jar ./build/libs/impatient.jar data/rain.txt output/wc data/en.stop

Driven
~~~~~~
Open your *Driven-enabled* app to track the progress of your application in real-time. Make
sure that you have set the Refresh feature to ON. By default, the Driven updates the visualization
every 30 seconds.

*If you have not installed the Driven plugin, you can still explore Part 4 through
Driven by following this https://driven.cascading.io/driven/1AE69BD0641146EFB926DE7AC83B94B0[link]*

[NOTE]
===============================
If you registered at http://cascading.io and installed the Driven API key, you will
have accces to the “All Applications” view that tracks all your historical application
runs. This view starts becoming interesting over a period of time when you want to
track trending, identify outlier behavior, or monitor applications based on their
termination status
===============================

image:driven-part4.png[]

You can get detailed insights
into how your Cascading steps translated into Map Reduce by clicking on your Flow name
in the timeline view ("wc").

image:driven-part4-b.png[]

Output text gets stored in the partition file `output/wc` which you can then verify:

more output/wc/part-00000
Expand All @@ -206,15 +232,6 @@ sample app, part 4. If your run looks terribly different, something is probably
not set up correctly. Drop us a line on the
https://groups.google.com/forum/#!forum/cascading-user[cascading-user] email forum.

For those familiar with Apache Pig, we have included a comparable script, and to run that:

[source,java]
----
rm -rf output
mkdir -p dot
pig -p docPath=./data/rain.txt -p wcPath=./output/wc -p stopPath=./data/en.stop ./src/scripts/wc.pig
----

Next
----
link:impatient5.html[Part 5] of Cascading for the Impatient implements the
Expand Down

0 comments on commit d7384fe

Please sign in to comment.