Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Cascading plus City of Palo Alto open data
branch: master
Failed to load latest commit information.
data finally resolved idiomatic Cascalog for geo bigrams
docs
src geo-precision
test/copa refactor, styling, and add some tests
.gitignore parsing trees
LICENSE.txt initial check in
README.md updated README to state Leiningen 2.0+ dependency
build.gradle updated Gradle build script
emr.sh added EMR script
project.clj refactor, styling, and add some tests

README.md

CMU Workshop on Cascading + City of Palo Alto Data Open Data

We have built an example app in Cascading and Apache Hadoop, based on the City of Palo Alto open data provided via Junar: http://paloalto.opendata.junar.com/dashboards/7576/geographic-information/

Students can extend the example workflow to build derivative apps, or use it as a starting point for other ways to leverage this data.

We will also draw some introductory material from these two previous talks:

For more details, please read the accompanying wiki page.

Build Instructions

To build the sample app from the command line use:

gradle clean jar

Note that this depends on Gradle 1.3+, JVM 1.6, and Apache Hadoop 1.x

Before running this sample app, be sure to set your HADOOP_HOME environment variable. Then clear the out directory. To run on a desktop/laptop with Apache Hadoop in standalone mode:

rm -rf out
hadoop jar ./build/libs/copa.jar data/copa.csv data/meta_tree.tsv data/meta_road.tsv data/gps.csv \
  out/trap out/tsv out/tree out/road out/park out/shade out/reco

To view the results, for example the output recommendations in reco:

ls out
more out/reco/part-00000

An example of log captured from a successful build+run is at https://gist.github.com/3660888

To run the R script, load src/scripts/copa.R into RStudio or from the command line run:

R --vanilla -slave < src/scripts/copa.R

...and then check output in the file Rplots.pdf

Cascalog Build

See the Leiningen build script in project.clj and Cascalog source in the src/main/clj/copa directory.

Note that this depends on Cascalog 1.9 or later, Leiningen 2.0 or later, JVM 1.6, and Apache Hadoop 1.x

To build and run:

lein clean
lein uberjar
rm -rf out/ 
hadoop jar ./target/copa.jar data/copa.csv data/meta_tree.tsv data/meta_road.tsv data/gps.csv \
  out/trap out/park out/tree out/road out/shade out/gps out/reco

About Cascading

There is a tutorial about getting started with Cascading in the blog post series called Cascading for the Impatient. Other documentation is available at http://www.cascading.org/documentation/.

For more discussion, see the cascading-user email forum or check out one of our meetups.

Something went wrong with that request. Please try again.