Home
We have built an example app in Cascading and Apache Hadoop, based on the City of Palo Alto open data provided via Junar: http://paloalto.opendata.junar.com/dashboards/7576/geographic-information/
Students can extend the example workflow to build derivative apps, or use it as a starting point for other ways to leverage this data.
We will also draw some introductory material from these related talks:
There is also a chapter which describes this app in more detail in the O'Reilly book "Enterprise Data Workflows with Cascaading" (June 2013)
We used some of the CoPA open data for parks, roads, trees, etc., and have shown how to use Cascading and Hadoop to clean up the raw, unstructured download. Based on that initial ETL workflow, we get geolocation + metadata for each item of interest:
- trees w/ species
- road pavement w/ traffic conditions
- parks
One use case could be “Find a shady spot on a summer day in which to walk near downtown Palo Alto. While on a long conference call. Sippin’ a latte or enjoying some fro-yo.” In other words, we could determine estimates for albedo vs. relative shade. Perhaps as the starting point for a mobile killer app. Or something.
Additional data is included here, to be joined with the cleaned-up CoPA data about trees and roads. We will also use log data collected using GPS Tracks.
A conceptual diagram for this app in Cascading is shown as:
- Clean up the raw, unstructured data from the CoPA download… aka ETL
- Before modeling, perform visualization and summary statistics in RStudio
- Ideation and research for potential use cases
- Iterate on business process for the app workflow
- Apply best practices and TDD at scale
- Integrate with end use cases represented by the workflow endpoints
- …
- PROFIT!
- Data Quality: some species names have spelling errors or misclassifications -- could be cleaned up and provided back to CoPA
- Assumptions have been made about missing data -- were these appropriate for the intended use case?
- The resulting data still needs: common names for trees, photos, natives vs. invasives, toxicity, etc.
- There are much better ways to handle the geospatial work, e.g., k-d trees
- Arguably, this is not a large data set; however, it’s early for the open data initiative, and besides Palo Alto has only 65K population.
- This provides a good area for a POC, prior to deploying in other, larger metro areas.
- This example helps illustrate how in terms of “Big Data”, complexity is more important to consider than bigness.
Other relevant data science aspects... some extensions could improve results:
- Bayesian point estimates for identifying "most frequented" paths and locations from the GPS logs
- Kriging to smooth the geo distribution of estimated metrics
The use of geohash is arguably a hack, but it works fine for this case. In a larger geographic area there might be discontinuities. A more robust approach for geospatial indexing, for example, would be to use K-D Trees
Note that this example illustrates some key elements of a good data product:
- ETL of unstructured data (CoPA GIS export)
- curated metadata: tree species dataset, road albedo dataset
- log files: iPhone personalized mobile coordinates
- calibration and testing based on R
- algorithms: geospatial indexing, replicated joins
We could combine this CoPA open data with access to external APIs:
- Factual local business (FB Places, etc.) [uses Cascading]
- CommonCrawl open source full web crawl
- Data.gov US federal open data
- Trulia neighborhood data, housing prices [uses Cascading]
- Google geocoding
- Wunderground local weather data
- WalkScore neighborhood data, walkability
- Beer need we say more?
- Data.NASA.gov NASA open data
- DBpedia datasets derived from Wikipedia
- GeoWordNet semantic knowledge base about localized terminology
- CityData US city profiles
- Geolytics demographics, GIS, etc.
- Foursquare, Yelp, CityGrid, Localeze, YP
- Programmable Web API mashup directory
- various photo sharing
- estimate allergy zones, for real estate preferences
- optimize sales leads: target sites for conversion to residential solar
- optimize sales leads: target sites for an urban agriculture venture
- report observations of natives on endangered species list
- report new observations of invasives / toxicology
- infer regions of affinity for beneficial insects
- premium payment / bid system for an open parking spot in the shade
- welcome services for visitors (ecotourism, translated park info, etc.)
- city planning: expected rates for tree replanting, natives vs. invasives, etc.
- liabilities: e.g., oleander (common, highly toxic) near day care centers
- epidemiology, e.g. there are outbreaks of disastrous tree diseases -- with big impact on property values
community organizations:
- volunteer events: harvest edibles to donate to shelters
start-ups:
- some of the invasive species are valuable in Chinese medicine while others can be converted to biodiesel -- potential win-win for targeted harvest services
Looks like this data would be even more valuable if it included ambient noise levels. Somehow.
Question: How could your new business obtain data for ambient noise levels in Palo Alto?
- infer from road data
- infer from bus lines, rail schedule
- sample/aggregate from mobile devices in exchange for micropayments
- buy/aggregate data from home security networks
- fly nano quadrotors, DIY "Street View" for audio
- fly micro aerostats, with Arduino-based accelerometer and positioned parabolic mic
- partner with City of Palo Alto to deploy a simple audio sensor grid
To generate an IntelliJ project use:
gradle ideaModule
To build the sample app from the command line use:
gradle clean jar
Before running this sample app, be sure to set your HADOOP_HOME
environment variable. Then clear the output
directory. To run on a desktop/laptop with Apache Hadoop in standalone mode:
rm -rf output
hadoop jar ./build/libs/copa.jar data/copa.csv data/meta_tree.tsv data/meta_road.tsv data/gps.csv output/trap output/tsv output/tree output/road output/park output/shade output/reco
To view the results, for example the output recommendations in reco
:
ls output
more output/reco/part-00000
An example of log captured from a successful build+run is at https://gist.github.com/3660888
To run the R script, load src/scripts/copa.R
into RStudio or from the command line run:
R --vanilla -slave < src/scripts/copa.R
...and then check output in the file Rplots.pdf
There is a tutorial about getting started with Cascading in the blog post series called Cascading for the Impatient. Other documentation is available at http://www.cascading.org/documentation/.
For more discussion, see the cascading-user email forum. We also have a meetup started.