In [1]:
%%javascript
$.getScript('http://asimjalis.github.io/ipyn-ext/js/ipyn-present.js')

<IPython.core.display.Javascript object>

<h1 class="tocheading">Data Engineering in Review</h1>
<div id="toc"></div>

## Defining data systems

$$􏰁
\text{query} = function(􏰁\text{all data}􏰂)
$$  

* *Latency*—The time it takes to run a query. May be milliseconds to hours.
* *Timeliness*—How up-to-date the query results are. 
* *Accuracy*—In many cases, approximations may be necessary

> ...mutability—and associated concepts like CRUD—are fundamentally not human-fault tolerant...  
> ...solution is to make your core data *immutable*...  


## Batch and serving layers

** Basic birthday-inference algorithm **

![Figure 18.1](images/18fig01_alt.jpg)

### Partial Recomputation
>  
1. For the new batch of data, find all people who have a new age sample.
2. Retrieve all age samples from the master dataset for all people in step 1.
3. Recompute the birthdays for all people in step 1 using the age samples from step 2 and the age samples in the new batch.
4. Merge the newly computed birthdays into the existing serving layer views.

** Bloom join **

![Figure 18.2](images/18fig02.jpg)

### Measuring and optimizing batch layer resource usage
> Consider these examples, which are based on real-world cases:
* After doubling the cluster size, the latency of a batch layer went down from 30 hours to 6 hours, an 80% improvement.
* An improper reconfiguration caused a Hadoop cluster to have 10% more task failure rates than before. This caused a batch workflow’s runtime to go from 8 hours to 72 hours, a 9x degradation in performance.


* $T$—The runtime of the workflow in hours.
* $O$—The overhead of the workflow in hours (_things like setting up processes, copying code, etc._)
* $H$—The amount of data being processed (_it’s assumed that the rate of incoming data is fairly constant._)
* $P$—The dynamic processing time. (_i.e. number of hours each unit of $H$ adds_)

$$T = O + P \times H $$

<!--
#### Performance effect of doubling cluster size

```python
from seaborn import plt
p = arange(0, .92, 0.01)
dt = 1-p/(2-p)  # The effect of doubling cluster size for different values of P
_ = plt.plot(p, dt)
```
* If $P >> O$ then $T$ will be very sensitive to cluster size
* If $P << O$ then $T$ will be unaffected by cluster size
```python
dt = (1-p)/(1-1.11*p)  # The effect of doubling cluster size for different values of P
_ = plt.plot(p, dt)
```
-->

---
## Technologies We Did Not Cover

* [YARN](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html), [Mesos](http://mesos.apache.org/) and other resource management systems
* [Ambari](https://ambari.apache.org/), [Cloudera Manager](http://www.cloudera.com/content/cloudera/en/products-and-services/cloudera-enterprise/cloudera-manager.html) and other cluster management systems
* [Oozie](http://oozie.apache.org/), [Azkaban](http://data.linkedin.com/opensource/azkaban), and other workflow scheduling tools
* [Chef](https://www.chef.io/chef/), [Puppet](https://puppetlabs.com/), [Ansible](http://www.ansible.com/) and other automated deployment systems
* [Docker](https://www.docker.com/) and other container and virtual machine environments
* [Redshift](https://aws.amazon.com/redshift/), [Greenplum](http://pivotal.io/big-data/pivotal-greenplum-database), [Teradata Aster](http://www.teradata.com/Teradata-Aster/overview/?LangType=1033&LangSelect=true) and other Massively Parallel Processing (MPP) Relational Database Management Systems
* [Impala](http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html), [Presto](https://prestodb.io/) and other distributed, in-memory, SQL query engines
* [Sqoop](https://sqoop.apache.org/) - for transfering data from RDBMS databases into Hadoop
* [Drill](http://drill.apache.org/) and [Dremel](http://research.google.com/pubs/pub36632.html) - for providing a SQL-like interface to non-relational data
* [Cassandra](http://cassandra.apache.org/), [Riak](http://docs.basho.com/riak/latest/) and so many other distributed NoSQL databases.
* [RabbitMQ](https://www.rabbitmq.com/), [Flume](https://flume.apache.org/), [Logstash](https://www.elastic.co/products/logstash), [SQS](https://aws.amazon.com/sqs/) and other queueing, messaging, and logging systems
* [Storm](https://storm.apache.org/), [Flink](https://flink.apache.org/), [Kinesis](https://aws.amazon.com/kinesis/) and other stream processing systems
* [Disco](http://discoproject.org/), [Manta](https://www.joyent.com/object-storage), and other alternatives to Hadoop
* [Pregel](https://kowshik.github.io/JPregel/pregel_paper.pdf), [GraphX](http://spark.apache.org/graphx/), [Giraph](http://giraph.apache.org/), [Dryad](http://research.microsoft.com/en-us/projects/dryad/) and other graph processing systems 
* [H2O](http://h2o.ai/), [GraphLab](https://dato.com/home/), and other scalable machine learning platforms
* [PMML](http://dmg.org/pmml/v4-2-1/GeneralStructure.html) and other ways of deploying machine learning models
* [Solr](http://lucene.apache.org/solr/), [Elasticsearch](https://www.elastic.co/products/elasticsearch), [Indri](http://www.lemurproject.org/indri/) and other information retrieval systems
* [Pig](https://pig.apache.org/), an alternative to Hive for querying Hadoop
* [Kerberos](http://web.mit.edu/kerberos/) and other security measures
* [Cascading](http://www.cascading.org/), [Scalding](http://www.cascading.org/projects/scalding/) and so on for Hadoop application development
* [Celery](http://www.celeryproject.org/) and other distributed task queues
* _...and many others..._