# Polyglot Persistence with Blaze  

Our world is complex and no single approach exists that solves all problems. Likewise, in the data world one cannot solve all problems with one piece of technology.  

Nowadays, any big technology company uses (in one form or another) a MapReduce paradigm to sift through terabytes (or even petabytes) of data collected daily. On the other hand, it is much easier to store, retrieve, extend, and update information about products in a document-type database (such as MongoDB) than it is in a relational database. Yet, persisting transaction records in a relational database aids later data summarizing and reporting.  

Even these simple examples show that solving a vast array of business problems requires adapting to different technologies. This means that you, as a database manager, data scientist, or data engineer, would have to learn all of these separately if you were to solve your problems with the tools that are designed to solve them easily. This, however, does not make your company agile and is prone to errors and lots of tweaking and hacking needing to be done to your system.  

Blaze abstracts most of the technologies and exposes a simple and elegant data structure and API.  

In this chapter, you will learn:  
• How to install Blaze  
• What polyglot persistence is about  
• How to abstract data stored in files, pandas DataFrames, or NumPy arrays  
• How to work with archives (GZip)
• How to connect to SQL (PostgreSQL and SQLite) and No-SQL (MongoDB)
databases with Blaze  
• How to query, join, sort, and transform the data, and perform simple
summary statistics

All that is now left to do is to import Blaze itself in our notebook:

## Polyglot persistence

Neal Ford introduced the, somewhat similar, term polyglot programming in 2006.
He used it to illustrate the fact that there is no such thing as a one-size-fits-all
solution and advocated using multiple programming languages that were more
suitable for certain problems.  

In the parallel world of data, any business that wants to remain competitive needs to
adapt a range of technologies that allows it to solve the problems in a minimal time,
thus minimizing the costs.  

Storing transactional data in Hadoop files is possible, but makes little sense. On
the other hand, processing petabytes of Internet logs using a Relational Database
Management System (RDBMS) would also be ill-advised. These tools were
designed to tackle specific types of tasks; even though they can be co-opted to solve
other problems, the cost of adapting the tools to do so would be enormous. It is a
virtual equivalent of trying to fit a square peg in a round hole.  

For example, consider a company that sells musical instruments and accessories
online (and in a network of shops). At a high-level, there are a number of problems
that a company needs to solve to be successful:  
* 1. Attract customers to its stores (both virtual and physical).  
* Present them with relevant products (you would not try to sell a drum kit to a pianist, would you?!).   
* 3. Once they decide to buy, process the payment and organize shipping.  

To solve these problems a company might choose from a number of available
technologies that were designed to solve these problems:  

1. Store all the products in a document-based database such as MongoDB,
Cassandra, DynamoDB, or DocumentDB. There are multiple advantages of
document databases: flexible schema, sharding (breaking bigger databases
into a set of smaller, more manageable ones), high availability, and
replication, among others.  

2. Model the recommendations using a graph-based database (such as Neo4j,
Tinkerpop/Gremlin, or GraphFrames for Spark): such databases reflect the
factual and abstract relationships between customers and their preferences.
Mining such a graph is invaluable and can produce a more tailored offering
for a customer.  

3. For searching, a company might use a search-tailored solution such as
Apache Solr or ElasticSearch. Such a solution provides fast, indexed text
searching capabilities.  

4. Once a product is sold, the transaction normally has a well-structured
schema (such as product name, price, and so on.) To store such data (and
later process and report on it) relational databases are best suited.  

With polyglot persistence, a company always chooses the right tool for the right job
instead of trying to coerce a single technology into solving all of its problems.  

Blaze can abstract many different data structures and expose a single, easy-to-use
API. This helps to get a consistent behavior and reduce the need to learn multiple
interfaces to handle data. If you know pandas, there is not really that much to learn,
as the differences in the syntax are subtle. We will go through some examples to
illustrate this.

In [4]:
import numpy as np
simpleArray = np.array([
[1,2,3],
[4,5,6]
])

simpleArray


array([[1, 2, 3],
       [4, 5, 6]])

In [6]:
simpleArray.T[0]

array([1, 4])

In [7]:
import pandas as pd

simpleDf = pd.DataFrame([
[1,2,3],
[4,5,6]
], columns=['a','b','c'])

simpleDf['a']

0    1
1    4
Name: a, dtype: int64

## Summary
The concepts presented in this chapter are just the beginning of the road to using
Blaze. There are many other ways it can be used and data sources it can connect
with. Treat this as a starting point to build your understanding of polyglot
persistence.  

Note, however, that these days most of the concepts explained in this chapter can be
attained natively within Spark, as you can use SQLAlchemy directly within Spark
making it easy to work with a variety of data sources. The advantage of doing so,
despite the initial investment of learning the API of SQLAlchemy, is that the data
returned will be stored in a Spark DataFrame and you will have access to everything
that PySpark has to offer. This, by no means, implies that you never should never use
Blaze: the choice, as always, is yours.  

In the next chapter, you will learn about streaming and how to do it with Spark.
Streaming has become an increasingly important topic these days, as, daily (true
as of 2016), the world produces roughly 2.5 exabytes of data (source: http://www.
northeastern.edu/levelblog/2016/05/13/how-much-data-produced-everyday/)
that need to be ingested, processed and made sense of.