# Big Data Architectures 

Let's take an example, a company wants to start building up **Big Data architectures**.

**What can be a reason behind this shift?**

Companies nowadays collect a tremendous amount of data on their customers. From purchase history to social media commentary, customer insights may be collected across multiple touchpoints. In addition, contact center metrics such as average handling time and first contact resolution provide data on how the customer experience is affected by service practices. The task here is to take their Website logs and predict customer behaviour of consuming their services. Say for example, answering question like, how much time on an average a customer spents there, what products are hit most and where do their search ends.

**What tasks you think are invovled in building an architecture for handling this Big Data problem?**

The best way to propose a solution to a Big Data problem is to divide it into layers of operations. This course will describe on desigining Big Data Pipeline which itself consists of layers as follows:

1. Data Ingestion Layer
2. Data Collector Layer
3. Data Processing Layer
4. Data Storage Layer
5. Data Query Layer
6. Data Visualization Layer

Let's have a quick look to all the layers so that we can start up on building one of ours. Detailed explanation for each layer will be provided in their respective section.

![Big Data layerd architecture](images/layers.png "Big Data layerd architecture")

**1. Data Ingestion Layer**

Big Data Ingestion involves connecting to various data sources, extracting the data, and detecting the changed data. It's about moving data - and especially the unstructured data - from where it is originated, into a system where it can be stored and analyzed.

**2. Data Collector Layer**

Data Collector layer comes into play when there are multiple sources ingesting Data into your pipeline. Before starting any analysis it's necessary to collect all data sources into one and bring to a format which further pipeline can take easily without making the architecture complex. This can only happen when all the data ingestion is finally being collected at one place. Essentialy the focus is on the transportation of data from ingestion layer to rest of data pipeline. 

**3. Data Processing Layer**

This is layer where a Data Scientist can play with data as much desired. Keeping this in mind, the course will take up this layer from basic to advance level. Now, we have all the data in required format and next we need to design parallel processing on the data and  get some output to serve our purpose of the Data Analysis done so far. Workflow is very simple - Input --> Process --> Output.

**4. Data Storage Layer**

Storage becomes a challenge when the size of the data you are dealing with, becomes large. Several possible solutions can rescue from such problems. Finding a storage solution is very much important when the size of your data becomes large. This layer focuses on **"where to store such a large data in an efficient, scalable, easily accesible and secured manner"**.

**5. Data Query Layer**

This is the layer where strong analytic processing takes place. Data analytics is an
essential step which solved the inefficiencies of traditional data platforms to handle
large amounts of data related to interactive queries, ETL, storage and processing 

**6. Data Visualization Layer**

This layer focus on Big Data Visualization. We need something that will grab
people’s attention, pull them in, make your findings well-understood. This is the
where the data value is perceived by the user.

• **Dashboards** – Save, share, and communicate insights. It helps users generate
questions by revealing the depth, range, and content of their data stores.

• **Recommenders** - Recommender systems focus on the task of information
filtering, which deals with the delivery of items selected from a large collection
that the user is likely to find interesting or useful.





# Programming Models for Big Data

## Why do we need a Programming model?



Given the large volume of data, applications that work on big data need to distribute data on a cluster of processors, and processing has to be carried out in parallel for computation to complete in a reasonable amount of time

## What is the challenge for developers?


Distributed applications require a developer to orchestrate concurrent computation and communication across machines, in a manner that is robust to delays and failures. This adds a overhead of maintaining the infrastructure demands and give considerate time to deploying the application and not just developing it.

## What is the solution?

- A programmer is provided high level primitives (API) to express computation tasks;
- The model automatically parallelize the tasks and executes them on a cluster of shared nothing commodity compute nodes.
- Focus on the solution of the problem rather than the mundane tasks of parallelization

### Define: Programming Model for Big Data

- A programming model is an abstraction or existing machinery or infrastructure. It is a set of abstract runtime libraries and programming languages that form a model of computation.

- If the enabling infrastructure for big data analysis is distributed file systems, then the programming model for big data should enable the programmability of the operations within distributed file systems.

- Enable a developer to write computer programs that work efficiently on top of distributed file systems using big data.

-  In big data programming, users focus on writing data-driven parallel programs which can be executed on large scale and distributed environments


Here we'll see three major programming models for writing big data applications:

**1. Distributed File System**

**2. MapReduce**

**3. Functional Programming**

Let's dig into details

### 1. Distributed File System

- In a distributed file system, Data sets, or parts of a data set, can be replicated across the multiple nodes of a cluster.

- Distributed file systems replicate the data between the racks, and also computers distributed across geographical regions.

- Since data is already on these nodes, then analysis of parts of the data is needed in a data parallel fashion, computation can be moved to these nodes.  
- Data replication makes the system more fault tolerant.

- That means, if some nodes or a rack goes down, there are other parts of the system, the same data can be found and analyzed.

- Data replication also helps with scaling the access to this data by many users.

- Hadoop DFS has a master/slave architecture. An HDFS cluster consists of a **single NameNode**, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are **a number of DataNodes** , usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

#### A simple use case

- Suppose you have enormous amount of data generated on a regular basis and so you need to store it somewhere. There are two ways to approach this problem. 
    
    1. First one, a straight approach to store it on a big storage capacity node, also called **vertical scaling**.
    2. The second one is to store it on a collection of nodes, also called **horizontal scaling**.
    
    **What if we want to store more data?** 
    
    
        1.  Buy more storage. But remember, nothing is infinite. For instance, Facebook had a storage of 300 petabytes of data in 2014. Can you think of scaling up to this limit with a hard drive?
    
        2.  Here Horizontal scaling and Distributed file system can together help us. Need to store more data? Let's take a cluster of 1000 nodes but here is a good probability that each node will get out service once in three years on an average.Thus with a cluster of 1,000 nodes you will get one pillar each day, approximately which may lead to data loss. That is where the replication property of distributed file system save us from data loss. Each scaling approach has its own pros and cons. Accessing data you usually get lower latency with vertical scaling. You get higher latency with horizontal scale but you can build a bigger storage solution. 
        
        
<img src="images/scaling.png">

***Fig: Vertical VS Horizontal Scaling***



***Or we can think of this in a non-technical fashion as shown below:***


<img src="images/scaling.jpg">

### 2. MapReduce
- Enables writing data-centric parallel applications.
- MapReduce is inspired by the commonly used functions - Map and Reduce in combination with the **divide-and-conquer parallel paradigm**. 
- For a single MapReduce job, users implement two basic procedure objects **Mapper and Reducer** for different processing stages as shown in Figure below.
- Then the MapReduce program is automatically interpreted by the execution engine and executed in parallel in a distributed environments.
- **Key-Value based**: In MapReduce, both input and output data are considered as Key-Value pairs with different types.

#### A simple use case

- Suppose you have a dataset of videos getting upload to Youtube and you aim to find out what are the top 5 categories with maximum number of videos uploaded. How will you proceed?
 
 
- Now from the mapper, we can get the video category from each video and map it into a key-value pair form with video category as key and related value ‘1’ as values which will be passed to the reducer step next.


- Reducer can finally aggregate the data at one place and map the count to each video category.

**Another Map Reduce example**

<img src="images/MapReduce.jpg">

### 3. Functional Programming

- In functional programming, programming interfaces are specified as functions that applied on input data sources.

- The computation is treated as a calculation of functions.

- Functional programming itself is declarative and it avoids mutable states sharing. Compared to Object-oriented Programming it is more compact and intuitive for representing data driven transformations and applications

- Wide application will be seen while working with Apache Spark. 


<img src = "images/funcProg.jpg">