## Big Data Ecosystem, Data Lakes and Spark

### Introduction to Big Data Ecosystem, Spark, and Data Lakes

In this lesson, you’ll learn about big data ecosystems. The modern big data ecosystem is an evolution of data processing on distributed architecture necessary to handle the sheer volume of data.

As businesses began gathering and processing ever larger amounts of data, the field of data science arose around the need to thoughtfully ask questions of data and answer them using scientific methods of discovery. In addition, ever-increasing amounts of data along with greater processing power led to a surge in artificial intelligence research.

Data Science and AI are immensely important to the modern business but without a big data ecosystem to move, store, clean, merge, and tidy up data, these tools are not effective. This is why the work of a data engineer is so critical. It is you, the data engineer, who will skillfully use modern tools and techniques to create the big data ecosystem.

https://www.youtube.com/watch?v=8o7h5tYvEXE

![image.png](attachment:78e65fe2-20b6-494d-8a0f-b99a05abe12f.png)

## Lesson Outline

In this lesson, you’ll get an overview of the big data engineering ecosystem that includes distributed file systems and processing, Hadoop, MapReduce, Spark, data lakes, and lakehouse architecture.

https://www.youtube.com/watch?v=GT_KGuUpgL0

![image.png](attachment:6abaec1c-4eab-4f58-99e0-f1dbf626a7da.png)

## Insight: From Hadoop to Data Lakehouse

 Hadoop and Spark enabled the evolution of the data warehouse to the data lake.

https://www.youtube.com/watch?v=KEWBNwYmmns

Data warehouses are based on specific and explicit data structures that allow for highly performant business intelligence and analytics but they do not perform well with unstructured data.

Data lakes are capable of ingesting massive amounts of both structured and unstructured data with Hadoop and Spark providing processing on top of these datasets.

Data lakes have several shortcomings that grew out of their flexibility. They are unable to support transactions and perform poorly with changing datasets. Data governance became difficult due to the unstructured nature of these systems.

Modern lakehouse architectures seek to combine the strengths of data warehouses and data lakes into a single, powerful architecture.

## The Hadoop Ecosystem

https://www.youtube.com/watch?v=OWkChITt21U

![image.png](attachment:c6988c24-efb0-4e8d-83a6-b65f803860f3.png)

https://www.youtube.com/watch?v=7ooOPYpgf8s

Streaming Data

Data streaming is a specialized topic in big data. The use case is when you want to store and analyze data in real-time such as Facebook posts or Twitter tweets.

Spark has a streaming library called Spark Streaming(opens in a new tab). Other popular streaming libraries include Storm(opens in a new tab) and Flink(opens in a new tab). Streaming won't be covered in this course, but you can follow these links to learn more about these technologies.

https://spark.apache.org/docs/latest/streaming-programming-guide.html

https://flink.apache.org/

http://storm.apache.org/

## MapReduce

https://www.youtube.com/watch?v=C8FiZB5eoLs

MapReduce is a programming technique for manipulating large data sets. "Hadoop MapReduce" is a specific implementation of this programming technique.

The technique works by first dividing up a large dataset and distributing the data across a cluster. In the map step, each data is analyzed and converted into a (key, value) pair. Then these key-value pairs are shuffled across the cluster so that all keys are on the same machine. In the reduce step, the values with the same keys are combined together.

While Spark doesn't implement MapReduce, you can write Spark programs that behave in a similar way to the map-reduce paradigm. In the next section, you will run through a code example.

![image.png](attachment:81c0ed5f-735a-49f0-9981-6ac4a33675db.png)

## Hadoop MapReduce Exercise

![image.png](attachment:c2e87e2a-5243-41fd-99de-0d9f0711e368.png)

## Why Spark?

https://www.youtube.com/watch?v=gHqwh5ARYps

Why Spark?

Spark is currently one of the most popular tools for big data analytics. Hadoop is a slightly older technology, although still in use by some companies. Spark is generally faster than Hadoop, which is why Spark has become more popular over the last few years.

There are many other big data tools and systems, each with its own use case. For example, there are database system like Apache Cassandra(opens in a new tab) and SQL query engines like Presto(opens in a new tab). But Spark is still one of the most popular tools for analyzing large data sets.

http://cassandra.apache.org/

https://prestodb.io/

## The Spark Cluster

https://www.youtube.com/watch?v=DpPD5hhvspg

![image.png](attachment:5eb6a672-ca59-45ee-bbe6-dac2a2d79202.png)

![image.png](attachment:751efcc3-74d6-453c-a3cd-f272fe1966a3.png)

Spark Clusters in local mode and standalone modes

## Spark Use Cases

https://www.youtube.com/watch?v=2oq2uzU69T4

![image.png](attachment:8c02abb3-7b27-4be0-accd-17467c34772b.png)

https://spark.apache.org/sql/

https://spark.apache.org/mllib/

https://spark.apache.org/streaming/

https://spark.apache.org/graphx/

## Data Lakes

Data lakes are an evolution beyond data warehouses and allow an organization to ingest massive amounts of both structured and unstructured data into storage.

https://www.youtube.com/watch?v=zM1MsPgt8pE

![image.png](attachment:f3ade146-37b4-48e7-9b7d-c033fa34d226.png)

## Data Lakes, Lakehouse, and Spark

https://www.youtube.com/watch?v=ZT0ObCroNVQ

![image.png](attachment:fef791fc-ba98-4b13-8445-84a7d8ffadef.png)

## Apache Spark, Data Lakes, and Lakehouse
Apache Spark can be used to perform data engineering tasks for building both data lakes and lakehouse architectures.

Most often, these data engineering tasks consist of ETL or ELT tasks. As a reminder, E stands of Extract, T stands for Transform and L stands for Load.

Apache Spark allows the data engineer to perform these tasks, along with raw data ingestion using the language of their choice with support for Python, R, SQL, Scala, and Java.

## Data Lakehouse

https://www.youtube.com/watch?v=dI-ogbLE_S4

![image.png](attachment:69908af3-9a1d-4a7f-9eb3-32e82f8b66c9.png)

![image.png](attachment:b0f34547-100e-4287-94b2-063309b37531.png)

## Lesson Summary

https://www.youtube.com/watch?v=oJ8cMU5iMA4

![image.png](attachment:0fee5bbe-7dcd-4f95-8112-21c9b4d00edd.png)

https://www.youtube.com/watch?v=4fALDooXjtQ