### Learning Objectives ###

- Identify the basic data structure of in Spark, also known as a DataFrame.
- Use the collaborative Databricks workspace.
- Write SQL code that executes against a cluster of machines.
- Discuss the core concepts of distributed computing.
- Recognize when and where distributed computing is useful.

#### Data Lakes vs. Data Warehouses ####

Data Lakes and Data Warehouses are two major trends in data science. The lakehouse architecture is the best way to store your data. It provides massive scalability and performance, it provides transactional support, it maintains an open standard, and it supports diverse data formats and data workloads. All of this sets the foundation for successful data projects because now the data is well organized, optimized, and open for all of your downstream tasks.Here's great value to advanced analytics like artificial intelligence, but the majority of data projects fail. Why do they fail? Well, often it has to do with the data. The early days of data meant creating your own data-center and then buying some proprietary software to store your data. That meant databases, but also data warehouses too.While databases are used for online transactions, data warehouses are common for all sorts of business intelligence or BI applications, like nightly reports for various business outcomes. There are clear downsides to data warehouses though as you can see below.

![Screenshot from 2023-04-24 14-45-20.png](attachment:2be755ac-6b2e-494d-9347-286bbdd83829.png)


Data warehouses were really created for your own custom data center. Then the Cloud came about where anybody could rent resources in a flexible way from companies like Amazon and Microsoft. This means, among other things, scalable and cheap storage. This enabled data lakes, where you can land data in any format you need. They support machine learning workloads because they're so highly flexible. You could better integrate with open formats like Parquet and Delta. But there are downsides too as shown below

![Screenshot from 2023-04-24 14-52-28.png](attachment:e96e1ad3-a1b4-4d1f-90c3-5278d9151ce8.png)

We can talk about more than just data warehousing and data science workloads. On the bottom left-hand side of the screen, we have a normal pipeline for working with data warehouses.we have a normal pipeline for working with data warehouses. You start with structured data, loaded into your data warehouse, and then you create data marts to serve different business use cases. Then your data engineers are working on their ETL pipelines. They take some raw data, they transform it, and they load it into some target databases. If you wanted to do streaming, you'd need a different technology stack to land your data and make it available. Then finally, for data science, you would need separate data lakes in order to enable those workloads. Now, try to stitch all of this together. Each of these data stacks have their own individual technologies to handle them and to integrate them all is no easy task, to put it lightly. Each comes with their own protocols and limitations, and keeping everything up-to-date is truly a nightmare. In brief, this means that most enterprises struggle with data because these various data systems are siloed from each other. The lakehouse paradigm solves many of these problems.Lakehouses join the robustness and guarantees of data warehouses with the flexibility, scalability, and cheap storage of data lakes. 

![Screenshot from 2023-04-24 14-56-05.png](attachment:0a32bfcc-6994-4f8c-bfee-5336aed01bb0.png)

LAKEHOUSE
A Lakehouse combines the best of both worlds of Data Warehouses and Data Lakes, no need to manage multiple systems or experience stale, inconsistent, redundant data. Using a Lakehouse, you can have a single source of truth for your data, allowing you to move fast without breaking thing.Lakehouses work with any sort of data, structured, semi-structured, and unstructured. We can then land them in our Data Lake with appropriate metadata management, caching, and indexing. This makes our Data Lake reliable enough to build many different applications, whether you're doing business intelligence, machine learning or anything else. What's the big difference between this and other approaches? Let's talk about some common data engineering problems:as shown below.

![Screenshot from 2023-04-24 15-26-46.png](attachment:d0261926-38e9-4795-87f0-8b5c90276268.png)

To the Lakehouse, this is unique in a number of ways. First, it's a simple way to manage data as it only needs to exist once to support all of your workloads, it's not siloed based upon type of workload you're performing. It's also open, that means that it's based on open source software and open standards to make it easy to work with without having to engage with expensive proprietary formats. Finally, it's collaborative, meaning engineers, analysts, and data scientists work together easily to serve a number of different workloads. The backbone of the Lakehouse is Data Lake, an open source software originally developed at Databricks and later open-sourced and donated to the Linux Foundation. That means that anybody can download the source code and use this framework to more efficiently manage their data applications. At a high level, Data Lakes have enhanced reliability by allowing for database or acid transactions against your data, more on this in the next video. It also has increased performance with indexing and partitioning in a bunch of related optimizations. There's improved governance with Table ACLs, an ACL is a so-called Access Control List, this is a common way of handling permissioning. Finally, you have better trust in your data through schema enforcement and other expectations. This is the Lakehouse, it's simple, open, and collaborative. 

![Screenshot from 2023-04-24 15-31-43.png](attachment:d5be8d6a-b150-4c67-bbe4-d8377cac3e22.png)

DELTA LAKE
To recap, lakehouses add reliability, quality, and performance to Data Lakes. It's backed by Delta Lake, which is an open source technology developed initially at Databricks before being given it to the Linux Foundation. It's built on the back of Apache Parquet, which is the scalable file format you saw in earlier lessons. Parquet is great, but its features won't quite get us to the lakehouse vision. We need a new tool that we find in Delta first. Delta adds a transaction log on top of Parquet files. This means that we can perform updates and delete rows of files. This specific term for this is ACID transactions. ACID is an acronym. The A stands for atomicity or the guarantee that if you're adding data to one table and subtracting it from another, you can have that be a single transaction where both steps either succeed or fail together. The C stands for consistency, or that the database is always in a valid state. That means that if I start writing to the table while somebody else is reading from it, there will be a consistent view. Isolation is our I, and that means that we can do concurrent queries against our data. For our D, that's durability. This means that if the lights go out, we won't lose our data.This also means that if we take down our Spark cluster, then the data will persist. In summary, you get data versioning reliable and fault tolerant transactions in a fast query engine all while maintaining open standards. It ensures your teams can access timely reliable, high-quality data.

![Screenshot from 2023-04-24 15-49-29.png](attachment:5bc3caba-0688-452d-ad51-a39ffd4c8918.png)
![Screenshot from 2023-04-24 15-50-22.png](attachment:3bdbfb7c-db47-4d3e-84a0-2b2b953d922e.png)

with ACID transactions, you can always delete parts of your data if need be. You couldn't really do this with Parquet, CSV, or other file types because you'd have to read back in all of the data and write it all back out without the particular rows you were wanting to delete. You can easily unify streaming and batch workloads. You can run a nightly batch process to propagate data through this architecture, or you can have a streaming job setup between each of these different tables. you can do standard practices for retention and corrections, including inserting, updating, merging, etc. This is really important when it comes to GDPR or the data protections that address issues of privacy and data ownership. In summary, Delta allows you to do advanced database-like operations in a Data Lake so that you can have scalable, reliable, and optimized queries. This is the so-called medallion architecture, moving from bronze to silver to gold

Below is the link to to databrics notebook that i was given to practice data lake.You can do it to

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6302813296822450/1157944569646621/4155951448646284/latest.html