# Hands-On Exercise: Implementing a Data Lake Using Apache HDFS

**Objective**: By the end of this exercise, students will have a clear understanding of data lakes, their benefits, design patterns, and best practices. They will also set up a basic data lake using Apache HDFS on the previously created Hadoop cluster, and gain an introduction to the Delta Lake concept.

## Step 1: Understanding Data Lakes and Their Benefits

**What is a Data Lake?**
A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to structure it first, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning.

**Benefits of a Data Lake:**
- Scalability: Handle massive volumes of structured, semi-structured, and unstructured data.
- Cost-Effective Storage: Store raw data in its native format without schema enforcement.
- Data Variety: Allows for ingestion of a wide variety of data formats.
- Real-Time Analytics: Supports real-time or near-real-time analysis with tools like Apache Spark.

## Step 2: Data Lake Design Patterns and Best Practices

**Design Patterns:**
- Raw Layer (Landing Zone): Where data enters the lake. It is stored in its raw format (JSON, CSV, Parquet, etc.) without processing or transformations.
- Cleansed Layer (Curated Zone): Cleaned and transformed data ready for querying or analysis. Often stored in more structured formats.
- Enriched Layer (Consumption Zone): Data optimized for specific use cases, such as machine learning models or reporting.


**Best Practices:**
- Partitioning: Divide large datasets into partitions (e.g., by date, region) for faster access.
- Metadata Management: Catalog data using tools like Apache Hive or AWS Glue for easier access.
- Governance: Implement strict access controls, data lineage tracking, and logging to manage data effectively.
- Schema-on-Read: Apply structure only when reading data, giving flexibility to store unstructured data without schema constraints.


## Step 3: Introduction to Delta Lake and Data Lakehouse Concept

**Delta Lake:**
Delta Lake is an open-source storage layer that brings ACID (Atomicity, Consistency, Isolation, Durability) transactions to data lakes. It ensures data reliability during streaming and batch operations by providing capabilities like time travel, upserts, and deletes.

**Data Lakehouse:**
A data lakehouse combines the best features of data lakes (ability to store structured and unstructured data) and data warehouses (ability to run SQL queries efficiently) into a single architecture.

## Step 4: Hands-On Exercise: Setting up a Data Lake Using Apache HDFS

**Pre-requisites:**
- A Hadoop cluster running on your local machine (from the previous setup).
- HDFS up and running.

**Steps:**
- Open the Hadoop cluster web interface (http://localhost:9870) and navigate to the "Utilities" tab.

- Checking hdfs health

In [None]:
$ hdfs fsck /

- hdfs listing help commands

In [None]:
$ hdfs dfs -help

- hdfs listing files with parameters

In [None]:
$ hdfs dfs -ls -t -r /user/datatech-labs/

- Make new directory

In [None]:
$ hdfs dfs -mkdir /user/datatech-labs/first_data

- Remove empty directory

In [None]:
$ hdfs dfs -rmdir /user/datatech-labs/first_data

- Copy files from local to hdfs

In [None]:
$ echo "hello from file 1" > ~/Desktop/file_1.txt

$ hdfs dfs -put Desktop/file* /user/datatech-labs/
# or
$ hdfs dfs -copyFromLocal Desktop/file* /user/datatech-labs/

- Copy from hdfs to local

In [None]:
$ hdfs dfs -get /user/datatech-labs/file* Desktop/test_data/
# or
$ hdfs dfs -copyToLocal /user/datatech-labs/file* Desktop/test_data/

- Copy files inside hdfs

In [None]:
$ hdfs dfs -mkdir /user/datatech-labs/test_data/
$ hdfs dfs -cp /user/datatech-labs/file* /user/datatech-labs/test_data/

- Delete directory and its content

In [None]:
$ hdfs dfs -rm -r /user/datatech-labs/test_data

- Move inside hdfs

In [None]:
$ hdfs dfs -mkdir /user/datatech-labs/test_data/
$ hdfs dfs -mv /user/datatech-labs/file* /user/datatech-labs/test_data/

- View file content on hdfs

In [None]:
$ hdfs dfs -cat /user/datatech-labs/file_1.txt

- Information about a specific file

In [None]:
$ hdfs fsck /user/datatech-labs/file_1.txt -files -blocks -locations

- hdfs storage usage

In [None]:
$ hdfs dfs -df -h  #(-h for human readable)

- hdfs storage information for files

In [None]:
$ hdfs dfs -du -h /user/datatech-labs/

- hdfs storage information for a folder

In [None]:
$ hdfs dfs -du -s -h /user/datatech-labs/ #(is for sum)

--------------------------------------------