# Exercise 1: HDFS Basics

## Learning Objectives
- Understand HDFS architecture (NameNode, DataNodes, blocks)
- Perform file operations: upload, list, download, delete
- Explore replication factor and block concepts
- Use the HDFS Web UI to visualize the filesystem

## Prerequisites
- Cluster is running (`./scripts/start-lab.sh`)
- Sanity checks passed

---

## Part 1: Exploring HDFS Commands

HDFS commands are similar to Linux filesystem commands, but prefixed with `hdfs dfs -`

In [1]:
# List the root directory of HDFS
!hdfs dfs -ls /

Found 3 items
drwxrwxrwt   - root supergroup          0 2026-01-16 01:08 /spark-logs
drwxr-xr-x   - root supergroup          0 2026-01-16 01:08 /tmp
drwxr-xr-x   - root supergroup          0 2026-01-16 01:08 /user


In [2]:
# Create a directory in HDFS for our exercises
!hdfs dfs -mkdir -p /user/student/data

In [3]:
# Verify the directory was created
!hdfs dfs -ls /user/student

Found 1 items
drwxr-xr-x   - jovyan supergroup          0 2026-01-10 22:08 /user/student/data


## Part 2: Uploading Files to HDFS

Let's upload our sample data to HDFS.

In [4]:
# Check what files are available locally
!ls -la /home/jovyan/data/sales/

total 8832
drwxrwxrwx 1 root root    4096 Jan  8 23:58 .
drwxrwxrwx 1 root root    4096 Jan  8 23:58 ..
-rwxrwxrwx 1 root root 9040461 Jan  8 23:58 transactions.csv


In [6]:
# Upload the transactions file to HDFS
!hdfs dfs -put /home/jovyan/data/sales/transactions.csv /user/student/data/

In [7]:
# Upload products catalog
!hdfs dfs -put /home/jovyan/data/products/catalog.csv /user/student/data/
!hdfs dfs -put /home/jovyan/data/products/catalog.json /user/student/data/

In [8]:
# Verify uploads
!hdfs dfs -ls -h /user/student/data/

Found 3 items
-rw-r--r--   3 jovyan supergroup     34.8 K 2026-01-10 22:09 /user/student/data/catalog.csv
-rw-r--r--   3 jovyan supergroup     96.7 K 2026-01-10 22:09 /user/student/data/catalog.json
-rw-r--r--   3 jovyan supergroup      8.6 M 2026-01-10 22:09 /user/student/data/transactions.csv


## Part 3: Understanding Blocks and Replication

HDFS splits large files into blocks (default 128MB in production, 16MB in our lab).
Each block is replicated across multiple DataNodes for fault tolerance.

In [9]:
# Check block information for our file
!hdfs fsck /user/student/data/transactions.csv -files -blocks -locations

Connecting to namenode via http://namenode:9870/fsck?ugi=jovyan&files=1&blocks=1&locations=1&path=%2Fuser%2Fstudent%2Fdata%2Ftransactions.csv
FSCK started by jovyan (auth:SIMPLE) from /172.28.0.13 for path /user/student/data/transactions.csv at Sat Jan 10 22:09:52 UTC 2026

/user/student/data/transactions.csv 9040461 bytes, replicated: replication=3, 1 block(s):  OK
0. BP-528258548-172.28.0.3-1768082688920:blk_1073741825_1001 len=9040461 Live_repl=3  [DatanodeInfoWithStorage[172.28.0.6:9866,DS-89344feb-57d4-48bb-afb5-852412929d1a,DISK], DatanodeInfoWithStorage[172.28.0.8:9866,DS-163cc9d4-5be8-48fe-93ca-6f3139fabc8b,DISK], DatanodeInfoWithStorage[172.28.0.4:9866,DS-40d953a5-006d-4de7-a723-c727328b5b2e,DISK]]


Status: HEALTHY
 Number of data-nodes:	3
 Number of racks:		1
 Total dirs:			0
 Total symlinks:		0

Replicated Blocks:
 Total size:	9040461 B
 Total files:	1
 Total blocks (validated):	1 (avg. block size 9040461 B)
 Minimally replicated blocks:	1 (100.0 %)
 Over-replicated blocks:

In [11]:
# Check the current replication factor
!hdfs dfs -stat '%r' /user/student/data/transactions.csv

3


In [12]:
# Change replication factor to 3
!hdfs dfs -setrep 2 /user/student/data/transactions.csv

Replication 2 set: /user/student/data/transactions.csv


In [23]:
# Verify the new replication factor
!hdfs fsck /user/student/data/transactions.csv -files -blocks -locations

Connecting to namenode via http://namenode:9870/fsck?ugi=jovyan&files=1&blocks=1&locations=1&path=%2Fuser%2Fstudent%2Fdata%2Ftransactions.csv
FSCK started by jovyan (auth:SIMPLE) from /172.28.0.11 for path /user/student/data/transactions.csv at Sat Jan 10 21:14:24 UTC 2026

/user/student/data/transactions.csv 9040461 bytes, replicated: replication=2, 1 block(s):  OK
0. BP-1325781439-172.28.0.2-1767962107483:blk_1073741841_1017 len=9040461 Live_repl=2  [DatanodeInfoWithStorage[172.28.0.3:9866,DS-526900cf-c0fc-45b6-896c-bfb736c4af98,DISK], DatanodeInfoWithStorage[172.28.0.5:9866,DS-a4b88363-9556-4743-bfa2-bb6a13741051,DISK], DatanodeInfoWithStorage[172.28.0.4:9866,DS-6704a6a5-4cf5-48d4-8ee8-5a87e7c7a28c,DISK]]


Status: HEALTHY
 Number of data-nodes:	3
 Number of racks:		1
 Total dirs:			0
 Total symlinks:		0

Replicated Blocks:
 Total size:	9040461 B
 Total files:	1
 Total blocks (validated):	1 (avg. block size 9040461 B)
 Minimally replicated blocks:	1 (100.0 %)
 Over-replicated blocks

### üîç Checkpoint Question 1
Open the HDFS NameNode UI at http://localhost:9870 and navigate to:
**Utilities ‚Üí Browse the file system**

Find the `transactions.csv` file. 
- How many blocks does it have?
- What is the block size?
- On which DataNodes are the blocks stored?

**Your Answer:**

(Write your observations here)

## Part 4: Reading Files from HDFS

In [14]:
# Preview the first few lines
!hdfs dfs -head /user/student/data/transactions.csv

transaction_id,transaction_date,transaction_time,customer_id,product_id,quantity,unit_price,total_amount,payment_method,store_region,is_online
TXN00000001,2024-06-03,11:56:54,CUST000481,PROD00352,1,259.06,259.06,PayPal,North,True
TXN00000002,2023-11-22,14:25:51,CUST001581,PROD00288,6,442.7,2656.2,PayPal,Central,False
TXN00000003,2023-08-13,13:40:52,CUST000534,PROD00205,1,116.0,116.0,Credit Card,Central,False
TXN00000004,2023-06-20,16:13:49,CUST000840,PROD00230,4,187.53,750.12,Credit Card,North,False
TXN00000005,2024-01-24,09:49:45,CUST001239,PROD00067,9,5.14,46.28,Cash,South,False
TXN00000006,2023-08-04,13:34:12,CUST001569,PROD00393,9,36.53,328.74,Debit Card,East,False
TXN00000007,2023-06-15,13:21:26,CUST001471,PROD00441,4,197.29,789.16,Debit Card,Central,True
TXN00000008,2024-06-23,09:32:25,CUST000028,PROD00482,1,477.45,477.45,PayPal,North,True
TXN00000009,2024-12-05,11:02:17,CUST000580,PROD00081,5,193.65,968.26,Credit Card,West,False
TXN00000010,2023-10-04,21:05:04,CUST000821,PROD003

In [15]:
# Count total lines
!hdfs dfs -cat /user/student/data/transactions.csv | wc -l

100001


In [16]:
# Get file statistics
!hdfs dfs -du -h /user/student/data/

34.8 K  104.4 K  /user/student/data/catalog.csv
96.7 K  290.2 K  /user/student/data/catalog.json
8.6 M   17.2 M   /user/student/data/transactions.csv


## Part 5: Cluster Health Check

In [17]:
# Check DataNode status
!hdfs dfsadmin -report

Configured Capacity: 12001952649216 (10.92 TB)
Present Capacity: 109698896723 (102.17 GB)
DFS Remaining: 109680267264 (102.15 GB)
DFS Used: 18629459 (17.77 MB)
DFS Used%: 0.02%
Replicated Blocks:
	Under replicated blocks: 0
	Blocks with corrupt replicas: 0
	Missing blocks: 0
	Missing blocks (with replication factor 1): 0
	Low redundancy blocks with highest priority to recover: 0
	Pending deletion blocks: 0
Erasure Coded Block Groups: 
	Low redundancy block groups: 0
	Block groups with corrupt internal blocks: 0
	Missing block groups: 0
	Low redundancy blocks with highest priority to recover: 0
	Pending deletion blocks: 0

-------------------------------------------------
Live datanodes (3):

Name: 172.28.0.4:9866 (datanode1.hadoop_hadoop-network)
Hostname: datanode1
Decommission Status : Normal
Configured Capacity: 4000650883072 (3.64 TB)
DFS Used: 9246853 (8.82 MB)
Non DFS Used: 3964081547131 (3.61 TB)
DFS Remaining: 36560089088 (34.05 GB)
DFS Used%: 0.00%
DFS Remaining%: 0.91%
Config