## Getting File Metadata

Let us see how to get metadata for the  files stored in HDFS using `hdfs fsck` command. 

* We have files copied under HDFS location `/user/${USER}/retail_db`. We also have some sample large files copied under HDFS location `/public/randomtextwriter`. We can use `hdfs fsck` command.
* We will first see how to get metadata of these files and then try to interpret it in subsequent topics.
* HDFS stands for Hadoop Distributed File System. It means files are copied in distributed fashion.
* Our cluster have master nodes and worker nodes, in this case the files will be physically copied in the worker nodes where data node process is running. We will cover this as part of the HDFS architecture.
* Here are the details about worker nodes along with corresponding private ips.

|Private ip|Full DNS|Short DNS|
|---|---|---|
|172.16.1.102|wn01.itversity.com|wn01|
|172.16.1.103|wn02.itversity.com|wn02|
|172.16.1.104|wn03.itversity.com|wn03|
|172.16.1.107|wn04.itversity.com|wn04|
|172.16.1.108|wn05.itversity.com|wn05|

In [1]:
%%sh

hdfs fsck -help

Usage: hdfs fsck <path> [-list-corruptfileblocks | [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks | -replicaDetails | -upgradedomains]]]] [-includeSnapshots] [-showprogress] [-storagepolicies] [-maintenance] [-blockId <blk_Id>] [-replicate]
	<path>	start checking from this path
	-move	move corrupted files to /lost+found
	-delete	delete corrupted files
	-files	print out files being checked
	-openforwrite	print out files opened for write
	-includeSnapshots	include snapshot data if the given path indicates a snapshottable directory or there are snapshottable directories under it
	-list-corruptfileblocks	print out list of missing blocks and files they belong to
	-files -blocks	print out block report
	-files -blocks -locations	print out locations for every block
	-files -blocks -racks	print out network topology for data-node locations
	-files -blocks -replicaDetails	print out each replica details 
	-files -blocks -upgradedomains	print out upgrade domains for every b

* We can get high level overview for a retail_db folder by using `hdfs fsck retail_db`

In [1]:
%%sh

hdfs dfs -ls /user/`whoami`/retail_db
hdfs fsck /user/`whoami`/retail_db

Found 9 items
drwxr-xr-x   - itversity supergroup          0 2022-11-07 03:11 /user/itversity/retail_db/categories
-rw-r--r--   1 itversity supergroup   10303297 2022-11-07 03:11 /user/itversity/retail_db/create_db.sql
-rw-r--r--   1 itversity supergroup       1748 2022-11-07 03:11 /user/itversity/retail_db/create_db_tables_pg.sql
drwxr-xr-x   - itversity supergroup          0 2022-11-07 03:11 /user/itversity/retail_db/customers
drwxr-xr-x   - itversity supergroup          0 2022-11-07 03:11 /user/itversity/retail_db/departments
-rw-r--r--   1 itversity supergroup   10297372 2022-11-07 03:11 /user/itversity/retail_db/load_db_tables_pg.sql
drwxrwxr-x   - itversity supergroup          0 2022-11-07 03:11 /user/itversity/retail_db/order_items
dr-xr-xr-x   - itversity supergroup          0 2022-11-07 03:11 /user/itversity/retail_db/orders
drwxr-xr-x   - itversity supergroup          0 2022-11-07 03:11 /user/itversity/retail_db/products
FSCK started by itversity (auth:SIMPLE) from /127.0.0.1

Connecting to namenode via http://localhost:9870/fsck?ugi=itversity&path=%2Fuser%2Fitversity%2Fretail_db


* We can get details about file names using `-files` option.

In [5]:
%%sh

hdfs fsck /user/`whoami`/retail_db -files

FSCK started by itversity (auth:SIMPLE) from /127.0.0.1 for path /user/itversity/retail_db at Mon Nov 07 02:40:58 GMT 2022

/user/itversity/retail_db <dir>
/user/itversity/retail_db/categories <dir>
/user/itversity/retail_db/categories/part-00000 1029 bytes, replicated: replication=1, 1 block(s):  OK
/user/itversity/retail_db/create_db.sql 10303297 bytes, replicated: replication=1, 1 block(s):  OK
/user/itversity/retail_db/create_db_tables_pg.sql 1748 bytes, replicated: replication=1, 1 block(s):  OK
/user/itversity/retail_db/customers <dir>
/user/itversity/retail_db/customers/part-00000 953719 bytes, replicated: replication=1, 1 block(s):  OK
/user/itversity/retail_db/departments <dir>
/user/itversity/retail_db/departments/part-00000 60 bytes, replicated: replication=1, 1 block(s):  OK
/user/itversity/retail_db/load_db_tables_pg.sql 10297372 bytes, replicated: replication=1, 1 block(s):  OK
/user/itversity/retail_db/order_items <dir>
/user/itversity/retail_db/order_items/part-00000 54

Connecting to namenode via http://localhost:9870/fsck?ugi=itversity&files=1&path=%2Fuser%2Fitversity%2Fretail_db


* Files in HDFS will be physically stored in worker nodes as blocks. We can get details of blocks associated with files using `-blocks` option.

In [6]:
%%sh

hdfs fsck /user/`whoami`/retail_db -files -blocks

FSCK started by itversity (auth:SIMPLE) from /127.0.0.1 for path /user/itversity/retail_db at Mon Nov 07 02:42:15 GMT 2022

/user/itversity/retail_db <dir>
/user/itversity/retail_db/categories <dir>
/user/itversity/retail_db/categories/part-00000 1029 bytes, replicated: replication=1, 1 block(s):  OK
0. BP-702067429-172.19.0.3-1667184293231:blk_1073742382_1558 len=1029 Live_repl=1

/user/itversity/retail_db/create_db.sql 10303297 bytes, replicated: replication=1, 1 block(s):  OK
0. BP-702067429-172.19.0.3-1667184293231:blk_1073742385_1561 len=10303297 Live_repl=1

/user/itversity/retail_db/create_db_tables_pg.sql 1748 bytes, replicated: replication=1, 1 block(s):  OK
0. BP-702067429-172.19.0.3-1667184293231:blk_1073742386_1562 len=1748 Live_repl=1

/user/itversity/retail_db/customers <dir>
/user/itversity/retail_db/customers/part-00000 953719 bytes, replicated: replication=1, 1 block(s):  OK
0. BP-702067429-172.19.0.3-1667184293231:blk_1073742381_1557 len=953719 Live_repl=1

/user/itve

Connecting to namenode via http://localhost:9870/fsck?ugi=itversity&files=1&blocks=1&path=%2Fuser%2Fitversity%2Fretail_db


* `-blocks` will only provide details about the names of the blocks, we need to use `-locations` as well to get the details about the worker nodes where the blocks are physically stored.
* A block is nothing but a physical file in HDFS. We will understand more about blocks as part of the subsequent topics.
* To understand where a block is physically stored you can get the infromation from **DatanodeInfoWithStorage** part of the output. It will contain ip address and we can get the corresponding DNS from the above table.

In [2]:
%%sh

hdfs fsck /user/`whoami`/retail_db -files -blocks -locations

FSCK started by itversity (auth:SIMPLE) from /127.0.0.1 for path /user/itversity/retail_db at Tue Nov 22 19:49:57 GMT 2022

/user/itversity/retail_db <dir>
/user/itversity/retail_db/categories <dir>
/user/itversity/retail_db/categories/part-00000 1029 bytes, replicated: replication=1, 1 block(s):  OK
0. BP-702067429-172.19.0.3-1667184293231:blk_1073742391_1567 len=1029 Live_repl=1  [DatanodeInfoWithStorage[127.0.0.1:9866,DS-c85e7a5b-db90-4b8b-ad8f-822ff99b3299,DISK]]

/user/itversity/retail_db/create_db.sql 10303297 bytes, replicated: replication=1, 1 block(s):  OK
0. BP-702067429-172.19.0.3-1667184293231:blk_1073742394_1570 len=10303297 Live_repl=1  [DatanodeInfoWithStorage[127.0.0.1:9866,DS-c85e7a5b-db90-4b8b-ad8f-822ff99b3299,DISK]]

/user/itversity/retail_db/create_db_tables_pg.sql 1748 bytes, replicated: replication=1, 1 block(s):  OK
0. BP-702067429-172.19.0.3-1667184293231:blk_1073742395_1571 len=1748 Live_repl=1  [DatanodeInfoWithStorage[127.0.0.1:9866,DS-c85e7a5b-db90-4b8b-ad8

Connecting to namenode via http://localhost:9870/fsck?ugi=itversity&files=1&blocks=1&locations=1&path=%2Fuser%2Fitversity%2Fretail_db


In [9]:
%%sh

hdfs dfs -ls -h /public/randomtextwriter/part-m-00000

ls: `/public/randomtextwriter/part-m-00000': No such file or directory


CalledProcessError: Command 'b'\nhdfs dfs -ls -h /public/randomtextwriter/part-m-00000\n'' returned non-zero exit status 1.

In [9]:
%%sh

hdfs fsck /public/randomtextwriter/part-m-00000 -files -blocks -locations

FSCK started by itversity (auth:SIMPLE) from /172.16.1.114 for path /public/randomtextwriter/part-m-00000 at Thu Jan 21 05:39:53 EST 2021
/public/randomtextwriter/part-m-00000 1102230331 bytes, 9 block(s):  OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1074171511_431441 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-1f4edfab-2926-45f9-a37c-ae9d1f542680,DISK]]
1. BP-292116404-172.16.1.101-1479167821718:blk_1074171524_431454 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-1edb1d35-81bf-471b-be04-11d973e2a832,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-1f4edfab-2926-45f9-a37c-ae9d1f542680,DISK]]
2. BP-292116404-172.16.1.101-1479167821718:blk_1074171559_431489 len=1342177

Connecting to namenode via http://172.16.1.101:50070/fsck?ugi=itversity&files=1&blocks=1&locations=1&path=%2Fpublic%2Frandomtextwriter%2Fpart-m-00000
