# Note: In this file, I want to simulate a situation where one DataNode is unavailable for unknown reasons.So, remember to manually kill one of the DataNode containers with a **docker kill** command!

## Besides, if it still contains 2 live datanodes. Please wait a moment. It takes time to find one of the datanodes is off. 

In [1]:
import requests
import pyarrow as pa
import pyarrow.fs
import io
import re

In [2]:
# Q8: how many live DataNodes are in the cluster?
!hdfs dfsadmin -fs hdfs://boss:9000 -report

Configured Capacity: 51642105856 (48.10 GB)
Present Capacity: 13594352735 (12.66 GB)
DFS Remaining: 12504010752 (11.65 GB)
DFS Used: 1090341983 (1.02 GB)
DFS Used%: 8.02%
Replicated Blocks:
	Under replicated blocks: 0
	Blocks with corrupt replicas: 0
	Missing blocks: 0
	Missing blocks (with replication factor 1): 0
	Low redundancy blocks with highest priority to recover: 0
	Pending deletion blocks: 0
Erasure Coded Block Groups: 
	Low redundancy block groups: 0
	Block groups with corrupt internal blocks: 0
	Missing block groups: 0
	Low redundancy blocks with highest priority to recover: 0
	Pending deletion blocks: 0

-------------------------------------------------
Live datanodes (1):

Name: 172.19.0.5:9866 (project_hdfs-dn-1.project_hdfs_default)
Hostname: dcf2a6d93ee2
Decommission Status : Normal
Configured Capacity: 25821052928 (24.05 GB)
DFS Used: 568368848 (542.04 MB)
Non DFS Used: 18983915824 (17.68 GB)
DFS Remaining: 6251991040 (5.82 GB)
DFS Used%: 2.20%
DFS Remaining%: 24.21%
C

In [3]:
# Q9: how are the blocks of single.csv distributed across the DataNode containers?

# The request fetches BlockLocations metadata from the HDFS Web API
r= requests.get("http://boss:9870/webhdfs/v1/single.csv?op=GETFILEBLOCKLOCATIONS")
info = r.json()["BlockLocations"]["BlockLocation"]
answer = {}
for dict in info:
    if len(dict["hosts"]) != 0:
        host = dict["hosts"][0]
        if host in answer:
            answer[host] +=1
        else:
            answer[host] = 1
    else:
        if 'lost' in answer:
            answer['lost']+= 1
        else:
            answer['lost'] = 1

answer

{'1427c7fc7f7b': 150, 'dcf2a6d93ee2': 194}

##### Lost Blocks: 172 blocks of single.csv are no longer accessible, as their DataNode is no longer available.
##### Remaining Blocks: The surviving DataNode (829a0ef3e8fa) still holds 172 blocks.

In [4]:
hdfs = pa.fs.HadoopFileSystem("boss", 9000)

2024-12-25 04:22:53,950 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [None]:
# Q10: how many times does the text "Single Family" appear in the remaining blocks of single.csv?
r= requests.get("http://boss:9870/webhdfs/v1/single.csv?op=GETFILEBLOCKLOCATIONS")
info = r.json()["BlockLocations"]["BlockLocation"]
for dict in info:
    # Checks if the block is accessible by verifying whether the hosts list is non-empty.
    # Blocks with an empty hosts list are considered lost or inaccessible.
    if len(dict["hosts"]) != 0:
        offset = dict['offset']
        count = 0
        blksize = dict["length"]
        with hdfs.open_input_file(f"hdfs://boss:9000/single.csv") as f:
            # Reads the content of the current block from the file based on its size (blksize) and starting point (offset).
            blk=f.read_at(blksize, offset)
            pattern = "Single Family"
            count+=str(blk).count(pattern)

count

##### The result indicates that despite some blocks being inaccessible, the surviving blocks still contained significant data.