# 1. Write a Python program to read a Hadoop configuration file and display the core components of Hadoop.

To read a Hadoop configuration file and display the core components of Hadoop, you'll need to parse the configuration file and extract the relevant information. Hadoop's core components are defined in the core-site.xml and hdfs-site.xml configuration files. Here's a Python program to achieve this using the xml.etree.ElementTree module:

In [1]:
import xml.etree.ElementTree as ET

def read_hadoop_configuration(file_path):
    components = set()

    try:
        tree = ET.parse(file_path)
        root = tree.getroot()

        for prop in root.findall(".//property"):
            name_elem = prop.find("name")
            value_elem = prop.find("value")

            if name_elem is not None and value_elem is not None:
                name = name_elem.text.strip()
                value = value_elem.text.strip()

                # Check for core components configuration entries
                if name == "fs.defaultFS":
                    components.add("HDFS NameNode")
                elif name.startswith("dfs.namenode"):
                    components.add("HDFS NameNode")
                elif name.startswith("dfs.datanode"):
                    components.add("HDFS DataNode")
                elif name.startswith("dfs.journalnode"):
                    components.add("HDFS JournalNode")
                elif name.startswith("dfs.zkfc"):
                    components.add("HDFS ZKFailoverController")
                elif name.startswith("dfs.ha"):
                    components.add("HDFS High-Availability")
                elif name.startswith("yarn.resourcemanager"):
                    components.add("YARN ResourceManager")
                elif name.startswith("yarn.nodemanager"):
                    components.add("YARN NodeManager")

    except ET.ParseError as e:
        print("Error parsing the configuration file:", e)

    return components

if __name__ == "__main__":
    config_file_path = "/path/to/hadoop_configuration.xml"
    core_components = read_hadoop_configuration(config_file_path)

    if core_components:
        print("Core Components of Hadoop:")
        for component in core_components:
            print("-", component)
    else:
        print("No core components found in the Hadoop configuration.")


FileNotFoundError: [Errno 2] No such file or directory: '/path/to/hadoop_configuration.xml'

# 2. Implement a Python function that calculates the total file size in a Hadoop Distributed File System (HDFS) directory.

To calculate the total file size in a Hadoop Distributed File System (HDFS) directory, you'll need to interact with HDFS using the Hadoop Distributed File System shell commands. Python provides a way to execute shell commands using the subprocess module. You can use this module to execute the hadoop fs commands and then parse the output to calculate the total file size. Here's a Python function to achieve this:

In [2]:
import subprocess

def get_hdfs_directory_size(directory_path):
    try:
        # Execute 'hadoop fs -du' command to get size info
        cmd = ["hadoop", "fs", "-du", "-s", directory_path]
        result = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, check=True)

        # Output of the command will contain the total size at the beginning of the line
        # Extract the size and convert it to bytes
        total_size_in_bytes = int(result.stdout.strip().split()[0])

        return total_size_in_bytes

    except subprocess.CalledProcessError as e:
        print("Error executing HDFS command:", e)
        return None

if __name__ == "__main__":
    hdfs_directory = "/user/hadoop/example_directory"  # Replace this with the HDFS directory path you want to calculate the size for
    total_size = get_hdfs_directory_size(hdfs_directory)

    if total_size is not None:
        print(f"Total file size in {hdfs_directory}: {total_size} bytes")
    else:
        print("Failed to calculate the total file size.")


FileNotFoundError: [WinError 2] The system cannot find the file specified

# 3. Create a Python program that extracts and displays the top N most frequent words from a large text file using the MapReduce approach.


To extract and display the top N most frequent words from a large text file using the MapReduce approach, we'll simulate the MapReduce process in Python. The MapReduce paradigm consists of two main steps: the Mapper step, where we extract relevant information, and the Reducer step, where we aggregate the results. In this example, we'll perform the MapReduce process in a single Python program.

For this task, we'll use the collections.Counter class to count the occurrences of words. Note that this is a simplified implementation, and in a real distributed MapReduce system, the data would be distributed among different nodes for parallel processing.

Here's the Python program:

In [3]:
import re
from collections import Counter

def mapper(text):
    # Split the text into words using regex
    words = re.findall(r'\w+', text.lower())
    # Emit each word with count 1
    for word in words:
        yield word, 1

def reducer(word_counts, n):
    # Count the occurrences of each word using the Counter class
    word_counter = Counter(word_counts)
    # Get the top N most common words
    top_n_words = word_counter.most_common(n)
    return top_n_words

def map_reduce(file_path, n):
    # Read the content of the text file
    with open(file_path, 'r') as file:
        text = file.read()

    # Map step: tokenize and emit word count pairs
    mapped_data = mapper(text)

    # Reduce step: aggregate word counts
    top_n_words = reducer(mapped_data, n)

    return top_n_words

if __name__ == "__main__":
    text_file_path = "large_text_file.txt"  # Replace with the path to your large text file
    N = 10  # Replace N with the desired number of top frequent words

    top_words = map_reduce(text_file_path, N)

    if top_words:
        print(f"Top {N} most frequent words:")
        for word, count in top_words:
            print(f"{word}: {count}")
    else:
        print("No data or error occurred during the MapReduce process.")


FileNotFoundError: [Errno 2] No such file or directory: 'large_text_file.txt'

# 4. Write a Python script that checks the health status of the NameNode and DataNodes in a Hadoop cluster using Hadoop's REST API.


To check the health status of the NameNode and DataNodes in a Hadoop cluster using Hadoop's REST API, we'll make HTTP requests to the respective endpoints exposed by Hadoop's HDFS web UI. We'll use the requests library to perform the HTTP requests in Python.

Please note that this script assumes that you have access to the Hadoop cluster and its web UI endpoints are reachable from the machine where this script runs.

Here's the Python script:

In [4]:
import requests

def check_namenode_health(hdfs_web_ui_url):
    namenode_health_url = f"{hdfs_web_ui_url}/jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus"
    try:
        response = requests.get(namenode_health_url)
        response_json = response.json()

        live_nodes = response_json['beans'][0]['LiveNodes']
        dead_nodes = response_json['beans'][0]['DeadNodes']
        safemode_status = response_json['beans'][0]['Safemode']

        print("NameNode Health Status:")
        print(f"Live Nodes: {live_nodes}")
        print(f"Dead Nodes: {dead_nodes}")
        print(f"Safemode Status: {safemode_status}")

    except requests.RequestException as e:
        print("Error checking NameNode health:", e)

def check_datanode_health(hdfs_web_ui_url):
    datanode_health_url = f"{hdfs_web_ui_url}/jmx?qry=Hadoop:service=DataNode,name=DataNodeInfo"
    try:
        response = requests.get(datanode_health_url)
        response_json = response.json()

        datanode_count = response_json['beans'][0]['Total']
        datanode_live_count = response_json['beans'][0]['Live']

        print("\nDataNode Health Status:")
        print(f"Total DataNodes: {datanode_count}")
        print(f"Live DataNodes: {datanode_live_count}")

    except requests.RequestException as e:
        print("Error checking DataNode health:", e)

if __name__ == "__main__":
    hdfs_web_ui_url = "http://your_namenode_hostname:50070"  # Replace with your HDFS NameNode web UI URL

    check_namenode_health(hdfs_web_ui_url)
    check_datanode_health(hdfs_web_ui_url)


Error checking NameNode health: HTTPConnectionPool(host='your_namenode_hostname', port=50070): Max retries exceeded with url: /jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000002775E07EC70>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))
Error checking DataNode health: HTTPConnectionPool(host='your_namenode_hostname', port=50070): Max retries exceeded with url: /jmx?qry=Hadoop:service=DataNode,name=DataNodeInfo (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000002775E07EBE0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))


# 5. Develop a Python program that lists all the files and directories in a specific HDFS path.

To list all the files and directories in a specific HDFS path, you can use the subprocess module to execute the hadoop fs -ls command and then parse the output to extract the file and directory names. Here's a Python program to achieve this:

python


In [5]:
import subprocess

def list_hdfs_path(hdfs_path):
    try:
        # Execute 'hadoop fs -ls' command to list files and directories in the specified path
        cmd = ["hadoop", "fs", "-ls", hdfs_path]
        result = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, check=True)

        # Extract file and directory names from the command output
        output_lines = result.stdout.strip().split("\n")

        # Skip the first line (header) and get the names from the rest of the lines
        names = [line.split()[-1] for line in output_lines[1:]]

        return names

    except subprocess.CalledProcessError as e:
        print("Error executing HDFS command:", e)
        return None

if __name__ == "__main__":
    hdfs_path = "/user/hadoop/example_directory"  # Replace this with the HDFS path you want to list

    files_and_directories = list_hdfs_path(hdfs_path)

    if files_and_directories is not None:
        if not files_and_directories:
            print(f"No files or directories found in {hdfs_path}.")
        else:
            print(f"Files and directories in {hdfs_path}:")
            for item in files_and_directories:
                print(item)
    else:
        print("Failed to list files and directories in the HDFS path.")


FileNotFoundError: [WinError 2] The system cannot find the file specified