GitHub - ShreevathsaBK/Mimic-HDFS: Simulation of a Hadoop distributed file system

YAH – Yet Another Hadoop

This project is an attempt to simulate a miniature HDFS capable of performing some of the important tasks a distributed file system performs, running HDFS commands as well as scheduling Hadoop jobs.

Execution steps

https://humble-reason-194.notion.site/HDFS-Simulation-project-c866dee84b874d97a650f4131a490eda

Design and implementation details

Creating/Loading DFS : DFS is created/Loaded based on the given configuration file. If DFS does not exist then a new DFS with Namenode, Datanodes, Secondary Namenode is created. Otherwise previously created DFS is loaded.

Namenode tracks information related to file to block mapping, location of each block and its replica and the file system directories.

Secondary datanode is used to backup the information of Namenode.

Datanode consists of all the file blocks created by the user.

Command line interface : A command line interface is created using the argparse library. User can execute commands like put, ls, cat, rm, mkdir, rmdir and mapreduce. Namenodes and Datanodes are appropriately updated after execution of these commands.

Block distribution : When a file is submitted to the DFS it is divided into multiple blocks and replicated. The file blocks are distributed to the datanodes in a round-robin fashion such that each replica goes to a different datanode.

Fault tolerance : Namenode periodically sends a heartbeat signal to each of the Datanodes to check for the existence of the file blocks. If a file block or a Datanode is missing then they are regenerated using the replicas.

Namenode failure : If Namenode fails all the data that is backed up in Secondary name node is used to bring back the Namenode.

Running hadoop jobs : This functionality is implemented using the subprocess library. cat command is used to get the input file details from the DFS. The output is passed to the mapper submitted by the user. Finally the output of mapper is sorted and submitted to the reducer. The reducer output is temporarily stored in a file which is used by the put command to send the output back the DFS.

Implementation Files

setup.py : Used to create the DFS

load.py : Loads the DFS

commands.py : functions for all commands namely put, ls, cat, rm, mkdir, rmdir and mapreduce.

heartbeat.py : Periodically checks datanodes and recreates in case of failure. (Fault tolerence)

zookeeper.py : Periodically checks for namenode failure and takes suitable action.

utilities.py : Utility functions like filesplit, updating json etc

main.py : Execution of all functionalities.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YAH – Yet Another Hadoop

Execution steps

Design and implementation details

Implementation Files

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.gitignore		.gitignore
README.md		README.md
commands.py		commands.py
config_sample.json		config_sample.json
current_config.json		current_config.json
default_config.json		default_config.json
hdfs.json		hdfs.json
heartbeat.py		heartbeat.py
load.py		load.py
main.py		main.py
setup.py		setup.py
utilities.py		utilities.py
zookeeper.py		zookeeper.py

ShreevathsaBK/Mimic-HDFS

Folders and files

Latest commit

History

Repository files navigation

YAH – Yet Another Hadoop

Execution steps

Design and implementation details

Implementation Files

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages