To read a data set and perform specific tasks using the Map-Reduce framework of the Hadoop Ecosystem.
Language used: Python 3.8.x
Dataset: 5% and 15% Dataset
To run the code locally on your system wihtout using hadoop, the following command can be used:
cat <path-to-json-file-dataset> | python3 mapper.py [command_line_arguments] | sort -k 1,1 | python3 reducer.py [command_line_arguments]
To run the code on Hadoop HDFS on your local system:
- Turn on Hadoop on your local system.
- Create a directory within hdfs to store the dataset file.
hdfs dfs -mkdir /<folder-name>
- The command below is used to create a folder called input to store the dataset.
hdfs dfs -mkdir /<folder-name>/input
- Add the json dataset file into the directory input which was created in the previous step.
hdfs dfs -put <path-to-json-file> /<folder-name>/input
- To verify if the JSON files was successfully added:
hdfs dfs -ls /<folder-name>/input
- To run the code on the Hadoop HDFS, run this command. Note that the output folder must NOT exist when running this command. Hadoop creates it internally.
hadoop jar <path-to-streaming-jar-file> -input /<folder-name>/input -output /<folder-name>/output -file <path-to-mapper-file> <path-to-reducer-file> -mapper "python3 mapper.py [command_line_arguments]" -reducer "python3 reducer.py"
- Once executed, the output will be visible using the following command.
hdfs dfs -cat /<folder-name>/output/part-00000