UE19CS322 : Big Data - Assignment 1

Aim of the assignment:

To read a data set and perform specific tasks using the Map-Reduce framework of the Hadoop Ecosystem.

Language used: Python 3.8.x
Dataset: 5% and 15% Dataset

Steps to run tasks:

To run the code locally on your system wihtout using hadoop, the following command can be used:

cat <path-to-json-file-dataset> | python3 mapper.py [command_line_arguments] | sort -k 1,1 | python3 reducer.py [command_line_arguments]

To run the code on Hadoop HDFS on your local system:

Turn on Hadoop on your local system.
Create a directory within hdfs to store the dataset file.

hdfs dfs -mkdir /<folder-name>

The command below is used to create a folder called input to store the dataset.

hdfs dfs -mkdir /<folder-name>/input

Add the json dataset file into the directory input which was created in the previous step.

hdfs dfs -put <path-to-json-file> /<folder-name>/input

To verify if the JSON files was successfully added:

hdfs dfs -ls /<folder-name>/input

To run the code on the Hadoop HDFS, run this command. Note that the output folder must NOT exist when running this command. Hadoop creates it internally.

hadoop jar <path-to-streaming-jar-file> -input /<folder-name>/input -output /<folder-name>/output -file <path-to-mapper-file> <path-to-reducer-file> -mapper "python3 mapper.py [command_line_arguments]" -reducer "python3 reducer.py"

Once executed, the output will be visible using the following command.

hdfs dfs -cat /<folder-name>/output/part-00000

Contributors:

Hari Raagav T R
Manasa S M
Lakshmi Narayan P

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

UE19CS322 : Big Data - Assignment 1

Aim of the assignment:

Steps to run tasks:

Contributors:

Files

README.md

Latest commit

History

README.md

File metadata and controls

UE19CS322 : Big Data - Assignment 1

Aim of the assignment:

Steps to run tasks:

Contributors: