Skip to content

HariRaagavTR/big-data-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

UE19CS322 : Big Data - Assignment 1

Aim of the assignment:

To read a data set and perform specific tasks using the Map-Reduce framework of the Hadoop Ecosystem.

Language used: Python 3.8.x
Dataset: 5% and 15% Dataset

Steps to run tasks:

To run the code locally on your system wihtout using hadoop, the following command can be used:

cat <path-to-json-file-dataset> | python3 mapper.py [command_line_arguments] | sort -k 1,1 | python3 reducer.py [command_line_arguments]

To run the code on Hadoop HDFS on your local system:

  1. Turn on Hadoop on your local system.

  2. Create a directory within hdfs to store the dataset file.
hdfs dfs -mkdir /<folder-name>

  1. The command below is used to create a folder called input to store the dataset.
hdfs dfs -mkdir /<folder-name>/input

  1. Add the json dataset file into the directory input which was created in the previous step.
hdfs dfs -put <path-to-json-file> /<folder-name>/input

  1. To verify if the JSON files was successfully added:
hdfs dfs -ls /<folder-name>/input

  1. To run the code on the Hadoop HDFS, run this command. Note that the output folder must NOT exist when running this command. Hadoop creates it internally.
hadoop jar <path-to-streaming-jar-file> -input /<folder-name>/input -output /<folder-name>/output -file <path-to-mapper-file> <path-to-reducer-file> -mapper "python3 mapper.py [command_line_arguments]" -reducer "python3 reducer.py"

  1. Once executed, the output will be visible using the following command.
hdfs dfs -cat /<folder-name>/output/part-00000

Contributors:

Hari Raagav T R
Manasa S M
Lakshmi Narayan P

About

Assignment for "Big Data" course in PESU.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages