Word Relatedness - Assignment 2 in Distributed System Programming course at BGU

The assignment in the course website: https://www.cs.bgu.ac.il/~dsp162/Assignments/Assignment_2 Briefly, the assignment was to calculate the PMI of each 2 words in google ngram.

To develop comfortably, install local hadoop. a good tutorial for mac: http://zhongyaonan.com/hadoop-tutorial/setting-up-hadoop-2-6-on-mac-osx-yosemite.html
A nice tutorial to run your first mapreduce program can be found in hadoop's website: https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

Example application

The app is separated to 2 modules: the local module and the mapreduce module. They have a mutual parent. In order to run the app (currently a wordcount app):

Install the maven project (mvn install)
Locate the jar that was created by maven
Run hadoop: hadoop jar <jar-path> <wordcount-hdfs-input-path> <wordcount-hdfs-output-path> For example: hadoop jar /Users/dsp-assignment-2/dsp-assignment-2-mapreduce/target/dsp-assignment-2-mapreduce-1.0-SNAPSHOT-job.jar wordcount/input wordcount/output

We are currently based on hadoop official tutorial and a tikal tutorial for maven + hadoop.

Unit tests

Use MRUnit (tutorials).

Map-reduce steps

The map reduce steps are as follows:

Split the n-grams into 2-grams. In addition a simple word count app to count how many times each word and each 2-gram appears in each decade.
Grouping of each 2-gram count with its first word count, i.e. for the 2-gram map reduce, we will write the result map,<map-count>,map reduce,<map-reduce-count> with the key reduce. In addition, in this step we will write each single word with its own count, i.e. for the word map we will write map,<map-count> with map as a key.
Final calculation: Now we will get in the reducer, for the 2gram map reduce the count of map, the count of reduce and the count of map reduce, so we are ready to caculate the PMI, and this is what we will do in this part.

Additional info

We have used hadoop counters in order to count how many words appear in each decade (this is required for the calculation of the PMI)
Before emitting a pair of words, we sort them lexicographically in order to prevent duplications.
The method that we have introduced above, in which for the pair w1,w2, we write the count of w1 and w1,w2 with w2 as a key saves us from performing a "join", which means that our application runs with runtime, instead of

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
dsp-assignment-2-commons		dsp-assignment-2-commons
dsp-assignment-2-local		dsp-assignment-2-local
dsp-assignment-2-mapreduce-all		dsp-assignment-2-mapreduce-all
dsp-assignment-2-mapreduce-find-top		dsp-assignment-2-mapreduce-find-top
dsp-assignment-2-mapreduce-stage-2		dsp-assignment-2-mapreduce-stage-2
dsp-assignment-2-mapreduce-stage-3		dsp-assignment-2-mapreduce-stage-3
dsp-assignment-2-mapreduce		dsp-assignment-2-mapreduce
dsp-assignment-2-part-b		dsp-assignment-2-part-b
src		src
top10		top10
.gitignore		.gitignore
README.md		README.md
circle.yml		circle.yml
pom.xml		pom.xml

17davidavid17/dsp-assignment-2

Folders and files

Latest commit

History

Repository files navigation

Word Relatedness - Assignment 2 in Distributed System Programming course at BGU

Example application

Unit tests

Map-reduce steps

Additional info

About

Resources

Stars

Watchers

Forks

Languages