Implementation of TFIDF in Hadoop using Python as three phases.
Mapper: ((word, doc_id), 1)
Reducer: ((word, doc_id), word_count_in_doc)
Mapper: (doc_id, (word, word_count_in_doc))
Reducer: ((word, doc_id), (word_count_in_doc, words_in_doc))
Mapper: (word, (doc_id, word_count_in_doc, words_in_doc, 1))
Reducer: ((word, doc_id), tf-idf)
The mapper and reducer programs associated with each phase, and the output files are available in the repository.
To test via pipes: cat Data/* | ./MapperPhaseOne.py | sort | ./ReducerPhaseOne.py
To run on cluster: hadoop jar /root/hadoop-2.7.1/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar -input -output -file *.py -mapper MapperPhaseOne.py -reducer ReducerPhaseOne.py