TFIDF-Hadoop

Implementation of TFIDF in Hadoop using Python as three phases.

Phase 1:

Mapper: ((word, doc_id), 1)

Reducer: ((word, doc_id), word_count_in_doc)

Phase 2:

Mapper: (doc_id, (word, word_count_in_doc))

Reducer: ((word, doc_id), (word_count_in_doc, words_in_doc))

Phase 3:

Mapper: (word, (doc_id, word_count_in_doc, words_in_doc, 1))

Reducer: ((word, doc_id), tf-idf)

The mapper and reducer programs associated with each phase, and the output files are available in the repository.

Hadoop commands:

To test via pipes: cat Data/* | ./MapperPhaseOne.py | sort | ./ReducerPhaseOne.py

To run on cluster: hadoop jar /root/hadoop-2.7.1/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar -input -output -file *.py -mapper MapperPhaseOne.py -reducer ReducerPhaseOne.py

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Data		Data
PhaseOne		PhaseOne
PhaseThree		PhaseThree
PhaseTwo		PhaseTwo
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TFIDF-Hadoop

Phase 1:

Phase 2:

Phase 3:

Hadoop commands:

About

Releases

Packages

Languages

SatishUC15/TFIDF-HadoopMapReduce

Folders and files

Latest commit

History

Repository files navigation

TFIDF-Hadoop

Phase 1:

Phase 2:

Phase 3:

Hadoop commands:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages