# Real-World Applications: TF-IDF

In this task Hadoop Streaming is used to process Wikipedia articles dump (/data/wiki/en_articles_part).

The purpose of this task is to calculate tf*idf for each pair (word, article) from the Wikipedia dump. Apply the stop words filter to speed up calculations. Term frequency (tf) is a function depending on a term (word) and a document (article):

tf(term, doc_id) = Nt/N,

where Nt - quantity of particular term in the document, N - the total number of terms in the document (without stop words)

Inverse document frequency (idf) is a function depends on a term:

idf(term) = 1/log(1 + Dt),

where Dt - number of documents in the dataset with the particular term.

You can find more information here: https://en.wikipedia.xn--org/wiki/Tfidf-q82h but use just the formulas mentioned above.

Dataset location: /data/wiki/en_articles_part

Stop words list is in ‘/datasets/stop_words_en.txt’ file.

Format: article_id <tab> article_text

To parse the articles don’t forget about Unicode (even though this is an English Wikipedia dump, there are many characters from other languages), remove punctuation marks and transform words to lowercase to get the correct quantities. To cope with Unicode we recommend to use the following tokenizer:

Output: tf*idf for term=’labor’ and article_id=12

The result on the sample dataset:

0.00351

Hint: all Wikipedia article_ids are greater than 0. So you can use a dummy article_id=0 to calculate the number of documents with each term.

Please, ensure that you call the right interpreter (python2 or python3), do not write just "python" without the major version because we do not guarantee that always only one particular version is always set as default in the grading system.

If you want to deploy the environment on your own machine, please use bigdatateam/yarn-notebook Docker container. New image: bigdatateam/hysh-full:py3-c1

In [None]:
%%writefile mapper.py

import sys
import re
import collections

with open('stop_words_en.txt') as f:
    stop_words = f.read().split()

for line in sys.stdin:
    try:
        article_id, text = line.strip().split('\t', 1)
        words = re.split("\W*\s+\W*", text, flags=re.UNICODE)
        words = [word.lower() for word in words if (word.lower() not in stop_words)]
        
        words_counter = collections.Counter(words)
        words_total = sum(words_counter.values())

        for word, count in sorted(words_counter.items()):
            if not word.isalpha(): continue
            tf = float(count)/float(words_total)            
            print(word, article_id, str(tf), sep="\t")
             
    except ValueError:        
        continue

In [22]:
%%writefile reducer.py

import sys
import math

current_word = None
articles = {}

for line in sys.stdin:
    try:
        word, article_id, tf = line.strip().split('\t', 2)      
        if current_word != word:
            if current_word:
                idf = 1.0 / math.log(1 + len(articles))
                for article, tf in articles.items():
                    print(current_word, article, str(tf*idf), sep="\t")            
            current_word = word
            articles = {}
        articles[article_id] = float(tf)    
    except ValueError:
        continue
        
if current_word:    
    idf = 1.0 / math.log(1 + len(articles))
    for article, tf in articles.items():
        print(current_word, article, str(tf*idf), sep="\t")

Overwriting reducer.py


In [23]:
%%bash
HADOOP_STREAMING_JAR="/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar"
OUT_DIR="TFIDF_"$(date +"%s%6N")
LOGS="stderr_logs.txt"

hdfs dfs -rm -r -skipTrash ${OUT_DIR} > /dev/null

yarn jar $HADOOP_STREAMING_JAR \
    -D mapred.jab.name="TDIDF" \
    -D mapreduce.job.reduces=4 \
    -D mapreduce.partition.keypartitioner.options=-k1,1 \
    -files mapper.py,reducer.py,/datasets/stop_words_en.txt \
    -mapper "python3 mapper.py" \
    -reducer "python3 reducer.py" \
    -numReduceTasks 4 \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
    -input /data/wiki/en_articles_part \
    -output ${OUT_DIR} > /dev/null 2> $LOGS 

hdfs dfs -cat ${OUT_DIR}/part* | grep -w "labor" | grep -w "12" | cut -f 3
cat $LOGS >&2

hdfs dfs -rm -r -skipTrash ${OUT_DIR} > /dev/null

Couldn't find program: 'bash'
