# Classical Word Count

The "Hello World" of Big Data always is counting words. As trivial as this example seems to be, it already contains many relevant principles and operations which are commonly used. So let's first buzild a word counter by following the classical "map-reduce" approach step by step.

In [1]:
# Load Text file from S3
text = sc.textFile('s3://dimajix-training/data/alice/')

In [2]:
# Split every line into words
def split_line(line):
    return line.split()


words = text.flatMap(split_line)

In [4]:
# Count every individual word
def count_word(word):
    return (word, 1)


words_one = words.map(count_word)

In [7]:
# Aggregate all counters
def add_counts(x, y):
    return x + y


words_count = words_one.reduceByKey(add_counts)

In [8]:
# Sort by word frequency
def extract_counter(key_value):
    return key_value[1]


sorted_words_count = words_count.sortBy(extract_counter)

In [11]:
# Make TSV like output
def make_tsv(key_value):
    return key_value[0] + '\t' + str(key_value[1])


tsv_result = sorted_words_count.map(make_tsv)

In [13]:
# Save Result
tsv_result.saveAsTextFile("alice_counts")

# Concise Formulation

Exactly the same logic can be written in a much shorter representation using Python lambda expressions.

In [None]:
text = sc.textFile('s3://dimajix-training/data/alice/')
words = (
    text.flatMap(lambda x: x.split())
    .map(lambda x: (x, 1))
    .reduceByKey(lambda x, y: x + y)
    .sortBy(lambda x: x[1], ascending=False)
    .map(lambda p: p[0] + ':' + str(p[1]))
)

words.saveAsTextFile('alice_counts')