In [1]:
from pyspark.sql import SparkSession
from operator import add

# New API
spark_session = SparkSession\
        .builder\
        .master("spark://192.168.2.119:7077") \
        .appName("LingkaiZhu")\
        .config("spark.executor.cores",2)\
        .config("spark.dynamicAllocation.enabled", True)\
        .config("spark.dynamicAllocation.shuffleTracking.enabled", True)\
        .config("spark.shuffle.service.enabled", False)\
        .config("spark.dynamicAllocation.executorIdleTimeout","30s")\
        .config("spark.driver.port",9998)\
        .config("spark.blockManager.port",10005)\
        .getOrCreate()
spark_context = spark_session.sparkContext

spark_context.setLogLevel("ERROR")

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/02/17 13:38:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/02/17 13:38:35 WARN ExecutorAllocationManager: Dynamic allocation without a shuffle service is an experimental feature.


# Part A - Working with the RDD API

## Question A.1

A.1.1 Read the English transcripts with Spark, and count the number of lines.

In [2]:
lines_english = spark_context.textFile("hdfs://192.168.2.119:9000/europarl/europarl-v7.de-en.en")
print(lines_english.first())
lines_english1 = lines_english.map(lambda line: line.split('\n'))
line_english_counts = lines_english1.map(lambda w: len(w))
total_english_lines = line_english_counts.reduce(add)
print(f'total number of lines = {total_english_lines}')

                                                                                

Resumption of the session




total number of lines = 1920209


                                                                                

A.1.2 Do the same with the other language (so that you have a separate lineage of RDDs for each).

In [3]:
lines_de = spark_context.textFile("hdfs://192.168.2.119:9000/europarl/europarl-v7.de-en.de")
print(lines_de.first())
lines_de1 = lines_de.map(lambda line: line.split('\n'))
line_de_counts = lines_de1.map(lambda w: len(w))
total_de_lines = line_de_counts.reduce(add)
print(f'total number of lines = {total_de_lines}')

Wiederaufnahme der Sitzungsperiode




total number of lines = 1920209


                                                                                

A.1.3 Verify that the line counts are the same for the two languages.
In this case, the count of the english transcripts is 1920209, which is equal to its original language's text.

A.1.4 Count the number of partitions.

In [4]:
print("number of partitions of the english:", lines_english.getNumPartitions())
print("number of partitions of the original:", lines_de.getNumPartitions())

number of partitions of the english: 3
number of partitions of the original: 3


## Question A.2

A.2.1 Pre-process the text from both RDDs by doing the following:

● Lowercase the text

● Tokenize the text (split on space)

Hint: define a function to run in your driver application to avoid writing this code twice.

In [9]:
from pyspark.sql.functions import lower, col
def preprocess(lines):
    lowercase_lines = lines.map(lambda line: line.lower())
    words = lowercase_lines\
    .flatMap(lambda line: line.split(' '))\
    .flatMap(lambda line: line.split('\n'))
    return lowercase_lines, words

A.2.2 Inspect 10 entries from each of your RDDs to verify your pre-processing.

In [6]:
# english
[english_lowercase_lines, _] = preprocess(lines_english)
print(english_lowercase_lines.take(10))
print("----------------------------------------------")
# original language
[de_lowercase_lines, _] = preprocess(lines_de)
print(de_lowercase_lines.take(10))

['resumption of the session', 'i declare resumed the session of the european parliament adjourned on friday 17 december 1999, and i would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.', "although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.", 'you have requested a debate on this subject in the course of the next few days, during this part-session.', "in the meantime, i should like to observe a minute' s silence, as a number of members have requested, on behalf of all the victims concerned, particularly those of the terrible storms, in the various countries of the european union.", "please rise, then, for this minute' s silence.", "(the house rose and observed a minute' s silence)", 'madam president, on a point of order.', 'you will be aware from the press and television that there have been a num

A.2.3 Verify that the line counts still match after the pre-processing.

In [7]:
# english
lines_english1 = english_lowercase_lines.map(lambda line: line.split('\n'))
line_english_counts = lines_english1.map(lambda w: len(w))
total_english_lines = line_english_counts.reduce(add)
print(f'total number of lines = {total_english_lines}')



total number of lines = 1920209


                                                                                

In [8]:
# original 
lines_de1 = de_lowercase_lines.map(lambda line: line.split('\n'))
line_de_counts = lines_de1.map(lambda w: len(w))
total_de_lines = line_de_counts.reduce(add)
print(f'total number of lines = {total_de_lines}')



total number of lines = 1920209


                                                                                

A.2.3 Verify that the line counts still match after the pre-processing.

After verification, the line counts are exactly the same as it is before preprocessing.

Total number of lines = 1920209

## Question1 A.3

A.3.1 Use Spark to compute the 10 most frequently according words in the English language corpus. Repeat for the other language.

In [17]:
# english
[_, english_words] = preprocess(lines_english)
english_word_key = english_words.map(lambda w: w.strip()).map(lambda w: (w, 1))
english_word_counts = english_word_key.reduceByKey(add)
print(english_word_counts.takeOrdered(10, key=lambda x: -x[1]))



[('the', 3663193), ('of', 1736975), ('to', 1611788), ('and', 1345073), ('in', 1134026), ('that', 835874), ('a', 810540), ('is', 792564), ('for', 557349), ('we', 551244)]


                                                                                

In [18]:
# original 
[_, de_words] = preprocess(lines_de)
de_word_key = de_words.map(lambda w: w.strip()).map(lambda w: (w, 1))
de_word_counts = de_word_key.reduceByKey(add)
print(de_word_counts.takeOrdered(10, key=lambda x: -x[1]))



[('die', 1980477), ('der', 1710353), ('und', 1337721), ('in', 781362), ('zu', 618872), ('den', 577654), ('wir', 489036), ('für', 478326), ('ich', 469025), ('das', 466127)]


                                                                                

In [24]:
english_words.take(3)
english_word_key.take(3)

[('resumption', 1), ('of', 1), ('the', 1), ('session', 1), ('i', 1)]

A.3.2 Verify that your results are reasonable.

The pipeline to get the 10 most frequently according words:

1. get the splited words using the 'preprocess' function, e.g ['resumption', 'of', 'the']
2. map step: remove the extra blank space and make a key-value-pair, e.g [('resumption', 1), ('of', 1), ('the', 1)]
3. reduce step: combine the pairs with the same key, add up the corresponding value, e.g ('of', 1), ('of', 1) --> ('of', 2).
4. output the ordered result

In [None]:
## Example #1 - Filter by Top_level Domain and compute most common words ##

# Try .ac.uk, .ru, .se, .com
p = re.compile('WARC-Target-URI: \S+\.ac.uk', re.IGNORECASE)


# Note: .partition(..) returns a 3-tuple: the string before the separator (index 0), 
# the separotor (index 1), and the part of the string afterwards (index 2) -- which is the part we want.
all_words = rdd\
    .filter(lambda doc: bool(p.search(doc[1])))\
    .map(lambda web_text: web_text[1].partition('\r\n\r\n')[2])\
    .flatMap(lambda t: t.split(' '))\
    .flatMap(lambda w: w.split('\n'))\



all_words_and_count = all_words.map(lambda w: w.strip())\
    .map(lambda w: (w,1))


word_counts = all_words_and_count.reduceByKey(add)

print(word_counts.takeOrdered(60, key=lambda x: -x[1]))
