# Prerequisites



*   Bellouch Ayoub
*   Mafkoud Khaoula
*   Hamid Hiba
*   Berkani Mohammed Adam



In [None]:
# Update packages and install required java version
!apt-get update
!apt-get install openjdk-21-jdk-headless -qq > /dev/null

# download and unzip spark
!wget -nc -q https://downloads.apache.org/spark/spark-4.0.0/spark-4.0.0-bin-hadoop3.tgz
!tar xf spark-4.0.0-bin-hadoop3.tgz

# get data for labs
!wget -nc -O around_the_world_in_80_days.txt https://www.gutenberg.org/ebooks/103.txt.utf-8

# install findspark
!pip install -q findspark

Hit:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:3 https://cli.github.com/packages stable InRelease [3,917 B]
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Hit:6 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Get:9 https://cli.github.com/packages stable/main amd64 Packages [346 B]
Hit:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Get:11 https://r2u.stat.illinois.edu/ubuntu jammy/main all Packages [9,299 kB]
Hit:12 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:13 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:14 https://r2u.stat

In [None]:
import os
import findspark

# set env vars for java and spark
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-21-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-4.0.0-bin-hadoop3"

# start findspark so notebook can interact with spark
findspark.init()


In [None]:
# what does findspark do? use the ?? magic command to find out
# It helps notebook to find and connect pyspark to syspath
# Note 1: in colab, this may open in a side panel
# Note 2: this magic command is often helpful when encountering an object in a
# notebook that is unfamiliar. More information will be displayed if it exists
?? findspark

# 1. Word Count

Instructions:  
For each cell marked "double-click and add explanation here" please answer the question in your own words.  
In the section where you complete the code to perform basic nlp text cleaning and exploration tasks, the goal is to chain all of the transformations together in a single function. For learning and exploration purposes, it is acceptable to have each step seperate, but the last cell in this section should be one function with all transformations chained together.  
For steps c and f, it is acceptable to use your favorite chatbot to generate a list of common stop words (c) and punctuation (e) for use in the code. As these are common steps in nlp/text processing tasks, there are pleanty of libraries to help with this such as nltk, but there is no need to import extra dependencies for this lab unless you are already familiar with working with them.

In [None]:
# start a spark session and create spark context for making rdd
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("word_count") \
    .getOrCreate()

sc = spark.sparkContext

In [None]:
# Defind the rdd
rdd = sc.textFile('/content/around_the_world_in_80_days.txt')

In [None]:
# view the first x lines of the rdd
rdd.take(20)

['The Project Gutenberg eBook of Around the World in Eighty Days',
 '    ',
 'This ebook is for the use of anyone anywhere in the United States and',
 'most other parts of the world at no cost and with almost no restrictions',
 'whatsoever. You may copy it, give it away or re-use it under the terms',
 'of the Project Gutenberg License included with this ebook or online',
 'at www.gutenberg.org. If you are not located in the United States,',
 'you will have to check the laws of the country where you are located',
 'before using this eBook.',
 '',
 'Title: Around the World in Eighty Days',
 '',
 'Author: Jules Verne',
 '',
 'Release date: January 1, 1994 [eBook #103]',
 '                Most recently updated: October 29, 2024',
 '',
 'Language: English',
 '',
 '']

In [None]:
# example lambda function
words = rdd.flatMap(lambda lines: lines.split(' '))

In [None]:
# Note and explain the output of the below command
words

PythonRDD[3] at RDD at PythonRDD.scala:56


words is not data but an RDD object that's why we don't get the data like what we saw in rdd.take(20) command.
it means an RDD was created but it wasnt executed.

<ADD EXPLANATION HERE>

In [None]:
# Note and explain the output of the following command, focusing on the difference with the
# above command
words.collect()

['The',
 'Project',
 'Gutenberg',
 'eBook',
 'of',
 'Around',
 'the',
 'World',
 'in',
 'Eighty',
 'Days',
 '',
 '',
 '',
 '',
 '',
 'This',
 'ebook',
 'is',
 'for',
 'the',
 'use',
 'of',
 'anyone',
 'anywhere',
 'in',
 'the',
 'United',
 'States',
 'and',
 'most',
 'other',
 'parts',
 'of',
 'the',
 'world',
 'at',
 'no',
 'cost',
 'and',
 'with',
 'almost',
 'no',
 'restrictions',
 'whatsoever.',
 'You',
 'may',
 'copy',
 'it,',
 'give',
 'it',
 'away',
 'or',
 're-use',
 'it',
 'under',
 'the',
 'terms',
 'of',
 'the',
 'Project',
 'Gutenberg',
 'License',
 'included',
 'with',
 'this',
 'ebook',
 'or',
 'online',
 'at',
 'www.gutenberg.org.',
 'If',
 'you',
 'are',
 'not',
 'located',
 'in',
 'the',
 'United',
 'States,',
 'you',
 'will',
 'have',
 'to',
 'check',
 'the',
 'laws',
 'of',
 'the',
 'country',
 'where',
 'you',
 'are',
 'located',
 'before',
 'using',
 'this',
 'eBook.',
 '',
 'Title:',
 'Around',
 'the',
 'World',
 'in',
 'Eighty',
 'Days',
 '',
 'Author:',
 'Jules'

words.collect() is basically the command to force spark to run the computation and show everything that was collected but words is basically just a description of the RDD.

In [None]:
# nicer print
for w in words.collect():
    print(w)

[1;30;43mLe flux de sortie a été tronqué et ne contient que les 5000 dernières lignes.[0m
was
drawing
near
his
last
turning-point.
The
bonds
were
quoted,
no
longer
at
a
hundred
below
par,
but
at
twenty,
at
ten,
and
at
five;
and
paralytic
old
Lord
Albemarle
bet
even
in
his
favour.

A
great
crowd
was
collected
in
Pall
Mall
and
the
neighbouring
streets
on
Saturday
evening;
it
seemed
like
a
multitude
of
brokers
permanently
established
around
the
Reform
Club.
Circulation
was
impeded,
and
everywhere
disputes,
discussions,
and
financial
transactions
were
going
on.
The
police
had
great
difficulty
in
keeping
back
the
crowd,
and
as
the
hour
when
Phileas
Fogg
was
due
approached,
the
excitement
rose
to
its
highest
pitch.

The
five
antagonists
of
Phileas
Fogg
had
met
in
the
great
saloon
of
the
club.
John
Sullivan
and
Samuel
Fallentin,
the
bankers,
Andrew
Stuart,
the
engineer,
Gauthier
Ralph,
the
director
of
the
Bank
of
England,
and
Thomas
Flanagan,
the
brewer,
one
and
all
waited
anxiously.

When


Its the same thing as words.collect() just with a nicer format.

In [None]:
# Print first x words
words.take(20)

['The',
 'Project',
 'Gutenberg',
 'eBook',
 'of',
 'Around',
 'the',
 'World',
 'in',
 'Eighty',
 'Days',
 '',
 '',
 '',
 '',
 '',
 'This',
 'ebook',
 'is',
 'for']

here we are just printing the first 20 words with the words.take(20) command

In [None]:
# Use cell magic command to help understand what the rdd.flatMap function is doing in the next cell.
# Insert a text/markdown cell and explain in your own words.

In [None]:
?? rdd.flatMap


It basically applies a function to each element of the RDD to return a single unified RDD.

In [None]:
# Initialize a word counter by creating a tuple with word and cound of 1
words = rdd.flatMap(lambda lines: lines.split(' ')) \
                    .map(lambda word: (word, 1))

for w in words.collect():
    print(w)

[1;30;43mLe flux de sortie a été tronqué et ne contient que les 5000 dernières lignes.[0m
('was', 1)
('drawing', 1)
('near', 1)
('his', 1)
('last', 1)
('turning-point.', 1)
('The', 1)
('bonds', 1)
('were', 1)
('quoted,', 1)
('no', 1)
('longer', 1)
('at', 1)
('a', 1)
('hundred', 1)
('below', 1)
('par,', 1)
('but', 1)
('at', 1)
('twenty,', 1)
('at', 1)
('ten,', 1)
('and', 1)
('at', 1)
('five;', 1)
('and', 1)
('paralytic', 1)
('old', 1)
('Lord', 1)
('Albemarle', 1)
('bet', 1)
('even', 1)
('in', 1)
('his', 1)
('favour.', 1)
('', 1)
('A', 1)
('great', 1)
('crowd', 1)
('was', 1)
('collected', 1)
('in', 1)
('Pall', 1)
('Mall', 1)
('and', 1)
('the', 1)
('neighbouring', 1)
('streets', 1)
('on', 1)
('Saturday', 1)
('evening;', 1)
('it', 1)
('seemed', 1)
('like', 1)
('a', 1)
('multitude', 1)
('of', 1)
('brokers', 1)
('permanently', 1)
('established', 1)
('around', 1)
('the', 1)
('Reform', 1)
('Club.', 1)
('Circulation', 1)
('was', 1)
('impeded,', 1)
('and', 1)
('everywhere', 1)
('disputes,', 1)

In [None]:
# a. count the occurence of each word
word_counts = words.reduceByKey(lambda a, b: a + b)
for w in word_counts.collect():
    print(w)


('Gutenberg', 22)
('eBook', 4)
('of', 1875)
('Around', 4)
('', 2182)
('for', 407)
('use', 16)
('anyone', 6)
('United', 23)
('States', 10)
('and', 1792)
('most', 43)
('other', 59)
('world', 30)
('at', 576)
('no', 124)
('cost', 9)
('with', 550)
('almost', 19)
('restrictions', 2)
('give', 17)
('it', 322)
('re-use', 2)
('under', 41)
('License', 8)
('this', 292)
('online', 4)
('www.gutenberg.org.', 4)
('If', 31)
('you', 243)
('are', 169)
('States,', 6)
('will', 108)
('have', 259)
('to', 1690)
('country', 18)
('where', 65)
('before', 77)
('using', 6)
('Author:', 1)
('date:', 1)
('1,', 1)
('[eBook', 1)
('#103]', 1)
('October', 11)
('29,', 1)
('***', 4)
('START', 1)
('PROJECT', 4)
('EBOOK', 2)
('AROUND', 4)
('IN', 81)
('EIGHTY', 2)
('[Illustration]', 1)
('by', 377)
('I.', 2)
('FOGG', 34)
('EACH', 4)
('MASTER,', 4)
('OTHER', 3)
('MAN', 2)
('II.', 2)
('THAT', 8)
('HAS', 2)
('AT', 12)
('HIS', 18)
('IDEAL', 2)
('III.', 2)
('A', 89)
('SEEMS', 2)
('IV.', 2)
('ASTOUNDS', 2)
('SERVANT', 2)
('V.', 2)
(

In [None]:
# b. a common first step in text analysis, change all capital letters to lower case
words = rdd.flatMap(lambda line: line.split(' ')) \
           .map(lambda word: (word.lower(), 1))
for w in words.collect():
    print(w)

[1;30;43mLe flux de sortie a été tronqué et ne contient que les 5000 dernières lignes.[0m
('was', 1)
('drawing', 1)
('near', 1)
('his', 1)
('last', 1)
('turning-point.', 1)
('the', 1)
('bonds', 1)
('were', 1)
('quoted,', 1)
('no', 1)
('longer', 1)
('at', 1)
('a', 1)
('hundred', 1)
('below', 1)
('par,', 1)
('but', 1)
('at', 1)
('twenty,', 1)
('at', 1)
('ten,', 1)
('and', 1)
('at', 1)
('five;', 1)
('and', 1)
('paralytic', 1)
('old', 1)
('lord', 1)
('albemarle', 1)
('bet', 1)
('even', 1)
('in', 1)
('his', 1)
('favour.', 1)
('', 1)
('a', 1)
('great', 1)
('crowd', 1)
('was', 1)
('collected', 1)
('in', 1)
('pall', 1)
('mall', 1)
('and', 1)
('the', 1)
('neighbouring', 1)
('streets', 1)
('on', 1)
('saturday', 1)
('evening;', 1)
('it', 1)
('seemed', 1)
('like', 1)
('a', 1)
('multitude', 1)
('of', 1)
('brokers', 1)
('permanently', 1)
('established', 1)
('around', 1)
('the', 1)
('reform', 1)
('club.', 1)
('circulation', 1)
('was', 1)
('impeded,', 1)
('and', 1)
('everywhere', 1)
('disputes,', 1)

In [None]:
# c. eliminate the stop words.
stop_words = {'the', 'is', 'and', 'a', 'an', 'in', 'on', 'of', 'to'}

words = rdd.flatMap(lambda line: line.split(' ')) \
           .map(lambda word: word.lower()) \
           .filter(lambda word: word not in stop_words) \
           .map(lambda word: (word, 1))
for w in words.collect():
    print(w)

[1;30;43mLe flux de sortie a été tronqué et ne contient que les 5000 dernières lignes.[0m
('he', 1)
('thought', 1)
('fix,', 1)
('but', 1)
('no', 1)
('longer', 1)
('anger.', 1)
('fix,', 1)
('like', 1)
('all', 1)
('world,', 1)
('had', 1)
('been', 1)
('mistaken', 1)
('phileas', 1)
('fogg,', 1)
('had', 1)
('only', 1)
('done', 1)
('his', 1)
('duty', 1)
('tracking', 1)
('arresting', 1)
('him;', 1)
('while', 1)
('he,', 1)
('passepartout.', 1)
('.', 1)
('.', 1)
('.', 1)
('this', 1)
('thought', 1)
('haunted', 1)
('him,', 1)
('he', 1)
('never', 1)
('ceased', 1)
('cursing', 1)
('his', 1)
('miserable', 1)
('folly.', 1)
('', 1)
('finding', 1)
('himself', 1)
('too', 1)
('wretched', 1)
('remain', 1)
('alone,', 1)
('he', 1)
('knocked', 1)
('at', 1)
('aouda’s', 1)
('door,', 1)
('went', 1)
('into', 1)
('her', 1)
('room,', 1)
('seated', 1)
('himself,', 1)
('without', 1)
('speaking,', 1)
('corner,', 1)
('looked', 1)
('ruefully', 1)
('at', 1)
('young', 1)
('woman.', 1)
('aouda', 1)
('was', 1)
('still', 1

In [None]:
# d. sort in alphabetical order
words = word_counts.sortBy(lambda x: x[0])

for w in words.collect():
    print(w)

('', 2182)
('#103]', 1)
('$5,000)', 1)
('&c.,', 1)
('($1', 1)
('(801)', 1)
('(Japan),', 1)
('(Saturday,', 1)
('(Sunday)', 1)
('(a)', 1)
('(and', 1)
('(any', 1)
('(b)', 1)
('(c)', 1)
('(does', 1)
('(if', 1)
('(or', 3)
('(sort', 1)
('(trademark/copyright)', 1)
('(www.gutenberg.org),', 1)
('(“the', 1)
('***', 4)
('-', 3)
('-------', 1)
('.', 3)
('.....', 1)
('........', 1)
('.........', 1)
('.............', 2)
('.................', 1)
('...................', 1)
('....................', 1)
('............................................', 1)
('1,', 1)
('1,000.', 1)
('1,170,', 1)
('1.', 1)
('1.A.', 1)
('1.B.', 1)
('1.C', 1)
('1.C.', 1)
('1.D.', 1)
('1.E', 1)
('1.E.', 1)
('1.E.1', 3)
('1.E.1.', 2)
('1.E.2.', 1)
('1.E.3.', 1)
('1.E.4.', 1)
('1.E.5.', 1)
('1.E.6.', 1)
('1.E.7', 2)
('1.E.7.', 1)
('1.E.8', 2)
('1.E.8.', 2)
('1.E.9.', 3)
('1.F.', 1)
('1.F.1.', 1)
('1.F.2.', 1)
('1.F.3,', 3)
('1.F.3.', 2)
('1.F.4.', 1)
('1.F.5.', 1)
('1.F.6.', 1)
('10th,', 1)
('11', 1)
('11.40', 1)
('117,', 1)
('11

In [None]:
# e. sort descending by word frequency
words = word_counts.sortBy(lambda x: x[1], ascending=False)

for w in words.collect():
    print(w)

('the', 4316)
('', 2182)
('of', 1875)
('and', 1792)
('to', 1690)
('a', 1261)
('in', 991)
('was', 974)
('his', 804)
('he', 720)
('at', 576)
('with', 550)
('not', 500)
('had', 484)
('The', 482)
('on', 482)
('that', 460)
('which', 414)
('for', 407)
('as', 396)
('by', 377)
('Mr.', 371)
('Fogg', 331)
('it', 322)
('be', 307)
('from', 302)
('were', 300)
('this', 292)
('is', 288)
('would', 266)
('have', 259)
('you', 243)
('an', 222)
('Passepartout', 219)
('but', 216)
('Phileas', 214)
('He', 208)
('I', 207)
('or', 185)
('they', 183)
('him', 181)
('who', 179)
('are', 169)
('been', 167)
('said', 155)
('her', 146)
('their', 145)
('It', 139)
('if', 138)
('could', 136)
('its', 133)
('all', 131)
('Fogg,', 130)
('one', 125)
('no', 124)
('Fix', 122)
('did', 119)
('Passepartout,', 117)
('“I', 115)
('so', 112)
('upon', 112)
('more', 109)
('will', 108)
('about', 102)
('only', 96)
('when', 96)
('up', 91)
('two', 91)
('out', 90)
('my', 90)
('hundred', 90)
('some', 90)
('A', 89)
('replied', 89)
('now', 87)
(

In [None]:
# f. remove punctuations and blank spaces
import re


words = rdd.flatMap(lambda line: line.split(' ')) \
           .map(lambda word: re.sub(r'[^\w\s]', '', word)) \
           .filter(lambda word: word.strip() != '')

for w in words.collect():
    print(w)


[1;30;43mLe flux de sortie a été tronqué et ne contient que les 5000 dernières lignes.[0m
asked
him
if
it
was
not
too
late
to
notify
the
Reverend
Samuel
Wilson
of
Marylebone
parish
that
evening
Passepartout
smiled
his
most
genial
smile
and
said
Never
too
late
It
was
five
minutes
past
eight
Will
it
be
for
tomorrow
Monday
For
tomorrow
Monday
said
Mr
Fogg
turning
to
Aouda
Yes
for
tomorrow
Monday
she
replied
Passepartout
hurried
off
as
fast
as
his
legs
could
carry
him
CHAPTER
XXXVI
IN
WHICH
PHILEAS
FOGGS
NAME
IS
ONCE
MORE
AT
A
PREMIUM
ON
CHANGE
It
is
time
to
relate
what
a
change
took
place
in
English
public
opinion
when
it
transpired
that
the
real
bankrobber
a
certain
James
Strand
had
been
arrested
on
the
17th
day
of
December
at
Edinburgh
Three
days
before
Phileas
Fogg
had
been
a
criminal
who
was
being
desperately
followed
up
by
the
police
now
he
was
an
honourable
gentleman
mathematically
pursuing
his
eccentric
journey
round
the
world
The
papers
resumed
their
discussion
about
the
wager
a

ALL IN ONE CODE

# 2. What does the following cell block do?
Comment the code below line by line after the provided hash-tag. You should be able to explain each line while respecting the pep8 style guide of 79 characters or less per line!

In [None]:
 # Create an RDD of tuples (name, age)
dataRDD = sc.parallelize([("Brooke", 20), ("Denny", 31), ("Jules", 30),
("TD", 35), ("Brooke", 25)])

# Try to undestand what this code does (line by line)
agesRDD = (dataRDD
  # Change to (name, (age, 1)) for counting
  .map(lambda x: (x[0], (x[1], 1)))
  # Add up ages and counts for each name
  .reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
  # Divide total age by count to get average
  .map(lambda x: (x[0], x[1][0]/x[1][1])))

# Where to go from here

Further exploration for students who complete the lab before the end of the session or want to go further.

- perform eda on the original french version of the [book](https://www.gutenberg.org/ebooks/46541.txt.utf-8) and compare the two
- recomplete the exercises using a the docker install
- install java and spark directly onto host machine and either rexplore this notebook or perform eda on other data sets
- write a simple python timer function for seeing how quickly your rdd runs as written. change the order of the steps in order to make the rdd run as optimally as possible