# Exercise #1 - Getting Started with PySpark
In this exercise you will play with ome basic Spark functions

# Part 1: PySpark Basic

## Python vs PySpark: Map & Reduce
The code below creates a list of numbers, multiplies each number by itself and sums their result.
We will see an example with Python and another one with PySpark so you can see the differences:®

In [4]:
from operator import add
l = [1,2,3,4,5]

In [5]:
# Python Example
from functools import reduce

reduce(add, map(lambda i: i*i, l))

In [6]:
# PySpark example
sc.parallelize(l)\
  .map(lambda i: i*i)\
  .reduce(add)

## Task 1 - Read a file
For your first task you'll need to do the following:
1. Upload the `wordcount.txt` file to DBFS
2. Read it with spark (Hint: the path is `/FileStore/tables/wordcount.txt`)
3. Print the first three lines

In [8]:
# Write your code here
rdd = sc.textFile("/FileStore/tables/wordcount.txt")
rdd.take(3)

## Task 2 - Filter & Count
1. Count how many lines does the RDD have
2. Filter the text so it only contains lines with the word 'Hadoop' (case insensitive).
2. Count now how many lines are in the rdd after filtering it

In [10]:
# Write your code here
print("Lines before filtering: {}".format(rdd.count()))
rdd_filtered = rdd.filter(lambda line: 'hadoop' in line.lower())
print("Lines after filtering: {}".format(rdd_filtered.count()))

## Task 3 - Word Count
1. Write a word count application using the same input of task 1 (not the filtered rdd)
2. Print the results (Word -> Count)
3. Sort the output in descending order (the word that appears the most will be in the beginning) and print the top 20 words with their counter

In [12]:
# Write your code here
from operator import add
word_count = rdd\
              .flatMap(lambda line: line.split(" "))\
              .map(lambda word: (word, 1))\
              .reduceByKey(add)\
              .collect()
# .reduceByKey(add) is similar to .reduceByKey(lambda a,b: a+b)
print("Word count output: {}\n".format(word_count))

sorted_words = sorted(word_count, key=lambda x: x[1], reverse=True)
top_20_words = sorted_words[:20]
print("The top 20 words are: {}".format(top_20_words))

## Task 4 - Factorial
1. Write a Spark application that receives a number N and returns factorial of N (N!). You can assume that N > 0.



#### Definition of factorial
In mathematics, the factorial of a positive integer n, denoted by n!, is the product of all positive integers less than or equal to n:

`n! = n x (n-1) x (n-2) x ... x 3 x 2 x 1`

For example:

`5! = 5 x 4 x 3 x 2 x 1`

The value of 0! is 1.

In [14]:
N = 5 # Replace N with a small number to test the code
rdd = sc.parallelize(range(1, N+1))

# Write your code here
rdd.reduce(lambda a,b: a*b)

## Task 5 - Modulo
1. Write a Spark application that receives natural numbers X1 and X2, such that `X2 > X1 > 0`, and returns all the numbers between X1 and X2 that can be divided by 3 `x%3 == 0`.
2. Same as the step before but receive Y1, Y2 and return all the numbers between Y1 and Y2 that can be divided by 4.
3. Create and RDD that contains both rdds from step 1 and 2, without duplicates and sorted from lowest to highest.
4. Print the sorted RDD from step 3.

In [16]:
X1 = 1
X2 = 50
rddX = sc.parallelize(range(X1,X2))

Y1 = 1
Y2 = 50
rddY = sc.parallelize(range(Y1,Y2))

# 1. Write your code here
rddX = rddX.filter(lambda x: x%3 == 0)

# 2. Write your code here
rddY = rddY.filter(lambda y: y%4 == 0)

# 3. Write your code here
rdd = sc.union([rddX, rddY]).distinct().sortBy(lambda x: x)

# 4. Write your code here
rdd.collect()