# First steps with PySpark

## Learning objectives

- Get familiar with PySpark RDDs
- Become imbued with the concept of lazyness

Let's start by loading the file from it's URI: `s3://full-stack-bigdata-datasets/Big_Data/tears_in_rain.txt`

In [None]:
# spark
# sc = spark.sparkContext

text_rdd = sc.textFile('s3://full-stack-bigdata-datasets/Big_Data/tears_in_rain.txt')

Print out the first line to make sure everything went well.

In [None]:
text_rdd.take(1)

Out[2]: ["I've seen things you people wouldn't believe. "]

Good you remember how to load a file from a URI, however, most of the time you will need to access files from distributed files systems, like amazon S3, here we will show you how you can access the file from an S3 bucket using specific credentials.

In [None]:
# When you try to load a non public file you will get an error

text_rdd_1 = sc.textFile("s3://full-stack-bigdata-datasets/Big_Data/tears_in_rain_not_public.txt")
text_rdd_1.take(1)

# Note that you will only get the error after trying to perform an action, that's only then that you are trying to access the data.

Out[3]: ["I've seen things you people wouldn't believe. "]

In [None]:
# Let's set up everything to easily load the file

FILENAME = 's3://full-stack-bigdata-datasets/Big_Data/tears_in_rain_not_public.txt'

# ACCESS_KEY_ID = "your access key ID" # jedha student account access key
# SECRET_ACCESS_KEY = "your secret access key" # student account secret key

# hadoop_conf = spark._jsc.hadoopConfiguration() # this will set the Spark framework to interact with your S3 DFS
# hadoop_conf.set("fs.s3a.access.key", ACCESS_KEY_ID)
# hadoop_conf.set("fs.s3a.secret.key", SECRET_ACCESS_KEY)
# hadoop_conf.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")

1. Load the file from `filepath` into a PySpark RDD into a `text_file` variable.

In [None]:
text_file = sc.textFile(FILENAME)
# text_file = sc.textFile("s3://full-stack-bigdata-datasets/Big_Data/tears_in_rain_not_public.txt")

2. Print out `text_file`.

In [None]:
text_file.take(1)

Out[6]: ["I've seen things you people wouldn't believe. "]

3. That doesn't tell us much, what would you do to see the first 3 elements of this RDD? Print out the first 3 elements of the `text_file`.

In [None]:
text_file.take(3)

Out[7]: ["I've seen things you people wouldn't believe. ",
 'Attack ships on fire off the shoulder of Orion. ',
 'I watched C-beams glitter in the dark near the Tannhäuser Gate. ']

4. What's the type of `text_file`?

In [None]:
type(text_file)

Out[8]: pyspark.rdd.RDD

It's a PySpark `RDD`. It means we can call **actions** on it and it will return a result.

We want the results to be all elements of the `rdd`.

5. collect all elements of `text_file`.

In [None]:
text_file.collect()

Out[9]: ["I've seen things you people wouldn't believe. ",
 'Attack ships on fire off the shoulder of Orion. ',
 'I watched C-beams glitter in the dark near the Tannhäuser Gate. ',
 'All those moments will be lost in time, like tears in rain. ',
 'Time to die.']

6. How many lines are there in `text_file`? Count the number of lines in `text_file`.

In [None]:
print(f"Nb de lignes = {text_file.count()} ")

Nb de lignes = 5 


7. What's the length of each sentence? Call `.map(...)` on your rdd and give it a function that computes the lenght of a string: `lineLengths`

*NOTE: `lineLengths` is how you should name your result variable*

In [None]:
# lineLengths = text_file.map(lambda s: len(s))
lineLengths = text_file.map(len)

8. Take the first 3 elements of lineLengths

In [None]:
lineLengths.take(3)

Out[20]: [46, 48, 64]

9. Collect all elements of lineLenghts

In [None]:
lineLengths.collect()

Out[21]: [46, 48, 64, 60, 12]

10. What's the average length? Compute the average value of `lineLengths`: `avgLength`

In [None]:
avgLength = lineLengths.mean()

11. What's the type of `avgLength`? Print it out.

In [None]:
type(avgLength)

Out[23]: float

12. Print out `avgLength`

In [None]:
print(f"Long moyenne des chaines : {avgLength}")

Long moyenne des chaines : 46.0


13. Now we want to compute the total length of the document. Compute the sum of all `lineLengths`: `totalLength`

In [None]:
totalLength = lineLengths.sum()

14. What's the type of `totalLength`

In [None]:
type(totalLength)

Out[27]: int

15. Print out `totalLength`

In [None]:
print(f"Long totale du texte : {totalLength}")

Long totale du texte : 230


## Bonus: another way to compute the sum would be to use a `reducer`
This is a step exercise to get you prepare for the next (optional) assignment.

Your goal is to compute the sum of lineLenghts, just like we did, but this time using `.reduce(...)`.  
Here is the link to the [documentation](https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions).

16. Try to compute the total sum, but this time using `.reduce(...)`

In [None]:
totalLength2 = text_file                    \
                .map(len)                   \
                .reduce(lambda a, b: a + b)  

In [None]:
print(f"Long totale du texte : {totalLength2}")

Long totale du texte : 230
