Lessons are further are continued from https://realpython.com/pyspark-intro/

In [1]:
# Further is an equivalent to a Hello World in PySpark
import pyspark
# sc = pyspark.SparkContext("local[*]")

# If you have setup spark to run seperatelly, likely
# your SparkContext has already started and running
# You can check by uncommenting and running the following:

# sc

In [2]:
# An additional file has been added to this repo

# In this case - amount of lines are calculated in the file.
text_file = "PythonCopyright.txt"
txt = sc.textFile(text_file)
print(f"The amount of lines and paragraphs in the {text_file} is: {txt.count()}")

The amount of lines and paragraphs in the PythonCopyright.txt is: 8


In [3]:
python_lines = txt.filter(lambda line: 'python' in line.lower())
print(f"The amount of lines there are in which the word 'python' shows up in the text is: {python_lines.count()}")

The amount of lines there are in which the word 'python' shows up in the text is: 1


If you will want to properly work with `PySpark` you will have to refer from time to time to the following documenatations: [PySpark and API for it](http://spark.apache.org/docs/latest/api/python/index.html) and [Scala Documentation](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/api/index.html). The thing is, Spark is written with scala based on a functional language paradigm. With it, it is much easier to parallelize the executable code. Therefore, pretty much this approach is used in such cluster based calculations oriented systems.

In [4]:
# Use an example of RDD through parallelization
big_list = range(100000)
rdd = sc.parallelize(big_list, 2)
odds = rdd.filter(lambda x: x % 2 != 0)
odds.take(10)

[1, 3, 5, 7, 9, 11, 13, 15, 17, 19]

In [5]:
# To time how long a process takes the following approach can be used

import time

t0 = time.time()
time.sleep(3)
t1 = time.time()

time_dif = t1 - t0
time_dif

3.0033297538757324

In [6]:
# Use the RDD process again but time it
# And will save the whole output in memory
bigger_list = range(10000000)

# The difference will be in the slicing and amount of them
import time
dist = 2
t0 = time.time()
rdd_2 = sc.parallelize(bigger_list, dist)
odds = rdd_2.filter(lambda x: x % 2 != 0)
collector = odds.collect()
t1 = time.time()
time_dif = round(t1 - t0, 2)
print(f"With {dist} distributed processes it took {time_dif} seconds")

dist = 6
t0 = time.time()
rdd_8 = sc.parallelize(bigger_list, dist)
odds = rdd_8.filter(lambda x: x % 2 != 0)
collector = odds.collect()
t1 = time.time()
time_dif = round(t1 - t0, 2)
print(f"With {dist} distributed processes it took {time_dif} seconds")

dist = 10
t0 = time.time()
rdd_8 = sc.parallelize(bigger_list, dist)
odds = rdd_8.filter(lambda x: x % 2 != 0)
collector = odds.collect()
t1 = time.time()
time_dif = round(t1 - t0, 2)
print(f"With {dist} distributed processes it took {time_dif} seconds")

With 2 distributed processes it took 3.77 seconds
With 6 distributed processes it took 3.54 seconds
With 10 distributed processes it took 2.75 seconds
