# Let's try to understand Spark a little bit using PySpark<br> and the classical *Word Count* example

# Some references:

## [Holden Karau](http://www.bigdataspain.org/2017/speakers/holden-karau/)

- http://youtu.be/Wg2boMqLjCg 
- https://www.youtube.com/watch?v=4xsBQYdHgn8
- https://www.youtube.com/watch?v=V6DkTVvy9vk
- https://www.youtube.com/watch?v=vfiJQ7wg81Y
- https://robertovitillo.com/2015/06/30/spark-best-practices/ <br>
- https://www.slideshare.net/SparkSummit/getting-the-best-performance-with-pyspark

#### [HandySpark: bringing pandas-like capabilities to Spark DataFrames](https://towardsdatascience.com/handyspark-bringing-pandas-like-capabilities-to-spark-dataframes-5f1bcea9039e)

![Distrubuted Spark](http://www.bogotobogo.com/Hadoop/images/PySpark/ComponentsForDistributedExecutionInSpark.png)

![PySpark Python](https://www.packtpub.com/graphics/9781786463708/graphics/B05793_03_01.jpg )

## Word Count Example
- ### <font color=  2e5f54 size=6 face="verdana">Spark’s simplicity makes it all too easy to ignore its execution model and still manage to write jobs that eventually complete.
- ### With larger datasets having an understanding of what happens under the hood becomes critical to reduce run-time and avoid out of memory errors</font>

### RDD operations are compiled into a Direct Acyclic Graph of RDD objects, where each RDD points to the parent it depends on:

![DAG](https://raw.githubusercontent.com/MasterMSTC/PySpark_DataFrames_MLib/master/images/image1.jpg)

![Direct Acyclic Graph of RDD objects](https://ravitillo.files.wordpress.com/2015/06/dag1.png)

## Best practices

## Spark UI

- ###<font color=red size=4 face="verdana">Running Spark jobs without the **Spark UI** is like flying blind.
- ### The UI allows to monitor and inspect the execution of jobs.
- ### To access it remotely a SOCKS proxy is needed as the UI connects also to the worker nodes.

## Let's try: Word Count Example

## <font color= 187b1a>Word count using RDD: reduceByKey? groupByKey?</font>

In [9]:
%sh ls /dbfs/databricks-datasets

In [10]:
%sh cat /dbfs/databricks-datasets/README.md

## <font color= 187b1a>Word count using RDD: reduceByKey</font>

- ## TO DO: cread rdd_lines from a text file `file:/dbfs/databricks-datasets/README.md `

In [13]:
rdd_lines = ???

In [14]:
rdd_lines.take(5)

- ## TO DO: count how how many times each line of text occurs

In [16]:
???

- ## TO DO: to count how how many WORDS  ... we need an rdd_words !!! (splited by words!)

In [18]:
rdd_words = ???

In [19]:
rdd_words.take(5)

- ## TO DO (Group 1): Word count using RDD: reduceByKey

In [21]:
???

## <font color= 187b1a>Word count using RDD: groupByKey</font>

In [23]:
???

## <font color=C70039>We can download some larger text files</font>

In [25]:
#import time
import os

from six.moves import urllib

#file_url = 'http://www.gutenberg.org/cache/epub/2000/pg2000.txt'
#file_name = '/resources/data/MSTC/cervantes.txt'

#file_url = 'https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt'
#file_name = '/resources/data/MSTC/t8.shakespeare.txt'

# NOTE that compressed files can be read as simple txt : NOTHING particular must be done!
file_url='http://ftp.sunet.se/mirror/archive/ftp.sunet.se/pub/tv+movies/imdb/producers.list.gz'
file_name = 'producers.list.gz'
    
if not os.path.exists(file_name):
    urllib.request.urlretrieve(file_url, file_name)

In [26]:
%sh
gunzip producers.list.gz
ls -al

## <font color=  #7b1864 >Word count using DataFrames:</font>
### without ordering the results...

In [29]:
from pyspark.sql import Row
import pyspark.sql.functions as f

eDF = spark.createDataFrame([Row(a=1, intlist=[1,2,3], mapfield={"a": "b"})])

eDF.select(f.explode(eDF.intlist).alias("LL")).show()

In [30]:
import pyspark
import pyspark.sql.functions as f

#df = sqlContext.read.text("file:/databricks/driver/producers.list")
df = sqlContext.read.text("file:/dbfs/databricks-datasets/README.md")

In [31]:
???

## <font color= 187b1a>Word count using RDD</font>
### TO DO: NOW ordering the results...

In [34]:
???

In [35]:
rdd_word_count[0:5]

## <font color=  #7b1864 >Word count using DataFrames:</font>
### TO DO: Now ordering the results...

In [37]:
???

In [38]:
word_count[0:5]