# Let's try to understand Spark a little bit using PySpark<br> and the classical *Word Count* example

# Some references:

## [Holden Karau](http://www.bigdataspain.org/2017/speakers/holden-karau/)

- http://youtu.be/Wg2boMqLjCg 
- https://www.youtube.com/watch?v=4xsBQYdHgn8
- https://www.youtube.com/watch?v=V6DkTVvy9vk
- https://www.youtube.com/watch?v=vfiJQ7wg81Y
- https://robertovitillo.com/2015/06/30/spark-best-practices/ <br>
- https://www.slideshare.net/SparkSummit/getting-the-best-performance-with-pyspark

#### [HandySpark: bringing pandas-like capabilities to Spark DataFrames](https://towardsdatascience.com/handyspark-bringing-pandas-like-capabilities-to-spark-dataframes-5f1bcea9039e)

![Distrubuted Spark](http://www.bogotobogo.com/Hadoop/images/PySpark/ComponentsForDistributedExecutionInSpark.png)

![PySpark Python](https://www.packtpub.com/graphics/9781786463708/graphics/B05793_03_01.jpg )

## Word Count Example
- ### <font color=  2e5f54 size=6 face="verdana">Spark’s simplicity makes it all too easy to ignore its execution model and still manage to write jobs that eventually complete.
- ### With larger datasets having an understanding of what happens under the hood becomes critical to reduce run-time and avoid out of memory errors</font>

### RDD operations are compiled into a Direct Acyclic Graph of RDD objects, where each RDD points to the parent it depends on:

![DAG](https://raw.githubusercontent.com/MasterMSTC/PySpark_DataFrames_MLib/master/images/image1.jpg)

![Direct Acyclic Graph of RDD objects](https://ravitillo.files.wordpress.com/2015/06/dag1.png)

## Best practices

## Spark UI

- ###<font color=red size=4 face="verdana">Running Spark jobs without the **Spark UI** is like flying blind.
- ### The UI allows to monitor and inspect the execution of jobs.
- ### To access it remotely a SOCKS proxy is needed as the UI connects also to the worker nodes.

## Let's try: Word Count Example

## <font color= 187b1a>Word count using RDD: reduceByKey? groupByKey?</font>

In [9]:
%sh ls /dbfs/databricks-datasets

In [10]:
%sh cat /dbfs/databricks-datasets/README.md

## <font color= 187b1a>Word count using RDD: reduceByKey</font>

- ## TO DO: cread rdd_lines from a text file `file:/dbfs/databricks-datasets/README.md `

In [13]:
# rdd_lines = ???

In [14]:
rdd_lines = sc.textFile("file:/dbfs/databricks-datasets/README.md")

In [15]:
rdd_lines.count()

- ## TO DO: count how how many times each line of text occurs

In [17]:
# ???

In [18]:
# NOT THIS: 

rdd_lines.count()

In [19]:
pairs = rdd_lines.map(lambda s: (s, 1))
counts = pairs.reduceByKey(lambda a, b: a + b)

#pairs.take(5)
counts.collect()

- ## TO DO: to count how how many WORDS  ... we need an rdd_words !!! (splited by words!)

In [21]:
# rdd_words = ???

In [22]:


rdd_words = rdd_lines.flatMap(lambda line: line.split())

In [23]:
rdd_words.take(5)

## <font color=C70039>We can download some larger text files</font>

In [25]:
import os

from six.moves import urllib

#file_url = 'http://www.gutenberg.org/cache/epub/2000/pg2000.txt'
#file_name = '/resources/data/MSTC/cervantes.txt'

#file_url = 'https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt'
#file_name = '/resources/data/MSTC/t8.shakespeare.txt'

# NOTE that compressed files can be read as simple txt : NOTHING particular must be done!
file_url='http://ftp.sunet.se/mirror/archive/ftp.sunet.se/pub/tv+movies/imdb/producers.list.gz'
file_name = 'producers.list.gz'
    
if not os.path.exists(file_name):
    urllib.request.urlretrieve(file_url, file_name)

In [26]:
%sh
gunzip producers.list.gz
ls -al

- ## TO DO (Group 1): Word count using RDD: reduceByKey

In [28]:
#rdd_lines = sc.textFile("file:/dbfs/databricks-datasets/README.md")
rdd_lines = sc.textFile("file:/databricks/driver/producers.list")
#rdd_lines = sc.textFile("file:/dbfs/databricks-datasets/README.md")
# rdd_lines = sc.textFile("file:/databricks/driver/producers.list.gz")

rdd_words = rdd_lines.flatMap(lambda line: line.split())

rdd_word_pairs = rdd_words.map(lambda x: (x, 1))

In [29]:
rdd_word_pairs.take(10)

In [30]:
#rdd_lines = sc.textFile("file:/dbfs/databricks-datasets/README.md")
rdd_lines = sc.textFile("file:/databricks/driver/producers.list")
#rdd_lines = sc.textFile("file:/dbfs/databricks-datasets/README.md")
# rdd_lines = sc.textFile("file:/databricks/driver/producers.list.gz")

#rdd_words = rdd_lines.flatMap(lambda line: line.split())

#rdd_word_pairs = rdd_words.map(lambda x: (x, 1))

#word_count = rdd_word_pairs.reduceByKey(lambda x, y : x + y)

word_count = rdd_lines.flatMap(lambda line: line.split()).map(lambda x: (x, 1)). \
                      reduceByKey(lambda x, y : x + y)

word_count.collect()



## <font color= 187b1a>Word count using RDD: groupByKey</font>

In [32]:
# ???

In [33]:
rdd_lines = sc.textFile("file:/databricks/driver/producers.list")
#rdd_lines = sc.textFile("file:/dbfs/databricks-datasets/README.md")

rdd_words = rdd_lines.flatMap(lambda line: line.split())

rdd_word_pairs = rdd_words.map(lambda x: (x, 1))

rdd_groups = rdd_word_pairs.groupByKey()

#rdd_counted_words = rdd_groups.mapValues(lambda counts: sum(counts))

rdd_counted_words = rdd_groups.map(lambda (w, counts): (w, sum(counts)))

rdd_counted_words.collect()

## <font color=  #7b1864 >Word count using DataFrames:</font>
### without ordering the results...
### BUT see first PySpark Dataframes notebook

In [36]:
# ???

In [37]:
# http://wpcertification.blogspot.com/2016/07/wordcount-program-using-spark-dataframe.html?utm_source=twitterfeed&utm_medium=twitter

import pyspark
import pyspark.sql.functions as f

df = sqlContext.read.text("file:/databricks/driver/producers.list")
#df = sqlContext.read.text("file:/dbfs/databricks-datasets/README.md")

#df.show()

wordDF = df.select(f.explode(f.split(df['value'], ' ')).alias("words"))

wordCountDF = wordDF.groupBy("words").count()

wordCountDF.show()

### Create and cache a Dataframe with words

In [40]:
df = sqlContext.read.text(file_name)

words=df.flatMap(lambda line: line.value.split())\
    .map(lambda x:Row(word=x, cnt=1)).toDF()
    
words.cache()

In [41]:
words.limit(5).toPandas()

In [42]:
words.count()

In [43]:
t0 = time()

word_count=words.groupBy('word').count()\
    .collect()

tt = time() - t0
print("Task completed in {} seconds".format(round(tt,3)))
    

In [44]:
word_count[0:5]

## <font color= 187b1a>Word count using RDD</font>
### NOW ordering the results...

In [46]:
t0 = time()


rdd_word_count = rdd_words.map(lambda word: (word,1))\
    .reduceByKey(lambda x,y: x + y)\
    .map(lambda x: (x[1],x[0])) \
    .sortByKey(ascending=False) \
    .collect()
    
tt = time() - t0
print("Task completed in {} seconds".format(round(tt,3)))

In [47]:
rdd_word_count[0:5]

## <font color=  #7b1864 >Word count using DataFrames:</font>
### Now ordering the results...

In [49]:
t0 = time()

word_count=words.groupBy('word').count()\
    .orderBy('count',ascending=False).collect()

tt = time() - t0
print("Task completed in {} seconds".format(round(tt,3)))
    

In [50]:
word_count[0:5]