<font size=6> Spark Data Frames and SQL</font><br><br>
# ** MSTC MLlab**

## Sources:
* [Introduction to Spark with Python, by Jose A. Dianes](http://jadianes.github.io/spark-py-notebooks)
* [Complete Guide on DataFrame Operations in PySpark](https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/)
* [Understanding-DataFrames](https://github.com/awantik/pyspark-tutorial/wiki/Understanding-DataFrames)
* [From Pandas to Spark Dataframes](https://github.com/awantik/pyspark-tutorial/wiki/Migrating-from-Pandas-to-Apache-Spark%E2%80%99s-DataFrame)
* [Also ML](https://www.analyticsvidhya.com/blog/2016/09/comprehensive-introduction-to-apache-spark-rdds-dataframes-using-pyspark/)

<font size=5 color=brown> This notebook will introduce Spark capabilities to deal with data in a structured way. Basically, everything turns around the concept of *Data Frame* and using *SQL language* to query them.</font>
<br><br>

<font size=5> In Apache Spark, a DataFrame is a **distributed collection of rows under named columns**. In simple terms, it is same as a table in relational database or an Excel sheet with Column headers. It also shares some common characteristics with RDD:</font>

*    <font size=5 color=red>Immutable</font> <font size=4>in nature : We can create DataFrame / RDD once but can’t change it. And we can transform a DataFrame / RDD after applying transformations.
*    **Lazy Evaluations**: Which means that a task is not executed until an action is performed.
*    **Distributed**: RDD and DataFrame both are distributed in nature.</font>
 

### PERFORMANCE:

![How to create a DataFrame](https://camo.githubusercontent.com/cc93c064c6fd754df0209d42ec054998edd81fa0/68747470733a2f2f7777772e736166617269626f6f6b736f6e6c696e652e636f6d2f6c6962726172792f766965772f6c6561726e696e672d7079737061726b2f393738313738363436333730382f67726170686963732f4230353739335f30335f30332e6a7067)

 ## How to create a DataFrame ?
 
 ![How to create a DataFrame](https://www.analyticsvidhya.com/wp-content/uploads/2016/10/DataFrame-in-Spark.png)

* ### A Spark `DataFrame` is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R or Pandas. They can be constructed from a wide array of sources such as a existing RDD in our case.

* ### The entry point into all SQL functionality in Spark is the `SQLContext` class. To create a basic instance, all we need is a `SparkContext` reference. Since we are running Spark in shell mode (using pySpark) we can use the global context object `sc` for this purpose. 

In [4]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

## <font color=#AA1B5A> DataFrame RDD of Row objects

From: http://www.cs.sfu.ca/CourseCentral/732/ggbaker/content/spark-sql.html

### Think of a DataFrame being implemented with an RDD of Row objects.
### <font color=#F01B5A>Nicest way to create Rows: create a custom subclass for your data:

In [5]:
from pyspark.sql import Row

NameAge = Row('fname', 'lname', 'age') # build a Row subclass
data_rows = [
    NameAge('John', 'Smith', 47),
    NameAge('Jane', 'Smith', 22),
    NameAge('Frank', 'Jones', 28),
]

In [6]:
# create a DataFrame from an RDD of Rows
data_rdd = sc.parallelize(data_rows)
data = sqlContext.createDataFrame(data_rdd)

In [7]:
type(data)

pyspark.sql.dataframe.DataFrame

In [8]:
# ... or from a list (equivalent for small data)
data = sqlContext.createDataFrame(data_rows)

In [9]:
type(data)

pyspark.sql.dataframe.DataFrame

In [10]:
data.show()

+-----+-----+---+
|fname|lname|age|
+-----+-----+---+
| John|Smith| 47|
| Jane|Smith| 22|
|Frank|Jones| 28|
+-----+-----+---+



### For using Spark SQL we need the schema in our data.

In [11]:
data.printSchema()

root
 |-- fname: string (nullable = true)
 |-- lname: string (nullable = true)
 |-- age: long (nullable = true)



## <font color=#AA1B5A>Creating a Data Frame from CSV file

## <font color=#F01B5A>We will read our Orange Churn dataset 

In [12]:
CV_data = sqlContext.read.load('/resources/data/MSTC/churn-bigml-80.csv', 
                          format='com.databricks.spark.csv', 
                          header='true', 
                          inferSchema='true')


In [13]:
type(CV_data)

pyspark.sql.dataframe.DataFrame

In [14]:
CV_data.count()

2666

### Spark SQL schema schema

For using Spark SQL we need the schema in our data.

In [15]:
CV_data.printSchema()

root
 |-- State: string (nullable = true)
 |-- Account length: integer (nullable = true)
 |-- Area code: integer (nullable = true)
 |-- International plan: string (nullable = true)
 |-- Voice mail plan: string (nullable = true)
 |-- Number vmail messages: integer (nullable = true)
 |-- Total day minutes: double (nullable = true)
 |-- Total day calls: integer (nullable = true)
 |-- Total day charge: double (nullable = true)
 |-- Total eve minutes: double (nullable = true)
 |-- Total eve calls: integer (nullable = true)
 |-- Total eve charge: double (nullable = true)
 |-- Total night minutes: double (nullable = true)
 |-- Total night calls: integer (nullable = true)
 |-- Total night charge: double (nullable = true)
 |-- Total intl minutes: double (nullable = true)
 |-- Total intl calls: integer (nullable = true)
 |-- Total intl charge: double (nullable = true)
 |-- Customer service calls: integer (nullable = true)
 |-- Churn: string (nullable = true)



## COLUMNS?

## <font color=#F81B5A>...worth mentioning PARQUET

![Parquet](https://parquet.apache.org/assets/img/parquet_logo.png)
https://parquet.apache.org/

### Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

In [16]:
CV_data.columns

['State',
 'Account length',
 'Area code',
 'International plan',
 'Voice mail plan',
 'Number vmail messages',
 'Total day minutes',
 'Total day calls',
 'Total day charge',
 'Total eve minutes',
 'Total eve calls',
 'Total eve charge',
 'Total night minutes',
 'Total night calls',
 'Total night charge',
 'Total intl minutes',
 'Total intl calls',
 'Total intl charge',
 'Customer service calls',
 'Churn']

In [17]:
CV_data.head(5)

[Row(State='KS', Account length=128, Area code=415, International plan='No', Voice mail plan='Yes', Number vmail messages=25, Total day minutes=265.1, Total day calls=110, Total day charge=45.07, Total eve minutes=197.4, Total eve calls=99, Total eve charge=16.78, Total night minutes=244.7, Total night calls=91, Total night charge=11.01, Total intl minutes=10.0, Total intl calls=3, Total intl charge=2.7, Customer service calls=1, Churn='False'),
 Row(State='OH', Account length=107, Area code=415, International plan='No', Voice mail plan='Yes', Number vmail messages=26, Total day minutes=161.6, Total day calls=123, Total day charge=27.47, Total eve minutes=195.5, Total eve calls=103, Total eve charge=16.62, Total night minutes=254.4, Total night calls=103, Total night charge=11.45, Total intl minutes=13.7, Total intl calls=3, Total intl charge=3.7, Customer service calls=1, Churn='False'),
 Row(State='NJ', Account length=137, Area code=415, International plan='No', Voice mail plan='No',

## <font color=#AA1B5A> In Python, you can also convert freely between Pandas DataFrame and Spark DataFrame</font>

In [18]:
import pandas as pd

In [19]:
pd.DataFrame(CV_data.take(5), columns=CV_data.columns)

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
0,KS,128,415,No,Yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,No,Yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,No,No,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,Yes,No,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,Yes,No,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


## or... 

<font color=red size=6>BUT discuss this in terms of efficency???</font>

In [20]:
CV_data.toPandas().head(5)

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
0,KS,128,415,No,Yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,No,Yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,No,No,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,Yes,No,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,Yes,No,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


## Spark DataFrames include some built-in functions as for example Summary Statistics

## `describe`:
* ### get the summary statistics (mean, standard deviance, min ,max , count) of numerical columns in a DataFrame


In [21]:
CV_data.describe().show()

+-------+------------------+------------------+---------------------+------------------+------------------+------------------+------------------+------------------+-----------------+-------------------+------------------+------------------+------------------+-----------------+------------------+----------------------+
|summary|    Account length|         Area code|Number vmail messages| Total day minutes|   Total day calls|  Total day charge| Total eve minutes|   Total eve calls| Total eve charge|Total night minutes| Total night calls|Total night charge|Total intl minutes| Total intl calls| Total intl charge|Customer service calls|
+-------+------------------+------------------+---------------------+------------------+------------------+------------------+------------------+------------------+-----------------+-------------------+------------------+------------------+------------------+-----------------+------------------+----------------------+
|  count|              2666|            

In [22]:
CV_data.describe().toPandas().transpose()

Unnamed: 0,0,1,2,3,4
summary,count,mean,stddev,min,max
Account length,2666,100.62040510127532,39.56397365334986,1,243
Area code,2666,437.43885971492875,42.52101801942723,408,510
Number vmail messages,2666,8.021755438859715,13.612277018291945,0,50
Total day minutes,2666,179.48162040510107,54.21035022086984,0.0,350.8
Total day calls,2666,100.31020255063765,19.988162186059505,0,160
Total day charge,2666,30.512404351087763,9.215732907163499,0.0,59.64
Total eve minutes,2666,200.38615903975995,50.95151511764594,0.0,363.7
Total eve calls,2666,100.02363090772693,20.161445115318898,0,170
Total eve charge,2666,17.03307201800451,4.330864176799865,0.0,30.91


## <font color=#F81B5A>Methods on Data Frames feel very SQL-like:
http://www.cs.sfu.ca/CourseCentral/732/ggbaker/content/spark-sql.html

In [23]:
CV_data.select('Customer service calls','Churn').toPandas().head(5)

Unnamed: 0,Customer service calls,Churn
0,1,False
1,1,False
2,0,False
3,2,False
4,3,False


## Number of distinct states in train?

In [24]:
CV_data.select('State').distinct().count()

51

## Crosstab: contingency table

In [25]:
CV_data.crosstab('State', 'Churn').show()

+-----------+-----+----+
|State_Churn|False|True|
+-----------+-----+----+
|         MA|   44|   8|
|         IN|   48|   6|
|         ID|   51|   5|
|         NM|   40|   4|
|         OR|   55|   7|
|         IA|   35|   3|
|         IL|   41|   4|
|         TN|   36|   5|
|         MO|   46|   5|
|         ME|   38|  11|
|         AZ|   42|   3|
|         AK|   40|   3|
|         VT|   51|   6|
|         WA|   38|  10|
|         SD|   43|   6|
|         KY|   37|   6|
|         NJ|   36|  14|
|         TX|   39|  16|
|         MI|   45|  13|
|         MD|   46|  14|
+-----------+-----+----+
only showing top 20 rows



### Filter and count

In [26]:
CV_data.filter(CV_data['Customer service calls'] > 3).count()

210

## `groupby`:
* ### How to find Churn vs no_Churn cases?

In [27]:
Count=CV_data.groupby('Churn').count().show()

+-----+-----+
|Churn|count|
+-----+-----+
| True|  388|
|False| 2278|
+-----+-----+



In [28]:
CV_data.groupby('Churn').agg({'Customer service calls': 'mean'}).show()

+-----+---------------------------+
|Churn|avg(Customer service calls)|
+-----+---------------------------+
| True|         2.2061855670103094|
|False|         1.4530289727831431|
+-----+---------------------------+



* ### <font color=#F81BA0 size=5>TO DO:</font>

<font color=#F81B5A size=5>How to find the mean of 'Customer service calls' in Churn vs no_Churn groups in train?

In [2]:
CV_data.groupby('State').agg({'Total day minutes': 'mean', 'Customer service calls': 'mean'}).toPandas()

NameError: name 'CV_data' is not defined

<font color=#F81B5A size=5>And the mean of 'Total day minutes" and  'Customer service calls' for each State in train?

In [None]:
CV_data.groupby('State').agg({'Total day minutes': 'mean', 'Customer service calls': 'mean'}).toPandas()

# <font color=#F81B5A>SQL Syntax

## There is also a spark.sql function where you can do the same things with SQL query syntax.

### Apply SQL Queries on DataFrame

* ### <font color=brown>To apply SQL queries on DataFrame first we need to register DataFrame as table. Let’s first register train DataFrame as table.

In [30]:
CV_data.registerTempTable('CV_data_table')

In [31]:
Day_min = sqlContext.sql("""
    SELECT State, MEAN(`Total day minutes`), MEAN(`Customer service calls`) 
    FROM CV_data_table GROUP BY State
""")

In [32]:
Day_min.toPandas()

Unnamed: 0,State,_c1,_c2
0,MS,173.447917,1.6875
1,MT,173.524528,1.584906
2,TN,182.846341,1.317073
3,NC,185.492857,1.535714
4,ND,186.236364,1.5
5,NE,178.355556,1.622222
6,NH,177.523256,1.627907
7,AK,180.132558,1.511628
8,AL,188.372727,1.666667
9,NJ,199.086,1.7


### <font color=red>...NOW order: descend by average Day Minutes

In [33]:
Day_min = sqlContext.sql("""
    SELECT State, MEAN(`Total day minutes`) as average_DayMin, MEAN(`Customer service calls`) 
    FROM CV_data_table GROUP BY State order by average_DayMin desc
""")

In [34]:
pd.DataFrame(Day_min.take(5))

Unnamed: 0,0,1,2
0,NJ,199.086,1.7
1,MD,198.615,1.65
2,IN,196.92963,1.611111
3,PA,196.211111,1.305556
4,KS,196.178846,1.480769


## <font color=#F81B5A>... same as before but using SQL-like methods:

In [35]:
import pyspark.sql.functions as fn 

Day_min2=CV_data.groupby('State').agg(fn.mean('Total day minutes').alias("average_DayMin")
                            , fn.mean('Customer service calls')) \
                            .orderBy(fn.desc("average_DayMin"))

In [36]:
pd.DataFrame(Day_min2.take(5))

Unnamed: 0,0,1,2
0,NJ,199.086,1.7
1,MD,198.615,1.65
2,IN,196.92963,1.611111
3,PA,196.211111,1.305556
4,KS,196.178846,1.480769


### <font color=brownUDFs> We can register a user defined function (UDF) from Python

In [37]:
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import UserDefinedFunction

binary_map = {'Yes':1.0, 'No':0.0, 'True':1.0, 'False':0.0}

toNum = UserDefinedFunction(lambda k: binary_map[k], DoubleType())

In [38]:
pd.DataFrame(CV_data.take(5), columns=CV_data.columns)

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
0,KS,128,415,No,Yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,No,Yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,No,No,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,Yes,No,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,Yes,No,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [39]:
CV_data = CV_data.withColumn('Churn', toNum(CV_data['Churn'])) \
    .withColumn('International plan', toNum(CV_data['International plan'])) \
    .withColumn('Voice mail plan', toNum(CV_data['Voice mail plan']))

### <font color=red>...NOTE that you MUST assign CV_data = ... to a NEW dataFrame

In [40]:
CV_data=CV_data.drop('Voice mail plan2')

In [41]:
CV_data.columns

['State',
 'Account length',
 'Area code',
 'International plan',
 'Voice mail plan',
 'Number vmail messages',
 'Total day minutes',
 'Total day calls',
 'Total day charge',
 'Total eve minutes',
 'Total eve calls',
 'Total eve charge',
 'Total night minutes',
 'Total night calls',
 'Total night charge',
 'Total intl minutes',
 'Total intl calls',
 'Total intl charge',
 'Customer service calls',
 'Churn']

In [42]:
pd.DataFrame(CV_data.take(5), columns=CV_data.columns)

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
0,KS,128,415,0.0,1.0,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,0.0
1,OH,107,415,0.0,1.0,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,0.0
2,NJ,137,415,0.0,0.0,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,0.0
3,OH,84,408,1.0,0.0,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,0.0
4,OK,75,415,1.0,0.0,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,0.0


## `sample`:
    How to create a sample DataFrame from the base DataFrame?

The sample method on DataFrame will return a DataFrame containing the sample of base DataFrame. The sample method will take 3 parameters.

    withReplacement = True or False to select a observation with or without replacement.
    fraction = x, where x = .5 shows that we want to have 50% data in sample DataFrame.
    seed for reproduce the result

Let’s create the two DataFrame t1 and t2 from train, both will have 20% sample of train and count the number of rows in each.

In [43]:
t1 = CV_data.sample(False, 0.5, 42)

In [44]:
t1.count()

1294

## `appy`: apply map operation on DataFrame columns

We can apply a function on each row of DataFrame using map operation. After applying this function, we get the result in the form of RDD. Let’s apply a map operation on User_ID column of train and print the first 5 elements of mapped RDD(x,1) after applying the function (I am applying lambda function).

## Word Count Example

https://www.youtube.com/watch?v=V6DkTVvy9vk
https://www.youtube.com/watch?v=vfiJQ7wg81Y

In [45]:
import time
import os

from six.moves import urllib

#file_url = 'http://www.gutenberg.org/cache/epub/2000/pg2000.txt'
#file_name = '/resources/data/MSTC/cervantes.txt'

file_url = 'https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt'
file_name = '/resources/data/MSTC/t8.shakespeare.txt'
    
    
#if not os.path.exists(file_name):
urllib.request.urlretrieve(file_url, file_name)

('/resources/data/MSTC/t8.shakespeare.txt',
 <http.client.HTTPMessage at 0x7fa42ae31940>)

In [52]:
from time import time

In [53]:
rdd = sc.textFile(file_name) \
    .flatMap(lambda line: line.split()) \
    .map(lambda word: (word, 1))\
    .reduceByKey(lambda x, y: x + y)\
    .collect()

In [139]:
t0 = time()


rdd = sc.textFile(file_name) \
    .flatMap(lambda line: line.split()) \
    .map(lambda word: (word,1))\
    .reduceByKey(lambda x,y: x + y)\
    .map(lambda x: (x[1],x[0])) \
    .sortByKey(ascending=False) \
    .collect()
    
tt = time() - t0
print("Task completed in {} seconds".format(round(tt,3)))

Task completed in 1.444 seconds


In [55]:
rdd[0:20]

[(23242, 'the'),
 (19540, 'I'),
 (18297, 'and'),
 (15623, 'to'),
 (15544, 'of'),
 (12532, 'a'),
 (10824, 'my'),
 (9576, 'in'),
 (9081, 'you'),
 (7851, 'is'),
 (7531, 'that'),
 (7068, 'And'),
 (6948, 'not'),
 (6722, 'with'),
 (6218, 'his'),
 (6009, 'your'),
 (6002, 'be'),
 (5616, 'for'),
 (5236, 'have'),
 (4912, 'it')]

## Dataframe

In [165]:
t0 = time()

df = sqlContext.read.text(file_name)

words=df.flatMap(lambda line: line.value.split())\
    .map(lambda x:Row(word=x, cnt=1)).toDF()

word_count=words.groupBy("word").sum('cnt')\
    .orderBy('sum(cnt)',ascending=False).show()

tt = time() - t0
print("Task completed in {} seconds".format(round(tt,3)))
    

+----+--------+
|word|sum(cnt)|
+----+--------+
| the|   23242|
|   I|   19540|
| and|   18297|
|  to|   15623|
|  of|   15544|
|   a|   12532|
|  my|   10824|
|  in|    9576|
| you|    9081|
|  is|    7851|
|that|    7531|
| And|    7068|
| not|    6948|
|with|    6722|
| his|    6218|
|your|    6009|
|  be|    6002|
| for|    5616|
|have|    5236|
|  it|    4912|
+----+--------+
only showing top 20 rows

Task completed in 3.961 seconds


In [168]:
t0 = time()

df = sqlContext.read.text(file_name)

words=df.flatMap(lambda line: line.value.split())\
    .map(lambda x:Row(word=x, cnt=1)).toDF()

word_count=words.groupBy('word').count()\
    .orderBy('count',ascending=False).show()

tt = time() - t0
print("Task completed in {} seconds".format(round(tt,3)))

+----+-----+
|word|count|
+----+-----+
| the|23242|
|   I|19540|
| and|18297|
|  to|15623|
|  of|15544|
|   a|12532|
|  my|10824|
|  in| 9576|
| you| 9081|
|  is| 7851|
|that| 7531|
| And| 7068|
| not| 6948|
|with| 6722|
| his| 6218|
|your| 6009|
|  be| 6002|
| for| 5616|
|have| 5236|
|  it| 4912|
+----+-----+
only showing top 20 rows

Task completed in 3.756 seconds


In [171]:
t0 = time()

words_df = sc.textFile(file_name) \
    .flatMap(lambda line: line.split())\
    .map(lambda x:Row(word=x, cnt=1)).toDF()
    
word_count=words_df.groupBy('word').count()\
    .orderBy('count',ascending=False).show()
    
tt = time() - t0
print("Task completed in {} seconds".format(round(tt,3)))

+----+-----+
|word|count|
+----+-----+
| the|23242|
|   I|19540|
| and|18297|
|  to|15623|
|  of|15544|
|   a|12532|
|  my|10824|
|  in| 9576|
| you| 9081|
|  is| 7851|
|that| 7531|
| And| 7068|
| not| 6948|
|with| 6722|
| his| 6218|
|your| 6009|
|  be| 6002|
| for| 5616|
|have| 5236|
|  it| 4912|
+----+-----+
only showing top 20 rows

Task completed in 3.357 seconds


https://robertovitillo.com/2015/06/30/spark-best-practices/
https://www.slideshare.net/SparkSummit/getting-the-best-performance-with-pyspark

result = rdd.flatMap(lambda x: [(y, 1) for y in x[1]] ).reduceByKey(lambda x,y: x+y)

In [None]:
https://www.analyticsvidhya.com/blog/2016/09/comprehensive-introduction-to-apache-spark-rdds-dataframes-using-pyspark/