#### Grouping Records in PySpark:
* Various methods from GroupedData object can be used to aggregate data based on groups. 
* Grouping data into groups allows us to apply aggregation functions on these each group formed.
* groupby() method can be used to form groups by passing the columns we want to group as parameters.
* groupby() method returns a GroupedData object and waits further instructions (cue Lazy eval)
* **_If you need to create groups based on the values of multiple columns, you can pass multiple columns as parameters to `groupby()`_**

In [2]:

import os
import pandas as pd

from pyspark.sql import SparkSession
from pyspark import SparkConf
from pyspark.sql.column import isin 

In [3]:
spark = (SparkSession
         .builder
         .appName("Analyzing the vocabulary of Pride and Prejudice")
         #.master("local[8]")
         .getOrCreate())

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


24/06/09 12:47:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [22]:
import pyspark.sql.functions as F

#dir(F)

['Any',
 'ArrayType',
 'Callable',
 'Column',
 'DataFrame',
 'DataType',
 'Dict',
 'Iterable',
 'List',
 'Optional',
 'PandasUDFType',
 'PythonEvalType',
 'SparkContext',
 'StringType',
 'StructType',
 'TYPE_CHECKING',
 'Tuple',
 'Union',
 'UserDefinedFunction',
 'ValuesView',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_create_column_from_literal',
 '_create_lambda',
 '_create_udf',
 '_get_jvm_function',
 '_get_lambda_parameters',
 '_invoke_binary_math_function',
 '_invoke_function',
 '_invoke_function_over_columns',
 '_invoke_function_over_seq_of_columns',
 '_invoke_higher_order_function',
 '_options_to_str',
 '_test',
 '_to_java_column',
 '_to_seq',
 '_unresolved_named_lambda_variable',
 'abs',
 'acos',
 'acosh',
 'add_months',
 'aggregate',
 'approxCountDistinct',
 'approx_count_distinct',
 'array',
 'array_contains',
 'array_distinct',
 'array_except',
 'array_intersect',
 'array_join',
 'array_max',
 'array_m

In [3]:
import pyspark.sql.functions as F
groups = (spark.read.text("/Users/u354769/Desktop/Ameya_Learning/DataAnalysisWithPythonAndPySpark/Book_Materials/Data/DataAnalysisWithPythonAndPySpark-Data/gutenberg_books/1342-0.txt")
    .select(F.split(F.col("value"), " ").alias("line"))
    .select(F.explode(F.col("line")).alias("word"))
    .select(F.lower(F.col("word")).alias("word_lower"))
    .select(F.regexp_extract(F.col("word_lower"), "[a-z]+",0).alias("word_clean"))
    .filter(F.col("word_clean") != "")
    .groupby(F.col("word_clean"))
    )

In [4]:
print(groups)

<pyspark.sql.group.GroupedData object at 0x7fe731d39ee0>


In [11]:
result = (spark.read.text("/Users/u354769/Desktop/Ameya_Learning/DataAnalysisWithPythonAndPySpark/Book_Materials/Data/DataAnalysisWithPythonAndPySpark-Data/gutenberg_books/1342-0.txt")
    .select(F.split(F.col("value"), " ").alias("line"))
    .select(F.explode(F.col("line")).alias("word"))
    .select(F.lower(F.col("word")).alias("word_lower"))
    .select(F.regexp_extract(F.col("word_lower"), "[a-z]+",0).alias("word_clean"))
    .filter(F.col("word_clean") != "")
    .groupby(F.col("word_clean"))
    .count()
    )

In [7]:
print(result)

DataFrame[word_clean: string, count: bigint]


#### Ordering Records in PySpark:
* As spark is lazy it doesn't care about order of records unless explicitly asked. 
* `OrderBy()` method is used to order a data frame by the values of one or more columns. 
* Column name(s) on which to order data can be provided as parameters.
* Data is ordered in _ascending_ order by default. By setting `ascending = False`, order can be reversed.
* PySpark orders the data frame using each column, one at a time. If multiple columns are passed, PySpark uses the first column's values to order the data frame, then the second (and third etc,) when there are identical values. 

**_Why is `groupby()` in lowercase and `orderBy()` in lowerCamelCase?_**
* **_`groupby()` is an alias for `groupBy()` similar to `where()` is an alias to `filter()`. Reason for this incoherence is that Scala, original language in which Spark was built, prefers camelCase. On other hand, `regexp_extract()` uses Python's snake_case, since PySpark is basically a python wrapper around Spark API._**

#### Writing data from a data frame:
* Spark uses `write()` and the `SparkWriter` object to write data frame to disk. 
* Unless specified, data will be written across multiples files, in distributed manner. 
* Writing data in distributed manner is useful, because when working across large cluster of nodes, it makes it easy to logically distribute reading and writing the data, making it way faster than having a single massive file. 
* PySpark gives you one file per partition by default. 
* To reduce number of files, use `coelesce()` method with desired number of partitions.

In [12]:
# Writing results in multiple CSV files (one per partition)
result.write.csv("/Users/u354769/Desktop/Ameya_Learning/DataAnalysisWithPythonAndPySpark/Book_Materials/Chapters/dump/simple_count.csv")

                                                                                

In [None]:
# Writing results in single CSV file under a single partition
result.coalesce(1).write.csv("/Users/u354769/Desktop/Ameya_Learning/DataAnalysisWithPythonAndPySpark/Book_Materials/Chapters/dump/simple_count_single_partition.csv")

In [14]:
# Scaling data input by including more books
#Using a glob pattern to select many files at once
result_scale = (spark.read.text("/Users/u354769/Desktop/Ameya_Learning/DataAnalysisWithPythonAndPySpark/Book_Materials/Data/DataAnalysisWithPythonAndPySpark-Data/gutenberg_books/*.txt")
    .select(F.split(F.col("value"), " ").alias("line"))
    .select(F.explode(F.col("line")).alias("word"))
    .select(F.lower(F.col("word")).alias("word_lower"))
    .select(F.regexp_extract(F.col("word_lower"), "[a-z]*",0).alias("word_clean"))
    .filter(F.col("word_clean") != "")
    .groupby(F.col("word_clean"))
    .count()
    .orderBy("count", ascending=False)
    .show(15)
    )

[Stage 9:>                                                          (0 + 6) / 6]

+----------+-----+
|word_clean|count|
+----------+-----+
|       the|38895|
|       and|23919|
|        of|21199|
|        to|20526|
|         a|14464|
|         i|13974|
|        in|12777|
|      that| 9623|
|        it| 9099|
|       was| 8920|
|       her| 7923|
|        my| 7385|
|       his| 6642|
|      with| 6575|
|        he| 6444|
+----------+-----+
only showing top 15 rows



                                                                                

#### Additional Exercises
 * For these exercises, you’ll need the word_count_submit.py program we worked on in this chapter. You can pick it from the book’s code repository (Code/Ch03/word_ count_submit.py).

In [6]:
#Exercise 3.3
# 1. By modifying the word_count_submit.py program, return the number of distinct words in Jane Austen’s Pride and Prejudice.
# (Hint: results contains one record for each unique word.)
result_3_3_1 = (spark.read.text("/Users/u354769/Desktop/Ameya_Learning/DataAnalysisWithPythonAndPySpark/Book_Materials/Data/DataAnalysisWithPythonAndPySpark-Data/gutenberg_books/1342-0.txt")
    .select(F.split(F.col("value"), " ").alias("line"))
    .select(F.explode(F.col("line")).alias("word"))
    .select(F.lower(F.col("word")).alias("word_lower"))
    .select(F.regexp_extract(F.col("word_lower"), "[a-z]*",0).alias("word_clean"))
    .filter(F.col("word_clean") != "")
    .count()
    )


In [7]:
result_3_3_1

735603

In [14]:
#. 2 (Challenge) Wrap your program in a function that takes a file name as a parameter. 
#  It should return the number of distinct words.

def distinct_words(filename):
    file = "/Users/u354769/Desktop/Ameya_Learning/DataAnalysisWithPythonAndPySpark/Book_Materials/Data/DataAnalysisWithPythonAndPySpark-Data/gutenberg_books/"+filename
    result_3_3_2 = (spark.read.text(file)
    .select(F.split(F.col("value"), " ").alias("line"))
    .select(F.explode(F.col("line")).alias("word"))
    .select(F.lower(F.col("word")).alias("word_lower"))
    .select(F.regexp_extract(F.col("word_lower"), "[a-z]*",0).alias("word_clean"))
    .filter(F.col("word_clean") != "")
    .count()
    )
    
#distinct_words()

In [15]:
distinct_words("1342-0.txt")

NameError: name 'filename' is not defined

In [19]:
#Exercise 3.4
# Taking word_count_submit.py, modify the script to return a sample of five words that appear only once 
# in Jane Austen’s Pride and Prejudice.

result_3_4 = (spark.read.text("/Users/u354769/Desktop/Ameya_Learning/DataAnalysisWithPythonAndPySpark/Book_Materials/Data/DataAnalysisWithPythonAndPySpark-Data/gutenberg_books/*.txt")
    .select(F.split(F.col("value"), " ").alias("line"))
    .select(F.explode(F.col("line")).alias("word"))
    .select(F.lower(F.col("word")).alias("word_lower"))
    .select(F.regexp_extract(F.col("word_lower"), "[a-z]*",0).alias("word_clean"))
    .filter(F.col("word_clean") != "")
    .groupby(F.col("word_clean"))
    .count().alias("count")
    .orderBy("count", ascending=True)
    .show(5)
    )

[Stage 15:>                                                         (0 + 6) / 6]

+----------+-----+
|word_clean|count|
+----------+-----+
| quakerism|    1|
|       zig|    1|
|  spoiling|    1|
|  blackish|    1|
|    wields|    1|
+----------+-----+
only showing top 5 rows



                                                                                

In [21]:
#Exercise 3.5
# 1. Using the substring function (refer to PySpark’s API or the pyspark shell if needed), 
# return the top five most popular first letters (keep only the first letter of each word).
result_3_5_1 = (spark.read.text("/Users/u354769/Desktop/Ameya_Learning/DataAnalysisWithPythonAndPySpark/Book_Materials/Data/DataAnalysisWithPythonAndPySpark-Data/gutenberg_books/*.txt")
    .select(F.split(F.col("value"), " ").alias("line"))
    .select(F.explode(F.col("line")).alias("word"))
    .select(F.lower(F.col("word")).alias("word_lower"))
    .select(F.regexp_extract(F.col("word_lower"), "[a-z]*",0).alias("word_clean"))
    .filter(F.col("word_clean") != "")
    .select(F.substring(F.col("word_clean"),1,1).alias("first_letter"))
    .groupby(F.col("first_letter"))
    .count().alias("count")
    .orderBy("count", ascending=False)
    .show(5)
    )


[Stage 18:>                                                         (0 + 6) / 6]

+------------+------+
|first_letter| count|
+------------+------+
|           t|106621|
|           a| 83299|
|           s| 55709|
|           i| 54002|
|           h| 52509|
+------------+------+
only showing top 5 rows



                                                                                

In [27]:
# 2. Compute the number of words starting with a consonant or a vowel. 
# (Hint: The isin() function might be useful.)
result_3_5_2_1 = (spark.read.text("/Users/u354769/Desktop/Ameya_Learning/DataAnalysisWithPythonAndPySpark/Book_Materials/Data/DataAnalysisWithPythonAndPySpark-Data/gutenberg_books/*.txt")
    .select(F.split(F.col("value"), " ").alias("line"))
    .select(F.explode(F.col("line")).alias("word"))
    .select(F.lower(F.col("word")).alias("word_lower"))
    .select(F.regexp_extract(F.col("word_lower"), "[a-z]*",0).alias("word_clean"))
    .filter(F.col("word_clean") != "")
    .select(F.substring(F.col("word_clean"),1,1).alias("first_letter"))
    .filter(F.col("first_letter").isin(["a","e","i","o","u"]))
    .groupby(F.col("first_letter"))
    .count().alias("count")
    .orderBy("count", ascending=False)
    .show(5)
    )





+------------+-----+
|first_letter|count|
+------------+-----+
|           a|83299|
|           i|54002|
|           o|45295|
|           e|17046|
|           u| 9270|
+------------+-----+



                                                                                

In [28]:
#Consonants only
result_3_5_2_1 = (spark.read.text("/Users/u354769/Desktop/Ameya_Learning/DataAnalysisWithPythonAndPySpark/Book_Materials/Data/DataAnalysisWithPythonAndPySpark-Data/gutenberg_books/*.txt")
    .select(F.split(F.col("value"), " ").alias("line"))
    .select(F.explode(F.col("line")).alias("word"))
    .select(F.lower(F.col("word")).alias("word_lower"))
    .select(F.regexp_extract(F.col("word_lower"), "[a-z]*",0).alias("word_clean"))
    .filter(F.col("word_clean") != "")
    .select(F.substring(F.col("word_clean"),1,1).alias("first_letter"))
    .filter(~(F.col("first_letter").isin(["a","e","i","o","u"])))
    .groupby(F.col("first_letter"))
    .count().alias("count")
    .orderBy("count", ascending=False)
    .show(5)
    )



+------------+------+
|first_letter| count|
+------------+------+
|           t|106621|
|           s| 55709|
|           h| 52509|
|           w| 52337|
|           m| 38434|
+------------+------+
only showing top 5 rows



                                                                                

In [30]:
# Exercise 3.6
# Let’s say you want to get both the count() and sum() of a GroupedData object. Why doesn’t this code work? 
# Map the inputs and outputs of each method.
# my_data_frame.groupby("my_column").count().sum()
# Multiple aggregate function applications will be covered in chapter 4.

result_3_6 = (spark.read.text("/Users/u354769/Desktop/Ameya_Learning/DataAnalysisWithPythonAndPySpark/Book_Materials/Data/DataAnalysisWithPythonAndPySpark-Data/gutenberg_books/*.txt")
    .select(F.split(F.col("value"), " ").alias("line"))
    .select(F.explode(F.col("line")).alias("word"))
    .select(F.lower(F.col("word")).alias("word_lower"))
    .select(F.regexp_extract(F.col("word_lower"), "[a-z]*",0).alias("word_clean"))
    .filter(F.col("word_clean") != "")
    .groupby(F.col("word_clean"))
    .count()
    .sum()
    )

AttributeError: 'DataFrame' object has no attribute 'sum'

24/06/09 18:00:57 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 990376 ms exceeds timeout 120000 ms
24/06/09 18:00:57 WARN SparkContext: Killing executors is not supported by current scheduler.
24/06/09 18:16:44 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
	at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:36)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.driverEndpoint$lzycompute(BlockManagerMasterEndpoint.scala:117)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$driverEndpoint(BlockManagerMasterEndpoint.scala:116)
	at org.apache.spark.storage.B