###### Chapter 2: Step by step analysis

In [1]:
import os
import pandas as pd

from pyspark.sql import SparkSession
from pyspark import SparkConf

In [2]:
spark = (SparkSession
         .builder
         .appName("Analyzing the vocabulary of Pride and Prejudice")
         .master("local[8]")
         .getOrCreate())

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


24/06/04 22:37:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
spark.sparkContext

In [4]:
book = spark.read.text("/Users/u354769/Desktop/Ameya_Learning/DataAnalysisWithPythonAndPySpark/Book_Materials/Data/DataAnalysisWithPythonAndPySpark-Data/gutenberg_books/1342-0.txt")
#book

DataFrame[value: string]

PySpark provides printSchema() to display the schema in tree form.

In [5]:
book.printSchema()

root
 |-- value: string (nullable = true)



If you need to filter the schema, `printSchema` is not useful. Instead you can use the `dtypes` attributes of the dataframe, which outputs a list of tuples. Ex: [(`column_name1`, `column_type1`), (`column_name2`, `column_type2`)].
Schema can also be accessed using `schema` attribute. 

In [6]:
print(book.dtypes)

[('value', 'string')]


###### *Using PySpark documentation directly in REPL*

In [7]:
print(spark.__doc__)

The entry point to programming Spark with the Dataset and DataFrame API.

    A SparkSession can be used create :class:`DataFrame`, register :class:`DataFrame` as
    tables, execute SQL over tables, cache tables, and read parquet files.
    To create a :class:`SparkSession`, use the following builder pattern:

    .. autoattribute:: builder
       :annotation:

    Examples
    --------
    >>> spark = SparkSession.builder \
    ...     .master("local") \
    ...     .appName("Word Count") \
    ...     .config("spark.some.config.option", "some-value") \
    ...     .getOrCreate()

    >>> from datetime import datetime
    >>> from pyspark.sql import Row
    >>> spark = SparkSession(sc)
    >>> allTypes = sc.parallelize([Row(i=1, s="string", d=1.0, l=1,
    ...     b=True, list=[1, 2, 3], dict={"s": 0}, row=Row(a=1),
    ...     time=datetime(2014, 8, 1, 14, 1, 5))])
    >>> df = allTypes.toDF()
    >>> df.createOrReplaceTempView("allTypes")
    >>> spark.sql('select i+1, d+1, not b, 

###### Exploring data with show() method
`show ()` method takes 3 optional parameters
  * `n` can be set to any positive integer and will display the number of rows
  * `truncate`, if set to `True`, will truncate the columns to display only 20 characters. Set to `False`, it will display whole length, or any positive integer to trunctate a specific number of characters.
  * `vertical` takes a Boolean value and, when set to `True`, will display each record as a small table. If you need to check records in detail, this is very useful option. 

In [8]:
book.show()

[Stage 0:>                                                          (0 + 1) / 1]

+--------------------+
|               value|
+--------------------+
|The Project Guten...|
|                    |
|This eBook is for...|
|almost no restric...|
|re-use it under t...|
|with this eBook o...|
|                    |
|                    |
|Title: Pride and ...|
|                    |
| Author: Jane Austen|
|                    |
|Posting Date: Aug...|
|Release Date: Jun...|
|Last Updated: Mar...|
|                    |
|   Language: English|
|                    |
|Character set enc...|
|                    |
+--------------------+
only showing top 20 rows



                                                                                

In [9]:
book.show(10, truncate=50)

+--------------------------------------------------+
|                                             value|
+--------------------------------------------------+
|The Project Gutenberg EBook of Pride and Prejud...|
|                                                  |
|This eBook is for the use of anyone anywhere at...|
|almost no restrictions whatsoever.  You may cop...|
|re-use it under the terms of the Project Gutenb...|
|    with this eBook or online at www.gutenberg.org|
|                                                  |
|                                                  |
|                        Title: Pride and Prejudice|
|                                                  |
+--------------------------------------------------+
only showing top 10 rows



###### Column Transformations

 * Splitting a string into words:  `split()` function, splits a longer string into a **list** of shorter strins. Most popilar use case for this function is to split a sentence into words. 
 * Exploding a list into rows: `explode()` function when applied to a column containing an array like data structure, it will take each single element from the array and give it its own row. 


In [25]:
from pyspark.sql.functions import col, split

lines = book.select(split(col("value"), " ").alias("line"))

In [26]:
lines

DataFrame[line: array<string>]

In [27]:
lines.printSchema()

root
 |-- line: array (nullable = true)
 |    |-- element: string (containsNull = false)



In [28]:
lines.show(5)

+--------------------+
|                line|
+--------------------+
|[The, Project, Gu...|
|                  []|
|[This, eBook, is,...|
|[almost, no, rest...|
|[re-use, it, unde...|
+--------------------+
only showing top 5 rows



In [29]:
from pyspark.sql.functions import explode, col

words = lines.select(explode(col("line")).alias("word"))

words.show()

+----------+
|      word|
+----------+
|       The|
|   Project|
| Gutenberg|
|     EBook|
|        of|
|     Pride|
|       and|
|Prejudice,|
|        by|
|      Jane|
|    Austen|
|          |
|      This|
|     eBook|
|        is|
|       for|
|       the|
|       use|
|        of|
|    anyone|
+----------+
only showing top 20 rows



###### Cleaning dataframe
 * changing all words into lower case words.
 * Removing punctuations and null values from `word` column


In [31]:
from pyspark.sql.functions import lower
words_lower = words.select(lower(col("word")).alias("word_lower"))
words_lower.show()

+----------+
|word_lower|
+----------+
|       the|
|   project|
| gutenberg|
|     ebook|
|        of|
|     pride|
|       and|
|prejudice,|
|        by|
|      jane|
|    austen|
|          |
|      this|
|     ebook|
|        is|
|       for|
|       the|
|       use|
|        of|
|    anyone|
+----------+
only showing top 20 rows



In [34]:
from pyspark.sql.functions import regexp_extract
words_clean = words_lower.select(regexp_extract(col("word_lower"), "[a-z]+",0).alias("word_clean"))
words_clean.show()

+----------+
|word_clean|
+----------+
|       the|
|   project|
| gutenberg|
|     ebook|
|        of|
|     pride|
|       and|
| prejudice|
|        by|
|      jane|
|    austen|
|          |
|      this|
|     ebook|
|        is|
|       for|
|       the|
|       use|
|        of|
|    anyone|
+----------+
only showing top 20 rows



In [35]:
from pyspark.sql.functions import filter
words_nonull = words_clean.filter(col("word_clean") != "")
words_nonull.show()

+----------+
|word_clean|
+----------+
|       the|
|   project|
| gutenberg|
|     ebook|
|        of|
|     pride|
|       and|
| prejudice|
|        by|
|      jane|
|    austen|
|      this|
|     ebook|
|        is|
|       for|
|       the|
|       use|
|        of|
|    anyone|
|  anywhere|
+----------+
only showing top 20 rows



In [3]:
#Final Analysis in one step
import pyspark.sql.functions as F


In [5]:
#Enable logging
spark.sparkContext.setLogLevel("INFO")

In [4]:
fin_result = (
    spark.read.text("/Users/u354769/Desktop/Ameya_Learning/DataAnalysisWithPythonAndPySpark/Book_Materials/Data/DataAnalysisWithPythonAndPySpark-Data/gutenberg_books/1342-0.txt")
    .select(F.split(F.col("value"), " ").alias("line"))
    .select(F.explode(F.col("line")).alias("word"))
    .select(F.lower(F.col("word")).alias("word_lower"))
    .select(F.regexp_extract(F.col("word_lower"), "[a-z]*",0).alias("word_clean"))
    .filter(F.col("word_clean") != "")
    .groupby(F.col("word_clean"))
    .count()  
)

In [5]:
fin_result.orderBy("count", ascending=False).show(15)

[Stage 0:>                                                          (0 + 1) / 1]

+----------+-----+
|word_clean|count|
+----------+-----+
|       the| 4496|
|        to| 4235|
|        of| 3719|
|       and| 3602|
|       her| 2223|
|         i| 2052|
|         a| 1997|
|        in| 1920|
|       was| 1844|
|       she| 1703|
|      that| 1582|
|        it| 1542|
|       not| 1447|
|       you| 1426|
|        he| 1334|
+----------+-----+
only showing top 15 rows



                                                                                

24/06/05 02:23:39 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 963310 ms exceeds timeout 120000 ms
24/06/05 02:23:39 WARN SparkContext: Killing executors is not supported by current scheduler.
24/06/05 02:23:42 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
	at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:36)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.driverEndpoint$lzycompute(BlockManagerMasterEndpoint.scala:117)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$driverEndpoint(BlockManagerMasterEndpoint.scala:116)
	at org.apache.spark.storage.B

### Exercise 2.2
##### Given the following data frame, programmatically count the number of columns that aren't strings.

In [4]:
exo2_2_df = spark.createDataFrame(
    [["test","more test", 10_000_000_000]],["one","two","three"]
)

In [5]:
exo2_2_df.printSchema()

root
 |-- one: string (nullable = true)
 |-- two: string (nullable = true)
 |-- three: long (nullable = true)



In [6]:
exo2_2_df.show(
)

                                                                                

+----+---------+-----------+
| one|      two|      three|
+----+---------+-----------+
|test|more test|10000000000|
+----+---------+-----------+



### Exercise 2.3
##### Rewrite the following code snipper, removing the `withColumnRenamed` method. Which version is clearer and easier to read?

```pySpark
from pyspark.sql.functions import length, col
exo2_3_df = (
    spark.read.text("/Users/u354769/Desktop/Ameya_Learning/DataAnalysisWithPythonAndPySpark/Book_Materials/Data/DataAnalysisWithPythonAndPySpark-Data/gutenberg_books/1342-0.txt")
    .select(length(col("value")))
    .wthColumnRenamed("length(value)", "number_of_char")
)
```

In [None]:
#
from pyspark.sql.functions import length, col
exo2_3_df = (
    spark.read.text("/Users/u354769/Desktop/Ameya_Learning/DataAnalysisWithPythonAndPySpark/Book_Materials/Data/DataAnalysisWithPythonAndPySpark-Data/gutenberg_books/1342-0.txt")
    .select(length(col("value")))
    .alias("number_of_char")
)

### Exercise 2.4
##### Assume a data frame `exo_2_4_df. The following code block gives an error. What is the problem and how can you solve it?

```pySpark
from pyspark.sql.functions import greatest, col
exo2_4_df = spark.createDataFrame([["key", 10_000, 20_000]], ["key", "va;ue1", "value2"]
)
```

In [13]:
from pyspark.sql.functions import greatest, col
exo2_4_df = spark.createDataFrame([["key", 10_000, 20_000]], ["key", "value1", "value2"]
)

exo2_4_df.printSchema()

root
 |-- key: string (nullable = true)
 |-- value1: long (nullable = true)
 |-- value2: long (nullable = true)



In [16]:
#`greatest` will return the greatest value of the list of column names, skipping the null value
# The following statement wil return an error

from pyspark.sql.utils import AnalysisException
try:
    exo2_4_mod = exo2_4_df.select(
        greatest(col("value1"), col("value2")).alias("maximum_value")).select("key", "maximum_value")
except AnalysisException as err:
    print(err)


Column 'key' does not exist. Did you mean one of the following? [maximum_value];
'Project ['key, maximum_value#47L]
+- Project [greatest(value1#37L, value2#38L) AS maximum_value#47L]
   +- LogicalRDD [key#36, value1#37L, value2#38L], false



24/06/04 14:56:15 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 1040702 ms exceeds timeout 120000 ms
24/06/04 14:56:15 WARN SparkContext: Killing executors is not supported by current scheduler.
24/06/04 14:56:23 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:87)
	at org.apache.spark.storage.BlockManagerMaster.registerBlockManager(BlockManagerMaster.scala:80)
	at org.apache.spark.storage.BlockManager.reregister(BlockManager.scala:643)
	at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1057)
	at org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:238)
	at s

In [None]:
# Correct Code:
from pyspark.sql.utils import AnalysisException
try:
    exo2_4_mod = exo2_4_df.select(
        greatest(col("value1"), col("value2")).alias("maximum_value")).select("maximum_value")
except AnalysisException as err:
    print(err)