# Data Analysis with PySpark

## What is PySpark?

As you know, Python is a dynamic, general-purpose language. PySpark provides an entry point to Python in the computational model of Spark. PySpark is open source and it is fast, expressive, and versatile. 

Spark is coded in Scala. However, you can use it with R and Java programming languages. 

All three major cloud (Amazon Web Services, Google Cloud Platform, and Microsoft Azure) allows you to work with PySpark. 

## Hands-on Data Analysis with PySpark

Let's import the SparkSession entry point is located in the pyspark.sql package. This entry point allows the functionality for data transformation.

In [1]:
from pyspark.sql import SparkSession 

Let's create a builder pattern through the SparkSession.builder object. Please keep in mind that a builder pattern offers a set of methods to create a highly configurable object without having multiple constructors. Here, the .getOrCreate() method allows you to work in both interactive and batch mode by avoiding the creation of a new SparkSession if one already exists. 

In [2]:
spark = (SparkSession
         .builder
         .appName("Analyzing the vocabulary of Pride and Prejudice.")
         .getOrCreate())

Spark entry point is SparkContext.SparkSession is a superset of that. Let's take a look at the sparkContext.

In [3]:
spark.sparkContext

You can usually perform data analysis three major steps: reading, transforming, and exporting. Let me show these steps.

## Data Ingestion

Let's read the dataset with the `spark.read.test` method.

In [4]:
book = spark.read.text("./data/gutenberg_books/1342-0.txt")

Let me look at this variable. 

In [5]:
book

DataFrame[value: string]

You can see the name of the columns and their type. Please note that PySpark data frames consist of a collection of columns. Let's now use the `printSchema` to display the schema in a tree form.

In [6]:
book.printSchema()

root
 |-- value: string (nullable = true)



You can see the same information wiht the data frame’s `dtypes` attribute. Let me show you this.

In [7]:
print(book.dtypes)

[('value', 'string')]


## Exploring Data

Let's take a look at our data with the `show` method. It shows 20 rows and truncate long values by default

In [8]:
book.show()

+--------------------+
|               value|
+--------------------+
|The Project Guten...|
|                    |
|This eBook is for...|
|almost no restric...|
|re-use it under t...|
|with this eBook o...|
|                    |
|                    |
|Title: Pride and ...|
|                    |
| Author: Jane Austen|
|                    |
|Posting Date: Aug...|
|Release Date: Jun...|
|Last Updated: Mar...|
|                    |
|   Language: English|
|                    |
|Character set enc...|
|                    |
+--------------------+
only showing top 20 rows



You can set the number of row and truncate. Let me show you this.

In [9]:
book.show(10, truncate=60)

+------------------------------------------------------------+
|                                                       value|
+------------------------------------------------------------+
|The Project Gutenberg EBook of Pride and Prejudice, by Ja...|
|                                                            |
|This eBook is for the use of anyone anywhere at no cost a...|
|almost no restrictions whatsoever.  You may copy it, give...|
|re-use it under the terms of the Project Gutenberg Licens...|
|              with this eBook or online at www.gutenberg.org|
|                                                            |
|                                                            |
|                                  Title: Pride and Prejudice|
|                                                            |
+------------------------------------------------------------+
only showing top 10 rows



## Column transformations

Let's split our lines of text into words with the `select` method and then count them. Let me first split the values the `split` method and rename transformation column with the `alias` method.  Please keep in mind that PySpark’s data preprocessing methods that operate on columns is located in `pyspark
.sql.functions`. 

In [10]:
from pyspark.sql.functions import split
lines = book.select(split(book.value, " ").alias("line"))
lines.show(5)

+--------------------+
|                line|
+--------------------+
|[The, Project, Gu...|
|                  []|
|[This, eBook, is,...|
|[almost, no, rest...|
|[re-use, it, unde...|
+--------------------+
only showing top 5 rows



### Using the select method

You can select columns or transformed columns via the `select` method.There are many ways to select a column using the `select` method. Let's take a look at these ways.

In [11]:
from pyspark.sql.functions import col
print(book.select(book.value))
print(book.select(book["value"]))
print(book.select(col("value")))
print(book.select("value"))

DataFrame[value: string]
DataFrame[value: string]
DataFrame[value: string]
DataFrame[value: string]


### Renaming the columns

When performing a transformation on your columns, you can rename the column with the `alia` and `withColumnRenamed` methods. I showed you how to use the `alias` method. Let's take a look at the `withColumnRenamed` method.

In [12]:
lines = book.select(split(book.value, " "))
lines=lines.withColumnRenamed("split(value,  , -1)", "line")
lines.show(10)

+--------------------+
|                line|
+--------------------+
|[The, Project, Gu...|
|                  []|
|[This, eBook, is,...|
|[almost, no, rest...|
|[re-use, it, unde...|
|[with, this, eBoo...|
|                  []|
|                  []|
|[Title:, Pride, a...|
|                  []|
+--------------------+
only showing top 10 rows



### Reshaping your data

PySpark can have columns of nested values, like arrays of elements. The `explode` method allows you to extract the elements into distinct records. Let's create one record for each word with the `explode` function.

In [13]:
from pyspark.sql.functions import explode, col
words = lines.select(explode(col("line")).alias("word"))
words.show(10)

+----------+
|      word|
+----------+
|       The|
|   Project|
| Gutenberg|
|     EBook|
|        of|
|     Pride|
|       and|
|Prejudice,|
|        by|
|      Jane|
+----------+
only showing top 10 rows



### Finding the number of character in each row

Let's take a look at the number of the characters in each row via the `lenght` method.

In [14]:
from pyspark.sql.functions import length
number_of_char = book.select(length(col("value"))).withColumnRenamed("length(value)", "number_of_char")
number_of_char.show(10)

+--------------+
|number_of_char|
+--------------+
|            66|
|             0|
|            64|
|            68|
|            67|
|            46|
|             0|
|             0|
|            26|
|             0|
+--------------+
only showing top 10 rows



## Working with words

Let's lower the case of all the words in the dataframe with the `lower` method.

In [15]:
from pyspark.sql.functions import lower
words_lower = words.select(lower(col("word")).alias("word_lower"))
words_lower.show(10)

+----------+
|word_lower|
+----------+
|       the|
|   project|
| gutenberg|
|     ebook|
|        of|
|     pride|
|       and|
|prejudice,|
|        by|
|      jane|
+----------+
only showing top 10 rows



Let's clean our words of any punctuation and other non-useful characters the `regexp_extract` method.

In [16]:
from pyspark.sql.functions import regexp_extract
words_clean = words_lower.select(regexp_extract(col("word_lower"), "[a-z]+", 0).alias("word"))
words_clean.show(10)

+---------+
|     word|
+---------+
|      the|
|  project|
|gutenberg|
|    ebook|
|       of|
|    pride|
|      and|
|prejudice|
|       by|
|     jane|
+---------+
only showing top 10 rows



## Filtering rows

You can filter columns with the `where` or `filter` methods. These methods allows you to provide a test that will return `True` or `False` and only the records returning `True` will be kept. Let me show you how to filter records from a dataframe. To do this, I'm going to use the `filter` method. Here I want to filter the no null words using comparison operators.

In [17]:
words_nonull = words_clean.filter(col("word") != "")
words_nonull.show(10)

+---------+
|     word|
+---------+
|      the|
|  project|
|gutenberg|
|    ebook|
|       of|
|    pride|
|      and|
|prejudice|
|       by|
|     jane|
+---------+
only showing top 10 rows



Nice, the blank cell is gone. Let's now find only the words with more than five characters.

In [18]:
words_great_three_char = words_clean.filter(length(col("word")) > 5)
words_great_three_char.show(10)

+------------+
|        word|
+------------+
|     project|
|   gutenberg|
|   prejudice|
|      austen|
|      anyone|
|    anywhere|
|      almost|
|restrictions|
|  whatsoever|
|     project|
+------------+
only showing top 10 rows



Let’s remove the words of, the, and by from your list of words the `where` and `isin` methods.

In [19]:
words_no_is_not_the_if = (words_nonull.where(~col("word").isin(["of", "the", "by"])))
words_no_is_not_the_if.show(10)

+---------+
|     word|
+---------+
|  project|
|gutenberg|
|    ebook|
|    pride|
|      and|
|prejudice|
|     jane|
|   austen|
|     this|
|    ebook|
+---------+
only showing top 10 rows



##  Grouping records

The easiest way to group record is to use the `groupby` method. Let's only create a groupby object with the `groupby` method.

In [20]:
groups = words_nonull.groupby(col("word"))
print(groups)

<pyspark.sql.group.GroupedData object at 0x000002D05FA0B040>


The `groupby` method returns a `GroupedData` object that waits for an aggregation method. Using this `GroupedData` object, let me calculate the frequency of the words with the `count` method.

In [21]:
results = words_nonull.groupby(col("word")).count()
print(results)

DataFrame[word: string, count: bigint]


Awesome. We counted the frequency of the words. Let's take a look at the number of these words.

In [22]:
results.show(10)

+---------+-----+
|     word|count|
+---------+-----+
|   online|    4|
|     some|  209|
|    still|   72|
|      few|   72|
|     hope|  122|
|    those|   60|
| cautious|    4|
|imitation|    1|
|      art|    3|
|  solaced|    1|
+---------+-----+
only showing top 10 rows



## Ordering the results with orderBy

The `orderBy()` method allows you to order a dataframe by the values of one or many columns. There are two ways to order. You can use the column names as parameters or can set the `col` function. Let me show you.

In [23]:
results.orderBy("count", ascending=False).show(10)

+----+-----+
|word|count|
+----+-----+
| the| 4496|
|  to| 4235|
|  of| 3719|
| and| 3602|
| her| 2223|
|   a| 1997|
|  in| 1910|
| was| 1844|
|   i| 1752|
| she| 1703|
+----+-----+
only showing top 10 rows



To order values, you can also use the `col` method via the `decs` method.

In [24]:
results.orderBy(col("count").desc()).show(10)

+----+-----+
|word|count|
+----+-----+
| the| 4496|
|  to| 4235|
|  of| 3719|
| and| 3602|
| her| 2223|
|   a| 1997|
|  in| 1910|
| was| 1844|
|   i| 1752|
| she| 1703|
+----+-----+
only showing top 10 rows



## Writing data

You can also export your results. To do this, you can use the `write` method. Let's export our results in comma-separated value (CSV) file that is a human-readable format.

In [27]:
results.write.csv("my_count.csv")

Note that PySpark creates one file per partition. This is due to PySpark is worked multiple workers. So you have many partitions. Let's take a look at these partitions.

In [38]:
ls simple_count.csv

 Volume in drive D is Yeni Birim
 Volume Serial Number is B6A7-CDC9

 Directory of D:\My-Works\Github-Book\BigData\DataAnalysisWithPythonAndPySpark-trunk\simple_count.csv

08/17/2022  09:41 AM    <DIR>          .
08/17/2022  09:41 AM    <DIR>          ..
08/17/2022  09:41 AM                 8 ._SUCCESS.crc
08/17/2022  09:41 AM                12 .part-00000-7f1cfd1e-936a-4ad7-b196-63c15e3025c5-c000.csv.crc
08/17/2022  09:41 AM                12 .part-00001-7f1cfd1e-936a-4ad7-b196-63c15e3025c5-c000.csv.crc
08/17/2022  09:41 AM                12 .part-00002-7f1cfd1e-936a-4ad7-b196-63c15e3025c5-c000.csv.crc
08/17/2022  09:41 AM                12 .part-00003-7f1cfd1e-936a-4ad7-b196-63c15e3025c5-c000.csv.crc
08/17/2022  09:41 AM                12 .part-00004-7f1cfd1e-936a-4ad7-b196-63c15e3025c5-c000.csv.crc
08/17/2022  09:41 AM                12 .part-00005-7f1cfd1e-936a-4ad7-b196-63c15e3025c5-c000.csv.crc
08/17/2022  09:41 AM                12 .part-00006-7f1cfd1e-936a-4ad7-b196-63c15e3025c

08/17/2022  09:41 AM               320 part-00191-7f1cfd1e-936a-4ad7-b196-63c15e3025c5-c000.csv
08/17/2022  09:41 AM               277 part-00192-7f1cfd1e-936a-4ad7-b196-63c15e3025c5-c000.csv
08/17/2022  09:41 AM               529 part-00193-7f1cfd1e-936a-4ad7-b196-63c15e3025c5-c000.csv
08/17/2022  09:41 AM               448 part-00194-7f1cfd1e-936a-4ad7-b196-63c15e3025c5-c000.csv
08/17/2022  09:41 AM               339 part-00195-7f1cfd1e-936a-4ad7-b196-63c15e3025c5-c000.csv
08/17/2022  09:41 AM               562 part-00196-7f1cfd1e-936a-4ad7-b196-63c15e3025c5-c000.csv
08/17/2022  09:41 AM               355 part-00197-7f1cfd1e-936a-4ad7-b196-63c15e3025c5-c000.csv
08/17/2022  09:41 AM               473 part-00198-7f1cfd1e-936a-4ad7-b196-63c15e3025c5-c000.csv
08/17/2022  09:41 AM               351 part-00199-7f1cfd1e-936a-4ad7-b196-63c15e3025c5-c000.csv
             402 File(s)         78,795 bytes
               2 Dir(s)  88,106,213,376 bytes free


You can reduce the number of partitions with the `coalesce` method and set your desired number of partitions. I'm going to put one file per partition.

In [39]:
results.coalesce(1).write.csv("my_single_partition.csv")

Let's take a look at this file.

In [52]:
ls my_single_partition.csv

 Volume in drive D is Yeni Birim
 Volume Serial Number is B6A7-CDC9

 Directory of D:\My-Works\Github-Book\BigData\DataAnalysisWithPythonAndPySpark-trunk\my_single_partition.csv

08/17/2022  10:03 AM    <DIR>          .
08/17/2022  10:03 AM    <DIR>          ..
08/17/2022  10:03 AM                 8 ._SUCCESS.crc
08/17/2022  10:03 AM               608 .part-00000-9ded3cad-e2ff-4335-a199-682b6213590b-c000.csv.crc
08/17/2022  10:03 AM                 0 _SUCCESS
08/17/2022  10:03 AM            76,351 part-00000-9ded3cad-e2ff-4335-a199-682b6213590b-c000.csv
               4 File(s)         76,967 bytes
               2 Dir(s)  88,106,090,496 bytes free


Nice we wrote our record in a CSV file.