## Week 04

In this week we want to analyze the vocabulary in Pride and Prejudice novel
by Jane Austen

In [1]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

In [2]:
spark = (SparkSession
  .builder
  .master("local[*]") 
  .appName("Analyzing the vocabulary of Pride and Prejudice.")
  .getOrCreate())

In [3]:
spark

The full program will be divided into five subprogram
1. 
2.

### 1. Read

In [4]:
book = spark.read.text(
  "./data/gutenberg_books/1342-0.txt")
book

DataFrame[value: string]

In [5]:
book.printSchema()

root
 |-- value: string (nullable = true)



Show some rows from the data

In [6]:
book.show()

+--------------------+
|               value|
+--------------------+
|The Project Guten...|
|                    |
|This eBook is for...|
|almost no restric...|
|re-use it under t...|
|with this eBook o...|
|                    |
|                    |
|Title: Pride and ...|
|                    |
| Author: Jane Austen|
|                    |
|Posting Date: Aug...|
|Release Date: Jun...|
|Last Updated: Mar...|
|                    |
|   Language: English|
|                    |
|Character set enc...|
|                    |
+--------------------+
only showing top 20 rows



In [7]:
book.show(10, truncate=50)

+--------------------------------------------------+
|                                             value|
+--------------------------------------------------+
|The Project Gutenberg EBook of Pride and Prejud...|
|                                                  |
|This eBook is for the use of anyone anywhere at...|
|almost no restrictions whatsoever.  You may cop...|
|re-use it under the terms of the Project Gutenb...|
|    with this eBook or online at www.gutenberg.org|
|                                                  |
|                                                  |
|                        Title: Pride and Prejudice|
|                                                  |
+--------------------------------------------------+
only showing top 10 rows



### 2. Tokenization

In [8]:
lines = book.select(
  F.split(book.value, " ")
   .alias("line"))
lines

DataFrame[line: array<string>]

In [9]:
lines.printSchema()

root
 |-- line: array (nullable = true)
 |    |-- element: string (containsNull = false)



In [10]:
lines.show(5, truncate=50)

+--------------------------------------------------+
|                                              line|
+--------------------------------------------------+
|[The, Project, Gutenberg, EBook, of, Pride, and...|
|                                                []|
|[This, eBook, is, for, the, use, of, anyone, an...|
|[almost, no, restrictions, whatsoever., , You, ...|
|[re-use, it, under, the, terms, of, the, Projec...|
+--------------------------------------------------+
only showing top 5 rows



Create a column `word`

In [12]:
words = lines.select(
  F.explode(F.col("line"))
   .alias("word"))
words

DataFrame[word: string]

In [13]:
words.printSchema()


root
 |-- word: string (nullable = false)



In [16]:
words.show(15, truncate=10)

+----------+
|      word|
+----------+
|       The|
|   Project|
| Gutenberg|
|     EBook|
|        of|
|     Pride|
|       and|
|Prejudice,|
|        by|
|      Jane|
|    Austen|
|          |
|      This|
|     eBook|
|        is|
+----------+
only showing top 15 rows



### 3. Cleaning

Make a lowercase

In [18]:
words_lower = words.select(
  F.lower(F.col("word")).alias("word_lower"))
words_lower

DataFrame[word_lower: string]

In [19]:
words_lower.printSchema()

root
 |-- word_lower: string (nullable = false)



In [20]:
words_lower.show()

+----------+
|word_lower|
+----------+
|       the|
|   project|
| gutenberg|
|     ebook|
|        of|
|     pride|
|       and|
|prejudice,|
|        by|
|      jane|
|    austen|
|          |
|      this|
|     ebook|
|        is|
|       for|
|       the|
|       use|
|        of|
|    anyone|
+----------+
only showing top 20 rows



Remove punctuation (tanda baca)

In [23]:
words_clean = words_lower.select(
  F.regexp_extract(F.col("word_lower"), 
                   "[a-z]+", 0)
   .alias("word"))
words_clean.show()

+---------+
|     word|
+---------+
|      the|
|  project|
|gutenberg|
|    ebook|
|       of|
|    pride|
|      and|
|prejudice|
|       by|
|     jane|
|   austen|
|         |
|     this|
|    ebook|
|       is|
|      for|
|      the|
|      use|
|       of|
|   anyone|
+---------+
only showing top 20 rows



Remove empty word (a word without any character)

In [24]:
words_nonull = words_clean.filter(
    F.col("word") != "")
words_nonull.show()

+---------+
|     word|
+---------+
|      the|
|  project|
|gutenberg|
|    ebook|
|       of|
|    pride|
|      and|
|prejudice|
|       by|
|     jane|
|   austen|
|     this|
|    ebook|
|       is|
|      for|
|      the|
|      use|
|       of|
|   anyone|
| anywhere|
+---------+
only showing top 20 rows



### 4. Count

### 5. Presenting

###