## Week 05

This is a second part of vocabulary counter for Jane Austin's novel
"Pride and Prejudice".   
We also explore a scaling up to count several novels.

Let us copy all the necessary codes from Week 04


In [1]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

In [2]:
spark = (SparkSession
  .builder
  .master("local[*]") 
  .appName("Analyzing the vocabulary of Pride and Prejudice.")
  .getOrCreate())


### 1. Read

In [3]:
book = spark.read.text(
  "./data/gutenberg_books/1342-0.txt")

### 2. Tokenization

In [4]:
lines = book.select(
  F.split(book.value, " ")
   .alias("line"))

In [5]:
words = lines.select(
  F.explode(F.col("line"))
   .alias("word"))

### 3. Cleaning

In [6]:
words_lower = words.select(
  F.lower(F.col("word")).alias("word_lower"))

In [7]:
words_clean = words_lower.select(
  F.regexp_extract(F.col("word_lower"), 
                   "[a-z]+", 0)
   .alias("word"))

In [8]:
words_nonull = words_clean.filter(
    F.col("word") != "")

### 4. Count

In [9]:
groups = words_nonull.groupby(F.col("words"))
groups

GroupedData[grouping expressions: [words], value: [word: string], type: GroupBy]

In [12]:
results = words_nonull.groupby(F.col("word")).count()
results

DataFrame[word: string, count: bigint]

In [13]:
results.printSchema()

root
 |-- word: string (nullable = false)
 |-- count: long (nullable = false)



In [14]:
results.show()

+-------------+-----+
|         word|count|
+-------------+-----+
|       online|    4|
|         some|  209|
|        still|   72|
|          few|   72|
|         hope|  122|
|        those|   60|
|     cautious|    4|
|    imitation|    1|
|          art|    3|
|      solaced|    1|
|       poetry|    2|
|    arguments|    5|
| premeditated|    1|
|      elevate|    1|
|       doubts|    2|
|    destitute|    1|
|    solemnity|    5|
|   lieutenant|    1|
|gratification|    1|
|    connected|   14|
+-------------+-----+
only showing top 20 rows



### 5. Presentation

In [16]:
ordered_results = results.orderBy("count", ascending=False)
ordered_results.show(10)


+----+-----+
|word|count|
+----+-----+
| the| 4496|
|  to| 4235|
|  of| 3719|
| and| 3602|
| her| 2223|
|   i| 2052|
|   a| 1997|
|  in| 1920|
| was| 1844|
| she| 1703|
+----+-----+
only showing top 10 rows



Write to .csv

In [17]:
results.write.csv("./data/vocab_count.csv")

We will discuss
1. Make into a python script. See `word_count_submit.py`