## Week 06

You can also run `word_count_submit.py` with  
```bash
pyspark-submit word_count_submit.py
```
in Anaconda Prompt (miniconda3)

In this week, we want to compare how fast is pyspark to process eight books.

In [21]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

import glob
import pandas as pd
import re

from collections import Counter

In [22]:
spark = (SparkSession
  .builder
  .master("local[*]")   # optional
  .appName("Word Counts Program.")
  .getOrCreate())

A single novel (Jane Austen 1813 - Pride and Prejudice)

In [24]:
# Computational times: (4.6 secs first run, 0.2 - 0.4 secs second runs)
results = (
  spark.read.text("./data/gutenberg_books/1342-0.txt")
  .select(F.split(F.col("value"), " ").alias("line"))
  .select(F.explode(F.col("line")).alias("word"))
  .select(F.lower(F.col("word")).alias("word"))
  .select(F.regexp_extract(F.col("word"), "[a-z]+", 0).alias("word"))
  .where(F.col("word") != "")
  .groupby("word")
  .count()
)

# Show the top 10 of the most occurrence words in Jane Austen - Pride and Prejudice
results.orderBy("count", ascending=False).show(10)

+----+-----+
|word|count|
+----+-----+
| the| 4496|
|  to| 4235|
|  of| 3719|
| and| 3602|
| her| 2223|
|   i| 2052|
|   a| 1997|
|  in| 1920|
| was| 1844|
| she| 1703|
+----+-----+
only showing top 10 rows



8 classical books
- `11-0.txt`: Lewis Carol (1865) - Alice's Adventures in Wonderland
- `84-0.txt`: Mary Shelley (1818) - Frankenstein; or, The Modern Promotheus
- `1342-0.txt`: Jane Austen (1813) - Pride and Prejudice
- `1661-0.txt`: Arthur Conan Doyle (1892) - The Adventures of Sherlock Holmes
- `2701-0.txt`: Herman Melville (1851) - Moby-Dick; or, The Whale
- `pg132.txt`: 孫子/Sun Tzu (5th century BC) - 孫子兵法 (The Art of War / Sun Tzu's Military Method) 
- `pg514.txt`: Louisa May Alcott (1868-1869) - Little Women
- `pg1399.txt`: Лев Толстой/Leo Tolstoy (1878) - Анна Каренина (Anna Karenina)

In [25]:
# Computational time: 5.1 secs for first run; 0.5 - 0.7 for the second runs
results = (
  spark.read.text("./data/gutenberg_books/*.txt")
  .select(F.split(F.col("value"), " ").alias("line"))
  .select(F.explode(F.col("line")).alias("word"))
  .select(F.lower(F.col("word")).alias("word"))
  .select(F.regexp_extract(F.col("word"), "[a-z]+", 0).alias("word"))
  .where(F.col("word") != "")
  .groupby("word")
  .count()
)

# Show the top 10 of the most occurrence words in all books inside `data/project-gutenberg/`
results.orderBy("count", ascending=False).show(10)

+----+-----+
|word|count|
+----+-----+
| the|60331|
| and|39571|
|  to|31793|
|  of|30994|
|   a|23322|
|  in|19247|
|   i|19189|
|that|15784|
|  he|15253|
|  it|14744|
+----+-----+
only showing top 10 rows



Let us compare to the pandas

In [26]:
# Read all text files from the folder
files = glob.glob("./data/gutenberg_books/*.txt")

# Initialize an empty list to store words
words = []

# Process each file
for file in files:
    with open(file, 'r', encoding='utf-8') as f:
        # Read the file line by line
        for line in f:
            # Split each line into words and convert them to lowercase
            line_words = re.findall(r"[a-z]+", line.lower())
            words.extend(line_words)

# Count the occurrence of each word
word_counts = Counter(words)

# Convert the result into a Pandas DataFrame
df = pd.DataFrame(word_counts.items(), columns=['word', 'count'])

# Sort the DataFrame by 'count' in descending order
df = df.sort_values(by='count', ascending=False)

# Show the top 10 most frequent words
print(df.head(10))

# Write the DataFrame to a CSV file
df.to_csv("./data/vocab_count.csv", index=False)

     word  count
14    the  60555
22    and  39862
69     to  31868
16     of  31060
93      a  23499
5      in  19427
63      i  19390
131  that  15954
701    he  15348
30     it  14819
