## Counting letter frequencies in a novel

See https://towardsdatascience.com/a-hands-on-demo-of-analyzing-big-data-with-spark-68cb6600a295

Project Gutenberg is an online repository of books in the public domain, so we can pull a book from there to analyze. Let’s do Leo Tolstoy’s War and Peace − I’ve always wanted to read it, or at least know how frequently each letter in the alphabet appears! [3]

Below, we get the HTML from the novel’s webpage with the Beautiful Soup Python library, tidy up the paragraphs, and then append them to a list. We then remove the first 383 paragraphs that are just the table of contents! We’re left with 11,186 paragraphs ranging from 4 characters to 4381. (The string kind of characters, but with War and Peace, maybe novel characters too.)

In [11]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

In [12]:
# Pull book
book_url = "https://www.gutenberg.org/files/2600/2600-h/2600-h.htm"
response = requests.get(book_url)
soup = BeautifulSoup(response.text, 'html')

In [13]:
# Create list to store clean paragraphs
pars = []

# Set minimum str length for paragraphs
MIN_PAR_LEN = 2

# Iterate through paragraphs
for par in soup.findAll('p'):

    # Remove newlines and returns
    par_clean = ''.join(par.text.split('\n'))
    par_clean = ''.join(par_clean.split('\r'))

    if par_clean == '':
        continue

    # Remove extra spaces
    par_clean = par_clean.split('  ')
    par_clean = ' '.join([p for p in par_clean if p != ''])

    # Add cleaned paragraph to list
    if len(par_clean) > MIN_PAR_LEN:
        pars.append(par_clean)

In [16]:
for ipar, par in enumerate(pars[:5]):
    print(ipar, par)

0 “Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.”  
1 It was in July, 1805, and the speaker was the well-known Anna PÃ¡vlovna SchÃ©rer, maid of honor and favorite of the Empress MÃ¡rya FÃ«dorovna. With these words she greeted Prince VasÃ­li KurÃ¡gin, a man of high rank and importance, who was the first to arrive at her reception. Anna PÃ¡vlovna had had a cough for some days. She was, as she said, suffering from la grippe; grippe being then a new word in St. Petersburg, used only by the elite.
2 All her invitations without exception, written in French, and delivered by a scarl

In [21]:
for ipar, par in enumerate(pars[-5:]):
    print(len(pars)-5+ipar, par)

11329 Just so it now seems as if we have only to admit the law of inevitability, to destroy the conception of the soul, of good and evil, and all the institutions of state and church that have been built up on those conceptions.
11330 So too, like Voltaire in his time, uninvited defenders of the law of inevitability today use that law as a weapon against religion, though the law of inevitability in history, like the law of Copernicus in astronomy, far from destroying, even strengthens the foundation on which the institutions of state and church are erected.
11331 As in the question of astronomy then, so in the question of history now, the whole difference of opinion is based on the recognition or nonrecognition of something absolute, serving as the measure of visible phenomena. In astronomy it was the immovability of the earth, in history it is the independence of personality—free will.
11332 As with astronomy the difficulty of recognizing the motion of the earth lay in abandoning the 

In [22]:
# Remove table of contents - don't need this as already starting at beginning and finishing at end of text
# pars = pars[383:]

# Visualize paragraph lengths
pd.Series([len(par) for par in pars]).describe().astype(int)
# count    11186
# mean       284
# std        337
# min          4
# 25%         84
# 50%        170
# 75%        347
# max       4381

count    11334
mean       281
std        336
min          4
25%         83
50%        167
75%        343
max       4381
dtype: int64

Despite War and Peace being a massive novel, we see that pandas can still process high-level metrics without a problem − the line `pd.Series([len(par) for par in pars]).describe().astype(int)` runs nearly instantly on my laptop.
But we’ll start to notice a substantial performance improvement with Spark when we start asking tougher questions, like the frequency of each letter throughout the book. This is because the paragraphs can be processed independently from one another; Spark will process paragraphs several at a time, whereas Python and pandas will process them one by one.

As before, we start our Spark session and create an RDD of our paragraphs. We also load Counter, a built-in Python class optimized for counting, and reduce, which we’ll use to demo the base Python approach later.

In [25]:
from functools import reduce
from collections import Counter
from pyspark.sql import SparkSession

In [26]:
# Create Spark session
spark = SparkSession.builder.getOrCreate()

# Partition novel into RDD
rdd = spark.sparkContext.parallelize(pars)

22/09/23 11:21:44 WARN Utils: Your hostname, tmps-MacBook-Air.local resolves to a loopback address: 127.0.0.1; using 192.168.0.62 instead (on interface en0)
22/09/23 11:21:44 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/09/23 11:21:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/09/23 11:21:46 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
22/09/23 11:21:46 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


In [27]:
rdd.take(3)

                                                                                

['“Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.”  ',
 'It was in July, 1805, and the speaker was the well-known Anna PÃ¡vlovna SchÃ©rer, maid of honor and favorite of the Empress MÃ¡rya FÃ«dorovna. With these words she greeted Prince VasÃ\xadli KurÃ¡gin, a man of high rank and importance, who was the first to arrive at her reception. Anna PÃ¡vlovna had had a cough for some days. She was, as she said, suffering from la grippe; grippe being then a new word in St. Petersburg, used only by the elite.',
 'All her invitations without exception, written in French, and delivered by 

In [28]:
# Nothing happens

rdd.map(Counter)

# PythonRDD[12] at RDD at PythonRDD.scala:53


PythonRDD[2] at RDD at PythonRDD.scala:53

In [29]:
# We force the execution by asking to display 1st two results
rdd.map(Counter).take(2)

# [Counter({'“': 1,
#           'W': 1,
#           'e': 40,
#           ...
#           '!': 1,
#           '?': 1,
#           '”': 1}),
#  Counter({'I': 1,
#           't': 19,
#           ' ': 79,
#           ...
#           'K': 1,
#           ';': 1,
#           'b': 3})]

[Counter({'“': 1,
          'W': 1,
          'e': 40,
          'l': 23,
          ',': 6,
          ' ': 89,
          'P': 1,
          'r': 21,
          'i': 21,
          'n': 25,
          'c': 6,
          's': 19,
          'o': 30,
          'G': 1,
          'a': 27,
          'd': 14,
          'L': 1,
          'u': 15,
          'w': 8,
          'j': 1,
          't': 32,
          'f': 11,
          'm': 8,
          'y': 15,
          'h': 17,
          'B': 3,
          'p': 3,
          '.': 2,
          'I': 5,
          '’': 2,
          'b': 2,
          'A': 2,
          '—': 3,
          'v': 4,
          'g': 4,
          '‘': 1,
          '!': 1,
          '?': 1,
          '”': 1}),
 Counter({'I': 1,
          't': 19,
          ' ': 79,
          'w': 10,
          'a': 31,
          's': 20,
          'i': 19,
          'n': 26,
          'J': 1,
          'u': 6,
          'l': 9,
          'y': 5,
          ',': 8,
          '1': 1,
          '8': 1,
    

rdd.map(Counter) gives us a new RDD with the letter frequencies for each paragraph, but we actually want the letter frequencies of the entire book. Fortunately, we can do this by simply adding the Counter objects together.
We perform this reduction from a multi-element RDD to a single output with the .reduce method, passing in an anonymous addition function to specify how to collapse the RDD. [4] The result is a Counter object. We then finish our analysis by using its .most_common method to print out the ten most common characters.

In [30]:
rdd.map(Counter).reduce(lambda x, y: x + y).most_common(10)
# [(' ', 554621),
#  ('e', 308958),
#  ('t', 217658),
#  ('a', 194793),
#  ('o', 186623),
#  ('n', 179202),
#  ('i', 162925),
#  ('h', 162216),
#  ('s', 158811),
#  ('r', 143914)]

                                                                                

[(' ', 556171),
 ('e', 309614),
 ('t', 218172),
 ('a', 195208),
 ('o', 187149),
 ('n', 179581),
 ('i', 163298),
 ('h', 162543),
 ('s', 159207),
 ('r', 144268)]

In [31]:

%%timeit
rdd.map(Counter).reduce(lambda x, y: x + y).most_common(5)
# 272 ms ± 18.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

274 ms ± 51.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [32]:

%%timeit
reduce(lambda x, y: x + y, (Counter(val) for val in pars)).most_common(5)
# 806 ms ± 9.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

304 ms ± 3.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
