# Prerequisites

In [1]:
# Update packages and install required java version
!apt-get update
!apt-get install openjdk-21-jdk-headless -qq > /dev/null

# download and unzip spark
!wget -nc -q https://downloads.apache.org/spark/spark-4.0.0/spark-4.0.0-bin-hadoop3.tgz
!tar xf spark-4.0.0-bin-hadoop3.tgz

# get data for labs
!wget -nc -O around_the_world_in_80_days.txt https://www.gutenberg.org/ebooks/103.txt.utf-8

# install findspark
!pip install -q findspark

Get:1 https://cli.github.com/packages stable InRelease [3,917 B]
Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:6 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:7 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:8 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:12 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages [2,798 kB]
Get:13 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [3,638 kB]
Get:14 htt

In [2]:
import os
import findspark

# set env vars for java and spark
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-21-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-4.0.0-bin-hadoop3"

# start findspark so notebook can interact with spark
findspark.init()


In [3]:
# what does findspark do? use the ?? magic command to find out
# Note 1: in colab, this may open in a side panel
# Note 2: this magic command is often helpful when encountering an object in a
# notebook that is unfamiliar. More information will be displayed if it exists
?? findspark

# 1. Word Count

Instructions:  
For each cell marked "double-click and add explanation here" please answer the question in your own words.  
In the section where you complete the code to perform basic nlp text cleaning and exploration tasks, the goal is to chain all of the transformations together in a single function. For learning and exploration purposes, it is acceptable to have each step seperate, but the last cell in this section should be one function with all transformations chained together.  
For steps c and f, it is acceptable to use your favorite chatbot to generate a list of common stop words (c) and punctuation (e) for use in the code. As these are common steps in nlp/text processing tasks, there are pleanty of libraries to help with this such as nltk, but there is no need to import extra dependencies for this lab unless you are already familiar with working with them.

In [4]:
# start a spark session and create spark context for making rdd
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("word_count") \
    .getOrCreate()

sc = spark.sparkContext

In [5]:
# Defind the rdd
rdd = sc.textFile('/content/around_the_world_in_80_days.txt')

In [6]:
# view the first x lines of the rdd
rdd.take(20)

['The Project Gutenberg eBook of Around the World in Eighty Days',
 '    ',
 'This ebook is for the use of anyone anywhere in the United States and',
 'most other parts of the world at no cost and with almost no restrictions',
 'whatsoever. You may copy it, give it away or re-use it under the terms',
 'of the Project Gutenberg License included with this ebook or online',
 'at www.gutenberg.org. If you are not located in the United States,',
 'you will have to check the laws of the country where you are located',
 'before using this eBook.',
 '',
 'Title: Around the World in Eighty Days',
 '',
 'Author: Jules Verne',
 '',
 'Release date: January 1, 1994 [eBook #103]',
 '                Most recently updated: October 29, 2024',
 '',
 'Language: English',
 '',
 '']

In [7]:
# example lambda function
words = rdd.flatMap(lambda lines: lines.split(' '))

In [8]:
# Note and explain the output of the below command
words

PythonRDD[3] at RDD at PythonRDD.scala:56

For each lines, we split on every space caracters, so we now have a list of all the words in the objects words.
Since we get words mapping an RDD, words get the type RDD.
Then when we write words, we dont ask spark to compute anything so we just get a description of the object words (RDD).

In [9]:
# Note and explain the output of the following command, focusing on the difference with the
# above command
words.collect()

['The',
 'Project',
 'Gutenberg',
 'eBook',
 'of',
 'Around',
 'the',
 'World',
 'in',
 'Eighty',
 'Days',
 '',
 '',
 '',
 '',
 '',
 'This',
 'ebook',
 'is',
 'for',
 'the',
 'use',
 'of',
 'anyone',
 'anywhere',
 'in',
 'the',
 'United',
 'States',
 'and',
 'most',
 'other',
 'parts',
 'of',
 'the',
 'world',
 'at',
 'no',
 'cost',
 'and',
 'with',
 'almost',
 'no',
 'restrictions',
 'whatsoever.',
 'You',
 'may',
 'copy',
 'it,',
 'give',
 'it',
 'away',
 'or',
 're-use',
 'it',
 'under',
 'the',
 'terms',
 'of',
 'the',
 'Project',
 'Gutenberg',
 'License',
 'included',
 'with',
 'this',
 'ebook',
 'or',
 'online',
 'at',
 'www.gutenberg.org.',
 'If',
 'you',
 'are',
 'not',
 'located',
 'in',
 'the',
 'United',
 'States,',
 'you',
 'will',
 'have',
 'to',
 'check',
 'the',
 'laws',
 'of',
 'the',
 'country',
 'where',
 'you',
 'are',
 'located',
 'before',
 'using',
 'this',
 'eBook.',
 '',
 'Title:',
 'Around',
 'the',
 'World',
 'in',
 'Eighty',
 'Days',
 '',
 'Author:',
 'Jules'

The collect() function is an action that will get all the elements in a list, while words before was only a reference.

In [10]:
# nicer print
for w in words.collect():
    print(w)

[1;30;43mLe flux de sortie a été tronqué et ne contient que les 5000 dernières lignes.[0m
was
drawing
near
his
last
turning-point.
The
bonds
were
quoted,
no
longer
at
a
hundred
below
par,
but
at
twenty,
at
ten,
and
at
five;
and
paralytic
old
Lord
Albemarle
bet
even
in
his
favour.

A
great
crowd
was
collected
in
Pall
Mall
and
the
neighbouring
streets
on
Saturday
evening;
it
seemed
like
a
multitude
of
brokers
permanently
established
around
the
Reform
Club.
Circulation
was
impeded,
and
everywhere
disputes,
discussions,
and
financial
transactions
were
going
on.
The
police
had
great
difficulty
in
keeping
back
the
crowd,
and
as
the
hour
when
Phileas
Fogg
was
due
approached,
the
excitement
rose
to
its
highest
pitch.

The
five
antagonists
of
Phileas
Fogg
had
met
in
the
great
saloon
of
the
club.
John
Sullivan
and
Samuel
Fallentin,
the
bankers,
Andrew
Stuart,
the
engineer,
Gauthier
Ralph,
the
director
of
the
Bank
of
England,
and
Thomas
Flanagan,
the
brewer,
one
and
all
waited
anxiously.

When


In [11]:
# Print first x words
words.take(20)

['The',
 'Project',
 'Gutenberg',
 'eBook',
 'of',
 'Around',
 'the',
 'World',
 'in',
 'Eighty',
 'Days',
 '',
 '',
 '',
 '',
 '',
 'This',
 'ebook',
 'is',
 'for']

In [12]:
# Use cell magic command to help understand what the rdd.flatMap function is doing in the next cell.
# Insert a text/markdown cell and explain in your own words.
rdd.flatMap

The rdd.flatMap() function will return a RDD object, after applying the function to all the elements of the RDD and flattening the results.

In [13]:
# Initialize a word counter by creating a tuple with word and cound of 1
words = rdd.flatMap(lambda lines: lines.split(' ')) \
                    .map(lambda word: (word, 1))

for w in words.collect():
    print(w)

[1;30;43mLe flux de sortie a été tronqué et ne contient que les 5000 dernières lignes.[0m
('was', 1)
('drawing', 1)
('near', 1)
('his', 1)
('last', 1)
('turning-point.', 1)
('The', 1)
('bonds', 1)
('were', 1)
('quoted,', 1)
('no', 1)
('longer', 1)
('at', 1)
('a', 1)
('hundred', 1)
('below', 1)
('par,', 1)
('but', 1)
('at', 1)
('twenty,', 1)
('at', 1)
('ten,', 1)
('and', 1)
('at', 1)
('five;', 1)
('and', 1)
('paralytic', 1)
('old', 1)
('Lord', 1)
('Albemarle', 1)
('bet', 1)
('even', 1)
('in', 1)
('his', 1)
('favour.', 1)
('', 1)
('A', 1)
('great', 1)
('crowd', 1)
('was', 1)
('collected', 1)
('in', 1)
('Pall', 1)
('Mall', 1)
('and', 1)
('the', 1)
('neighbouring', 1)
('streets', 1)
('on', 1)
('Saturday', 1)
('evening;', 1)
('it', 1)
('seemed', 1)
('like', 1)
('a', 1)
('multitude', 1)
('of', 1)
('brokers', 1)
('permanently', 1)
('established', 1)
('around', 1)
('the', 1)
('Reform', 1)
('Club.', 1)
('Circulation', 1)
('was', 1)
('impeded,', 1)
('and', 1)
('everywhere', 1)
('disputes,', 1)

In [14]:
# a. count the occurence of each word
occurence = {}

for key, value in words.collect():
    if key in occurence:
        occurence[key] += value
    else:
        occurence[key] = value

occurence

{'The': 482,
 'Project': 79,
 'Gutenberg': 22,
 'eBook': 4,
 'of': 1875,
 'Around': 4,
 'the': 4316,
 'World': 3,
 'in': 991,
 'Eighty': 3,
 'Days': 3,
 '': 2182,
 'This': 46,
 'ebook': 2,
 'is': 288,
 'for': 407,
 'use': 16,
 'anyone': 6,
 'anywhere': 4,
 'United': 23,
 'States': 10,
 'and': 1792,
 'most': 43,
 'other': 59,
 'parts': 4,
 'world': 30,
 'at': 576,
 'no': 124,
 'cost': 9,
 'with': 550,
 'almost': 19,
 'restrictions': 2,
 'whatsoever.': 2,
 'You': 31,
 'may': 38,
 'copy': 9,
 'it,': 37,
 'give': 17,
 'it': 322,
 'away': 16,
 'or': 185,
 're-use': 2,
 'under': 41,
 'terms': 21,
 'License': 8,
 'included': 3,
 'this': 292,
 'online': 4,
 'www.gutenberg.org.': 4,
 'If': 31,
 'you': 243,
 'are': 169,
 'not': 500,
 'located': 7,
 'States,': 6,
 'will': 108,
 'have': 259,
 'to': 1690,
 'check': 4,
 'laws': 11,
 'country': 18,
 'where': 65,
 'before': 77,
 'using': 6,
 'eBook.': 2,
 'Title:': 1,
 'Author:': 1,
 'Jules': 2,
 'Verne': 2,
 'Release': 1,
 'date:': 1,
 'January': 1,


In [15]:
# b. a common first step in text analysis, change all capital letters to lower case
words_lower_case = rdd.flatMap(lambda lines: lines.split(' ')).map(lambda x: x.lower())
words_lower_case.take(20)


['the',
 'project',
 'gutenberg',
 'ebook',
 'of',
 'around',
 'the',
 'world',
 'in',
 'eighty',
 'days',
 '',
 '',
 '',
 '',
 '',
 'this',
 'ebook',
 'is',
 'for']

In [16]:
# c. eliminate the stop words.
stopwords = [
    "a", "an", "and", "are", "as", "at", "be", "but",
    "by", "for", "if", "in", "into", "is", "it",
    "no", "not", "of", "on", "or", "such", "that",
    "the", "their", "then", "there", "these", "they",
    "this", "to", "was", "will", "with"
]

words_stopwords = words_lower_case.filter(lambda y: y not in stopwords)

words_stopwords.take(20)

['project',
 'gutenberg',
 'ebook',
 'around',
 'world',
 'eighty',
 'days',
 '',
 '',
 '',
 '',
 '',
 'ebook',
 'use',
 'anyone',
 'anywhere',
 'united',
 'states',
 'most',
 'other']

In [17]:
# d. sort in alphabetical order
words_sorted = words_stopwords.sortBy(lambda z: z)

words_sorted.take(20)

['',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '']

In [18]:
# e. sort descending by word frequency

words_maped_reduced = words_sorted.flatMap(lambda lines: lines.split(' ')).map(lambda word: (word, 1)).reduceByKey(lambda x,y: x + y)

words_sorted_frequency = words_maped_reduced.sortBy(lambda z: z[1], ascending=False)

words_sorted_frequency.take(20)


[('', 2182),
 ('he', 930),
 ('his', 855),
 ('had', 503),
 ('which', 490),
 ('mr.', 373),
 ('fogg', 365),
 ('from', 323),
 ('were', 303),
 ('you', 280),
 ('would', 274),
 ('have', 267),
 ('phileas', 250),
 ('passepartout', 239),
 ('i', 207),
 ('him', 183),
 ('who', 182),
 ('been', 167),
 ('her', 157),
 ('said', 157)]

In [19]:
# f. remove punctuations and blank spaces
import re, unicodedata

to_remove = [
    " ", "", ".", ",", ";", ":", "!", "?", "'", '"', "«", "»", "_",
    "-", "–", "—", "(", ")", "[", "]", "{", "}", "/", "\\",
    "*", "&", "#", "@", "%", "+", "=", "~", "`", "^",
    "<", ">", "|", "¿", "¡", "§", "¶", "†", "‡", "•", "…",
    "´", "¨", "°", "©", "®", "™", "€", "$", "£", "¥", "₹", "”", "“"
]

words_clean = words_sorted.map(lambda y: unicodedata.normalize("NFKD", y)) \
    .map(lambda y: y.replace("_", "")) \
    .map(lambda y: re.sub(r"[^A-Za-z0-9\s]", "", y)) \
    .map(lambda y: re.sub(r'[\u2000-\u200F\u2028-\u202F\u2066-\u2069\uFEFF]', '', y)) \
    .filter(lambda y: y not in to_remove) \
    .map(lambda y: re.sub(r"\s+", "", y))

words_clean.take(300)

['103',
 '5000',
 'c',
 '1',
 '801',
 'a',
 'and',
 'any',
 'b',
 'c',
 'does',
 'if',
 'japan',
 'or',
 'or',
 'or',
 'saturday',
 'sort',
 'sunday',
 'trademarkcopyright',
 'wwwgutenbergorg',
 'the',
 '1',
 '1000',
 '1170',
 '1',
 '1a',
 '1b',
 '1c',
 '1c',
 '1d',
 '1e',
 '1e',
 '1e1',
 '1e1',
 '1e1',
 '1e1',
 '1e1',
 '1e2',
 '1e3',
 '1e4',
 '1e5',
 '1e6',
 '1e7',
 '1e7',
 '1e7',
 '1e8',
 '1e8',
 '1e8',
 '1e8',
 '1e9',
 '1e9',
 '1e9',
 '1f',
 '1f1',
 '1f2',
 '1f3',
 '1f3',
 '1f3',
 '1f3',
 '1f3',
 '1f4',
 '1f5',
 '1f6',
 '10th',
 '11',
 '1140',
 '117',
 '117',
 '11th',
 '11th',
 '11th',
 '11th',
 '11th',
 '11th',
 '11th',
 '12th',
 '12th',
 '12th',
 '13',
 '13',
 '13th',
 '13th',
 '13th',
 '14th',
 '14th',
 '14th',
 '14th',
 '1500',
 '15812',
 '15th',
 '16th',
 '1756',
 '17th',
 '17th',
 '1814',
 '1825',
 '1837',
 '1839',
 '1842',
 '1843',
 '1845',
 '1849a',
 '1853',
 '1862',
 '1867',
 '1872',
 '1872',
 '18th',
 '1994',
 '19th',
 '2',
 '20',
 '2001',
 '2024',
 '20th',
 '20th',
 '20th

In [20]:
#Full cleaning/splitting function
def clean_split(rdd):
  #split
  words = rdd.flatMap(lambda lines: lines.split(' '))

  #lowercasing
  words_lower_case = words.map(lambda x: x.lower())

  #remove punctuation/space/nonverbal caracters

  to_remove = [
      " ", "", ".", ",", ";", ":", "!", "?", "'", '"', "«", "»", "_",
      "-", "–", "—", "(", ")", "[", "]", "{", "}", "/", "\\",
      "*", "&", "#", "@", "%", "+", "=", "~", "`", "^",
      "<", ">", "|", "¿", "¡", "§", "¶", "†", "‡", "•", "…",
      "´", "¨", "°", "©", "®", "™", "€", "$", "£", "¥", "₹", "”", "“"
  ]

#remove punctuation using unicodedata
#remove "_" and other non alphabetical caracters
#remove from list
#remove spaces caracters


  words_clean = words_lower_case.map(lambda y: unicodedata.normalize("NFKD", y)) \
      .map(lambda y: y.replace("_", "")) \
      .map(lambda y: re.sub(r"[^A-Za-z0-9\s]", "", y)) \
      .map(lambda y: re.sub(r'[\u2000-\u200F\u2028-\u202F\u2066-\u2069\uFEFF]', '', y)) \
      .filter(lambda y: y not in to_remove) \
      .map(lambda y: re.sub(r"\s+", "", y))

  #stopword
  stopwords = [
      "a", "an", "and", "are", "as", "at", "be", "but",
      "by", "for", "if", "in", "into", "is", "it",
      "no", "not", "of", "on", "or", "such", "that",
      "the", "their", "then", "there", "these", "they",
      "this", "to", "was", "will", "with"
  ]

  words_stopwords = words_clean.filter(lambda y: y not in stopwords)

  #mapReduce
  words_maped_reduced = words_stopwords.map(lambda word: (word, 1)).reduceByKey(lambda x,y: x + y)


  return words_maped_reduced


# 2. What does the following cell block do?
Comment the code below line by line after the provided hash-tag. You should be able to explain each line while respecting the pep8 style guide of 79 characters or less per line!

In [21]:
 # Create an RDD of tuples (name, age)
dataRDD = sc.parallelize([("Brooke", 20), ("Denny", 31), ("Jules", 30),
("TD", 35), ("Brooke", 25)])

# Try to undestand what this code does (line by line)
agesRDD = (dataRDD
  #
  .map(lambda x: (x[0], (x[1], 1)))
  #
  .reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
  #
  .map(lambda x: (x[0], x[1][0]/x[1][1])))

agesRDD.take(4)

[('Brooke', 22.5), ('Denny', 31.0), ('Jules', 30.0), ('TD', 35.0)]

# Where to go from here

Further exploration for students who complete the lab before the end of the session or want to go further.

- perform eda on the original french version of the [book](https://www.gutenberg.org/ebooks/46541.txt.utf-8) and compare the two
- recomplete the exercises using a the docker install
- install java and spark directly onto host machine and either rexplore this notebook or perform eda on other data sets
- write a simple python timer function for seeing how quickly your rdd runs as written. change the order of the steps in order to make the rdd run as optimally as possible

In [42]:
def clean_split_fr(rdd):
  words = rdd.flatMap(lambda lines: lines.split(' '))
  words_lower_case = words.map(lambda x: x.lower())

  to_remove = [
      " ", "", ".", ",", ";", ":", "!", "?", "'", '"', "«", "»", "_",
      "-", "–", "—", "(", ")", "[", "]", "{", "}", "/", "\\",
      "*", "&", "#", "@", "%", "+", "=", "~", "`", "^",
      "<", ">", "|", "¿", "¡", "§", "¶", "†", "‡", "•", "…",
      "´", "¨", "°", "©", "®", "™", "€", "$", "£", "¥", "₹", """, """
  ]

  words_clean = words_lower_case.map(lambda y: unicodedata.normalize("NFKD", y)) \
      .map(lambda y: y.replace("_", "")) \
      .map(lambda y: re.sub(r"[^A-Za-z0-9àâäéèêëïîôùûüÿœæç\s]", "", y)) \
      .map(lambda y: re.sub(r'[\u2000-\u200F\u2028-\u202F\u2066-\u2069\uFEFF]', '', y)) \
      .filter(lambda y: y not in to_remove) \
      .map(lambda y: re.sub(r"\s+", "", y))

  stopwords_fr = [
      "le", "la", "les", "un", "une", "des", "de", "du", "et",
      "est", "sont", "dans", "pour", "par", "sur", "avec", "sans",
      "sous", "que", "qui", "dont", "où", "ce", "ces", "son", "sa",
      "ses", "leur", "leurs", "il", "elle", "ils", "elles", "je",
      "tu", "nous", "vous", "au", "aux", "mais", "ou", "si", "ne",
      "pas", "plus", "tout", "tous", "toute", "toutes","a","à","en"
  ]

  words_stopwords = words_clean.filter(lambda y: y not in stopwords_fr)
  words_maped_reduced = words_stopwords.map(lambda word: (word, 1)).reduceByKey(lambda x,y: x + y)

  return words_maped_reduced

In [22]:
rdd_francais = sc.textFile('/content/french_book.txt')

In [43]:
word_counts_french = clean_split_fr(rdd_francais)

In [44]:
top_20_fr = word_counts_french.sortBy(lambda x: x[1], ascending=False).take(20)
print("Top 20 fr after cleaning")
for word, count in top_20_fr:
    print(f"{word}: {count}")

Top 20 fr after cleaning
fogg: 688
se: 592
passepartout: 453
lui: 333
phileas: 331
etait: 320
mr: 287
fix: 286
avait: 280
cette: 279
heures: 242
on: 238
repondit: 215
quil: 194
bien: 194
the: 193
comme: 173
deux: 155
fut: 150
monsieur: 145


In [45]:
word_counts_english = clean_split(rdd)


In [46]:
top_20_en = word_counts_english.sortBy(lambda x: x[1], ascending=False).take(20)
print("\nTop 20 en after cleaning")
for word, count in top_20_en:
    print(f"{word}: {count}")


Top 20 en after cleaning
he: 978
his: 858
fogg: 601
which: 515
had: 513
passepartout: 402
you: 395
mr: 389
i: 333
from: 325
him: 320
were: 308
would: 283
have: 276
phileas: 255
fix: 240
who: 201
said: 194
her: 183
all: 176


In [47]:
english_counts = clean_split(rdd)
french_counts = clean_split_fr(rdd_francais)

In [48]:
def get_top_n(word_counts, n=20):
    return word_counts.sortBy(lambda x: x[1], ascending=False).take(n)

In [49]:
def compare(en, fr):
    # Top 10 most frequent words
    print("TOP 10 WORDS:")
    for w, c in en.sortBy(lambda x: -x[1]).take(10):
        print(f"EN: {w}: {c}")
    print("---")
    for w, c in fr.sortBy(lambda x: -x[1]).take(10):
        print(f"FR: {w}: {c}")

    # Vocabulary diversity
    unique_en = en.count()
    unique_fr = fr.count()
    total_en = en.map(lambda x: x[1]).sum()
    total_fr = fr.map(lambda x: x[1]).sum()

    print(f"\nVOCABULARY DIVERSITY:")
    print(f"Unique words: EN={unique_en}, FR={unique_fr}")
    print(f"Total words: EN={total_en}, FR={total_fr}")
    print(f"Lexical richness: EN={unique_en/total_en:.3f}, FR={unique_fr/total_fr:.3f}")
    print("(Higher ratio = more diverse vocabulary)")



compare(english_counts, french_counts)

TOP 10 WORDS:
EN: he: 978
EN: his: 858
EN: fogg: 601
EN: which: 515
EN: had: 513
EN: passepartout: 402
EN: you: 395
EN: mr: 389
EN: i: 333
EN: from: 325
---
FR: fogg: 688
FR: se: 592
FR: passepartout: 453
FR: lui: 333
FR: phileas: 331
FR: etait: 320
FR: mr: 287
FR: fix: 286
FR: avait: 280
FR: cette: 279

VOCABULARY DIVERSITY:
Unique words: EN=7433, FR=10396
Total words: EN=44253, FR=46148
Lexical richness: EN=0.168, FR=0.225
(Higher ratio = more diverse vocabulary)
