# Chapter 04 Final Exercise:
# Text mining Shakespeare's _Hamlet_

In this chapter, we have explored a number of interesting aspects of strings, the primary data type within Julia for storing alphanumeric data. We have discussed ways of string composition, manipulation, searching, matching and regular expressions. Our sojourn into the land of strings ended with a discussion of streams and loading a string stream.

This exercise is intended as a practical demonstration of some of the features discussed in Chapter 04.

## Loading Hamlet

For the purposes of this chapter, I have downloaded a text of Shakespeare's _Hamlet_ made publicly available by textfiles.org. We will begin by loading this text for reading only.

For the sake of brevity, we'll only be reading in a limited number of lines. In this section, we are employing the handler function syntax discussed in _Listing 4.4_. A function is used in this case to split the text where a newline (`\n`) is indicated.

We are going to omit the first 70 lines, which consists mainly of the _dramatis personae_.

In [34]:
function read_and_split(f::IOStream)
    return join(readlines(f)[70:2000], "")
end

hamlet = open(read_and_split, "hamlet.txt")
print(hamlet)


SCENE I	Elsinore. A platform before the castle.

	[FRANCISCO at his post. Enter to him BERNARDO]

BERNARDO	Who's there?

FRANCISCO	Nay, answer me: stand, and unfold yourself.

BERNARDO	Long live the king!

FRANCISCO	Bernardo?

BERNARDO	He.

FRANCISCO	You come most carefully upon your hour.

BERNARDO	'Tis now struck twelve; get thee to bed, Francisco.

FRANCISCO	For this relief much thanks: 'tis bitter cold,
	And I am sick at heart.

BERNARDO	Have you had quiet guard?

FRANCISCO	Not a mouse stirring.

BERNARDO	Well, good night.
	If you do meet Horatio and Marcellus,
	The rivals of my watch, bid them make haste.

FRANCISCO	I think I hear them. Stand, ho! Who's there?

	[Enter HORATIO and MARCELLUS]

HORATIO	Friends to this ground.

MARCELLUS	And liegemen to the Dane.

FRANCISCO	Give you good night.

MARCELLUS	O, farewell, honest soldier:
	Who hath relieved you?

FRANCISCO	Bernardo has my place.
	Give you good night.

	[Exit]

MARCELLUS	Holla! Bernardo!

BERNARDO	Say,
	What, is Horatio t

## All about Ophelia

As a little diversion, we would like to know what we know about Ophelia. In general, adjectives that describe a person precede that person's name immediately or are separated by the name by 'is' or 'was'. So we can create a function that gives us all the words that are in those significant positions when considering a particular person

One of the tricky issues posed by this is the need to create a regex that includes the name of the character we are searching for - perhaps we are not just interested in Ophelia! Because a Regex is a stringlike but not a string as such, classic string interpolation does not work:

In [35]:
cats_name = "River"
regexed_cat_sentence = r"My cat's name is $cats_name."

r"My cat's name is $cats_name."

This is clearly not what we wanted. However, using the `Regex()` constructor allows interpolation, since it constructs the regex from a string supplied to it, and strings _do_ allow interpolation.

In [36]:
regexed_cat_sentence = Regex("My cat's name is $cats_name.")

r"My cat's name is River."

With that in mind, we can construct a regex for both the word preceding the name of a character (in the function, this will be represented as the argument `persona`) and the word separated from the character by `is` or `was`. We then simply execute the regexes and return the results.

In [37]:
function get_descriptors(persona::ASCIIString, play::ASCIIString)
    prior_regex = Regex("([a-zA-Z]*) $persona")
    posterior_regex = Regex("$persona (i|wa)s ([a-zA-Z]*)")
    
    prior_matches = eachmatch(prior_regex, play)
    posterior_matches = eachmatch(posterior_regex, play)
    
    println("All about $persona:")
    for match in prior_matches
        if length(match[1]) > 1
            println("- $(match[1])")
        end
    end
    for match in posterior_matches
        if length(match[2]) > 1
            println("- $(match[2])")
        end
    end
end

get_descriptors (generic function with 1 method)

In [38]:
get_descriptors("Ophelia", hamlet)

All about Ophelia:
- beautified
- dear


Of course, our method is not perfect. It performs a lot worse with Hamlet, as he is sometimes addressed as `Lord Hamlet` - even though `Lord` is not his title!

In [39]:
get_descriptors("Hamlet", hamlet)

All about Hamlet:
- valiant
- to
- young
- of
- cousin
- Good
- of
- For
- Lord
- Lord
- Lord
- thee
- Lord
- as
- Lord
- Of
- where
- of
- from
- Lord
- Lord
- Lord


## Who has the most speaking parts?

An interesting question in analysing a play is who has the most speaking parts. There are various clever methods to calculate this, and we will encounter far more sophisticated methods when discussing control flow and arrays. For now, we can make use of a simple regex-based counter that iterates through the lines of the play. We are making use of a feature of the text, namely that every speaking part begins with the name of the character, in capitals, separated by a tab (`\t`) from the rest of the part. 

We can build a `Regex` object to isolate that part, then use a dictionary to store values. A dictionary is an associative array in Julia. We have ont yet encountered them in detail, but for now, it suffices to think of them as a map that assigns numbers (of occurrences) to characters.

We'll be reloading the play, because we need it in an array-of-lines format.

In [40]:
function read_as_arr(f::IOStream)
    return readlines(f)[70:200]
end

hamlet_arr = open(read_as_arr, "hamlet.txt")

131-element Array{Union{UTF8String,ASCIIString},1}:
 "\n"                                                      
 "SCENE I\tElsinore. A platform before the castle.\n"      
 "\n"                                                      
 "\t[FRANCISCO at his post. Enter to him BERNARDO]\n"      
 "\n"                                                      
 "BERNARDO\tWho's there?\n"                                
 "\n"                                                      
 "FRANCISCO\tNay, answer me: stand, and unfold yourself.\n"
 "\n"                                                      
 "BERNARDO\tLong live the king!\n"                         
 "\n"                                                      
 "FRANCISCO\tBernardo?\n"                                  
 "\n"                                                      
 ⋮                                                         
 "\n"                                                      
 "HORATIO\tBefore my God, I might not this belie

Great - we've got the line-by-line load. We will now iterate through the lines using a `for` construct, looking for the right formatting. A lot of text mining is exploiting various quirks and patterns in language or, in this case, the way something is formatted (sometimes, such a feature that effectively gives us 'information for free', such as the capitalisation of characters before speaking parts, is referred to as a _crib_).

We're creating an associative array (a dictionary, known as the `Dict` data type). A Dict assigns two elements, a key and a value, to each other. We can specify the data types of each of these. We want a dictionary that assigns character names (strings) to their frequencies (integers).

In [69]:
character_frequency = Dict{ASCIIString,Int}()

Dict{ASCIIString,Int64} with 0 entries

This `Dict` is empty for now, but we will soon fill it with values.

Next, we will nail down a `Regex` object that grabs all capitalised first words followed by a `\t`, which is how character names are formatted. Sometimes, other values, such as scene numbers, may be formatted similarly, and for this reason, we are only interested in results of 5 characters or more - the length of the shortest-named character (namely, `Osric`).

In [70]:
speaker_regex = r"([A-Z]{5,})\t"

r"([A-Z]{5,})\t"

We will now iterate over the lines we have loaded above. For each line, we will look whether there is a regex match, and if there is one, we check whether it exists in the dict. If it does, we increment its count by one, and if not, we simply set its count to one.

In [71]:
for line in hamlet_arr
    speaker = match(speaker_regex, line)
    
    # We are only interested in results that are of the type RegexMatch - 
    # with lines that have zero matches, e.g. empty lines, the object would
    # be of the type `Void`. 
    if typeof(speaker) == RegexMatch

        # Here, we are checking whether the character frequency dict contains
        # the matched character.
        if haskey(character_frequency, speaker[1])
            # If the character is already in the character frequency dict, we
            # increment its value by one.
            character_frequency[speaker[1]] += 1
        else
            # If the character is not in the dict yet, we set its value to one.
            character_frequency[speaker[1]] = 1
        end
    end
end 

Our character frequency dictionary now has the characters and occurrences neatly assigned:

In [72]:
character_frequency

Dict{ASCIIString,Int64} with 4 entries:
  "FRANCISCO" => 8
  "BERNARDO"  => 16
  "HORATIO"   => 9
  "MARCELLUS" => 11

For some bonus points, we can output this nicely:

In [86]:
# We iterate through the dict we have created:
for character in character_frequency
    # When iterating over a dict, we receive a tuple formatted (key, value). As
    # such, if we're interested in the key (the character name), we must subset
    # it with [1] and if we're interested in the value (the number of speaking 
    # parts), we must subset it with [2].
    println("$(character[1]) has $(character[2]) speaking parts within the play.")
end

FRANCISCO has 8 speaking parts within the play.
BERNARDO has 16 speaking parts within the play.
HORATIO has 9 speaking parts within the play.
MARCELLUS has 11 speaking parts within the play.


Wouldn't it be cool, finally, if we could somehow turn these into properly capitalised names? Let's write a function that takes a string, puts it all into lower case, then capitalises the first character. There are multiple ways to accomplish this - regex substitution being one. For now, we will instead use the `uppercase()` and `lowercase()` functions instead, since we haven't used them yet:

In [87]:
function capitalise_correctly(name::ASCIIString)
    first_char = name[1]
    rest = name[2:end]
    return "$(uppercase(first_char))$(lowercase(rest))"
end

capitalise_correctly (generic function with 1 method)

Now, we can use our neat function to rewrite the code we used above to display the speaking part frequencies:

In [90]:
for character in character_frequency
    println("$(capitalise_correctly(character[1])) has $(character[2]) speaking parts within the play.")
end

Francisco has 8 speaking parts within the play.
Bernardo has 16 speaking parts within the play.
Horatio has 9 speaking parts within the play.
Marcellus has 11 speaking parts within the play.
