# Analysing Text in terms of Letter Frequency

This notebook illustrates the use of a vector feature representation
of data samples in order to calculate a measure of similarity, which
could be used for classification or clustering of text samples.

In order to run the functions yourself, you will need to download
and unzip the `books.zip` file and also `myhtml.py`. Ensure
that the `books` folder and `myhtml.py` are in the same folder
as this notebook file.

Let us start by writing a function that will, given any string, 
return a dictionary giving the count of the number of occurrences
of each letter of the alphabet in that string. This is done by
the code below. Non letter characters will be ignored and
upper and lower case characters will be counted together --- i.e.
it is case insensitive.

In [None]:
import math

alphabet = "abcdefghijklmnopqrstuvwxyz"

def get_string_letter_counts( string ):
    string = string.lower()
    counts = {}
    for letter in alphabet:
        counts[letter] = 0
    for letter in string:
        if letter in counts:
            counts[letter] += 1  
    return counts

We can test this on an example string:

In [None]:
get_string_letter_counts("Hello, I'm a string.")

To analyse text files, it will be useful to code up a similar function
where the argument is a file name and the output is a dictionary of
letter counts for the contents of the file:

In [None]:
def get_file_letter_counts( file_name ):
    with open( file_name) as f:
        contents = f.read()
    return get_string_letter_counts(contents)

get_file_letter_counts("books/Gadsby.txt")

It may be useful to visualise the relative frequency of each letter.
One could, of course, use `matplotlib`, but here I illustrate some more
basic Python programming that will display a simple text-based chart
showing the letter frequencies in decreasing order:

In [None]:
def sort_keys_by_value( dictionary ):
    return sorted( dictionary.keys(),
                   key = lambda x: dictionary[x])        
                 
def display_file_letter_counts( filename ):
    percent_per_star = 0.2
    counts = get_file_letter_counts( filename)
    total = sum( counts.values() )             
    for letter in sort_keys_by_value(counts):
        count = counts[letter]
        percent = 100*(count/total)
        num_stars = int( percent / percent_per_star)
        percent_str = "({}%)".format(round(percent,2)) 
        print( letter, "*" * num_stars, count, percent_str )
        
display_file_letter_counts("books/Dracula.txt")

# Letter Frequency Vectors¶

When comparing different text samples, it is likely that we will want to
compare samples of different lengths. In this case, we will probably want
to characterise the samples in terms of letter frequency rather than
total letter counts.

I define the letter frequency of a text string as a vector with 26 components,
where each of the components is the frequency of that letter given as
a decimal fraction of the total number of letters in the text. We can easily
calculate this from the letter count dictionary that we computed above:

In [None]:
def letter_frequency_vector( string ):
    counts = get_string_letter_counts( string )
    total = sum( counts.values() )
    return [ counts[letter]/total for letter in alphabet ]

And again, for analysing text files, it will convenient to define a version
of this function which takes a file name argument rather than a string:

In [None]:
def file_letter_frequency_vector( file_name ):
    with open(file_name) as f:
        contents = f.read()
    return letter_frequency_vector( contents )

#### 'Distance' between letter frequency vectors
A measure of 'distance' between two vectors may be defined using Pythagoras
theorem.
The frequency of each letter may be considered as a distance in a particular
dimension of a 'space' of letter frequences. To calculate the 'distance', 
$\delta(f_1, f_2)$, between
letter frequencies $f_1$ and $f_2$, we can use the following equation:

$$
\delta(f_1, f_2) = \sqrt{\sum_{l \in \textbf{Alphabet}} (f_1(l) - f_2(l))^2}
$$

Using Python, we can calculate this as follows:

In [None]:
def vector_distance( V1, V2 ): 
    diffs = [ x-y for (x,y) in zip(V1,V2)]
    sum_of_squares = sum( [ x**2 for x in diffs ] )
    return math.sqrt(sum_of_squares)

## Systematic Comparison of Texts

Suppose we have a set of samples and can compute some measure of 'distance' between
each pair of samples. A simple and systematic way to investigate this distance measure
is to compute the distance between every pair of samples and record and/or display
this in the form of a table.

Since Python allows us to have functions as the values of variables, and hence to
pass functions as parameters to functions, we can address this requirement in
a very general way. The function defined below can
be given any binary function and set of input values. From this it computes
a matrix (a list of lists) giving the value of that function from any pair
of inputs taken from the input set:

In [None]:
def matrix_of_binary_function( function, inputs ):
       return [  [ function(i1, i2) for i2 in inputs]
                 for i1 in inputs
              ]

 The use of `matrix_of_binary_function` may be illustrated as follows:

In [None]:
def f(x,y):
    return x + y**2

matrix_of_binary_function( f, [0,1,2,3] )

Building on the above, we can augment the matrix obtained by adding the input values
as indices for the rows and columns of the matrix, so we get a more readable table
view of the matrix. The code also requires a label parameter which will go in the top left
of the table to describe the table content.

In [None]:
def table_of_binary_function( label, function, inputs ):
    matrix = matrix_of_binary_function( function, inputs )
    table = [ [label] + inputs ] # header row
    for i in range(len(inputs)):
        table.append( [inputs[i]] + matrix[i] )
    return table

### Now we make a letter frequency vector distance table for some books
If we have a selection of text files (e.g. books from 
[Project Gutenburg](https://www.gutenberg.org/)), we can now calculate
the letter frequency vectors for each of these, with code such as the following:

In [None]:
BOOK_LIST = [ "Romeo+Juliet",
              "Midsummer-Nights-Dream",
              "Wuthering-Heights",
              "Jane-Eyre",
              "Dracula",
              "Jewel-of-Seven-Stars",
              "Gadsby"
            ]

## Calculate the Letter Frequency Vectors of all the books
## and store in a dictionary (key is book name)
LFVs = {}
for b in BOOK_LIST:
    LFVs[b] = file_letter_frequency_vector( "books/" + b + ".txt")

Given any two book names in `BOOK_LIST` we can now calculate the 'distance' between them
by the following function:

In [None]:
def book_distance(b1,b2):
    return vector_distance( LFVs[b1], LFVs[b2] )

Or if we want to round the distance measure down to, say, 4 decimal places (so
it is more easily displayed and read), we could define:

In [None]:
def rounded_book_distance(b1,b2):
    return round( vector_distance( LFVs[b1], LFVs[b2] ), 4 ) 

So, now that we have defined a suitable distance function between book samples,
we can use `table_of_binary_function` to get a table giving the distances for
all pairs:

In [None]:
DIST_TABLE = table_of_binary_function( "LFV distance", 
                                       rounded_book_distance, 
                                       BOOK_LIST )

### Displaying a List of Lists using Pandas DataFrame
Using pandas, we can easily convert the table created above, as a list of lists, into a DataFrame, which we can easily display.
Notice that in the resulting DataFrame the row and column names that were in the list of lists representation are not treated as index labels
but as data within the DataFrame. We could reset the column names and index labels to get a nicer table. Also, the name of the function now appears in the (0,0) position of the DataFrame and there is no obvious way to attach this to the DataFrame object without it being considered as data. Hence, even going from the list of lists format to a DataFrame presents some small problems in preserving the content of the stored information.

In [None]:
from pandas import DataFrame
df = DataFrame(DIST_TABLE)
display(df)

#### Displaying a Table using HTML
Another way to display a table nicely is by encoding it into HTML format.
This does have some advantages, in that we have more control over the format of the table. But it does require somwhat complex encoding. Hence, I have created a  function `display_datalist_as_html_table` in my own module `myhtml` which I import in order to produce an HTML version of the table. Using that we can display the table as follows:

In [None]:
import myhtml
myhtml.display_datalist_as_html_table( DIST_TABLE )

This table shows of course that the distance between each book and itself is 0, and that the distance measure is symmetrical. We also see that the closest two books are _Romeo and Juliet_ and _Midsummer Night's Dream_ which are both plays by William Shakespeare. The next closest pairs are: _Dracula_ and _Jewel of Seven Stars_, both witten by Bram Stoker; and
_Wuthering Heights_ and _Jane Eyre_, written by sisters Emily and Charlotte Bronte.
We also see that the book _Gadsby_ has a greater distance from every other book than there is between all other pairs. Why is this? 

## Using letter frequencies to categorise text

I conclude with a small experiment to test whether we may be able to identify the
source of some text, just by comparing the letter frequency of small samples of text.
The following code defines a function that makes a random selection of lines from
a file. Following that, a recognition test is defined. It picks a random book from
`BOOK_LIST` selects 10 random lines from that book and calculates the letter frequency
vector for those lines. It then calculates the distance of that letter frequence vector
from the vectors that have been previously calculated from each of the books.
Based on these distances it selects the closest book.

Following that, 100 tests are carried out, and the percentage of those for which the
correct book source was found is printed.

In [None]:
import random

def random_lines_from_file( filename, n ):
    with open(filename) as f:
        lines = f.readlines()
    return [ random.choice(lines) for _ in range(n) ]
        
def random_recognition_test(num_sample_lines = 20):   
    random_book = random.choice(BOOK_LIST)
    print("Testing", num_sample_lines, "random lines from", random_book)
    lines = random_lines_from_file( "books/" + 
                                    random_book + ".txt",
                                    num_sample_lines )
    text = "\n".join(lines)
    line_lfv = letter_frequency_vector(text)
    distances = [ (vector_distance(line_lfv, LFVs[book]), book) 
                  for book in BOOK_LIST ]
    distances.sort()
    nearest_book = distances[0][1]
    print("Nearest book:", nearest_book )
    if random_book == nearest_book:
        print("Correct")
        return True
    else:
        print("Wrong")
        return False

correct = 0
for _ in range(100):
    if random_recognition_test():
        correct += 1
        
print( "Percentage of correct tests:", correct)

### Evaluating the Test
Having conducted an analysis experiment such as the one carried out by the previous code cell, we need to carefully consider both the results and the nature of the test that has been performed. We should try to form a clear view not only of what the results may tell us, but also of what are the limitations of the experiment. Could it be unreliable, misleading or incomplete? We will then be in a position to draw reasonable conclusions and also to envisage further tests that might yield more interesting or useful findings.

So to sum up you should consider the following:
* What does the test recognition experiment tell us?
* Might the test be unreliable or misleading?
* What are the limitations of the test?
* What similar tests might be more revealing or useful?
* And, why is text from the book _Gadsby_ so easy to identify?