### Navigation Reminder

- **Grey cells** are **code cells**. Click inside them and type to edit.
- **Run**  code cells by pressing $ \triangleright $  in the toolbar above, or press ``` shift + enter```.
-  **Stop** a running process by clicking &#9634; in the toolbar above.
- You can **add new cells** by clicking to the left of a cell and pressing ```A``` (for above), or ```B``` (for below). 
- **Delete cells** by pressing ```X```.
- Run all code cells that import objects (such as the one below) to ensure that you can follow exercises and examples.
- Feel free to edit and experiment - you will not corrupt the original files.

# 06B Putting it all together. Simple Text Analysis (Lessons 1-6)

In lessons 1-5, we learned about many of the building blocks that make Python such a powerful language, including basic data types (strings, integers, floats, lists and dictionaries), conditional structures, and loops. 

In this notebook, we will run through a long-form exercise that allows us to put all this knowledge together. 

We will be working with a corpus of sonnets by Shakespeare, counting the frequency of terms used in each individual poem and then creating a simple algorithm that measures similarity between poems using this information. In the future, such algorithms can be imported from modules created by others; but creating one is a useful way to reinforce the knowledge you obtained in the past lessons, practice working through a small project and avoid getting into the habit of 'outsourcing' algorithms without thinking about our needs and objectives, as well as what these algorithms are doing.

There are several modules for text analysis that provide advanced tools for text analysis. In this lesson, however, we will use the basic building blocks of the Python language to construct simple tools for term counting and similarity analysis. 

Please note that there are multiple paths to the objectives we outline below. We do provide a 'solution' notebook with our approach towards this project as guidance if you get stuck, but if you find alternative solutions, it is more useful to think critically about your code and whether it achieves its goals than to try to make it conform to the sample solution.

---

**Lesson Objectives**
- Practice:
    - Using basic data structures (strings and numbers) and their operators
    - Using collections (lists and dictionaries), and discerning when to select one or the other
    - Creating loops
    - Using conditional statements
- Develop good habits for projects, including: 
    - Taking time to understand the source data and its particularities before committing to an approach
    - Thinking critically about algorithms and problem-solving, instead of immediately delegating to solutions developed by others
---

**1.** The file we will be using is saved in 'Other_files/Shakespeare_sonnets.txt'.  Using the 'with xxx as file' notation, create a file handle using the open() statement and then read the file as one block of text, assigning it to the variable text. This ensures that the file is closed after we read into it. 

In [4]:
with open('Other_files/Shakespeare_sonnets.txt','r',encoding='utf=8') as file:
    text = file.read()

In [None]:
# Alternatives: poems by Wilde, Walrus poem from Alice in Wonderland. 
# Also have the lives of Vasari and we could use John Donne's poems from Gutenberg (this could be
# interesting if we compared the similiarity of poems by JD and WS, same time period but different authors)

**2.** Take a first look at the text file by printing the first 1600 characters of the text file.

**Hint:** the object is like a long string, so you can index into it like as you usually would a string.

In [5]:
print(text[:1600])

From fairest creatures we desire increase,
That thereby beauty’s rose might never die,
But as the riper should by time decease,
His tender heir might bear his memory:
But thou contracted to thine own bright eyes,
Feed’st thy light’s flame with self-substantial fuel,
Making a famine where abundance lies,
Thy self thy foe, to thy sweet self too cruel:
Thou that art now the world’s fresh ornament,
And only herald to the gaudy spring,
Within thine own bud buriest thy content,
And, tender churl, mak’st waste in niggarding:
Pity the world, or else this glutton be,
To eat the world’s due, by the grave and thee.


When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty’s field,
Thy youth’s proud livery so gazed on now,
Will be a tattered weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv’d th

**3.** It is also helpful to display the text without formatting, so we can see invisible characters such as those used for whitespace. Index into the whole text as before, but without using the print statement.

Start thinking about the **characteristics of the file**. How are the sonnets separated and structured?  How might we use whitespace to divide it into individual poems, and these poems into terms (also known as 'tokens')? What actions might we have to perform in order to extract these tokens for counting?

In [6]:
text[0:1600]

'From fairest creatures we desire increase,\nThat thereby beauty’s rose might never die,\nBut as the riper should by time decease,\nHis tender heir might bear his memory:\nBut thou contracted to thine own bright eyes,\nFeed’st thy light’s flame with self-substantial fuel,\nMaking a famine where abundance lies,\nThy self thy foe, to thy sweet self too cruel:\nThou that art now the world’s fresh ornament,\nAnd only herald to the gaudy spring,\nWithin thine own bud buriest thy content,\nAnd, tender churl, mak’st waste in niggarding:\nPity the world, or else this glutton be,\nTo eat the world’s due, by the grave and thee.\n\n\nWhen forty winters shall besiege thy brow,\nAnd dig deep trenches in thy beauty’s field,\nThy youth’s proud livery so gazed on now,\nWill be a tattered weed of small worth held:\nThen being asked, where all thy beauty lies,\nWhere all the treasure of thy lusty days;\nTo say, within thine own deep sunken eyes,\nWere an all-eating shame, and thriftless praise.\nHow muc

Here are some of the characteristics of the text that we see will affect our analysis.

    - Each sonnet (14 lines) is separated by three instances of the newline character \n (one to end the final line of the poem, and two to generate blank lines after).
    - Words are separated by spaces or the newline character \n (the last word of a line and the first of the next).
    - Some words begin with a capital letter, others are fully lowercase.
    - Some words have punctuation around them.
    
We can use the first two characteristics to split the block of text into poems, and then tokens. The other characteristics will have to be edited out to get a clean count of the tokens (to ensure, for instance, that 'Now' and 'now' or 'now.' and 'now' are counted as the same token).

# 1. Text Pre-Processing

How we choose to clean a text depends on our project's goal. In this case, our desired output is a list of terms in each of our poems, which we will then count and compare in our analysis.

To end up with a set of clean tokens, we will have to pre-process the text in several phases:
1. Strip the text of punctuation and cases (capitalization).
1. Split the text into poems using triple newline characters (\n\n\n).
1. Split the text into tokens using whitespace.

Note that step 1 could actually be undertaken at any point: we choose to do it first for the sake of efficiency, applying the changes to the whole text and avoiding creating loops that would otherwise have to re-iterate the step for each poem or for each token. 

## 1.a. Removing Punctuation and Cases

In this case, we want to very simply count term frequency, so we make the decision to remove capitals and any punctuation outside of tokens. This way, words will be counted as the same term regardless of how they were capitalized or if they had any adjacent punctuation marks. 

In the code cell below, we are creating a string object that contains common punctuation marks. 

In [10]:
punctuation = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~’‘'

Recall that strings can be thought of as a series of characters, and a for-loop will iterate through each character in a string if used in the for-statement.

1. Using variable assignment and a string method, create a variable called 'text_clean' which is an all-lowercase version of the text variable.
2. Create a loop that iterates through the punctuation marks above and uses another string method to replace the marks from the clean_text with no character (this could be written as a string with nothing in between the quotes, or ''). Remember that string methods do not act in place, so you will have to reassign the clean_text variable within the loop.

In [11]:
text_clean = text.lower()
for mark in punctuation:
    text_clean = text_clean.replace(mark,'')

In [12]:
print(text_clean[0:1600])

from fairest creatures we desire increase
that thereby beautys rose might never die
but as the riper should by time decease
his tender heir might bear his memory
but thou contracted to thine own bright eyes
feedst thy lights flame with selfsubstantial fuel
making a famine where abundance lies
thy self thy foe to thy sweet self too cruel
thou that art now the worlds fresh ornament
and only herald to the gaudy spring
within thine own bud buriest thy content
and tender churl makst waste in niggarding
pity the world or else this glutton be
to eat the worlds due by the grave and thee


when forty winters shall besiege thy brow
and dig deep trenches in thy beautys field
thy youths proud livery so gazed on now
will be a tattered weed of small worth held
then being asked where all thy beauty lies
where all the treasure of thy lusty days
to say within thine own deep sunken eyes
were an alleating shame and thriftless praise
how much more praise deservd thy beautys use
if thou couldst answer this

# 2. Splitting the Block of Text into Poems

Previously, we observed that the poems in the text are separated by three newline characters (\n). Use this information and a [string method](https://docs.python.org/2.5/lib/string-methods.html) to split the clean text into a list of poems, assigned to a variable called 'poems'.

In [13]:
poems = text_clean.split('\n\n\n')

How many poems do we have? Check the length of the list, which should be 154.

In [14]:
len(poems)

154

Finally, retrieve and print an item from the list, to examine its contents:

In [15]:
print(poems[0])

from fairest creatures we desire increase
that thereby beautys rose might never die
but as the riper should by time decease
his tender heir might bear his memory
but thou contracted to thine own bright eyes
feedst thy lights flame with selfsubstantial fuel
making a famine where abundance lies
thy self thy foe to thy sweet self too cruel
thou that art now the worlds fresh ornament
and only herald to the gaudy spring
within thine own bud buriest thy content
and tender churl makst waste in niggarding
pity the world or else this glutton be
to eat the worlds due by the grave and thee


# 3. Tokenization: Splitting each Poem into a List of Terms

Thus far, we have transformed one long text file into a collection of poem objects. To count the frequency of terms in each poem, we have to split each poem into a collection of terms or tokens. 

> Tokenization:  The process of breaking down text document apart into individual units of meaning, most frequently words.

One issue we noticed before is that some words are separated by spaces, but others might be separated by newline characters (\n). Since we don't care about poem structure in this exercise, the simplest solution is to replace line break characters with spaces. This will allow us to use the string.split() method afterwards using spaces to divide the poems up into tokens.

Note that we didn't do this in the first text-preprocessing step, where we argued we should make these changes in a way that minimized the number of iterations necessary in our code. The reason is simple: up to this point, we needed line breaks to split the text into individual poems. Now, the newline characters have served their purpose, and we can remove them before splitting further.

In the code cell below, create an empty list called poems_clean. Then build a loop that iterates through each item in our poems list, replaces all \n characters with spaces, and appends the modified poem to our poems_clean list.

In [17]:
poems_clean = []
for poem in poems:
    poems_clean.append(poem.replace('\n',' '))

Next, create an empty list called poems_tokens. Then, construct a loop that uses a string method to split each cleaned poem by spaces, and appends the results to the poems_tokens list.

In [18]:
poems_tokens = []
for poem in poems_clean:
    poems_tokens.append(poem.split(' '))

Where we started with a list of poems, we further split each of those items into a list of terms. It is useful to thus bear in mind that poems_tokens is a list of lists. Each item in poems_tokens, indexable by position, is a list of the terms included in a particular poem.  

In this situation, because the poems have no title or other useful ID, accessing them by position, which has been preserved in the previous transformation from individual poem to list of tokens, is sufficient for our purposes. In other cases, where we might have used other unique information as an identifier and not cared about order, a dictionary would have been a more appropriate container.

Let's examine one item from our new poems_tokens list. What it should contain is a list of each of the terms in our poem. 

In [20]:
print(poems_tokens[0])

['from', 'fairest', 'creatures', 'we', 'desire', 'increase', 'that', 'thereby', 'beautys', 'rose', 'might', 'never', 'die', 'but', 'as', 'the', 'riper', 'should', 'by', 'time', 'decease', 'his', 'tender', 'heir', 'might', 'bear', 'his', 'memory', 'but', 'thou', 'contracted', 'to', 'thine', 'own', 'bright', 'eyes', 'feedst', 'thy', 'lights', 'flame', 'with', 'selfsubstantial', 'fuel', 'making', 'a', 'famine', 'where', 'abundance', 'lies', 'thy', 'self', 'thy', 'foe', 'to', 'thy', 'sweet', 'self', 'too', 'cruel', 'thou', 'that', 'art', 'now', 'the', 'worlds', 'fresh', 'ornament', 'and', 'only', 'herald', 'to', 'the', 'gaudy', 'spring', 'within', 'thine', 'own', 'bud', 'buriest', 'thy', 'content', 'and', 'tender', 'churl', 'makst', 'waste', 'in', 'niggarding', 'pity', 'the', 'world', 'or', 'else', 'this', 'glutton', 'be', 'to', 'eat', 'the', 'worlds', 'due', 'by', 'the', 'grave', 'and', 'thee']


At this stage, we have the raw information we need in a format appropriate for our analysis. 

# 4. Term Frequency Analysis. Counting Terms

Next, we will count the terms in each poem. We should think about the nature of our problem. 

We have a data structure which is a list in which each item is a list of tokens. For each list of tokens, we will have to identify a term, and then count the number of times it appears in the list. Then we should store this information in some way. 

In other words, we will have a number of **unique terms** and an associated count **value** for each. This should ring some alarm bells about the best data structure. What data structure would you use?

Ideally, we would want to be creating dictionaries of word counts. In each dictionary, each term would be a key, and its count would be the value. There would be one dictionary per poem. 

## Solving the Problem: Designing our Algorithm

What we have to design now is a loop that can generate the count for each word. We can work it out for one poem, before creating a more general loop that works through the 154 sonnets.

## A loop for one poem


First, assign the first poem in our list to the variable 'poem1'. Remember that in a list, the first position is 0. We will use this poem to think about how to work through an individual case before generalizing and applying the method to other poems.

In [24]:
poem1 =poems_tokens[0]

In [25]:
print(poem1)

['from', 'fairest', 'creatures', 'we', 'desire', 'increase', 'that', 'thereby', 'beautys', 'rose', 'might', 'never', 'die', 'but', 'as', 'the', 'riper', 'should', 'by', 'time', 'decease', 'his', 'tender', 'heir', 'might', 'bear', 'his', 'memory', 'but', 'thou', 'contracted', 'to', 'thine', 'own', 'bright', 'eyes', 'feedst', 'thy', 'lights', 'flame', 'with', 'selfsubstantial', 'fuel', 'making', 'a', 'famine', 'where', 'abundance', 'lies', 'thy', 'self', 'thy', 'foe', 'to', 'thy', 'sweet', 'self', 'too', 'cruel', 'thou', 'that', 'art', 'now', 'the', 'worlds', 'fresh', 'ornament', 'and', 'only', 'herald', 'to', 'the', 'gaudy', 'spring', 'within', 'thine', 'own', 'bud', 'buriest', 'thy', 'content', 'and', 'tender', 'churl', 'makst', 'waste', 'in', 'niggarding', 'pity', 'the', 'world', 'or', 'else', 'this', 'glutton', 'be', 'to', 'eat', 'the', 'worlds', 'due', 'by', 'the', 'grave', 'and', 'thee']


Let's also create an empty dictionary titled 'term_counts' to store our terms. 

In [26]:
term_counts={}

Within our problem, we have to think of two scenarios:

For each word in our poem,
1. If we are encountering a new word, that is, if the word is not in our data structure, create an item where the key is the token, and the value is 1. 
2. If the word already has an item in our data structure, retrieve the value and update it by adding 1 to the count.

Remember how to access, create and modify information in a dictionary. To retrieve a value from a dictionary, you call the dictionary name and the key in brackets. This method also works to create a new dictionary item, or to update an item's value. Also remember that you can iterate through the keys in a dictionary as you would iterate through the items in a list.

Now, construct a loop that will populate the term_counts dictionary with unique terms and the number of times each one appears in the poem.  Within the dictionary, each item will be comprised of a key (the token) and a value (the count).

In [27]:
for token in poem1:
    if token not in term_counts:
        term_counts[token]=1
    else:
        term_counts[token]=term_counts[token]+1    

In [28]:
term_counts

{'from': 1,
 'fairest': 1,
 'creatures': 1,
 'we': 1,
 'desire': 1,
 'increase': 1,
 'that': 2,
 'thereby': 1,
 'beautys': 1,
 'rose': 1,
 'might': 2,
 'never': 1,
 'die': 1,
 'but': 2,
 'as': 1,
 'the': 6,
 'riper': 1,
 'should': 1,
 'by': 2,
 'time': 1,
 'decease': 1,
 'his': 2,
 'tender': 2,
 'heir': 1,
 'bear': 1,
 'memory': 1,
 'thou': 2,
 'contracted': 1,
 'to': 4,
 'thine': 2,
 'own': 2,
 'bright': 1,
 'eyes': 1,
 'feedst': 1,
 'thy': 5,
 'lights': 1,
 'flame': 1,
 'with': 1,
 'selfsubstantial': 1,
 'fuel': 1,
 'making': 1,
 'a': 1,
 'famine': 1,
 'where': 1,
 'abundance': 1,
 'lies': 1,
 'self': 2,
 'foe': 1,
 'sweet': 1,
 'too': 1,
 'cruel': 1,
 'art': 1,
 'now': 1,
 'worlds': 2,
 'fresh': 1,
 'ornament': 1,
 'and': 3,
 'only': 1,
 'herald': 1,
 'gaudy': 1,
 'spring': 1,
 'within': 1,
 'bud': 1,
 'buriest': 1,
 'content': 1,
 'churl': 1,
 'makst': 1,
 'waste': 1,
 'in': 1,
 'niggarding': 1,
 'pity': 1,
 'world': 1,
 'or': 1,
 'else': 1,
 'this': 1,
 'glutton': 1,
 'be': 1,
 'e

# Counting terms: Applying the loop to all our poems

The loop above works well for a poem, but we have 154 poems to analyze. 

That means we have to create 154 dictionaries, and store them in a way that is easy to retrieve. Once again, because all we are using is position of the poem to identify them, a list (which maintains and is indexable by position) is the best data structure for this case. 

Create an empty list called poems_counts.  Then, construct a nested for-loop. First, the loop should iterate through each item in our poems_tokens list.  Within the loop, for each list of poem tokens, the loop should generate a dictionary of word counts. In other words, the loop you created above can be modified and nested into the new loop to work for all our poems.

In [29]:
poems_counts = []

for poem in poems_tokens:
    term_counts={}
    for token in poem:
        if token not in term_counts:
            term_counts[token]=1
        else:
            term_counts[token]=term_counts[token]+1
    poems_counts.append(term_counts)

In [31]:
print(poems_counts)



# Visualization 1: Most common terms in Shakespearean Sonnets

Before we proceed with comparing similarity of poems, we can already do some cool 'distant reading' of Shakespearean sonnets. For instance, we could visualize the most common terms in the corpus.

# Creating a similarity score between two sonnets

We will now create a simple algorithm for comparing how similar two sonnets are to one another. Using the term counts, we will compare the number of times a token appears in both sonnets.

With our approach, two poems that are exactly the same should have a value of 100.
Terms that appear in both poems the same number of times are taken to be equal, so do not discount from the value.
Terms that appear in one poem but not the other contribute to dissimilarity. b

In [33]:
poem1 = poems_counts[0]

In [None]:
poem1

In [34]:
poem2 = poems_counts[1]

In [None]:
#poem1.keys

In [35]:
tokens = set(poem1)|set(poem2) # create a list of all tokens in both poems

In [None]:
#tokens

In [None]:
# similarity = 1
# dif = 0
# for token in tokens:
 #   try:
  #      a=poem1[token]/sum(poem1.values())
#    except: 
 #       a=0
 #   try:
  #      b= poem2[token]/sum(poem2.values())
#    except:
 #       b=0
#    dis = abs(a-b)
 #   dif = dif + dis

In [None]:
#similarity

In [None]:
# dif

In [None]:
# poem1.values

# Euclidean Distance Loop (2 Poems)

In [36]:
termsum=0
tokens = set(poem1)|set(poem2) # create a list of all tokens in both poems
for token in tokens:
    if token in poem1:
        a=poem1[token]
    else:
        a=0
    if token in poem2:
        b= poem2[token]
    else:
        b=0
    term = (a-b)**2
    termsum= termsum+term
euc_similarity = termsum**0.5

In [37]:
euc_similarity

15.066519173319364

# Cosine Similarity Loop (2 poems)

In [38]:
termsum=0
for token in tokens:
    
    if token in poem1:
        a=poem1[token]
    else:
        a=0
    if token in poem2:
        b= poem2[token]
    else:
        b=0
    term = a*b
    termsum= termsum+term
p1sum =0
for key, value in poem1.items():
    sq = value**2
    p1sum= p1sum+sq
p1 = p1sum**0.5
p2sum=0
for key, value in poem2.items():
    sq = value**2
    p2sum= p2sum+sq
p2 = p2sum**0.5

cos_similarity = termsum/(p1*p2)

In [39]:
cos_similarity

0.4481677087948826

# Euclidean Distance - Comparing Poem 1 to all other poems

In [45]:
c0 = poems_counts[0]
c0_sims = []

for c1 in poems_counts:
    tokens = set(c0)|set(c1) # create a list of all tokens in both poems
    termsum=0
    for token in tokens:
        if token in c0:
            a=c0[token]
        else:
            a=0
        if token in c1:
            b= c1[token]
        else:
            b=0
        term = (a-b)**2
        termsum= termsum+term
    euc_similarity = termsum**0.5
    c0_sims.append(round(euc_similarity,3))
print(c0_sims)

[0.0, 15.067, 14.799, 15.492, 15.492, 16.852, 14.866, 16.062, 15.033, 16.462, 17.378, 16.733, 19.287, 16.583, 16.217, 17.436, 17.607, 15.492, 14.318, 15.362, 17.944, 18.138, 16.31, 15.748, 16.733, 18.166, 18.276, 16.432, 18.221, 17.607, 17.407, 16.248, 16.0, 14.765, 16.0, 17.521, 17.776, 14.283, 15.524, 20.298, 15.556, 22.136, 17.776, 17.0, 15.588, 16.401, 19.105, 17.349, 16.062, 15.811, 18.894, 15.652, 19.519, 16.432, 16.643, 15.362, 18.547, 21.354, 16.093, 14.353, 16.279, 20.469, 16.852, 16.371, 17.493, 17.059, 17.692, 16.279, 15.46, 15.362, 19.0, 19.57, 16.823, 16.763, 16.401, 18.493, 16.583, 15.556, 18.439, 17.321, 18.083, 15.264, 19.105, 18.358, 16.523, 16.673, 15.556, 16.34, 17.234, 17.972, 20.494, 16.523, 17.205, 15.524, 14.56, 15.362, 14.933, 17.493, 15.906, 14.9, 16.673, 17.464, 18.193, 17.607, 19.519, 18.055, 15.524, 14.526, 18.841, 17.176, 17.607, 19.287, 17.464, 16.432, 17.176, 16.155, 17.776, 16.432, 17.464, 18.028, 17.635, 15.264, 15.166, 16.763, 15.133, 15.684, 17.464, 1

In [47]:
max(c0_sims)

22.136

In [48]:
min(c0_sims)

0.0

In [None]:
# One reason we might want to store this info as a dictionary: 
# if we sort on value, we lose order and therefore reference to the original poem

In [49]:
c0_sims.sort()

In [50]:
c0_sims

[0.0,
 14.283,
 14.318,
 14.353,
 14.526,
 14.56,
 14.765,
 14.765,
 14.799,
 14.799,
 14.866,
 14.9,
 14.933,
 15.033,
 15.067,
 15.133,
 15.166,
 15.264,
 15.264,
 15.362,
 15.362,
 15.362,
 15.362,
 15.46,
 15.492,
 15.492,
 15.492,
 15.524,
 15.524,
 15.524,
 15.556,
 15.556,
 15.556,
 15.588,
 15.652,
 15.684,
 15.748,
 15.811,
 15.906,
 15.906,
 16.0,
 16.0,
 16.062,
 16.062,
 16.062,
 16.093,
 16.155,
 16.217,
 16.248,
 16.279,
 16.279,
 16.31,
 16.31,
 16.34,
 16.371,
 16.371,
 16.401,
 16.401,
 16.432,
 16.432,
 16.432,
 16.432,
 16.432,
 16.462,
 16.523,
 16.523,
 16.583,
 16.583,
 16.583,
 16.643,
 16.673,
 16.673,
 16.733,
 16.733,
 16.763,
 16.763,
 16.793,
 16.823,
 16.852,
 16.852,
 16.882,
 17.0,
 17.0,
 17.029,
 17.059,
 17.117,
 17.146,
 17.176,
 17.176,
 17.205,
 17.205,
 17.234,
 17.321,
 17.349,
 17.378,
 17.407,
 17.436,
 17.464,
 17.464,
 17.464,
 17.464,
 17.493,
 17.493,
 17.521,
 17.607,
 17.607,
 17.607,
 17.607,
 17.635,
 17.692,
 17.776,
 17.776,
 17.776,
 

In [None]:
## I think up to here, with some sort of visualization (word cloud? ugh) Would be good for the first review lesson.
## After the other lessons, we could bring in functions and pandas and look at refactoring and applying this to all the texts.

# Visually looking at the most similar & most dissimilar poems

In [None]:
print(poems[0])

In [None]:
print(poems[59])

In [None]:
print(poems[41])

# Comparing Overlapping Words in Two Most Similar Poems

In [None]:
# Eric's Code
def overlap_terms(index_a, index_b):
    c_a = counts_nonstop_list[index_a]
    c_b = counts_nonstop_list[index_b]
    only_a = []
    only_b = []
    both_ab = []
    for term,count in c_a.items():
        if term in c_b:
            both_ab.append(((term,count),(term,c_b[term])))
        else:
            only_a.append((term,count))
    for term,count in c_b.items():
        if term not in c_a:
            only_b.append((term,count))
    # sorting overlapped terms by the sum of the frequencies
    both_sorted = sorted(both_ab, key=lambda x:x[0][1]+x[1][1], reverse=True)
    a_sorted = sorted(only_a, key=lambda x:x[1], reverse=True)
    b_sorted = sorted(only_b, key=lambda x:x[1], reverse=True)
    # Just returning the first 15 of each non-overlap list
    return [('a','b')] + both_sorted + [('a','b')] + list(zip(a_sorted,b_sorted))[:15]

In [None]:
# Using Sets

overlap = set(poem_counts[0].keys()) & set(poem_counts[59].keys())
only_a = set(poem_counts[0])-set(poem_counts[59])
only_b = set(poem_counts[59])-set(poem_counts[0])

In [None]:
for item in overlap:
    print(item, poem_counts[0][item],poem_counts[59][item])

In [None]:
for item in only_a:
    print(item, poem_counts[0][item])

In [None]:
for item in only_b:
    print(item, poem_counts[59][item])

In [None]:
# Below I am playing with the two algorithms. I like the simplicity of the code for Euclidean distance, 
# but it gives low numerical results and thus is hard to graph. The cosine distance graphs much more nicely,
# but I think it will be harder to prompt the students to get there. 

# For all pairs

In [None]:
c0 = poem_counts[0]
c0_sims = {}

for poem, c1 in poem_counts.items():
    tokens = set(c0)|set(c1) # create a list of all tokens in both poems
    termsum=0
    for token in tokens:
        if token in c0:
            a=c0[token]
        else:
            a=0
        if token in c1:
            b= c1[token]
        else:
            b=0
        term = (a-b)**2
        termsum= termsum+term
    euc_similarity = 1/(1+termsum**0.5)
    key= '0'+'-'+str(poem)
    c0_sims[key]= round(euc_similarity,3)
print(c0_sims)

In [None]:
def similarity_matrix(all_counts_dict):
    all_sims_list = []
    for text, c0 in all_counts_dict.items():
        sims_list = []
        for text, tc in all_counts_dict.items():
            tokens = set(c0)|set(tc)
            termsum = 0
            for token in tokens:
                if token in c0:
                    a=c0[token]
                else:
                    a=0
                if token in tc:
                    b= tc[token]
                else:
                    b=0
                term = (a-b)**2
                termsum =termsum+term
            euc_similarity = termsum**0.5
            sims_list.append(euc_similarity)
        # max_sim = max(sims_list)
        norm_sims_id_list = [round(xx,3) for xx in sims_list]
        all_sims_list.append(norm_sims_id_list)
    return all_sims_list

In [None]:
only_sims_list = similarity_matrix(poem_counts)

In [None]:
only_sims_list

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
sim_ns_array = np.array(only_sims_list)

In [None]:
f, ax = plt.subplots(figsize=(8,8))
ax = sns.heatmap(sim_ns_array, square=True)

In [None]:
g = sns.clustermap(sim_ns_array)

# Using Cosine Similarity

In [None]:
termsum=0
for token in tokens:
    
    if token in poem1:
        a=poem1[token]
    else:
        a=0
    if token in poem2:
        b= poem2[token]
    else:
        b=0
    term = a*b
    termsum= termsum+term
p1sum =0
for key, value in poem1.items():
    sq = value**2
    p1sum= p1sum+sq
p1 = p1sum**0.5
p2sum=0
for key, value in poem2.items():
    sq = value**2
    p2sum= p2sum+sq
p2 = p2sum**0.5

cos_similarity = termsum/(p1*p2)

In [None]:
def similarity_matrix(all_counts_dict):
    all_sims_list = []
    for text, c0 in all_counts_dict.items():
        sims_list = []
        for text, tc in all_counts_dict.items():
            tokens = set(c0)|set(tc)
            termsum = 0
            for token in tokens:
                if token in c0:
                    a=c0[token]
                else:
                    a=0
                if token in tc:
                    b= tc[token]
                else:
                    b=0
                term = a*b
                termsum= termsum+term
            p1sum =0
            for key, value in c0.items():
                sq = value**2
                p1sum= p1sum+sq
            p1 = p1sum**0.5
            p2sum=0
            for key, value in tc.items():
                sq = value**2
                p2sum= p2sum+sq
            p2 = p2sum**0.5
            cos_similarity = termsum/(p1*p2)
            sims_list.append(cos_similarity)
        # max_sim = max(sims_list)
        norm_sims_id_list = [round(xx,3) for xx in sims_list]
        all_sims_list.append(norm_sims_id_list)
    return all_sims_list

In [None]:
only_sims_list = similarity_matrix(poem_counts)

In [None]:
only_sims_list

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
sim_ns_array = np.array(only_sims_list)

In [None]:
f, ax = plt.subplots(figsize=(8,8))
ax = sns.heatmap(sim_ns_array, square=True)

In [None]:
g = sns.clustermap(sim_ns_array)

<div style="text-align:center">    
  <a href="06%20Loops.ipynb">Previous Lesson: Loops</a>|
   <a href="07%20Accessing%20Files.ipynb">Next Lesson: Accessing Files</a>
</div>