### Let's play with language!

We know that computers are really good at number crunching; because of this, they're very often associated with areas like Maths and Physics.



**But** computers can also be used to play with *language* in cool ways. You've probably encountered a lot of language based technologies already -- chat interfaces, like ChatGPT, and even more basic things like the predictive text function on your phone. In the world of Literature, scholars have develeoped a whole other way of thinking about texts using computation called *distant reading* (where they look at high-level trends as opposed to doing a *close reading*, which is a small or highly-targeted in-depth analysis).

We're going to try our hand at using the computer to transform and even *generate* text!

</br></br>


#### Getting some texts

We're going to grab some text to work with from the internet. In text analysis, you often work with groups of realted documents which together form a *corpus*. Our corpus today is going to be collections of plays of different genres written by Shakespeare

The below code fetches them and stores the data into a list of dictionaries. The keys of each dictionary are the play's title, its genre, and its full text. **You don't need to know how this code works**, but you are welcome to poke away at it to try to figure it out.



In [None]:
import os
!wget https://lexically.net/downloads/corpus_linguistics/ShakespearePlaysPlus.zip
!unzip ShakespearePlaysPlus.zip

corpus = []
for genre in ["comedies", "tragedies", "historical"]:
  prefix = genre + "/"
  for f in os.listdir(genre):
    if f.endswith(".txt"):
      fi= open(prefix + f, encoding="utf-16-le")
      corpus.append({"title":f[:-4], "genre":genre, "text":fi.read()})


#we expect this to be 37
print(len(corpus))
#and the titles of all of the plays
for doc in corpus:
  print(doc["title"])


Now if we take a look at one of the documents we just read in, we see that it is full of *tags* used to distinguish things like acts, characher names, and stage directions.




In [None]:
#you can change the number in the first set of square brackets if you want to see different ones
print(corpus[17]["text"][:1000])

### Tokenization


*We* know what all this means but the computer, dumb rock that it is, has no idea. Computers see everything as a series of 1s and 0s. The computer doesn't *know* what a stage direction is, what a character name is, or even what a *word* is. We have to teach it what collections of 1s and 0s matter to us.

The process of dividing up the text into those meaningful collections -- which we'll call **tokens** is called *tokenization*. For today, we're going to use the **word** as the smallest unit we care about, so our tokens will be individual words. We're only going to care about words spoken as part of *dialogue*. This menas that we're going to discard any of the text enclosed in angled brackets.

**Your first task is to write a function to tokenize a text**.

Currently, the smallest unit python 'understands' is an individual character. That means that tokenization looks like looping over each character and deciding to either stick it to the end of a sequence of characters that will eventually make up a new word, or knowing that a given character indicates the end of a word and stopping the adding process (and adding the completed word to a running list of tokens).

This means we have to brainstorm all of the scenarios that indicate the end of a word. We should probably start a new token if:

- We hit a piece of puncutuation that *isn't* an apostrophe (sometimes you might to this differently, but for not we don't want posessives to be different tokens)
- We hit any sort of whitespace character (newlines, spaces)


In our particular case, we also need to account for angled brackets. Finally, we want to make sure that all of our tokens are only lowercase.

</br>

####A few helpful things to know

* `String.puncutation` is a string that includes all types of punctuation.
* `String.whitespace` is a string that includes all types of whitespace characters.
* The .lower() string function makes any string (include one character strings) lowercase.
* You can use a simple for loop to loop over any character




*Check out the code below for some examples. Make sure to read the comments!*




In [None]:

#getting string punctuation
import string
punct = string.punctuation

#added the emdash because it was missing
punct += "—"
#keeping the apostrophe because we're not doing advanced tokenization so we'll want to keep contractions
punct = punct.replace("'", "")


s = 'We are the knights who say "Ni"!'
#This does very basic tokenization (printing out each token)
#you'll need to add some things to account for angled brackets, and add all the tokens to a list instead of printing them
running = ""
new_token = False
for c in s:
  if c in punct:
    new_token = True
  elif c in string.whitespace:
    new_token = True
  if new_token:
    if len(running)>0:
      print(running.lower())
    new_token = False
    running =""
  else:
    running += c






### Your turn!

Fill in the tokenization function below. If do your tokenization about the same way we have, the test code should output `16601`

Don't just rely on that, though! Make sure you print out some part of your list of tokens to make sure things are working as you expect!

In [None]:
#This is a copy of one of the texts from the corpus, for us to play around with
sample = corpus[17]["text"][:]


def tokenize(text):
  #YOUR CODE HERE


sample_tokens = tokenize(sample)
print(len(sample_tokens))


#compare the start of the text to your tokeization
#print(sample[:1000])
print(sample_tokens)

### Talking like Shakespeare


Now that we've sort of taught the computer what an individual work is, we're going to teach it how to string them together into sequences that (more of less) make sense.

</br>

**To start, your first task is to build a massive list of all the tokens used in ONLY THE COMEDIES**.

You should use your tokenization code to help you do this. You should end up with a total number of tokens around `349361`

In [None]:
comedy_tokens = []
#YOUR CODE HERE

Language is really complicated. There are all sorts of rules (and many *many* exceptions) that we've all internalised about what kinds of words go together.

We could *try* to teach the computer all of those rules, but all the exeptions would make it tricky (and that's not even taking into account the fact that we're using antiquated language)! There's a lot of nuance that it would be hard to make concrete.

Instead, we're going to find a way of 'teaching' the computer about which words go together through something computers are really good at: maths!

We're going to take a huge number of examples (that massive token list we just created) and count up the number of times pairs of words appear beside eachother. These pair are called **bigrams**.

**Your task is to write a function that creates a dictionary where the keys are bigrams, and the values are the number of times they appear.**

If you do it like we did, your top ten (we give you some code to calculate and print this) should look something like:

```
('i', 'am') 932
('i', 'will') 805
('i', 'have') 753
('in', 'the') 702
('to', 'the') 575
('of', 'the') 570
('it', 'is') 535
('my', 'lord') 500
('to', 'be') 453
('that', 'i') 430
```



In [None]:
#this is a helper function for you to use -> for this question and others later on
def top_ten(d):
  keys = list(d.keys())
  keys.sort(key=lambda x:d[x], reverse=True)
  for i in range(10):
    print(keys[i], d[keys[i]])



def bigram_freq(tokens):
  #YOUR CODE HERE

bigrams = bigram_freq(comedy_tokens)
top_ten(bigrams)

These are pretty neat! Let's make things a little more complex, and try doing the same thing, but with sets of *three* words this time; these are called **trigrams**.

*Your code will be pretty similar to the bigram code.*

If you do it correctly, your top ten should look something like:

```
('i', 'pray', 'you') 157
('i', 'will', 'not') 106
('i', 'know', 'not') 75
('i', 'am', 'not') 75
('i', 'am', 'a') 72
('i', 'do', 'not') 69
('it', 'is', 'a') 66
('and', 'i', 'will') 65
('there', 'is', 'no') 64
('i', 'would', 'not') 61

```

In [None]:
def trigram_freq(tokens):
  #YOUR CODE HERE

trigrams = trigram_freq(comedy_tokens)
top_ten(trigrams)

A cool thing that we can do with these bigram and trigram counts is calculate the probability of any single token appearing after two other tokens.

To calculate this probability of token t and position i (`t[i]`) we simply do:

```
trigram_frequency of ( t[i-2], t[i-1], t[i] ) / bigram frequency of (t[i-2], t[i-1])
```

**Let's write a function that, for each DISTINCT token (in our comedy only corpus), will calculate the probability of it appearing following two other given tokens.**

This function should return a dictionary where the keys are the distinct words, and each key's value is the probability value for that word.
As parameters, you need to give it two previous tokens that at some point appear beside eachother.

</br>

*A useful thing to know*: You can use python `sets` to get all the distinct elements in a list. For example, the follow code:

```
l = [1,2,2,3,4]
dist = list(set(l))
print(dist)
```
outputs

```
[1,2,3,4]
```

If you test your distribution funcition with the previous tokens `"i"` and `will`, your top ten should look like:

```
not 0.13167701863354037
be 0.06459627329192547
go 0.03354037267080745
tell 0.02732919254658385
do 0.024844720496894408
have 0.02236024844720497
make 0.01987577639751553
give 0.013664596273291925
never 0.013664596273291925
no 0.009937888198757764

```

In [None]:
def trigram_freq_dist(distinct_tokens, bigram_freq, trigram_freq, prev_prev_token, prev_token):
  #YOUR CODE HERE


d_tokens = list(set(comedy_tokens))
#testing with 'i will'
dist = trigram_freq_dist(d_tokens, bigrams, trigrams, 'i', 'will')
top_ten(dist)

The really cool thing that we can do now that we have this probability distribution is *hallucinate* text that sounds a bit like Shakespeare! Starting with two tokens, we can predict -- like we did above -- what token might come next. If we pick one of those tokens, we can now use that new token and the previous one to generate *another* prediction....and so on and so forth. This is similarto how predictive text works on your phone.

To make things more interesting, instead of always picking the most proable token (which would give us the same text every time), we'll pseudorandomly pick in a way that takes the probability into account. The function below does just that, given the trigram distribution. Notice how the two different calls (likely) yield different results.

In [None]:
import random

#this function depends on at least one of the trigram probabilities being non-zero
#so you should make sure that's true before calling it
def pick_from_dist(tri_dist):
  vals = list(tri_dist.keys())
  choice = random.choices(vals, weights=tri_dist.values(), k=1)
  return choice[0]


trigram_dist = trigram_freq_dist(d_tokens, bigrams, trigrams, "i", "have" )
print(pick_from_dist(trigram_dist))
print(pick_from_dist(trigram_dist))

**Your task is to write a function that takes two starting tokens (as a list), the tokens, bigram and trigram frequencies, and an integer `n` and generates a text of length `n` that starts with those tokens.**

For example, if you pass in the tokens `'i'`, `'will'` and the integer `20` you might end up with something like:

```
i will not come fair princess he is to conjure tears up in a riot take your leave good madam
```
or
```
i will not trust you not hear her lamb when it bites and you shall see it in a quarrel

```

Before you start writing code, think carefully about the process. You might want to try writing down individual steps as comments and then gradually transforming those into code.

In [None]:
def hallucinate(tokens,start_tokens, bi_f, tri_f, n):
  #YOUR CODE HERE


print(hallucinate(d_tokens, ["i", "will"], bigrams, trigrams, 20))

Not entirely coherent, but *vaguely* shakesperian!

This is operating uniquely on the text from the comedies. **If you wanted, you could try the same thing on the *entire* corpus, or on just the collections of tragedies or historical plays, to see if you get different results.**

## Extension task: looking for meaning

In the first part, we used the documents to *generate* text. Another thing we sometimes use computers for is trying to *understand* things about text, without reading it all. In literature, this use of computers and statistical methods to identify high-level trends is sometimes called *distant reading* (as a counterpoint to *close reading*).

One basic technique that we can use to start to get an idea of what a document is talking about, is to (after tokenization) **count how many times each distinct word occurs**.


**Your first task is to write a function (staring with the give header) that will take a document's tokens and return a dictionary where the keys are distinct tokens and the values are integers correspinding to how many times each token appeared**. This is called a *frequency vector*.


If you do it correctly, your top 10 for *Macbeth* should look something like:

```
the 700
and 515
to 398
of 333
i 312
that 229
a 217
you 203
in 198
my 192
```

In [None]:
def freq_vec(tokens):
  #YOUR CODE HERE


m_tokens = tokenize(corpus[17]["text"])
vector = freq_vec(m_tokens)
top_ten(vector)

**Now, try using the code you've written to print out the top ten for every TRAGEDY in the corpus**.

In [None]:
#YOUR CODE HERE

**Well that doesn't tell us much!**

Looking at the frequency vectors for each play, we can see that they're mostly made up of words like `and`, `the` and `you` -- words that are used in every kind of text. In the world of natural language processing, these are called *stopwords*. In many problem-contexts, we *remove* them from token lists in order to get more meaningful results.

</br>

The below code creates a list of stopwords for you. You'll notice that it includes some modern variations that won't show up in Shakespeare, but it'll do. If you happen to notice antiquated forms popping up in our texts, you can try adding them in.

In [None]:
#NOTE: could use the NLTK stopwords instead, or read them in from a text file
#the reasons to do it as  would be 1) not importing and 2)they can see what the words are/play with them

# Source is Table 2.1 of Chapter 2 of Information Retrieval by C.J. Van Rijsbergen
# http://www.dcs.gla.ac.uk/Keith/Chapter.2/Ch.2.html

stopword_text = """
A               CANNOT          INTO            OUR             THUS
ABOUT           CO              IS              OURS            TO
ABOVE           COULD           IT              OURSELVES       TOGETHER
ACROSS          DOWN            ITS             OUT             TOO
AFTER           DURING          ITSELF          OVER            TOWARD
AFTERWARDS      EACH            LAST            OWN             TOWARDS
AGAIN           EG              LATTER          PER             UNDER
AGAINST         EITHER          LATTERLY        PERHAPS         UNTIL
ALL             ELSE            LEAST           RATHER          UP
ALMOST          ELSEWHERE       LESS            SAME            UPON
ALONE           ENOUGH          LTD             SEEM            US
ALONG           ETC             MANY            SEEMED          VERY
ALREADY         EVEN            MAY             SEEMING         VIA
ALSO            EVER            ME              SEEMS           WAS
ALTHOUGH        EVERY           MEANWHILE       SEVERAL         WE
ALWAYS          EVERYONE        MIGHT           SHE             WELL
AMONG           EVERYTHING      MORE            SHOULD          WERE
AMONGST         EVERYWHERE      MOREOVER        SINCE           WHAT
AN              EXCEPT          MOST            SO              WHATEVER
AND             FEW             MOSTLY          SOME            WHEN
ANOTHER         FIRST           MUCH            SOMEHOW         WHENCE
ANY             FOR             MUST            SOMEONE         WHENEVER
ANYHOW          FORMER          MY              SOMETHING       WHERE
ANYONE          FORMERLY        MYSELF          SOMETIME        WHEREAFTER
ANYTHING        FROM            NAMELY          SOMETIMES       WHEREAS
ANYWHERE        FURTHER         NEITHER         SOMEWHERE       WHEREBY
ARE             HAD             NEVER           STILL           WHEREIN
AROUND          HAS             NEVERTHELESS    SUCH            WHEREUPON
AS              HAVE            NEXT            THAN            WHEREVER
AT              HE              NO              THAT            WHETHER
BE              HENCE           NOBODY          THE             WHITHER
BECAME          HER             NONE            THEIR           WHICH
BECAUSE         HERE            NOONE           THEM            WHILE
BECOME          HEREAFTER       NOR             THEMSELVES      WHO
BECOMES         HEREBY          NOT             THEN            WHOEVER
BECOMING        HEREIN          NOTHING         THENCE          WHOLE
BEEN            HEREUPON        NOW             THERE           WHOM
BEFORE          HERS            NOWHERE         THEREAFTER      WHOSE
BEFOREHAND      HERSELF         OF              THEREBY         WHY
BEHIND          HIM             OFF             THEREFORE       WILL
BEING           HIMSELF         OFTEN           THEREIN         WITH
BELOW           HIS             ON              THEREUPON       WITHIN
BESIDE          HOW             ONCE            THESE           WITHOUT
BESIDES         HOWEVER         ONE             THEY            WOULD
BETWEEN         I               ONLY            THIS            YET
BEYOND          IE              ONTO            THOSE           YOU
BOTH            IF              OR              THOUGH          YOUR
BUT             IN              OTHER           THROUGH         YOURS
BY              INC             OTHERS          THROUGHOUT      YOURSELF
CAN             INDEED          OTHERWISE       THRU            YOURSELVES
"""

stopwords = set( [ s.lower().strip() for s in stopword_text.strip().split() ] )
print(stopwords)

Try running the code you wrote to calculate and display the frequency vectors for each tragedy but, this time, **remove all stopwords from the token list first**.

Your new top 10 for *Romeo and Juliet* should look like:

```
thou 277
thy 166
o 150
love 137
thee 137
romeo 113
shall 110
come 97
do 89
good 86
```

In [None]:
#YOUR CODE HERE


We're still seeing some common oldey-timey words in there, but this now gives us a better sense of the topics of each document.

That said, a lot of the documents share most frequent words. This means that t can be hard to tell which words are the most important to a *specific* document.

To address this challenge, we can use something called **TF-IDF** which stands for **Term Frequency - Inverse Document Frequency**. What TF-IDF lets us do is figure out which words in a document are most important (a combination of frequent and unique) to an individual document with respect to the corpus overall.

We calculate the individual TF-IDF value of a token using the following formula:

```
TF * IDF
```

Where **TF** is

```
TF(t,d) = number of times token t appears in document d/total number of tokens in document d
```
and **IDF** is

```
IDF(t) = log(number of documents in the corpus/number of corpus documents which use token t
```


**Your job is to calculate the TF-IDF vectors for each of the documents in the corpus**. This means claculating the individual values for each of its unique tokens, like you did for frequency vectors. You should be using your (stopword-free) frequency vectors as part of this process. Note that you can use `math.log10(x)` to take the base 10 log of a value `x`.


To make your life easier, the code below creates a dictionary there the keys are all the tokens (words) used in the overall corpus, and the values are the number of corpus documents in which that word appears.

In [None]:
def create_corp_counts():
  usage = {}
  for c in corpus:
    if c["genre"]=='tragedies':
      no_stop = [x for x in tokenize(c["text"]) if x not in stopwords]
      unique = set(no_stop)
      for word in unique:
        if word not in usage:
          usage[word]=0
        usage[word] +=1
  return usage

doc_usage = create_corp_counts()
print(doc_usage)


We also know that there are **10** documents in our corpus (if we're only looking at tragedies. But don't take our word for it, run the code below to confirm!

In [None]:
num =0
for c in corpus:
  if c["genre"]=="tragedies":
    num +=1

print(num)

Now you should have all the pieces you need to write the tf-idf vectorization code. **Write your code in the cell below**.

If you so it correctly, your updated top 10 for *Romeo and Juliet* should look something like:

```
romeo 0.00916240979485932
tybalt 0.0037298305359604314
juliet 0.0032433309008351578
friar 0.0017838319954593368
montague 0.0017027487229384578
paris 0.0013601946082919365
county 0.0012973323603340632
mercutio 0.0011351658152923053
thursday 0.0011351658152923053
capulet 0.0010540825427714264
```

In [None]:
import math
def tfidf_vectorize(freq_vec, num_docs, doc_use):
  #YOUR CODE HERE


no_stop = [x for x in tokenize(corpus[18]["text"]) if x not in stopwords]
f = freq_vec(no_stop)
tfidf = tfidf_vectorize(f, 10, doc_usage)
top_ten(tfidf)

**Now try calculating and outputting the tf-idf top 10 for each tragedy, just like you did for the frequency vectors**.

In [None]:
#YOUR CODE HERE

As you can see, it mostly ends up being character names and locations, but that certainly tells us a lot more about each specific document than normal freuquency vectors!

If you want to get a sense of play topics and themes, you could try also designating the character names as stopwords, so that they'll be removed (or, more simply, just look beyond the top 10)!