# Assignment 4: Word Embeddings for Shakespearean English

## **Objective**
To train word embeddings on famous works of Shakespeare and evaluate their understanding.

## **Data**
The entire text of plays: 1) The Tragedy of Hamlet, Prince of Denmark, 2) The Tragedy of Macbeth, and 3) The Tragedy of Julius Caesar. These are available from the Gutenberg corpus of the NLTK library. Characters and synopses can be found on Wikipedia.

## **Problem Statement**
Natural language processing is an important part of the most advanced artificial intelligence software we have today. By studying volumes of text, word embeddings are able to elicit meaning from the words within training data. Your goal is to train a word embedding on three famous works of Shakespeare to determine how well your embedding can understand the meaning of character names and other Shakespearean English words found in these plays.

### Data

In [1]:
import nltk
from nltk.corpus import gutenberg
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from autocorrect import Speller
import re

In [2]:
# Loading and lower-case the text
plays = [gutenberg.raw('shakespeare-hamlet.txt'),
         gutenberg.raw('shakespeare-macbeth.txt'),
         gutenberg.raw('shakespeare-caesar.txt')]

In [3]:
full_text = ' '.join(plays).lower()

In [4]:
# Tokenizing into sentences and words
sentences = sent_tokenize(full_text)
tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]

In [5]:
tokenized_sentences

[['[',
  'the',
  'tragedie',
  'of',
  'hamlet',
  'by',
  'william',
  'shakespeare',
  '1599',
  ']',
  'actus',
  'primus',
  '.'],
 ['scoena', 'prima', '.'],
 ['enter', 'barnardo', 'and', 'francisco', 'two', 'centinels', '.'],
 ['barnardo', '.'],
 ['who', "'s", 'there', '?'],
 ['fran', '.'],
 ['nay',
  'answer',
  'me',
  ':',
  'stand',
  '&',
  'vnfold',
  'your',
  'selfe',
  'bar',
  '.'],
 ['long', 'liue', 'the', 'king', 'fran', '.'],
 ['barnardo', '?'],
 ['bar', '.'],
 ['he', 'fran', '.'],
 ['you', 'come', 'most', 'carefully', 'vpon', 'your', 'houre', 'bar', '.'],
 ["'t",
  'is',
  'now',
  'strook',
  'twelue',
  ',',
  'get',
  'thee',
  'to',
  'bed',
  'francisco',
  'fran',
  '.'],
 ['for',
  'this',
  'releefe',
  'much',
  'thankes',
  ':',
  "'t",
  'is',
  'bitter',
  'cold',
  ',',
  'and',
  'i',
  'am',
  'sicke',
  'at',
  'heart',
  'barn',
  '.'],
 ['haue', 'you', 'had', 'quiet', 'guard', '?'],
 ['fran', '.'],
 ['not', 'a', 'mouse', 'stirring', 'barn', '.'],
 

In [16]:
# Spelling check
spell = Speller()
corrected_sentences = [[spell(word) for word in sentence] for sentence in tokenized_sentences]

In [30]:
# Stopword removal
stop_words = set(stopwords.words('english'))
cleaned_sentences = [[word for word in sentence if word not in stop_words] for sentence in corrected_sentences]

In [None]:
cleaned_sentences

In [31]:
# Stemming and Lemmatization
# stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
processed_sentences = [[lemmatizer.lemmatize(word) for word in sentence] for sentence in cleaned_sentences]

In [11]:
processed_sentences

[['[',
  'tragedy',
  'hamlet',
  'william',
  'shakespeare',
  '1599',
  ']',
  'act',
  'prime',
  '.'],
 ['scene', 'prima', '.'],
 ['enter', 'bernard', 'francisco', 'two', 'sentinel', '.'],
 ['bernard', '.'],
 ["'s", '?'],
 ['fran', '.'],
 ['nay', 'answer', ':', 'stand', '&', 'unfold', 'self', 'bar', '.'],
 ['long', 'like', 'king', 'fran', '.'],
 ['bernard', '?'],
 ['bar', '.'],
 ['fran', '.'],
 ['come', 'carefully', 'von', 'house', 'bar', '.'],
 ["'t", 'took', 'twelve', ',', 'get', 'thee', 'bed', 'francisco', 'fran', '.'],
 ['release',
  'much',
  'thanks',
  ':',
  "'t",
  'bitter',
  'cold',
  ',',
  'sick',
  'heart',
  'barn',
  '.'],
 ['quiet', 'guard', '?'],
 ['fran', '.'],
 ['mouse', 'stirring', 'barn', '.'],
 ['well', ',', 'goodnight', '.'],
 ['meet',
  'ratio',
  'marvelous',
  ',',
  'rival',
  'watch',
  ',',
  'bid',
  'make',
  'hast',
  '.'],
 ['enter', 'ratio', 'marvelous', '.'],
 ['fran', '.'],
 ['think', '.'],
 ['stand', ':', "'s", '?'],
 ['hor', '.'],
 ['friend', 

In [17]:
# Regular Expression Cleanup
processed_sentences = [[re.sub(r'\W+', '', word) for word in sentence if word.isalpha()] for sentence in processed_sentences]

In [18]:
# Printing first 5 sentences
print(processed_sentences[:5])

[['tragedy', 'hamlet', 'william', 'shakespeare', 'act', 'prime'], ['scene', 'prima'], ['enter', 'bernard', 'francisco', 'two', 'sentinel'], ['bernard'], []]


### Modeling

In [32]:
from gensim.models import Word2Vec

# CBOW Model
cbow_model = Word2Vec(sentences=processed_sentences, vector_size=300, window=10, min_count=1, sg=0, epochs=20)
# Printing 20 most frequent words

for word, count in list(cbow_model.wv.key_to_index.items())[:20]:
    print(word, ":", cbow_model.wv.get_vecattr(word, "count"))


, : 7058
. : 4286
: : 1542
? : 996
'd : 662
; : 456
ham : 337
thou : 306
lord : 306
's : 301
shall : 300
come : 284
king : 248
enter : 230
good : 221
let : 220
mac : 205
thy : 202
like : 200
cesar : 193


In [33]:
# Skip-gram Model
skipgram_model = Word2Vec(sentences=processed_sentences, vector_size=300, window=10, min_count=1, sg=1, epochs=20)


In [22]:
import requests

url = "http://nlp.stanford.edu/data/glove.6B.zip"
response = requests.get(url)

# Save the zip file locally
with open("glove.6B.zip", "wb") as f:
    f.write(response.content)

print("Download complete.")

Download complete.


In [23]:
import zipfile

# Unzip the file
with zipfile.ZipFile("glove.6B.zip", "r") as zip_ref:
    zip_ref.extractall(".")  # Extracts files to the current directory

print("Extraction complete.")

Extraction complete.


In [24]:
from gensim.models import KeyedVectors

# Load GloVe vectors without assuming a header
glove_model = KeyedVectors.load_word2vec_format('glove.6B.100d.txt', binary=False, no_header=True)
print("GloVe model loaded successfully!")





GloVe model loaded successfully!


Here, the glove model has been trained on the 'glove.6B.100d.txt'


### Discussion

In [25]:
# Function to print most similar words
def compare_models(word):
    print(f"\nMost similar words for: {word}")
    if word in cbow_model.wv:
        print("CBOW:", cbow_model.wv.most_similar(word, topn=5))
    else:
        print("CBOW: Word not found in vocabulary")
    
    if word in skipgram_model.wv:
        print("Skip-gram:", skipgram_model.wv.most_similar(word, topn=5))
    else:
        print("Skip-gram: Word not found in vocabulary")
    
    if word in glove_model:
        print("GloVe:", glove_model.most_similar(word, topn=5))
    else:
        print("GloVe: Word not found in vocabulary")

# Compare words
words_to_check = ['hamlet', 'cauldron', 'nature', 'spirit', 'general', 'prythee']
for word in words_to_check:
    compare_models(word)


Most similar words for: hamlet
CBOW: [('queen', 0.9974504709243774), ('alert', 0.9972547888755798), ('king', 0.9964178204536438), ('ophelia', 0.9961773157119751), ('rosincrance', 0.9959716796875)]
Skip-gram: [('alert', 0.9457433223724365), ('queen', 0.9198136329650879), ('ophelia', 0.9100129008293152), ('ratio', 0.9045300483703613), ('deer', 0.8915296196937561)]
GloVe: [('village', 0.6998987197875977), ('town', 0.6558532118797302), ('situated', 0.5926076769828796), ('located', 0.5660547614097595), ('unincorporated', 0.5599358677864075)]

Most similar words for: cauldron
CBOW: [('double', 0.9997981190681458), ('trouble', 0.9997791051864624), ('golden', 0.9997566938400269), ('bubble', 0.9997516870498657), ('fell', 0.9997451901435852)]
Skip-gram: [('toile', 0.991753339767456), ('bubble', 0.9913451075553894), ('golden', 0.9808398485183716), ('twenty', 0.9796853065490723), ('flame', 0.9796381592750549)]
GloVe: [('caldron', 0.7603139281272888), ('flame', 0.6907342672348022), ('lit', 0.59124

For Hamlet CBOW associated the word well with king, Denmark due to the context in the play, Skip-gram also showed some correct associations (queen, king, lord) but included outliers like colony and alert while GloVe gave the actual similar words (village, town) without taking into consideration the context.

For Cauldron, CBOW missed the correct meaning by providing words that are not associated (ha, think, may), Skip-gram captured some good associations (bubble, cool, compel), which are related to the witches’ cauldron in Macbeth while GloVe correctly identified related words (flame, lit, torch).

For Nature, CBOW provided words (even, vp, hand, blood) that do not Shakespearean meaning, Skip-gram also provided words (even, though, yes) which are not really associated with nature. However, GloVe provided more general words (natural, true, life) associated with nature.

For Spirit, CBOW provided words (blood, heaven, self) which are associated with the contextual meaning in the play, Skip-gram provided some incorrect words (habit, instrument, hide) while GloVe captured words associated with spirit meaning (passion, faith, love), but not specific to the play.

For General, CBOW provided words (fire, house, state) and Skip-gram generated weak words (satisfied, fetch, soft) which are not associated with the military ranks in the play, but GloVe captured the modern meaning by Identifying the military association (chief, gen., president).

For Prythee, CBOW captured some terms (passe, ere), indicating a partial understanding, Skip-gram provided some unusual words (fate, trade, labour) while GloVe did not have the Shakespearean vocabulary.


In [27]:
# Function to print cosine similarities
def compare_similarities(pair):
    word1, word2 = pair
    print(f"\nCosine similarity between {word1} and {word2}:")
    if word1 in cbow_model.wv and word2 in cbow_model.wv:
        print("CBOW:", cbow_model.wv.similarity(word1, word2))
    else:
        print("CBOW: Words not found in vocabulary")
    
    if word1 in skipgram_model.wv and word2 in skipgram_model.wv:
        print("Skip-gram:", skipgram_model.wv.similarity(word1, word2))
    else:
        print("Skip-gram: Words not found in vocabulary")
    
    if word1 in glove_model and word2 in glove_model:
        print("GloVe:", glove_model.similarity(word1, word2))
    else:
        print("GloVe: Words not in vocabulary")

# Compare cosine similarities
word_pairs = [('brutus', 'murder'), ('lady macbeth', 'queen gertrude'), ('fortinbras', 'norway'),
              ('rome', 'norway'), ('ghost', 'spirit'), ('macbeth', 'hamlet')]
for pair in word_pairs:
    compare_similarities(pair)


Cosine similarity between brutus and murder:
CBOW: Words not found in vocabulary
Skip-gram: Words not found in vocabulary
GloVe: 0.07364358

Cosine similarity between lady macbeth and queen gertrude:
CBOW: Words not found in vocabulary
Skip-gram: Words not found in vocabulary
GloVe: Words not in vocabulary

Cosine similarity between fortinbras and norway:
CBOW: 0.99970627
Skip-gram: 0.9407859
GloVe: -0.028961957

Cosine similarity between rome and norway:
CBOW: 0.9980002
Skip-gram: 0.61553466
GloVe: 0.28583667

Cosine similarity between ghost and spirit:
CBOW: 0.9952172
Skip-gram: 0.5267711
GloVe: 0.4282089

Cosine similarity between macbeth and hamlet:
CBOW: 0.9758054
Skip-gram: 0.47028545
GloVe: 0.42935854


For Brutus and Murder, CBOW and Skip-gram did not recognize any of the words, this could be due to low frequency in the text, however, GloVe showed a low similarity (0.07), which could mean that the general corpus does not strongly connect both words.

For Lady Macbeth and Queen Gertrude, the three models were unable to associate Lady Macbeth (Macbeth) and Queen Gertrude (Hamlet) as female Shakespearean figures due to missing data.

Fortinbras and Norway, CBOW and Skip-gram showed high similarity of 0.9991 and 0.9957 respectively because Fortinbras is a Norwegian prince, but GloVe gave a negative value of -0.0289 which might be because its training data does not capture this Shakespearean connection.

For Rome and Norway, CBOW and Skip-gram showed high similarity between the words maybe based on their usage within the text, while GloVe provided a weak connection, maybe because of the distance between both.

For Ghost and Spirit, CBOW and Skip-gram showed strong similarity which maybe due to how both words were used interchangeably in the play while GloVe captured some similarity but was weaker than the other models.

For Macbeth and Hamlet, CBOW captured a very high similarity between both words, Skip-gram also correctly associated them but not as high as CBOW and GloVe recognized weaker connection.


In [29]:
def compare_linear_combination(expression, words):
    print(f"\nMost similar to: {expression}")
    print("CBOW:", cbow_model.wv.most_similar(positive=words[:2], negative=words[2:], topn=5))
    print("Skip-gram:", skipgram_model.wv.most_similar(positive=words[:2], negative=words[2:], topn=5))
    try:
        print("GloVe:", glove_model.most_similar(positive=words[:2], negative=words[2:], topn=5))
    except KeyError:
        print("GloVe: Some words not found in vocabulary")


# Evaluate word vector combinations
linear_combinations = [('denmark + queen', ['denmark', 'queen']),
                       ('scotland + army + general', ['scotland', 'army', 'general']),
                       ('father - man + woman', ['father', 'woman', 'man']),
                       ('mother - woman + man', ['mother', 'man', 'woman'])]

for exp, words in linear_combinations:
    compare_linear_combination(exp, words)


Most similar to: denmark + queen
CBOW: [('ophelia', 0.9995806813240051), ('alert', 0.9995394349098206), ('ghost', 0.9993239045143127), ('claudia', 0.9991902709007263), ('ratio', 0.9991004467010498)]
Skip-gram: [('qu', 0.954936683177948), ('alert', 0.9379003643989563), ('ophelia', 0.9339475631713867), ('poison', 0.9320482015609741), ('tagging', 0.9284816980361938)]
GloVe: [('sweden', 0.7461869120597839), ('norway', 0.7017143964767456), ('kingdom', 0.6878639459609985), ('princess', 0.6799803376197815), ('britain', 0.6786327362060547)]

Most similar to: scotland + army + general
CBOW: [('macduffe', 0.998903214931488), ('servant', 0.9987934827804565), ('messenger', 0.9987815022468567), ('lady', 0.998756468296051), ('malcolm', 0.9987280964851379)]
Skip-gram: [('torch', 0.9589881896972656), ('angus', 0.9536026120185852), ('donalbaine', 0.9415716528892517), ('macduffe', 0.9349325299263), ('malcolm', 0.9325574636459351)]
GloVe: [('ireland', 0.658142626285553), ('wales', 0.6262768507003784), (

For ‘Denmark’ + ‘Queen’), CBOW returned king, nature, wife, which are somewhat related to Quenn Gertrude. King and wife relate to Gertrude, but nature and tonne are incorrect. Skip-gram found characters (rosincrance, guildensterne), which are related to Hamlet, but missed "Queen" or "Gertrude" while GloVe provided words related to country (Sweden, Norway, Britain), which makes sense in a modern corpus.

For ‘Scotland’ + ‘Army’ + ‘General’), CBOW provided irrelevant words (ham, good, hand), Skip-gram provided words like torch, messenger, and brake, which might relate to Macbeth but not directly to generals or the army while GloVe provided the word Ireland, Wales, England, and Scots capturing the geopolitical association but not Shakespearean context.

For ‘Father’ – ‘Man’ + ‘Woman’, CBOW correctly provided the word "mother" and related words (head, dead), which are somewhat associated with family. Skip-gram also returned "mother", but added some unconnected words (oh, alert, deer) while GloVe performed best by returning mother, daughter, wife, husband, showing a very good understanding of gender relationships.

For ‘Mother’- ‘Woman’ + ‘Man’), CBOW provided unassociated words (might, virtue, day), Skip-gram provided brother and majesty, which are somewhat related, but GloVe correctly returned father, brother, son, uncle, proving a strong grasp of gendered relationships.


## Conclusion

CBOW was able to capture words as in the Shakespearean context as it returned common words and general themes in Shakespearean plays. It also performed fair in word vector arithmetic (Denmark + Queen → King, Wife).
Skip-gram captured some relationships with characters in the plays well and was able to identify some unique Shakespearean names, but it struggled with general context of words.
GloVe performed best on word vector arithmetic as it correctly identifying gender roles such as (Father - Man + Woman → Mother) and captured modern meanings of words well as well as geopolitical associations but could not capture relationships between Shakespearean words.
We believe the use of all Shakespearean plays and other related literatures will perform better as it will expand vocabulary and improve context and words with old meaning. 
