# Greek Mythology Project: Code File

In this document you will find the following: <br>
- [Necessary imports](#Necessary-imports)
- [Data Acquisition](#Data-Acquisition)
- [Preprocessing](#Preprocessing)
- [Data Exploration](#Data-Exploration) 
- [Model](#Model) + [Word Embeddings](#Word-Embeddings)
- [Data Visualisation](#Data-Visualisation)

## Necessary imports

In [1]:
import numpy as np
import nltk
import json
import urllib
import requests
import matplotlib.pyplot as plt
import gensim.downloader as api
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from gensim.models import Word2Vec
from collections import Counter
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [2]:
with open('mortaldict.json', 'r') as f:
    mortaldict = json.load(f)
    
with open('goddict.json', 'r') as f:
    goddict = json.load(f)

## Data Acquisition

All our texts will be extracted from the website [theoi](https://www.theoi.com/). It contains almost all of the original texts of Greek Mythology translated into English. In this project, we have decided to extract some of the more popular stories, namely; [Homer Odysee](#Homer-Odysee) , [The Iliad](#The-Iliad) , [The Fall of Troy](#The-Fall-of-Troy) , and the [Ovid Metamorphosis](#Ovid-Metamorphosis).

#### Homer Odysee

In [3]:
Odysee_URL = "https://www.theoi.com/Text/HomerOdyssey1.html"
Odysee_page = requests.get(Odysee_URL)
Odysee_soup = BeautifulSoup(Odysee_page.content, "html.parser")

#visual part of text
Odysee_soup.get_text()[4000:4500]

"ather of gods and men was first to speak, for in his heart he thought of noble Aegisthus, whom far-famed Orestes, Agamemnon's son, had slain. Thinking on him he spoke among the immortals, and said: “Look you now, how ready mortals are to blame the gods. It is from us, they say, that evils come, but they even of themselves, through their own blind folly, have sorrows beyond that which is ordained. Even as now Aegisthus, beyond that which was ordained, took to himself the wedded wife of the son of"

#### The Iliad

In [4]:
Iliad_URL = "https://www.theoi.com/Text/HomerIliad1.html"
Iliad_page = requests.get(Iliad_URL)
Iliad_soup = BeautifulSoup(Iliad_page.content, "html.parser")

#visual part of text
Iliad_soup.get_text()[4000:4500]

'not protect you. Her I will not set free. Sooner shall old age come upon her in our house, in Argos, far from her native land, as she walks to and fro before the loom and serves my bed. But go, do not anger me, that you may return the safer."\n[33] So he spoke, and the old man was seized with fear and obeyed his word. He went forth in silence along the shore of the loud-resounding sea, and earnestly then, when he had gone apart, the old man prayed\xa0to the lord Apollo, whom fair-haired Leto bore: "'

#### The Fall of Troy

In [5]:
Troy_URL = "https://www.theoi.com/Text/QuintusSmyrnaeus1.html"
Troy_page = requests.get(Troy_URL)
Troy_soup = BeautifulSoup(Troy_page.content, "html.parser")

#visual part of text
Troy_soup.get_text()[4000:4500]

" she mid that charging host. Clonie was there, Polemusa, Derinoe, Evandre, and Antandre, and Bremusa, Hippothoe, dark-eyed Harmothoe, Alcibie, Derimacheia, Antibrote, and Thermodosa glorying with the spear. All these to battle fared with warrior-souled Penthesileia: even as when descends Dawn from Olympus' crest of adamant, Dawn, heart-exultant in her radiant steeds amidst the bright-haired Hours; and o'er them all, how flawless-fair soever these may be, her splendour of beauty glows pre-eminent"

#### Ovid Metamorphosis

In [6]:
OvidM_URL = "https://www.theoi.com/Text/OvidMetamorphoses1.html"
OvidM_page = requests.get(OvidM_URL)
OvidM_soup = BeautifulSoup(OvidM_page.content, "html.parser")

#visual part of text
OvidM_soup.get_text()[4000:4500]

'aste. It was a rude and undeveloped mass, that nothing made except a ponderous weight; and all discordant elements confused, were there congested in a shapeless heap. As yet the sun afforded earth no light, nor did the moon renew her crescent horns; the earth was not suspended in the air exactly balanced by her heavy weight. Not far along the margin of the shores had Amphitrite stretched her lengthened arms,—for all the land was mixed with sea and air. The land was soft, the sea unfit to sail, t'

Now that we have the starting pages of all 4 stories, we will have to find a way to extract the other chapters which are linked in the home pages.  As seen in the code below, the links to the other chapthers (called BOOKS), always have the following format: <br> 
< a href="HomerOdyssey1.html">BOOK 1</a >, <br>
< a href="HomerOdyssey2.html">BOOK 2</a >,


In [7]:
Odysee_soup.find_all('a')

[<a class="navbar-brand" href="https://www.theoi.com/Library.html">Theoi Project - Classical Texts Library</a>,
 <a href="../Library.html">LIBRARY HOME</a>,
 <a href="https://www.theoi.com/">GREEK MYTHOLOGY</a>,
 <a aria-expanded="false" aria-haspopup="true" class="dropdown-toggle" data-toggle="dropdown" href="#" role="button">GREEK GODS<span class="caret"></span></a>,
 <a href="../greek-mythology/olympian-gods.html">Olympian Gods</a>,
 <a href="../greek-mythology/primeval-gods.html">Primordial Gods</a>,
 <a href="../greek-mythology/titans.html">Titan Gods</a>,
 <a href="../greek-mythology/sky-gods.html">Sky Gods</a>,
 <a href="../greek-mythology/sea-gods.html">Sea Gods</a>,
 <a href="../greek-mythology/rustic-gods.html">Rustic Gods</a>,
 <a href="../greek-mythology/underworld-gods.html">Underworld Gods</a>,
 <a href="../greek-mythology/personifications.html">Daemones-Spirits</a>,
 <a href="../greek-mythology/nymphs.html">Nymphs</a>,
 <a href="../greek-mythology/greek-gods.html">more &

Now that we know what we are looking for, we will create a list which has all the links to the different chapters which we can then loop through later.  

In [8]:
Websites = []

#Odysee
for link in Odysee_soup.find_all('a'):
    if "HomerOdyssey" in link.get('href'):
        Websites.append("https://www.theoi.com/Text/" + link.get('href'))
        
#Iliad
for link in Iliad_soup.find_all('a'):
    if "HomerIliad" in link.get('href'):
        Websites.append("https://www.theoi.com/Text/" + link.get('href'))
        
#Troy
for link in Troy_soup.find_all('a'):
    if "QuintusSmyrnaeus" in link.get('href'):
        Websites.append("https://www.theoi.com/Text/" + link.get('href'))
        
#Ovid-Metamorphosis
for link in OvidM_soup.find_all('a'):
    if link.get('href') == None:
        continue
    elif "OvidMetamorphoses" in link.get('href') and not '#' in link.get('href'):
        Websites.append("https://www.theoi.com/Text/" + link.get('href'))    

#visualise all the chapters
for web in Websites:
    print(web)

https://www.theoi.com/Text/HomerOdyssey1.html
https://www.theoi.com/Text/HomerOdyssey2.html
https://www.theoi.com/Text/HomerOdyssey3.html
https://www.theoi.com/Text/HomerOdyssey4.html
https://www.theoi.com/Text/HomerOdyssey5.html
https://www.theoi.com/Text/HomerOdyssey6.html
https://www.theoi.com/Text/HomerOdyssey7.html
https://www.theoi.com/Text/HomerOdyssey8.html
https://www.theoi.com/Text/HomerOdyssey9.html
https://www.theoi.com/Text/HomerOdyssey10.html
https://www.theoi.com/Text/HomerOdyssey11.html
https://www.theoi.com/Text/HomerOdyssey12.html
https://www.theoi.com/Text/HomerOdyssey13.html
https://www.theoi.com/Text/HomerOdyssey14.html
https://www.theoi.com/Text/HomerOdyssey15.html
https://www.theoi.com/Text/HomerOdyssey16.html
https://www.theoi.com/Text/HomerOdyssey17.html
https://www.theoi.com/Text/HomerOdyssey18.html
https://www.theoi.com/Text/HomerOdyssey19.html
https://www.theoi.com/Text/HomerOdyssey20.html
https://www.theoi.com/Text/HomerOdyssey21.html
https://www.theoi.com/

Now that we have all the html links to the chapters of the books we want to use, we will scrape the text off them and combind it all into one list.

In [15]:
full_text = []
for link in Websites:
    p = requests.get(link)
    s = BeautifulSoup(p.content, "html.parser")
    full_text.append(s.get_text())
    print(s.title) #check which books have been printed


<title>HOMER, ODYSSEY BOOK 1 - Theoi Classical Texts Library</title>
<title>HOMER, ODYSSEY BOOK 2 - Theoi Classical Texts Library</title>
<title>HOMER, ODYSSEY BOOK 3 - Theoi Classical Texts Library</title>
<title>HOMER, ODYSSEY BOOK 4 - Theoi Classical Texts Library</title>
<title>HOMER, ODYSSEY BOOK 5 - Theoi Classical Texts Library</title>
<title>HOMER, ODYSSEY BOOK 6 - Theoi Classical Texts Library</title>
<title>HOMER, ODYSSEY BOOK 7 - Theoi Classical Texts Library</title>
<title>HOMER, ODYSSEY BOOK 8 - Theoi Classical Texts Library</title>
<title>HOMER, ODYSSEY BOOK 9 - Theoi Classical Texts Library</title>
<title>HOMER, ODYSSEY BOOK 10 - Theoi Classical Texts Library</title>
<title>HOMER, ODYSSEY BOOK 11 - Theoi Classical Texts Library</title>
<title>HOMER, ODYSSEY BOOK 12 - Theoi Classical Texts Library</title>
<title>HOMER, ODYSSEY BOOK 13 - Theoi Classical Texts Library</title>
<title>HOMER, ODYSSEY BOOK 14 - Theoi Classical Texts Library</title>
<title>HOMER, ODYSSEY BOOK 15

Finally, we will save the list which contains the full text of all 4 stories into a txt file so it can be used by the other group mates.

In [16]:
#save full_text to txt file to import to preprocessing 

textfile = open("text_file.txt", "w")
for element in full_text:
    textfile.write(element + "\n")
textfile.close()

## Preprocessing

First, let's check what the encoding is of the text file, so that we can properly import the texts

In [18]:
with open('text_file.txt') as f:
    print(f)

<_io.TextIOWrapper name='text_file.txt' mode='r' encoding='cp1252'>


As can be seen above, this file doesn't use the standard utf-8 encoding, but cp1252 encoding. So, we will use this encoder when reading it in:

In [20]:
with open("text_file.txt","r",encoding= 'cp1252') as f:
    data = f.readlines()
    
print(data[700])

[313] “So do not thou, my friend, wander long far from home, leaving thy wealth behind thee and men in thy house so insolent, lest they divide and devour all thy wealth, and thou shalt have gone on a fruitless journey. But to Menelaus I bid and command thee to go, for he has but lately come from a strange land, from a folk whence no one would hope in his heart to return, whom the storms had once driven astray into a sea so great, whence the very birds do not fare in the space of a year, so great is it and terrible. But now go thy way with thy ship and thy comrades, or, if thou wilt go by land, here are chariot and horses at hand for thee, and here at thy service are my sons, who will be thy guides to goodly Lacedaemon, where lives fair-haired Menelaus. And do thou beseech him thyself that he may tell thee the very truth. A lie will be not utter, for he is wise indeed.”



We split the data into sentences

In [None]:
sentences = []
for par in full_text:
    if par == '\n':
        continue
    senttemp = sent_tokenize(par)
    sentences = sentences +senttemp

print(sentences[100:105])

Now, we tokenize the sentences. If the sentence consists of two words or less, we remove it since it won't contain much useful information. Also, this way we might be able to filter out as many chapter titles as possible.

In [None]:
words = []
for sent in sentences:
    wordstemp = word_tokenize(sent)
    if len(wordstemp) <= 2:
        continue
    words.append(wordstemp)
    
print(words[1000])

Now, we PoS tag the data, so that lemmatization is easier, and names of characters can be identified more easily

In [None]:
tagged = []
for sent in words:
    senttagged = nltk.pos_tag(sent)
    tagged.append(senttagged)

tagged_filtered = []
for sent in tagged:
    sentls = []
    for wordtag in sent:
        if wordtag[1] in ['.', ',', ':', '--', '$', '(', ')'] or wordtag[0] in ['“', '”', '<', '>']:
            continue
        else: sentls.append(wordtag)
    tagged_filtered.append(sentls)
        
print(tagged_filtered[3000:3010])

Next, we lemmatize the dataset

In [None]:
# Reference for lemmatizing: https://stackoverflow.com/questions/15586721/wordnet-lemmatization-and-pos-tagging-in-python

lemmatizer = WordNetLemmatizer()

lemmatized = []
for sent in tagged_filtered:
    
    sentls = []
    for wordtag in sent:
        tag = ''
        if wordtag[1][0:2] ==  'NN':
            tag =  'n'
        elif wordtag[1][0:2] ==  'VB':
            tag =  'v'
        elif wordtag[1][0:2] ==  'JJ':
            tag =  'a'
        elif wordtag[1][0:2] ==  'RB':
            tag =  'r'
        
        # use postag if it exists in the wordnetlemmatizer
        if tag == '': lemma = lemmatizer.lemmatize(wordtag[0])
        else: lemma = lemmatizer.lemmatize(wordtag[0], pos=tag)
        
        sentls.append([lemma, wordtag[1]])
    lemmatized.append(sentls)
    
print(lemmatized[3000:3010])

Next, we go through the data and print all the proper nouns that aren't in the dictionaries. If they are not and they refer to names of characters that are either Olympian gods or mortals, we add them to the dictionaries in the file characterdicts.ipynb. We do this only for names that occur at least 10 times in the book, since adding all names is not realistic. The below code prints the proper nouns (or words that were classified as such) that we didn't add to the dictionary. As can be seen, all of these are either not names, names of people who aren't part of mythology, or children of gods (in which case we left them out, since they are somewhere between god and mortal).

In [None]:
# Reference for sorting dictionary: https://www.edureka.co/blog/sort-dictionary-by-value-in-python/

non_occuring = {}
non_occuring_min10 = {}

mortals = mortaldict
gods = goddict

for sent in lemmatized:
    for wordtag in sent:
    
        # Only look at proper nouns
        if wordtag[1] == 'NNP':

            # Binary to indicate if the word has been found
            filled_in = 0

            # Go through mortaldict
            for idx, vals in enumerate(mortals.values()):
                if wordtag[0] in vals[0]:
                    filled_in = 1
                    break

            # If word hasn't been found in mortaldict, go through goddict
            if filled_in == 0:
                for idx, vals in enumerate(gods.values()):
                    if wordtag[0] in vals[0]:
                        filled_in = 1
                        break

            # Print word if it hasn't been found at all
            if filled_in == 0:
                if wordtag[0] in non_occuring.keys():
                    non_occuring[wordtag[0]] += 1
                else: non_occuring[wordtag[0]] = 1

for key in non_occuring.keys():
    if non_occuring[key] >= 10:
        non_occuring_min10[key] = non_occuring[key]

for key in sorted(non_occuring_min10.keys()):
       print("%s: %s" % (key, non_occuring_min10[key]))

## Data exploration

First, we go through the texts, and count how many times a certain name is mentioned.

In [None]:
i = 0
for idx, sent in enumerate(lemmatized):
    for idx2, wordtag in enumerate(sent):
        word = wordtag[0]
        tag = wordtag[1]

        if tag == 'NNP':

            # Binary to indicate if the word has been found
            filled_in = 0

            # Go through mortaldict
            for idx3, vals in enumerate(mortaldict.values()):
                if word in vals[0]:
                    filled_in = 1
                    mortaldict[str(idx3)][1] += 1
                    lemmatized[idx][idx2][1] = 'NNPm'
                    i += 1
                    break

            # If word hasn't been found in mortaldict, go through goddict
            if filled_in == 0:
                for idx4, vals in enumerate(goddict.values()):
                    if word in vals[0]:
                        filled_in = 1
                        goddict[str(idx4)][1] += 1
                        lemmatized[idx][idx2][1] = 'NNPg'
                        i += 1
                        break

Below we print the amounts of times mortals and gods are mentioned. Note that in reality this number is a bit higher because we are not looking at all mortals and all gods. While mortals are named more often, both gods and mortals are named enough times to be able to do a meaningful analysis.

In [None]:
summortals = 0
sumgods = 0

for mval in mortaldict.values():
    summortals += mval[1]
    
for gval in goddict.values():
    sumgods += gval[1]
    
print('Total amount of mortal names: ', summortals)
print('Total amount of god names: ', sumgods)

Next, we plot the 20 most common names for both mortals and gods. 

In [None]:
# Reference for sorting dict: https://www.geeksforgeeks.org/python-sort-list-according-second-element-sublist/

sortm = list(mortaldict.values())
sortm.sort(key = lambda x: x[1])
sortm.reverse()

sortg = list(goddict.values())
sortg.sort(key = lambda x: x[1])
sortg.reverse()

In [None]:
mnames = [name[0][0] for name in sortm]
mcounts = [count[1] for count in sortm]

gnames = [name[0][0] for name in sortg]
gcounts = [count[1] for count in sortg]

As can be seen in the graphs, the most common names make sense considering the original texts. The main characters of for example the Iliad and Odyssee occur most often, and the most important gods are named the most.

In [None]:
plt.figure(figsize=(16, 10))
plt.bar(mnames[0:20], mcounts[0:20])
plt.title('20 most common mortals and their frequencies')

In [None]:
plt.figure(figsize=(16, 10))
plt.bar(gnames[0:20], gcounts[0:20])
plt.title('20 most common gods and their frequencies')

Next, we take the context of each name.

In [None]:
name_cons = []
for idx, sent in enumerate(lemmatized):
    for idx2, wordtag in enumerate(sent):    
        word = wordtag[0]
        tag = wordtag[1]

        if tag == 'NNPm' or tag == 'NNPg':
            context = sent[max(0, idx2-5):idx2] + sent[idx2+1: min(len(sent), idx2+5)]
            name_cons.append([wordtag, context])

print(name_cons[100])

We check whether there are enough adjectives and adverbs in the context of the names to be able to perform the analysis. 

In [None]:
advcountm = 0
adjcountm = 0
advcountg = 0
adjcountg = 0
adjcounter = Counter()

for namecon in name_cons:
    name = namecon[0]
    con = namecon[1]
    
    if name[1] == 'NNPm':
        for wordtag in con:
            if wordtag[1][0:2] == 'RB':
                advcountm += 1
                adjcounter[wordtag[0]] += 1
                
            elif wordtag[1][0:2] == 'JJ':
                adjcountm += 1
                adjcounter[wordtag[0]] += 1
                
    elif name[1] == 'NNPg':
        for wordtag in con:
            if wordtag[1][0:2] == 'RB':
                advcountg += 1
                adjcounter[wordtag[0]] += 1
                
            elif wordtag[1][0:2] == 'JJ':
                adjcountg += 1
                adjcounter[wordtag[0]] += 1
                
print('There are ', advcountm, ' adverbs, and ', adjcountm, ' adjectives around names of mortals.')
print('There are ', advcountg, ' adverbs, and ', adjcountg, ' adjectives around names of gods.')

We consider these as enough adjectives and adverbs to perform the analysis. We will also look at the most common adjectives and adverbs to see whether there are some that are not useful for this analysis.

In [None]:
print(adjcounter.most_common(30))

## Model

First, we will implement the Word2Vec model to see whether words around mortals are different from those around gods. Lemmatized_replace is a list of the PoS tagged data, where each name of a mortal or god is replaced by MORTALNAME or GODNAME. This way, we can clearly see the difference between gods and mortals, without having to go through all names of gods and mortals and seeing what words are most similar to them. 

In [None]:
lemmatized_replace = []

for sent in lemmatized:
    sentls = []
    for wordtag in sent:
        if wordtag[1] == 'NNPm': 
            sentls.append(['MORTALNAME', 'NNPm'])
        elif wordtag[1] == 'NNPg':
            sentls.append(['GODNAME', 'NNPg'])
        else: sentls.append(wordtag)
    lemmatized_replace.append(sentls)

print(lemmatized_replace[100])

Below, we made two lists that can be used in the model. One contains all names adverbs, and adjectives. The other contains all words. We were debating which one to use in the model, which is why for now we kept both in. Right now, we are using all words as input for the model. This is because if we filter out all words except adverbs and adjectives, the windows around the target words overlap much, resulting in almost identical similarity scores for mortals and gods. We will most likely filter out all the adjectives and adverbs after retrieving the words with the highest similarity scores, so that the filtering has less of an effect on the model. <br>
We are however, already removing a few words which play a key role in the visualisation of the data in future stages. This we are only doing when we are certain the word does not add any value to the results of the model later on. 

In [None]:
advadjfilt = []
nofilt = []

#remove invaluable words
remove = ["Then", "then", "thou", "own"] 

for sent in lemmatized_replace:
    sentls = []
    sentlsfilt = []
    for word in sent:
        if word[0] in remove:
            continue
        else:
            sentls.append(word[0]) 
        
        if word[1][0:2] in ['RB', 'JJ'] or word[1] in ['NNPg', 'NNPm']:
            if word[0] in remove:
                continue
            else:
                sentlsfilt.append(word[0])
                
    advadjfilt.append(sentlsfilt)
    nofilt.append(sentls)

The parameters below were mostly taken from the notebook from week 5. We lowered the min_count, since we also want rare adjectives and adverbs to be taken into account. We also upped the amount of epochs, to prevent underfitting. However, we also did not make this number as large, because overfitting should also be prevented. 

In [None]:
params = {
    'vector_size': 100, # dimension of embeddings
    'window': 4, # window -/+ before and after focus word
    'epochs': 10, # number of iterations over the corpus
    'min_count': 2, # filter on words whose frequency is below this count
    'sg': 0, # use the skip-gram (1) or the CBOW (0) mode. In class, we presented the CBOW (predict context given focus words). See optional materials for the skip-gram (predict focus given context)
    'negative': 5, # how many negative samples to use (see optional class contents too)
    'workers': 4, # how many cores to use
    'alpha': 0.05 # initial learning rate for SGD. This is lambda in the class notes
}
model = Word2Vec(nofilt, **params)

As can be seen below, the words that are most similar to gods are different from those that are most similar to mortals, but they contain many names and words that are not adjectives or adverbs. We now make lists with specifically the most similar adverbs and adjectives to gods and mortals.

In [None]:
for word in list(model.wv.most_similar('GODNAME', topn=10)):
    print(word)

In [None]:
for word in list(model.wv.most_similar('MORTALNAME', topn=10)):
    print(word)

In [None]:
lemmatized_replace_words = []
lemmatized_replace_tags = []

for sent in lemmatized_replace:
    for word in sent:
        lemmatized_replace_words.append(word[0])
        lemmatized_replace_tags.append(word[1])

similarg = []
similarm = []

for word in list(model.wv.most_similar('MORTALNAME', topn=3000)):
    tag = lemmatized_replace_tags[lemmatized_replace_words.index(word[0])]
    if tag[0:2] in ['RB', 'JJ']:
        similarm.append([word[0], word[1]])
    

for word in list(model.wv.most_similar('GODNAME', topn=3000)):
    tag = lemmatized_replace_tags[lemmatized_replace_words.index(word[0])]
    if tag[0:2] in ['RB', 'JJ']:
        similarg.append([word[0], word[1]])

Below are the lists of the most similar adverbs and adjectives. These lists seem to make sense, since adjectives that are only for gods (like Olympian) are most similar to gods, and the other way around (godlike is much more similar to mortals, which makes sense because a god won't be described as being 'godlike'.

In [None]:
for word in similarg[:15]:
    print(word)
print('\n')
for word in similarm[:15]:
    print(word)

The above results can be visualised in the following way to make interpretation easier. 

## Word Embeddings

In [None]:
most_common_list = []
coocu = Counter()
for sent in name_cons:
       
    target = sent[0]
    con = sent[1]
    filteredcon = []
    
    remove = ["Then", "then", "[", "]", "thy", "thou", "many", "even", "not", "now", "So", "so", 
              "thou", "ever", "other", "unto", "yet", "such", "here", "again", "there", "more", 
              "too", "such", "thus", "as", "ye", "also", "once", "thee", "far", "back", "Now", 
              "indeed", "most", "no", "away", "near", "most", "Most", "already", "only", "near"]
       
    for word in con:
        if word[0] in remove:
            continue
        else:
            filteredcon.append(word)

    for conword in filteredcon:
        if conword[1][0:2] in ['RB', 'JJ']:
            
            coocu[(target[0], conword[0])] += 1

for common in list(coocu.most_common(500)):
    most_common_list.append(common)

In [None]:
from tabulate import tabulate

# define header names
col_names = ["Character description", "Occurence"]

print(tabulate(most_common_list, headers=col_names))

### Data Visualisation

We will make a scatter plot of the gods 

In [None]:
from sklearn.decomposition import PCA

In [None]:
#MORTAL
def display_pca_scatterplot_m(model, words=None, sample=0):
    if words == None:
        if sample > 0:
            words = np.random.choice(list(model.vocab.keys()), sample)
        else:
            words = [ word for word in model.vocab ]
        
    word_vectors = np.array([model[w] for w in words])

    # do PCA on the selected embeddings
    twodim = PCA().fit_transform(word_vectors)[:,:2]
    
    plt.figure(figsize=(6,6))
    plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='b')
    for word, (x,y) in zip(words, twodim):
        plt.text(x+0.05, y+0.05, word)
        
display_pca_scatterplot_m(model.wv, 
                        ['MORTALNAME']+[x[0] for x in similarm[:10]])
plt.title("Most Similar Adverbs and Adjectives in Mortals")

In [None]:
#GOD
def display_pca_scatterplot_g(model, words=None, sample=0):
    if words == None:
        if sample > 0:
            words = np.random.choice(list(model.vocab.keys()), sample)
        else:
            words = [ word for word in model.vocab ]
        
    word_vectors = np.array([model[w] for w in words])

    # do PCA on the selected embeddings
    twodim = PCA().fit_transform(word_vectors)[:,:2]
    
    plt.figure(figsize=(6,6))
    plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='y')
    for word, (x,y) in zip(words, twodim):
        plt.text(x+0.05, y+0.05, word)
        
display_pca_scatterplot_g(model.wv, 
                        ['GODNAME']+[x[0] for x in similarg[:10]])

In [None]:
#words taken from most_similar above
god_mortal = ['GODNAME', 'MORTALNAME']

def display_pca_scatterplot(model, words=None, sample=0):
    if words == None:
        if sample > 0:
            words = np.random.choice(list(model.vocab.keys()), sample)
        else:
            words = [ word for word in model.vocab ]
        
    word_vectors = np.array([model[w] for w in words])

    # do PCA on the selected embeddings
    twodim = PCA().fit_transform(word_vectors)[:,:2]
    
    plt.figure(figsize=(6,6))
    plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c=god_mortal.target)
    for word, (x,y) in zip(words, twodim):
        plt.text(x+0.05, y+0.05, word)
        
display_pca_scatterplot(model.wv, 
                        god_mortal +[x[0] for x in similarm[:10]]+[x[0] for x in similarg[:10]])