Documentation for the Master thesis "Fanfiction Semantics - A Quantitative Analysis of Sensitive Topics in German Fanfiction" by Julian Jacopo Häußler, Date of submission: September 19, 2022.

# 7.1 Creation of Word Embedding Models

## Overview:
### - load libraries and read in data
### - preprocessing (merging, cleaning, sentence tokenizing, lemmatizing, removing punctuation, lowercasing)
### - save corpus
### - create and save word (token) list
### - train and save models
### - evaluate models
### - summarize info
### - test models

# LOAD LIBRARIES AND READ IN DATA

In [1]:
# load libraries

# read in data

import glob
import os

In [2]:
# preprocessing

import nltk
import pickle
import string

In [3]:
# lemmatizing

import spacy
!python -m spacy download de_core_news_lg
nlp = spacy.load('de_core_news_lg',exclude=["ner"],disable=["tagger","parser"])

Collecting de-core-news-lg==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_lg-3.2.0/de_core_news_lg-3.2.0-py3-none-any.whl (572.3 MB)
[+] Download and installation successful
You can now load the package via spacy.load('de_core_news_lg')


You should consider upgrading via the 'C:\Users\LitLab\anaconda3\python.exe -m pip install --upgrade pip' command.


In [4]:
# training models

from gensim.models import Word2Vec



In [5]:
# evaluation

from gensim.models import KeyedVectors
from gensim.test.utils import datapath
import matplotlib.pyplot as plt
import pandas as pd

In [12]:
# Potter

fileList = glob.glob(os.path.join(os.getcwd(), r"D:\JH\Masterarbeit\2021\Harry Potter\Harry Potter - FFs\Texte_txt", "*.txt"))
 
Potter2021_raw = []

for file_path in fileList:
    with open(file_path, encoding="utf8") as file:
        Potter2021_raw.append(file.read())

In [13]:
# Biss

fileList = glob.glob(os.path.join(os.getcwd(), r"D:\JH\Masterarbeit\2021\Bis(s)\Texte_txt", "*.txt"))
 
Biss2021_raw = []

for file_path in fileList:
    with open(file_path, encoding="utf8") as file:
        Biss2021_raw.append(file.read())

In [14]:
# WarriorCats

fileList = glob.glob(os.path.join(os.getcwd(), r"D:\JH\Masterarbeit\2021\Warrior Cats\Texte_txt", "*.txt"))
 
WarriorCats2021_raw = []

for file_path in fileList:
    with open(file_path, encoding="utf8") as file:
        WarriorCats2021_raw.append(file.read())

In [15]:
# DFFF

fileList = glob.glob(os.path.join(os.getcwd(), r"D:\JH\Masterarbeit\2021\Die drei FFF\Texte_txt", "*.txt"))
 
DFFF2021_raw = []

for file_path in fileList:
    with open(file_path, encoding="utf8") as file:
        DFFF2021_raw.append(file.read())

In [16]:
# Mittelerde

fileList = glob.glob(os.path.join(os.getcwd(), r"D:\JH\Masterarbeit\2021\J.R.R. Tolkien\Mittelerde\Texte_txt", "*.txt"))
 
Mittelerde2021_raw = []

for file_path in fileList:
    with open(file_path, encoding="utf8") as file:
           Mittelerde2021_raw.append(file.read())

In [17]:
# Jackson

fileList = glob.glob(os.path.join(os.getcwd(), r"D:\JH\Masterarbeit\2021\Rick Riordan\Percy Jackson\Texte_txt", "*.txt"))
 
Jackson2021_raw = []

for file_path in fileList:
    with open(file_path, encoding="utf8") as file:
        Jackson2021_raw.append(file.read())

In [18]:
# Panem

fileList = glob.glob(os.path.join(os.getcwd(), r"D:\JH\Masterarbeit\2021\Die Tribute von Panem\FFs\Texte_txt", "*.txt"))
 
Panem2021_raw = []

for file_path in fileList:
    with open(file_path, encoding="utf8") as file:
        Panem2021_raw.append(file.read())

In [19]:
# Potter Originals

fileList = glob.glob(os.path.join(os.getcwd(), r"D:\JH\Masterarbeit\originals", "*.txt"))
 
PotterOriginals_raw = []

for file_path in fileList:
    with open(file_path, encoding="utf8") as file:
        PotterOriginals_raw.append(file.read())

# PREPROCESSING (MERGING, CLEANING, SENTENCE TOKENIZING, LEMMATIZING, REMOVE PUNCTUATION, LOWERCASING)

The following code blocks are taken from Brottrager et al.'s "Character Shifts in Harry Potter Fanfictions", the relevant Jupyter Notebook can be found under https://github.com/jbrottrager/character-shifts-HPFFS/blob/main/scripts/05_Word2Vec.ipynb (last viewed: 2022/09/18)

In [20]:
Potter2021_raw[:1]

['\n                                \n                                NeujahrEs war sehr früh am Morgen des ersten Januar 1978, als James Potter sich aus dem Jungenschlafsaal schlich und vorsichtig die Treppen in den Gemeinschaftsraum hinunter stieg. In wenigen Stunden würde es Frühstück geben, doch er war schon jetzt wach. Er hatte nicht mehr schlafen können, wollte seine besten Freunde jedoch nicht so früh wecken. Außerdem musste er nachdenken. Gestern bei der Silvesterparty hätte er beinahe mit Lily getanzt. Mit seiner so wunderschönen Schulsprecherkollegin. Sie hatte ihn sogar gefragt, doch er hatte abgelehnt. Vielleicht war das dumm gewesen, doch seit Weihnachten wo er zufällig ein kurzes Gespräch zwischen Lily und Alice mitangehört hatte, war er ziemlich verwirrt.Nach dem Festmahl hatte er die beiden Mädchen in der Eingangshalle im Vorbeigehen reden gehört. Alice hatte ihre Freundin gerade gefragt: „Warum hast du nicht mit James getanzt?“ Natürlich war er sofort langsamer geworde

In [21]:
# PREPROCESSING

# merging

Potter2021_merged = ' '.join(Potter2021_raw)

In [23]:
Potter2021_merged[:1000]

'\n                                \n                                NeujahrEs war sehr früh am Morgen des ersten Januar 1978, als James Potter sich aus dem Jungenschlafsaal schlich und vorsichtig die Treppen in den Gemeinschaftsraum hinunter stieg. In wenigen Stunden würde es Frühstück geben, doch er war schon jetzt wach. Er hatte nicht mehr schlafen können, wollte seine besten Freunde jedoch nicht so früh wecken. Außerdem musste er nachdenken. Gestern bei der Silvesterparty hätte er beinahe mit Lily getanzt. Mit seiner so wunderschönen Schulsprecherkollegin. Sie hatte ihn sogar gefragt, doch er hatte abgelehnt. Vielleicht war das dumm gewesen, doch seit Weihnachten wo er zufällig ein kurzes Gespräch zwischen Lily und Alice mitangehört hatte, war er ziemlich verwirrt.Nach dem Festmahl hatte er die beiden Mädchen in der Eingangshalle im Vorbeigehen reden gehört. Alice hatte ihre Freundin gerade gefragt: „Warum hast du nicht mit James getanzt?“ Natürlich war er sofort langsamer geworden

In [24]:
Biss2021_merged = ' '.join(Biss2021_raw)

In [25]:
WarriorCats2021_merged = ' '.join(WarriorCats2021_raw)

In [26]:
DFFF2021_merged = ' '.join(DFFF2021_raw)

In [27]:
Mittelerde2021_merged = ' '.join(Mittelerde2021_raw)

In [28]:
Jackson2021_merged = ' '.join(Jackson2021_raw)

In [29]:
Panem2021_merged = ' '.join(Panem2021_raw)

In [30]:
PotterOriginals_merged = ' '.join(PotterOriginals_raw)

In [31]:
# cleaning

Potter2021_clean1 = Potter2021_merged.replace('\n', '')
Potter2021_clean2 = Potter2021_clean1.replace('\xa0', '')

In [32]:
Potter2021_clean2[:2000]

'                                                                NeujahrEs war sehr früh am Morgen des ersten Januar 1978, als James Potter sich aus dem Jungenschlafsaal schlich und vorsichtig die Treppen in den Gemeinschaftsraum hinunter stieg. In wenigen Stunden würde es Frühstück geben, doch er war schon jetzt wach. Er hatte nicht mehr schlafen können, wollte seine besten Freunde jedoch nicht so früh wecken. Außerdem musste er nachdenken. Gestern bei der Silvesterparty hätte er beinahe mit Lily getanzt. Mit seiner so wunderschönen Schulsprecherkollegin. Sie hatte ihn sogar gefragt, doch er hatte abgelehnt. Vielleicht war das dumm gewesen, doch seit Weihnachten wo er zufällig ein kurzes Gespräch zwischen Lily und Alice mitangehört hatte, war er ziemlich verwirrt.Nach dem Festmahl hatte er die beiden Mädchen in der Eingangshalle im Vorbeigehen reden gehört. Alice hatte ihre Freundin gerade gefragt: „Warum hast du nicht mit James getanzt?“ Natürlich war er sofort langsamer geworden um 

In [33]:
Biss2021_clean1 = Biss2021_merged.replace('\n', '')
Biss2021_clean2 = Biss2021_clean1.replace('\xa0', '')

In [34]:
WarriorCats2021_clean1 = WarriorCats2021_merged.replace('\n', '')
WarriorCats2021_clean2 = WarriorCats2021_clean1.replace('\xa0', '')

In [35]:
DFFF2021_clean1 = DFFF2021_merged.replace('\n', '')
DFFF2021_clean2 = DFFF2021_clean1.replace('\xa0', '')

In [36]:
Mittelerde2021_clean1 = Mittelerde2021_merged.replace('\n', '')
Mittelerde2021_clean2 = Mittelerde2021_clean1.replace('\xa0', '')

In [37]:
Jackson2021_clean1 = Jackson2021_merged.replace('\n', '')
Jackson2021_clean2 = Jackson2021_clean1.replace('\xa0', '')

In [38]:
Panem2021_clean1 = Panem2021_merged.replace('\n', '')
Panem2021_clean2 = Panem2021_clean1.replace('\xa0', '')

In [39]:
PotterOriginals_clean1 = PotterOriginals_merged.replace('\n', '')
PotterOriginals_clean2 = PotterOriginals_clean1.replace('\xa0', '')

In [41]:
# tokenize sentences

Potter2021_sentences = nltk.sent_tokenize(Potter2021_clean2, language='german')

In [42]:
Potter2021_sentences[:5]

['                                                                NeujahrEs war sehr früh am Morgen des ersten Januar 1978, als James Potter sich aus dem Jungenschlafsaal schlich und vorsichtig die Treppen in den Gemeinschaftsraum hinunter stieg.',
 'In wenigen Stunden würde es Frühstück geben, doch er war schon jetzt wach.',
 'Er hatte nicht mehr schlafen können, wollte seine besten Freunde jedoch nicht so früh wecken.',
 'Außerdem musste er nachdenken.',
 'Gestern bei der Silvesterparty hätte er beinahe mit Lily getanzt.']

In [43]:
Biss2021_sentences = nltk.sent_tokenize(Biss2021_clean2, language='german')

In [44]:
WarriorCats2021_sentences = nltk.sent_tokenize(WarriorCats2021_clean2, language='german')

In [45]:
DFFF2021_sentences = nltk.sent_tokenize(DFFF2021_clean2, language='german')

In [46]:
Mittelerde2021_sentences = nltk.sent_tokenize(Mittelerde2021_clean2, language='german')

In [47]:
Jackson2021_sentences = nltk.sent_tokenize(Jackson2021_clean2, language='german')

In [48]:
Panem2021_sentences = nltk.sent_tokenize(Panem2021_clean2, language='german')

In [49]:
PotterOriginals_sentences = nltk.sent_tokenize(PotterOriginals_clean2, language='german')

In [50]:
# lemmatizing

Potter2021_lemmatized = [0]*len(Potter2021_sentences)

for i in range(0, len(Potter2021_sentences)):
    words = nlp(Potter2021_sentences[i])
    interim = [0]*len(words)
    for j in range(0, len(interim)):
        interim[j] = words[j].lemma_
    Potter2021_lemmatized[i] = interim 

In [51]:
Potter2021_lemmatized[:5]

[['                                                                ',
  'NeujahrEs',
  'sein',
  'sehr',
  'früh',
  'am',
  'Morgen',
  'der',
  'erst',
  'Januar',
  '1978',
  ',',
  'als',
  'James',
  'Potter',
  'sich',
  'aus',
  'der',
  'Jungenschlafsaal',
  'schleichen',
  'und',
  'vorsichtig',
  'der',
  'Treppe',
  'in',
  'der',
  'Gemeinschaftsraum',
  'hinunter',
  'steigen',
  '.'],
 ['In',
  'wenig',
  'Stunde',
  'werden',
  'ich',
  'Frühstück',
  'geben',
  ',',
  'doch',
  'ich',
  'sein',
  'schon',
  'jetzt',
  'wach',
  '.'],
 ['ich',
  'haben',
  'nicht',
  'mehr',
  'schlafen',
  'können',
  ',',
  'wollen',
  'mein',
  'gut',
  'Freund',
  'jedoch',
  'nicht',
  'so',
  'früh',
  'wecken',
  '.'],
 ['Außerdem', 'musste', 'ich', 'nachdenken', '.'],
 ['Gestern',
  'bei',
  'der',
  'Silvesterparty',
  'haben',
  'ich',
  'beinahe',
  'mit',
  'Lily',
  'tanzen',
  '.']]

In [52]:
Biss2021_lemmatized = [0]*len(Biss2021_sentences)

for i in range(0, len(Biss2021_sentences)):
    words = nlp(Biss2021_sentences[i])
    interim = [0]*len(words)
    for j in range(0, len(interim)):
        interim[j] = words[j].lemma_
    Biss2021_lemmatized[i] = interim 

In [53]:
WarriorCats2021_lemmatized = [0]*len(WarriorCats2021_sentences)

for i in range(0, len(WarriorCats2021_sentences)):
    words = nlp(WarriorCats2021_sentences[i])
    interim = [0]*len(words)
    for j in range(0, len(interim)):
        interim[j] = words[j].lemma_
    WarriorCats2021_lemmatized[i] = interim 

In [54]:
DFFF2021_lemmatized = [0]*len(DFFF2021_sentences)

for i in range(0, len(DFFF2021_sentences)):
    words = nlp(DFFF2021_sentences[i])
    interim = [0]*len(words)
    for j in range(0, len(interim)):
        interim[j] = words[j].lemma_
    DFFF2021_lemmatized[i] = interim 

In [55]:
Mittelerde2021_lemmatized = [0]*len(Mittelerde2021_sentences)

for i in range(0, len(Mittelerde2021_sentences)):
    words = nlp(Mittelerde2021_sentences[i])
    interim = [0]*len(words)
    for j in range(0, len(interim)):
        interim[j] = words[j].lemma_
    Mittelerde2021_lemmatized[i] = interim 

In [56]:
Jackson2021_lemmatized = [0]*len(Jackson2021_sentences)

for i in range(0, len(Jackson2021_sentences)):
    words = nlp(Jackson2021_sentences[i])
    interim = [0]*len(words)
    for j in range(0, len(interim)):
        interim[j] = words[j].lemma_
    Jackson2021_lemmatized[i] = interim 

In [57]:
Panem2021_lemmatized = [0]*len(Panem2021_sentences)

for i in range(0, len(Panem2021_sentences)):
    words = nlp(Panem2021_sentences[i])
    interim = [0]*len(words)
    for j in range(0, len(interim)):
        interim[j] = words[j].lemma_
    Panem2021_lemmatized[i] = interim 

In [58]:
PotterOriginals_lemmatized = [0]*len(PotterOriginals_sentences)

for i in range(0, len(PotterOriginals_sentences)):
    words = nlp(PotterOriginals_sentences[i])
    interim = [0]*len(words)
    for j in range(0, len(interim)):
        interim[j] = words[j].lemma_
    PotterOriginals_lemmatized[i] = interim 

In [59]:
# removing punctuation (no "_"!!!)

punctuation = """!"#$%&'()*+,-./:;<=>?@[\]^`{|}~«»"""

In [60]:
for sent in Potter2021_lemmatized:
    for word in sent:
        if word in punctuation:
            sent.remove(word)

In [116]:
Potter2021_lemmatized[:5]

[['                                                                ',
  'NeujahrEs',
  'sein',
  'sehr',
  'früh',
  'am',
  'Morgen',
  'der',
  'erst',
  'Januar',
  '1978',
  'als',
  'James',
  'Potter',
  'sich',
  'aus',
  'der',
  'Jungenschlafsaal',
  'schleichen',
  'und',
  'vorsichtig',
  'der',
  'Treppe',
  'in',
  'der',
  'Gemeinschaftsraum',
  'hinunter',
  'steigen'],
 ['In',
  'wenig',
  'Stunde',
  'werden',
  'ich',
  'Frühstück',
  'geben',
  'doch',
  'ich',
  'sein',
  'schon',
  'jetzt',
  'wach'],
 ['ich',
  'haben',
  'nicht',
  'mehr',
  'schlafen',
  'können',
  'wollen',
  'mein',
  'gut',
  'Freund',
  'jedoch',
  'nicht',
  'so',
  'früh',
  'wecken'],
 ['Außerdem', 'musste', 'ich', 'nachdenken'],
 ['Gestern',
  'bei',
  'der',
  'Silvesterparty',
  'haben',
  'ich',
  'beinahe',
  'mit',
  'Lily',
  'tanzen']]

In [61]:
for sent in Biss2021_lemmatized:
    for word in sent:
        if word in punctuation:
            sent.remove(word)

In [62]:
for sent in WarriorCats2021_lemmatized:
    for word in sent:
        if word in punctuation:
            sent.remove(word)

In [63]:
for sent in DFFF2021_lemmatized:
    for word in sent:
        if word in punctuation:
            sent.remove(word)

In [64]:
for sent in Mittelerde2021_lemmatized:
    for word in sent:
        if word in punctuation:
            sent.remove(word)

In [65]:
for sent in Jackson2021_lemmatized:
    for word in sent:
        if word in punctuation:
            sent.remove(word)

In [66]:
for sent in Panem2021_lemmatized:
    for word in sent:
        if word in punctuation:
            sent.remove(word)

In [67]:
for sent in PotterOriginals_lemmatized:
    for word in sent:
        if word in punctuation:
            sent.remove(word)

In [68]:
# lowercasing

Potter2021_sents_final = [[word.lower() for word in sent] for sent in Potter2021_lemmatized]

In [117]:
Potter2021_sents_final[:5]

[['                                                                ',
  'neujahres',
  'sein',
  'sehr',
  'früh',
  'am',
  'morgen',
  'der',
  'erst',
  'januar',
  '1978',
  'als',
  'james',
  'potter',
  'sich',
  'aus',
  'der',
  'jungenschlafsaal',
  'schleichen',
  'und',
  'vorsichtig',
  'der',
  'treppe',
  'in',
  'der',
  'gemeinschaftsraum',
  'hinunter',
  'steigen'],
 ['in',
  'wenig',
  'stunde',
  'werden',
  'ich',
  'frühstück',
  'geben',
  'doch',
  'ich',
  'sein',
  'schon',
  'jetzt',
  'wach'],
 ['ich',
  'haben',
  'nicht',
  'mehr',
  'schlafen',
  'können',
  'wollen',
  'mein',
  'gut',
  'freund',
  'jedoch',
  'nicht',
  'so',
  'früh',
  'wecken'],
 ['außerdem', 'musste', 'ich', 'nachdenken'],
 ['gestern',
  'bei',
  'der',
  'silvesterparty',
  'haben',
  'ich',
  'beinahe',
  'mit',
  'lily',
  'tanzen']]

In [69]:
Biss2021_sents_final = [[word.lower() for word in sent] for sent in Biss2021_lemmatized]

In [70]:
WarriorCats2021_sents_final = [[word.lower() for word in sent] for sent in WarriorCats2021_lemmatized]

In [71]:
DFFF2021_sents_final = [[word.lower() for word in sent] for sent in DFFF2021_lemmatized]

In [72]:
Mittelerde2021_sents_final = [[word.lower() for word in sent] for sent in Mittelerde2021_lemmatized]

In [73]:
Jackson2021_sents_final = [[word.lower() for word in sent] for sent in Jackson2021_lemmatized]

In [74]:
Panem2021_sents_final = [[word.lower() for word in sent] for sent in Panem2021_lemmatized]

In [75]:
PotterOriginals_sents_final = [[word.lower() for word in sent] for sent in PotterOriginals_lemmatized]

# SAVE CORPORA

In [76]:
# save corpus

path_pickled = r'D:\JH\Masterarbeit\corpora_3.0'

In [77]:
with open(path_pickled + '\\Potter2021_sents_final.pkl', 'wb') as f:
    pickle.dump(Potter2021_sents_final, f)

In [78]:
with open(path_pickled + '\\Biss2021_sents_final.pkl', 'wb') as f:
    pickle.dump(Biss2021_sents_final, f)

In [79]:
with open(path_pickled + '\\WarriorCats2021_sents_final.pkl', 'wb') as f:
    pickle.dump(WarriorCats2021_sents_final, f)

In [81]:
with open(path_pickled + '\\DFFF20212021_sents_final.pkl', 'wb') as f:
    pickle.dump(DFFF2021_sents_final, f)

In [82]:
with open(path_pickled + '\\Mittelerde2021_sents_final.pkl', 'wb') as f:
    pickle.dump(Mittelerde2021_sents_final, f)

In [83]:
with open(path_pickled + '\\Jackson2021_sents_final.pkl', 'wb') as f:
    pickle.dump(Jackson2021_sents_final, f)

In [84]:
with open(path_pickled + '\\Panem2021_sents_final.pkl', 'wb') as f:
    pickle.dump(Panem2021_sents_final, f)

In [85]:
with open(path_pickled + '\\PotterOriginals_sents_final.pkl', 'wb') as f:
    pickle.dump(PotterOriginals_sents_final, f)

# CREATE AND SAVE WORD (TOKEN) LIST

In [86]:
# create and save word (token) list

Potter2021_words = [word for sent in Potter2021_sents_final for word in sent]

In [87]:
Biss2021_words = [word for sent in Biss2021_sents_final for word in sent]

In [88]:
WarriorCats2021_words = [word for sent in WarriorCats2021_sents_final for word in sent]

In [89]:
DFFF2021_words = [word for sent in DFFF2021_sents_final for word in sent]

In [90]:
Mittelerde2021_words = [word for sent in Mittelerde2021_sents_final for word in sent]

In [91]:
Jackson2021_words = [word for sent in Jackson2021_sents_final for word in sent]

In [92]:
Panem2021_words = [word for sent in Panem2021_sents_final for word in sent]

In [93]:
PotterOriginals_words = [word for sent in PotterOriginals_sents_final for word in sent]

In [94]:
with open(path_pickled + '\\Potter2021_words.pkl', 'wb') as f:
    pickle.dump(Potter2021_words, f)

In [95]:
with open(path_pickled + '\\BissPotter2021_words.pkl', 'wb') as f:
    pickle.dump(Biss2021_words, f)

In [96]:
with open(path_pickled + '\\WarriorCats2021_words.pkl', 'wb') as f:
    pickle.dump(WarriorCats2021_words, f)

In [98]:
with open(path_pickled + '\\DFFF2021_words.pkl', 'wb') as f:
    pickle.dump(DFFF2021_words, f)

In [99]:
with open(path_pickled + '\\Mittelerde2021_words.pkl', 'wb') as f:
    pickle.dump(Mittelerde2021_words, f)

In [100]:
with open(path_pickled + '\\Jackson2021_words.pkl', 'wb') as f:
    pickle.dump(Jackson2021_words, f)

In [101]:
with open(path_pickled + '\\Panem2021_words.pkl', 'wb') as f:
    pickle.dump(Panem2021_words, f)

In [102]:
with open(path_pickled + '\\PotterOriginals_words.pkl', 'wb') as f:
    pickle.dump(PotterOriginals_words, f)

# TRAIN AND SAVE MODELS

In [118]:
# train and save models

modelPotter2021A = Word2Vec(Potter2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=5)

modelPotter2021B = Word2Vec(Potter2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=10)

modelPotter2021C = Word2Vec(Potter2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=15)

modelPotter2021D = Word2Vec(Potter2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=20)

modelPotter2021E = Word2Vec(Potter2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=25)

modelPotter2021F = Word2Vec(Potter2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=5)

modelPotter2021G = Word2Vec(Potter2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=10)

modelPotter2021H = Word2Vec(Potter2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=15)

modelPotter2021I = Word2Vec(Potter2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=20)

modelPotter2021J = Word2Vec(Potter2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=25)

In [115]:
modelBiss2021A = Word2Vec(Biss2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=5)

modelBiss2021B = Word2Vec(Biss2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=10)

modelBiss2021C = Word2Vec(Biss2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=15)

modelBiss2021D = Word2Vec(Biss2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=20)

modelBiss2021E = Word2Vec(Biss2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=25)

modelBiss2021F = Word2Vec(Biss2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=5)

modelBiss2021G = Word2Vec(Biss2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=10)

modelBiss2021H = Word2Vec(Biss2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=15)

modelBiss2021I = Word2Vec(Biss2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=20)

modelBiss2021J = Word2Vec(Biss2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=25)

In [114]:
modelWarriorCats2021A = Word2Vec(WarriorCats2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=5)

modelWarriorCats2021B = Word2Vec(WarriorCats2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=10)

modelWarriorCats2021C = Word2Vec(WarriorCats2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=15)

modelWarriorCats2021D = Word2Vec(WarriorCats2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=20)

modelWarriorCats2021E = Word2Vec(WarriorCats2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=25)

modelWarriorCats2021F = Word2Vec(WarriorCats2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=5)

modelWarriorCats2021G = Word2Vec(WarriorCats2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=10)

modelWarriorCats2021H = Word2Vec(WarriorCats2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=15)

modelWarriorCats2021I = Word2Vec(WarriorCats2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=20)

modelWarriorCats2021J = Word2Vec(WarriorCats2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=25)

In [113]:
modelDFFF2021A = Word2Vec(DFFF2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=5)

modelDFFF2021B = Word2Vec(DFFF2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=10)

modelDFFF2021C = Word2Vec(DFFF2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=15)

modelDFFF2021D = Word2Vec(DFFF2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=20)

modelDFFF2021E = Word2Vec(DFFF2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=25)

modelDFFF2021F = Word2Vec(DFFF2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=5)

modelDFFF2021G = Word2Vec(DFFF2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=10)

modelDFFF2021H = Word2Vec(DFFF2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=15)

modelDFFF2021I = Word2Vec(DFFF2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=20)

modelDFFF2021J = Word2Vec(DFFF2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=25)

In [112]:
modelMittelerde2021A = Word2Vec(Mittelerde2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=5)

modelMittelerde2021B = Word2Vec(Mittelerde2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=10)

modelMittelerde2021C = Word2Vec(Mittelerde2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=15)

modelMittelerde2021D = Word2Vec(Mittelerde2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=20)

modelMittelerde2021E = Word2Vec(Mittelerde2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=25)

modelMittelerde2021F = Word2Vec(Mittelerde2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=5)

modelMittelerde2021G = Word2Vec(Mittelerde2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=10)

modelMittelerde2021H = Word2Vec(Mittelerde2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=15)

modelMittelerde2021I = Word2Vec(Mittelerde2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=20)

modelMittelerde2021J = Word2Vec(Mittelerde2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=25)

In [111]:
modelJackson2021A = Word2Vec(Jackson2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=5)

modelJackson2021B = Word2Vec(Jackson2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=10)

modelJackson2021C = Word2Vec(Jackson2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=15)

modelJackson2021D = Word2Vec(Jackson2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=20)

modelJackson2021E = Word2Vec(Jackson2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=25)

modelJackson2021F = Word2Vec(Jackson2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=5)

modelJackson2021G = Word2Vec(Jackson2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=10)

modelJackson2021H = Word2Vec(Jackson2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=15)

modelJackson2021I = Word2Vec(Jackson2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=20)

modelJackson2021J = Word2Vec(Jackson2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=25)

In [110]:
modelPanem2021A = Word2Vec(Panem2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=5)

modelPanem2021B = Word2Vec(Panem2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=10)

modelPanem2021C = Word2Vec(Panem2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=15)

modelPanem2021D = Word2Vec(Panem2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=20)

modelPanem2021E = Word2Vec(Panem2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=25)

modelPanem2021F = Word2Vec(Panem2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=5)

modelPanem2021G = Word2Vec(Panem2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=10)

modelPanem2021H = Word2Vec(Panem2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=15)

modelPanem2021I = Word2Vec(Panem2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=20)

modelPanem2021J = Word2Vec(Panem2021_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=25)

In [109]:
modelPotterOriginalsA = Word2Vec(PotterOriginals_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=5)

modelPotterOriginalsB = Word2Vec(PotterOriginals_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=10)

modelPotterOriginalsC = Word2Vec(PotterOriginals_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=15)

modelPotterOriginalsD = Word2Vec(PotterOriginals_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=20)

modelPotterOriginalsE = Word2Vec(PotterOriginals_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=0, epochs=25)

modelPotterOriginalsF = Word2Vec(PotterOriginals_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=5)

modelPotterOriginalsG = Word2Vec(PotterOriginals_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=10)

modelPotterOriginalsH = Word2Vec(PotterOriginals_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=15)

modelPotterOriginalsI = Word2Vec(PotterOriginals_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=20)

modelPotterOriginalsJ = Word2Vec(PotterOriginals_sents_final, vector_size=300, window=5, workers=6, min_count=25, sg=1, epochs=25)

In [10]:
# save models

path_models = r'D:\JH\Masterarbeit\models_3.0'

In [121]:
modelPotter2021A.wv.save(path_models + '\\modelPotter2021A.kv')

modelPotter2021B.wv.save(path_models + '\\modelPotter2021B.kv')

modelPotter2021C.wv.save(path_models + '\\modelPotter2021C.kv')

modelPotter2021D.wv.save(path_models + '\\modelPotter2021D.kv')

modelPotter2021E.wv.save(path_models + '\\modelPotter2021E.kv')

modelPotter2021F.wv.save(path_models + '\\modelPotter2021F.kv')

modelPotter2021G.wv.save(path_models + '\\modelPotter2021G.kv')

modelPotter2021H.wv.save(path_models + '\\modelPotter2021H.kv')

modelPotter2021I.wv.save(path_models + '\\modelPotter2021I.kv')

modelPotter2021J.wv.save(path_models + '\\modelPotter2021J.kv')

In [122]:
modelBiss2021A.wv.save(path_models + '\\modelBiss2021A.kv')

modelBiss2021B.wv.save(path_models + '\\modelBiss2021B.kv')

modelBiss2021C.wv.save(path_models + '\\modelBiss2021C.kv')

modelBiss2021D.wv.save(path_models + '\\modelBiss2021D.kv')

modelBiss2021E.wv.save(path_models + '\\modelBiss2021E.kv')

modelBiss2021F.wv.save(path_models + '\\modelBiss2021F.kv')

modelBiss2021G.wv.save(path_models + '\\modelBiss2021G.kv')

modelBiss2021H.wv.save(path_models + '\\modelBiss2021H.kv')

modelBiss2021I.wv.save(path_models + '\\modelBiss2021I.kv')

modelBiss2021J.wv.save(path_models + '\\modelBiss2021J.kv')

In [123]:
modelWarriorCats2021A.wv.save(path_models + '\\modelWarriorCats2021A.kv')

modelWarriorCats2021B.wv.save(path_models + '\\modelWarriorCats2021B.kv')

modelWarriorCats2021C.wv.save(path_models + '\\modelWarriorCats2021C.kv')

modelWarriorCats2021D.wv.save(path_models + '\\modelWarriorCats2021D.kv')

modelWarriorCats2021E.wv.save(path_models + '\\modelWarriorCats2021E.kv')

modelWarriorCats2021F.wv.save(path_models + '\\modelWarriorCats2021F.kv')

modelWarriorCats2021G.wv.save(path_models + '\\modelWarriorCats2021G.kv')

modelWarriorCats2021H.wv.save(path_models + '\\modelWarriorCats2021H.kv')

modelWarriorCats2021I.wv.save(path_models + '\\modelWarriorCats2021I.kv')

modelWarriorCats2021J.wv.save(path_models + '\\modelWarriorCats2021J.kv')

In [124]:
modelDFFF2021A.wv.save(path_models + '\\modelDFFF2021A.kv')

modelDFFF2021B.wv.save(path_models + '\\modelDFFF2021B.kv')

modelDFFF2021C.wv.save(path_models + '\\modelDFFF2021C.kv')

modelDFFF2021D.wv.save(path_models + '\\modelDFFF2021D.kv')

modelDFFF2021E.wv.save(path_models + '\\modelDFFF2021E.kv')

modelDFFF2021F.wv.save(path_models + '\\modelDFFF2021F.kv')

modelDFFF2021G.wv.save(path_models + '\\modelDFFF2021G.kv')

modelDFFF2021H.wv.save(path_models + '\\modelDFFF2021H.kv')

modelDFFF2021I.wv.save(path_models + '\\modelDFFF2021I.kv')

modelDFFF2021J.wv.save(path_models + '\\modelDFFF2021J.kv')

In [125]:
modelMittelerde2021A.wv.save(path_models + '\\modelMittelerde2021A.kv')

modelMittelerde2021B.wv.save(path_models + '\\modelMittelerde2021B.kv')

modelMittelerde2021C.wv.save(path_models + '\\modelMittelerde2021C.kv')

modelMittelerde2021D.wv.save(path_models + '\\modelMittelerde2021D.kv')

modelMittelerde2021E.wv.save(path_models + '\\modelMittelerde2021E.kv')

modelMittelerde2021F.wv.save(path_models + '\\modelMittelerde2021F.kv')

modelMittelerde2021G.wv.save(path_models + '\\modelMittelerde2021G.kv')

modelMittelerde2021H.wv.save(path_models + '\\modelMittelerde2021H.kv')

modelMittelerde2021I.wv.save(path_models + '\\modelMittelerde2021I.kv')

modelMittelerde2021J.wv.save(path_models + '\\modelMittelerde2021J.kv')

In [126]:
modelJackson2021A.wv.save(path_models + '\\modelJackson2021A.kv')

modelJackson2021B.wv.save(path_models + '\\modelJackson2021B.kv')

modelJackson2021C.wv.save(path_models + '\\modelJackson2021C.kv')

modelJackson2021D.wv.save(path_models + '\\modelJackson2021D.kv')

modelJackson2021E.wv.save(path_models + '\\modelJackson2021E.kv')

modelJackson2021F.wv.save(path_models + '\\modelJackson2021F.kv')

modelJackson2021G.wv.save(path_models + '\\modelJackson2021G.kv')

modelJackson2021H.wv.save(path_models + '\\modelJackson2021H.kv')

modelJackson2021I.wv.save(path_models + '\\modelJackson2021I.kv')

modelJackson2021J.wv.save(path_models + '\\modelJackson2021J.kv')

In [127]:
modelPanem2021A.wv.save(path_models + '\\modelPanem2021A.kv')

modelPanem2021B.wv.save(path_models + '\\modelPanem2021B.kv')

modelPanem2021C.wv.save(path_models + '\\modelPanem2021C.kv')

modelPanem2021D.wv.save(path_models + '\\modelPanem2021D.kv')

modelPanem2021E.wv.save(path_models + '\\modelPanem2021E.kv')

modelPanem2021F.wv.save(path_models + '\\modelPanem2021F.kv')

modelPanem2021G.wv.save(path_models + '\\modelPanem2021G.kv')

modelPanem2021H.wv.save(path_models + '\\modelPanem2021H.kv')

modelPanem2021I.wv.save(path_models + '\\modelPanem2021I.kv')

modelPanem2021J.wv.save(path_models + '\\modelPanem2021J.kv')

In [128]:
modelPotterOriginalsA.wv.save(path_models + '\\modelPotterOriginalsA.kv')

modelPotterOriginalsB.wv.save(path_models + '\\modelPotterOriginalsB.kv')

modelPotterOriginalsC.wv.save(path_models + '\\modelPotterOriginalsC.kv')

modelPotterOriginalsD.wv.save(path_models + '\\modelPotterOriginalsD.kv')

modelPotterOriginalsE.wv.save(path_models + '\\modelPotterOriginalsE.kv')

modelPotterOriginalsF.wv.save(path_models + '\\modelPotterOriginalsF.kv')

modelPotterOriginalsG.wv.save(path_models + '\\modelPotterOriginalsG.kv')

modelPotterOriginalsH.wv.save(path_models + '\\modelPotterOriginalsH.kv')

modelPotterOriginalsI.wv.save(path_models + '\\modelPotterOriginalsI.kv')

modelPotterOriginalsJ.wv.save(path_models + '\\modelPotterOriginalsJ.kv')

# EVALUATE MODELS

The following code blocks are taken from Brottrager et al.'s "Character Shifts in Harry Potter Fanfictions", the relevant Jupyter Notebook can be found under https://github.com/jbrottrager/character-shifts-HPFFS/blob/main/scripts/06_modelEvaluation.ipynb (last viewed: 2022/09/18)

In [7]:
# evaluate models (from Brottrager et al. 2022!!!)

google_analogies = r'D:\JH\Masterarbeit\model_evaluation\de_trans_Google_analogies.txt'
word_pairs = r'D:\JH\Masterarbeit\model_evaluation\de_re-rated_Schm280.txt'

In [8]:
path_results = r'D:\JH\Masterarbeit\results\evaluation'

In [33]:
model_name = "modelPotter2021"

# create model list
models = []
directory = path_models
for filename in sorted(os.listdir(directory)):
    if filename.endswith(".kv") and filename.startswith(model_name):
         models.append(("model " + filename[-4:-3], KeyedVectors.load(os.path.join(directory, filename))))

In [35]:
#evaluation
ticks = ["google_acc", "word_pairs"]
data = {}
for tick in ticks:
    data[tick] = [tick]
for (_, model) in models:
    google_acc = model.evaluate_word_analogies(datapath(google_analogies))[0]
    pvalue, spear, _ = model.evaluate_word_pairs(datapath(word_pairs), delimiter="\t")
    
    data["google_acc"].append(google_acc)
    data["word_pairs"].append(spear.correlation)

In [38]:
# visualization I

fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)

df_acc = pd.DataFrame([data["google_acc"]],
                   columns=['Test'] + [modelname for (modelname, _) in models])
df_acc.plot(x='Test',
        kind='bar',
        stacked=False,
        title='',
        colormap="Spectral",
        ax=ax,
        legend=False,
        rot=30,
        xlabel="",
        ylabel="accuracy")

plt.legend(loc='center right', bbox_to_anchor=(1.0, 0.5))

fig.savefig(path_results + '\\model_eval_acc_' + model_name + '.png', dpi=300, bbox_inches = 'tight')

# visualization II

fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)

df_spear = pd.DataFrame([data["word_pairs"]],
                   columns=['Test'] + [modelname for (modelname, _) in models])
df_spear.plot(x='Test',
        kind='bar',
        stacked=False,
        title='',
        colormap="Spectral",
        ax=ax,
        legend=False,
        rot=30,
        xlabel="",
        ylabel="correlation score")

plt.legend(loc='center right', bbox_to_anchor=(1.0, 0.5))

fig.savefig(path_results + '\\model_eval_corr_' + model_name + '.png', dpi=300, bbox_inches = 'tight')

In [39]:
model_name = "modelBiss2021"

# create model list
models = []
directory = path_models
for filename in sorted(os.listdir(directory)):
    if filename.endswith(".kv") and filename.startswith(model_name):
         models.append(("model " + filename[-4:-3], KeyedVectors.load(os.path.join(directory, filename))))
            
#evaluation
ticks = ["google_acc", "word_pairs"]
data = {}
for tick in ticks:
    data[tick] = [tick]
for (_, model) in models:
    google_acc = model.evaluate_word_analogies(datapath(google_analogies))[0]
    pvalue, spear, _ = model.evaluate_word_pairs(datapath(word_pairs), delimiter="\t")
    
    data["google_acc"].append(google_acc)
    data["word_pairs"].append(spear.correlation)
    
# visualization I

fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)

df_acc = pd.DataFrame([data["google_acc"]],
                   columns=['Test'] + [modelname for (modelname, _) in models])
df_acc.plot(x='Test',
        kind='bar',
        stacked=False,
        title='',
        colormap="Spectral",
        ax=ax,
        legend=False,
        rot=30,
        xlabel="",
        ylabel="accuracy")

plt.legend(loc='center right', bbox_to_anchor=(1.0, 0.5))

fig.savefig(path_results + '\\model_eval_acc_' + model_name + '.png', dpi=300, bbox_inches = 'tight')

# visualization II

fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)

df_spear = pd.DataFrame([data["word_pairs"]],
                   columns=['Test'] + [modelname for (modelname, _) in models])
df_spear.plot(x='Test',
        kind='bar',
        stacked=False,
        title='',
        colormap="Spectral",
        ax=ax,
        legend=False,
        rot=30,
        xlabel="",
        ylabel="correlation score")

plt.legend(loc='center right', bbox_to_anchor=(1.0, 0.5))

fig.savefig(path_results + '\\model_eval_corr_' + model_name + '.png', dpi=300, bbox_inches = 'tight')

In [40]:
model_name = "modelWarriorCats2021"

# create model list
models = []
directory = path_models
for filename in sorted(os.listdir(directory)):
    if filename.endswith(".kv") and filename.startswith(model_name):
         models.append(("model " + filename[-4:-3], KeyedVectors.load(os.path.join(directory, filename))))
            
#evaluation
ticks = ["google_acc", "word_pairs"]
data = {}
for tick in ticks:
    data[tick] = [tick]
for (_, model) in models:
    google_acc = model.evaluate_word_analogies(datapath(google_analogies))[0]
    pvalue, spear, _ = model.evaluate_word_pairs(datapath(word_pairs), delimiter="\t")
    
    data["google_acc"].append(google_acc)
    data["word_pairs"].append(spear.correlation)
    
# visualization I

fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)

df_acc = pd.DataFrame([data["google_acc"]],
                   columns=['Test'] + [modelname for (modelname, _) in models])
df_acc.plot(x='Test',
        kind='bar',
        stacked=False,
        title='',
        colormap="Spectral",
        ax=ax,
        legend=False,
        rot=30,
        xlabel="",
        ylabel="accuracy")

plt.legend(loc='center right', bbox_to_anchor=(1.0, 0.5))

fig.savefig(path_results + '\\model_eval_acc_' + model_name + '.png', dpi=300, bbox_inches = 'tight')

# visualization II

fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)

df_spear = pd.DataFrame([data["word_pairs"]],
                   columns=['Test'] + [modelname for (modelname, _) in models])
df_spear.plot(x='Test',
        kind='bar',
        stacked=False,
        title='',
        colormap="Spectral",
        ax=ax,
        legend=False,
        rot=30,
        xlabel="",
        ylabel="correlation score")

plt.legend(loc='center right', bbox_to_anchor=(1.0, 0.5))

fig.savefig(path_results + '\\model_eval_corr_' + model_name + '.png', dpi=300, bbox_inches = 'tight')

In [41]:
model_name = "modelDFFF2021"

# create model list
models = []
directory = path_models
for filename in sorted(os.listdir(directory)):
    if filename.endswith(".kv") and filename.startswith(model_name):
         models.append(("model " + filename[-4:-3], KeyedVectors.load(os.path.join(directory, filename))))
            
#evaluation
ticks = ["google_acc", "word_pairs"]
data = {}
for tick in ticks:
    data[tick] = [tick]
for (_, model) in models:
    google_acc = model.evaluate_word_analogies(datapath(google_analogies))[0]
    pvalue, spear, _ = model.evaluate_word_pairs(datapath(word_pairs), delimiter="\t")
    
    data["google_acc"].append(google_acc)
    data["word_pairs"].append(spear.correlation)
    
# visualization I

fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)

df_acc = pd.DataFrame([data["google_acc"]],
                   columns=['Test'] + [modelname for (modelname, _) in models])
df_acc.plot(x='Test',
        kind='bar',
        stacked=False,
        title='',
        colormap="Spectral",
        ax=ax,
        legend=False,
        rot=30,
        xlabel="",
        ylabel="accuracy")

plt.legend(loc='center right', bbox_to_anchor=(1.0, 0.5))

fig.savefig(path_results + '\\model_eval_acc_' + model_name + '.png', dpi=300, bbox_inches = 'tight')

# visualization II

fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)

df_spear = pd.DataFrame([data["word_pairs"]],
                   columns=['Test'] + [modelname for (modelname, _) in models])
df_spear.plot(x='Test',
        kind='bar',
        stacked=False,
        title='',
        colormap="Spectral",
        ax=ax,
        legend=False,
        rot=30,
        xlabel="",
        ylabel="correlation score")

plt.legend(loc='center right', bbox_to_anchor=(1.0, 0.5))

fig.savefig(path_results + '\\model_eval_corr_' + model_name + '.png', dpi=300, bbox_inches = 'tight')

In [42]:
model_name = "modelMittelerde2021"

# create model list
models = []
directory = path_models
for filename in sorted(os.listdir(directory)):
    if filename.endswith(".kv") and filename.startswith(model_name):
         models.append(("model " + filename[-4:-3], KeyedVectors.load(os.path.join(directory, filename))))
            
#evaluation
ticks = ["google_acc", "word_pairs"]
data = {}
for tick in ticks:
    data[tick] = [tick]
for (_, model) in models:
    google_acc = model.evaluate_word_analogies(datapath(google_analogies))[0]
    pvalue, spear, _ = model.evaluate_word_pairs(datapath(word_pairs), delimiter="\t")
    
    data["google_acc"].append(google_acc)
    data["word_pairs"].append(spear.correlation)
    
# visualization I

fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)

df_acc = pd.DataFrame([data["google_acc"]],
                   columns=['Test'] + [modelname for (modelname, _) in models])
df_acc.plot(x='Test',
        kind='bar',
        stacked=False,
        title='',
        colormap="Spectral",
        ax=ax,
        legend=False,
        rot=30,
        xlabel="",
        ylabel="accuracy")

plt.legend(loc='center right', bbox_to_anchor=(1.0, 0.5))

fig.savefig(path_results + '\\model_eval_acc_' + model_name + '.png', dpi=300, bbox_inches = 'tight')

# visualization II

fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)

df_spear = pd.DataFrame([data["word_pairs"]],
                   columns=['Test'] + [modelname for (modelname, _) in models])
df_spear.plot(x='Test',
        kind='bar',
        stacked=False,
        title='',
        colormap="Spectral",
        ax=ax,
        legend=False,
        rot=30,
        xlabel="",
        ylabel="correlation score")

plt.legend(loc='center right', bbox_to_anchor=(1.0, 0.5))

fig.savefig(path_results + '\\model_eval_corr_' + model_name + '.png', dpi=300, bbox_inches = 'tight')

In [43]:
model_name = "modelJackson2021"

# create model list
models = []
directory = path_models
for filename in sorted(os.listdir(directory)):
    if filename.endswith(".kv") and filename.startswith(model_name):
         models.append(("model " + filename[-4:-3], KeyedVectors.load(os.path.join(directory, filename))))
            
#evaluation
ticks = ["google_acc", "word_pairs"]
data = {}
for tick in ticks:
    data[tick] = [tick]
for (_, model) in models:
    google_acc = model.evaluate_word_analogies(datapath(google_analogies))[0]
    pvalue, spear, _ = model.evaluate_word_pairs(datapath(word_pairs), delimiter="\t")
    
    data["google_acc"].append(google_acc)
    data["word_pairs"].append(spear.correlation)
    
# visualization I

fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)

df_acc = pd.DataFrame([data["google_acc"]],
                   columns=['Test'] + [modelname for (modelname, _) in models])
df_acc.plot(x='Test',
        kind='bar',
        stacked=False,
        title='',
        colormap="Spectral",
        ax=ax,
        legend=False,
        rot=30,
        xlabel="",
        ylabel="accuracy")

plt.legend(loc='center right', bbox_to_anchor=(1.0, 0.5))

fig.savefig(path_results + '\\model_eval_acc_' + model_name + '.png', dpi=300, bbox_inches = 'tight')

# visualization II

fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)

df_spear = pd.DataFrame([data["word_pairs"]],
                   columns=['Test'] + [modelname for (modelname, _) in models])
df_spear.plot(x='Test',
        kind='bar',
        stacked=False,
        title='',
        colormap="Spectral",
        ax=ax,
        legend=False,
        rot=30,
        xlabel="",
        ylabel="correlation score")

plt.legend(loc='center right', bbox_to_anchor=(1.0, 0.5))

fig.savefig(path_results + '\\model_eval_corr_' + model_name + '.png', dpi=300, bbox_inches = 'tight')

In [44]:
model_name = "modelPanem2021"

# create model list
models = []
directory = path_models
for filename in sorted(os.listdir(directory)):
    if filename.endswith(".kv") and filename.startswith(model_name):
         models.append(("model " + filename[-4:-3], KeyedVectors.load(os.path.join(directory, filename))))
            
#evaluation
ticks = ["google_acc", "word_pairs"]
data = {}
for tick in ticks:
    data[tick] = [tick]
for (_, model) in models:
    google_acc = model.evaluate_word_analogies(datapath(google_analogies))[0]
    pvalue, spear, _ = model.evaluate_word_pairs(datapath(word_pairs), delimiter="\t")
    
    data["google_acc"].append(google_acc)
    data["word_pairs"].append(spear.correlation)
    
# visualization I

fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)

df_acc = pd.DataFrame([data["google_acc"]],
                   columns=['Test'] + [modelname for (modelname, _) in models])
df_acc.plot(x='Test',
        kind='bar',
        stacked=False,
        title='',
        colormap="Spectral",
        ax=ax,
        legend=False,
        rot=30,
        xlabel="",
        ylabel="accuracy")

plt.legend(loc='center right', bbox_to_anchor=(1.0, 0.5))

fig.savefig(path_results + '\\model_eval_acc_' + model_name + '.png', dpi=300, bbox_inches = 'tight')

# visualization II

fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)

df_spear = pd.DataFrame([data["word_pairs"]],
                   columns=['Test'] + [modelname for (modelname, _) in models])
df_spear.plot(x='Test',
        kind='bar',
        stacked=False,
        title='',
        colormap="Spectral",
        ax=ax,
        legend=False,
        rot=30,
        xlabel="",
        ylabel="correlation score")

plt.legend(loc='center right', bbox_to_anchor=(1.0, 0.5))

fig.savefig(path_results + '\\model_eval_corr_' + model_name + '.png', dpi=300, bbox_inches = 'tight')

In [45]:
model_name = "modelPotterOriginals"

# create model list
models = []
directory = path_models
for filename in sorted(os.listdir(directory)):
    if filename.endswith(".kv") and filename.startswith(model_name):
         models.append(("model " + filename[-4:-3], KeyedVectors.load(os.path.join(directory, filename))))
            
#evaluation
ticks = ["google_acc", "word_pairs"]
data = {}
for tick in ticks:
    data[tick] = [tick]
for (_, model) in models:
    google_acc = model.evaluate_word_analogies(datapath(google_analogies))[0]
    pvalue, spear, _ = model.evaluate_word_pairs(datapath(word_pairs), delimiter="\t")
    
    data["google_acc"].append(google_acc)
    data["word_pairs"].append(spear.correlation)
    
# visualization I

fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)

df_acc = pd.DataFrame([data["google_acc"]],
                   columns=['Test'] + [modelname for (modelname, _) in models])
df_acc.plot(x='Test',
        kind='bar',
        stacked=False,
        title='',
        colormap="Spectral",
        ax=ax,
        legend=False,
        rot=30,
        xlabel="",
        ylabel="accuracy")

plt.legend(loc='center right', bbox_to_anchor=(1.0, 0.5))

fig.savefig(path_results + '\\model_eval_acc_' + model_name + '.png', dpi=300, bbox_inches = 'tight')

# visualization II

fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)

df_spear = pd.DataFrame([data["word_pairs"]],
                   columns=['Test'] + [modelname for (modelname, _) in models])
df_spear.plot(x='Test',
        kind='bar',
        stacked=False,
        title='',
        colormap="Spectral",
        ax=ax,
        legend=False,
        rot=30,
        xlabel="",
        ylabel="correlation score")

plt.legend(loc='center right', bbox_to_anchor=(1.0, 0.5))

fig.savefig(path_results + '\\model_eval_corr_' + model_name + '.png', dpi=300, bbox_inches = 'tight')

# SUMMARIZE INFO

In [136]:
# summarize info

# word (token) count

len(Potter2021_words)

92615705

In [137]:
len(Biss2021_words)

7879972

In [138]:
len(WarriorCats2021_words)

2323804

In [139]:
len(DFFF2021_words)

4213801

In [140]:
len(Mittelerde2021_words)

10630831

In [141]:
len(Jackson2021_words)

3082836

In [142]:
len(Panem2021_words)

1971107

In [143]:
len(PotterOriginals_words)

1125168

In [None]:
# info model B

print(modelPotter2021B)

In [None]:
print(modelBiss2021B)

In [None]:
print(modelWarriorCats2021B)

In [None]:
print(modelDFFF2021B)

In [None]:
print(modelMittelerde2021B)

In [None]:
print(modelJackson2021B)

In [None]:
print(modelPanem2021B)

In [None]:
print(modelPotterOriginalsB)

# TEST MODELS

In [None]:
# test models

modelPotter2021B.wv.most_similar("held")

In [None]:
modelBiss2021B.wv.most_similar("held")

In [None]:
modelWarriorCats2021B.wv.most_similar("held")

In [None]:
modelDFFF2021B.most_similar("held")

In [None]:
modelMittelerde2021B.wv.most_similar("held")

In [None]:
modelJackson2021B.wv.most_similar("held")

In [None]:
modelPanem2021B.wv.most_similar("held")

In [None]:
modelPotterOriginals2021B.wv.most_similar("held")