Documentation for the Master thesis "Fanfiction Semantics - A Quantitative Analysis of Sensitive Topics in German Fanfiction" by Julian Jacopo Häußler, Date of submission: September 19, 2022.

# 7.2 Keyword Analysis

## Overview:

### - read in data
### - visualization of keywords and overlap

# READ IN DATA

In [1]:
# load libraries

# loading data 
from gensim.models import word2vec
import pickle

#visualizations
%matplotlib notebook
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import nltk
import numpy as np
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

In [2]:
# define path

path_results = r'C:\Users\Public\Data\Masterarbeit\results\7.2 Keyword Analyses'

In [3]:
#read in models

modelPotter = word2vec.KeyedVectors.load('modelPotter2021H.kv')

modelBiss = word2vec.KeyedVectors.load('modelBiss2021H.kv')

modelWarriorCats = word2vec.KeyedVectors.load('modelWarriorCats2021H.kv')

modelDFFF = word2vec.KeyedVectors.load('modelDFFF2021H.kv')

modelMittelerde = word2vec.KeyedVectors.load('modelMittelerde2021H.kv')

modelJackson = word2vec.KeyedVectors.load('modelJackson2021H.kv')

modelPanem = word2vec.KeyedVectors.load('modelPanem2021H.kv')

modelPotterOriginals = word2vec.KeyedVectors.load('modelPotterOriginalsH.kv')

# VISUALIZATION OF KEYWORDS AND OVERLAP

The following code blocks are taken https://github.com/sismetanin/word2vec-tsne/blob/master/Visualizing%20Word2Vec%20Word%20Embeddings%20using%20t-SNE.ipynb (last viewed: 2022/09/18)

In [20]:
# visualization sensitive topics and overlap

sensitive_topics = [
                    # violence
                    "angreifen",
                    # death
                    "töten",
                    # intimacy
                    "küssen",
                    # sex
                    "erregen"
                    ]

In [5]:
embedding_clusters = []
word_clusters = []
for word in sensitive_topics:
    embeddings = []
    words = []
    for similar_word, _ in modelPotter.most_similar(word, topn=30):
        words.append(similar_word)
        embeddings.append(modelPotter[similar_word])
    embedding_clusters.append(embeddings)
    word_clusters.append(words)

In [6]:
embedding_clusters = np.array(embedding_clusters)
n, m, k = embedding_clusters.shape
tsne_model_en_2d = TSNE(perplexity=15, n_components=2, init='pca', n_iter=3500, random_state=32)
embeddings_en_2d = np.array(tsne_model_en_2d.fit_transform(embedding_clusters.reshape(n * m, k))).reshape(n, m, 2)

In [7]:
def tsne_plot_similar_words(title, labels, embedding_clusters, word_clusters, a, filename=None):
    plt.figure(figsize=(16, 9))
    colors = cm.rainbow(np.linspace(0, 1, len(labels)))
    for label, embeddings, words, color in zip(labels, embedding_clusters, word_clusters, colors):
        x = embeddings[:, 0]
        y = embeddings[:, 1]
        plt.scatter(x, y, c=color, alpha=a, label=label)
        for i, word in enumerate(words):
            plt.annotate(word, alpha=0.5, xy=(x[i], y[i]), xytext=(5, 2),
                         textcoords='offset points', ha='right', va='bottom', size=8)
    plt.legend(loc=4)
    plt.title(title)
    plt.grid(True)
    if filename:
        plt.savefig(path_results +'\\' + filename, format='png', dpi=150, bbox_inches='tight')
    plt.show()

In [8]:
tsne_plot_similar_words('Sensitive Topics Potter 2021', sensitive_topics, embeddings_en_2d, word_clusters, 0.7,
                        'sensitive_topics_Potter2021.png')

<IPython.core.display.Javascript object>

*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or R

In [9]:
embedding_clusters = []
word_clusters = []
for word in sensitive_topics:
    embeddings = []
    words = []
    for similar_word, _ in modelBiss.most_similar(word, topn=30):
        words.append(similar_word)
        embeddings.append(modelBiss[similar_word])
    embedding_clusters.append(embeddings)
    word_clusters.append(words)


embedding_clusters = np.array(embedding_clusters)
n, m, k = embedding_clusters.shape
tsne_model_en_2d = TSNE(perplexity=15, n_components=2, init='pca', n_iter=3500, random_state=32)
embeddings_en_2d = np.array(tsne_model_en_2d.fit_transform(embedding_clusters.reshape(n * m, k))).reshape(n, m, 2)


tsne_plot_similar_words('Sensitive Topics Twilight 2021', sensitive_topics, embeddings_en_2d, word_clusters, 0.7,
                        'sensitive_topics_Biss2021.png')

<IPython.core.display.Javascript object>

*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or R

In [11]:
# visualization sensitive topics and overlap

sensitive_topics = [
                    # violence
                    "angreifen",
                    # death
                    "töten",
                    # intimacy
                    #"küssen",
                    # sex
                    "erregen"
                    ]

In [12]:
embedding_clusters = []
word_clusters = []
for word in sensitive_topics:
    embeddings = []
    words = []
    for similar_word, _ in modelWarriorCats.most_similar(word, topn=30):
        words.append(similar_word)
        embeddings.append(modelWarriorCats[similar_word])
    embedding_clusters.append(embeddings)
    word_clusters.append(words)


embedding_clusters = np.array(embedding_clusters)
n, m, k = embedding_clusters.shape
tsne_model_en_2d = TSNE(perplexity=15, n_components=2, init='pca', n_iter=3500, random_state=32)
embeddings_en_2d = np.array(tsne_model_en_2d.fit_transform(embedding_clusters.reshape(n * m, k))).reshape(n, m, 2)


tsne_plot_similar_words('Sensitive Topics Warriors 2021', sensitive_topics, embeddings_en_2d, word_clusters, 0.7,
                        'sensitive_topics_WarriorCats2021.png')

<IPython.core.display.Javascript object>

*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.


In [13]:
# visualization sensitive topics and overlap

sensitive_topics = [
                    # violence
                    "angreifen",
                    # death
                    "töten",
                    # intimacy
                    "küssen",
                    # sex
                    "erregen"
                    ]

In [14]:
embedding_clusters = []
word_clusters = []
for word in sensitive_topics:
    embeddings = []
    words = []
    for similar_word, _ in modelDFFF.most_similar(word, topn=30):
        words.append(similar_word)
        embeddings.append(modelDFFF[similar_word])
    embedding_clusters.append(embeddings)
    word_clusters.append(words)


embedding_clusters = np.array(embedding_clusters)
n, m, k = embedding_clusters.shape
tsne_model_en_2d = TSNE(perplexity=15, n_components=2, init='pca', n_iter=3500, random_state=32)
embeddings_en_2d = np.array(tsne_model_en_2d.fit_transform(embedding_clusters.reshape(n * m, k))).reshape(n, m, 2)


tsne_plot_similar_words('Sensitive Topics Three Investigators 2021', sensitive_topics, embeddings_en_2d, word_clusters, 0.7,
                        'sensitive_topics_DFFF2021.png')

<IPython.core.display.Javascript object>

*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or R

In [15]:
embedding_clusters = []
word_clusters = []
for word in sensitive_topics:
    embeddings = []
    words = []
    for similar_word, _ in modelMittelerde.most_similar(word, topn=30):
        words.append(similar_word)
        embeddings.append(modelMittelerde[similar_word])
    embedding_clusters.append(embeddings)
    word_clusters.append(words)


embedding_clusters = np.array(embedding_clusters)
n, m, k = embedding_clusters.shape
tsne_model_en_2d = TSNE(perplexity=15, n_components=2, init='pca', n_iter=3500, random_state=32)
embeddings_en_2d = np.array(tsne_model_en_2d.fit_transform(embedding_clusters.reshape(n * m, k))).reshape(n, m, 2)


tsne_plot_similar_words('Sensitive Topics Middle-earth 2021', sensitive_topics, embeddings_en_2d, word_clusters, 0.7,
                        'sensitive_topics_Mittelerde2021.png')

<IPython.core.display.Javascript object>

*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or R

In [21]:
embedding_clusters = []
word_clusters = []
for word in sensitive_topics:
    embeddings = []
    words = []
    for similar_word, _ in modelJackson.most_similar(word, topn=30):
        words.append(similar_word)
        embeddings.append(modelJackson[similar_word])
    embedding_clusters.append(embeddings)
    word_clusters.append(words)


embedding_clusters = np.array(embedding_clusters)
n, m, k = embedding_clusters.shape
tsne_model_en_2d = TSNE(perplexity=15, n_components=2, init='pca', n_iter=3500, random_state=32)
embeddings_en_2d = np.array(tsne_model_en_2d.fit_transform(embedding_clusters.reshape(n * m, k))).reshape(n, m, 2)


tsne_plot_similar_words('Sensitive Percy Jackson 2021', sensitive_topics, embeddings_en_2d, word_clusters, 0.7,
                        'sensitive_topics_Jackson2021.png')

<IPython.core.display.Javascript object>

*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or R

In [17]:
embedding_clusters = []
word_clusters = []
for word in sensitive_topics:
    embeddings = []
    words = []
    for similar_word, _ in modelPanem.most_similar(word, topn=30):
        words.append(similar_word)
        embeddings.append(modelPanem[similar_word])
    embedding_clusters.append(embeddings)
    word_clusters.append(words)


embedding_clusters = np.array(embedding_clusters)
n, m, k = embedding_clusters.shape
tsne_model_en_2d = TSNE(perplexity=15, n_components=2, init='pca', n_iter=3500, random_state=32)
embeddings_en_2d = np.array(tsne_model_en_2d.fit_transform(embedding_clusters.reshape(n * m, k))).reshape(n, m, 2)


tsne_plot_similar_words('Sensitive Topics Hunger Games 2021', sensitive_topics, embeddings_en_2d, word_clusters, 0.7,
                        'sensitive_topics_Panem2021.png')

<IPython.core.display.Javascript object>

*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or R

In [18]:
# visualization sensitive topics and overlap

sensitive_topics = [
                    # violence
                    "angreifen",
                    # death
                    "töten",
                    # intimacy
                    #"küssen",
                    # sex
                    "erregen"
                    ]

In [19]:
embedding_clusters = []
word_clusters = []
for word in sensitive_topics:
    embeddings = []
    words = []
    for similar_word, _ in modelPotterOriginals.most_similar(word, topn=30):
        words.append(similar_word)
        embeddings.append(modelPotterOriginals[similar_word])
    embedding_clusters.append(embeddings)
    word_clusters.append(words)


embedding_clusters = np.array(embedding_clusters)
n, m, k = embedding_clusters.shape
tsne_model_en_2d = TSNE(perplexity=15, n_components=2, init='pca', n_iter=3500, random_state=32)
embeddings_en_2d = np.array(tsne_model_en_2d.fit_transform(embedding_clusters.reshape(n * m, k))).reshape(n, m, 2)


tsne_plot_similar_words('Sensitive Topics Potter Originals 2021', sensitive_topics, embeddings_en_2d, word_clusters, 0.7,
                        'sensitive_topics_PotterOriginals2021.png')

<IPython.core.display.Javascript object>

*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
