Before we apply the LSTM model, it's important to identify words in our MediaTenor training set (2011-2020) that do not have corresponding representations in our pre-trained embeddings. The embeddings were trained on a separate set of data from Süddeutsche Zeitung, Welt, dpa, and Handelsblatt from 1991 to 2010. The words that are not represented in the embeddings might fall into one of three categories:

1. They could be rare words that seldom appeared in the text corpus used to train the embeddings.
2. They could be words that entered common usage after 2010, so they did not appear in the corpus used for the embeddings.
3. They could be misspelled or incorrect words in the MediaTenor training set, which is why they did not appear in the embeddings.

We will create a list of these words and later exclude them from our texts. This is important to ensure the compatibility of our training set with the pre-trained embeddings.

In [1]:
import numpy as np

# Open and read articles from the 'articles.txt' file 
with open('MediaTenor_data/articles.txt', 'r', encoding = 'utf-8') as f:
    articles = f.read()

# Open and read labels from the 'labels_binary.txt' file    
with open('MediaTenor_data/labels_binary.txt', 'r', encoding = 'utf-8') as f:
    labels = f.read()

In [2]:
# Print the first 1000 characters of the articles content
print(articles[:1000])
print()

# Print the first 20 characters of the labels content
print(labels[:20])

SPIEGEL-Gespräch mit Expräsident Lula da Silva über Korruptionsvorwürfe und seine Nachfolgerin Dilma Rousseff. Der ehemalige brasilianische Präsident Luiz Inácio Lula da Silva, 70, spricht über die Krise in seinem Land, wehrt sich gegen Korruptionsvorwürfe und verteidigt seine Nachfolgerin Dilma Rousseff. SPIEGEL: Herr Präsident, in einem Monat beginnen in Rio de Janeiro die Olympischen Spiele, aber das Land befindet sich in einer schweren politischen und wirtschaftlichen Krise. Als Brasilien den Zuschlag bekam, galt es als kommender Star unter den Schwellenländern. Wie konnte es zu diesem Absturz kommen? Lula: Als ich die Olympischen Spiele und die Fußballweltmeisterschaft nach Brasilien geholt habe, glaubte ich tatsächlich, wir würden uns bis 2016 im Kreis der fünf oder sechs größten Wirtschaftsmächte etablieren. Aber wir leiden noch immer unter einer weltweiten Finanzkrise. Die USA haben sich nicht wirklich erholt, Europa steckt weiter in der Krise, Chinas Wirtschaftsleistung geht z

To correctly identify words in the MediaTenor training set that do not have a corresponding embedding representation, we must ensure that we pre-process MediaTenor data using the same steps as were applied to the main corpus during the embeddings training. These steps include transforming all characters to lowercase, removing URLs, removing all punctuation and non-alphabetic characters, consolidating multiple whitespaces into a single one, and eliminating single-letter tokens from the text. However, an additional step specific to MediaTenor data is the exclusion of metadata. These words or phrases are not part of the actual content we wish to analyze. 

In [3]:
import re
from string import punctuation

def remove_multiple_spaces(text):
    """
    This function removes multiple spaces in a string. 
    It uses a regular expression to match 2 or more spaces and replaces them with a single space.
    """
    text = re.sub(r'\s{2,}', ' ', text)
    return text

def remove_short_words(text):
    """
    This function removes words of length 1 from a string.
    """
    text = ' '.join([word for word in text.split() if len(word) > 1])
    return text

def remove_metadata(text, meta_list):
    """
    This function removes metadata from a text.
    Metadata is a list of phrases. If any of these phrases are found in the text,
    everything from the phrase and onwards is cut off.
    """
    for phrase in meta_list:
        if phrase in text:
            text = text.split('dokument', 1)[0]
    return text

# List of metadata phrases
metadata_phrases = ['dokument bihann', 'dokument bid', 'dokument welt', 'dokument bberbr', 'dokument focus']

# Convert articles to lowercase
articles = articles.lower()

# Remove URLs
articles = re.sub(r'https\S+|http\S+|www.\S+', '', articles)

# Remove punctuation
articles = articles.replace('.', ' ').replace('-', ' ').replace('/', ' ')
articles = ''.join([c for c in articles if c not in punctuation and c not in ['»', '«']])

# Remove non-alphabetic characters from the text
articles = ''.join([c for c in articles if (c.isalpha() or c in [' ', '\n'])])

# Split articles by new lines
articles_split = articles.split('\n')

# Remove multiple spaces, short words, and metadata
articles_split = list(map(remove_multiple_spaces, articles_split))
articles_split = list(map(remove_short_words, articles_split))
articles_split = list(map(lambda text: remove_metadata(text, metadata_phrases), articles_split))

# Join all articles into a single string
all_text = ' '.join(articles_split)

# Create a list of all words in the MediaTenor data
words = all_text.split()

Next, we identify words in the vocabulary that don't have a pre-trained vector.

In [4]:
from collections import Counter
import csv
import numpy as np
import os

# Count the occurrences of each word in the articles
word_counts = Counter(words)

# Sort words by their count, in descending order
sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)

# Create a dictionary that maps each word to a unique integer
vocab_to_int = {word: idx for idx, word in enumerate(sorted_vocab)}

# Set the path variable to point to the 'word_embeddings' directory.
path = os.getcwd().replace('\\sentiment', '') + '\\word_embeddings'

# Identify words in the vocabulary that do have a pre-trained vector
with open(path + '\\news_word2vec.txt', 'r', encoding='utf-8', newline='\n', errors='ignore') as f:
    pretrained_vectors_found = 0
    words_with_pretrained_vector = []
    for line in f:
        tokens = line.rstrip().split(' ')
        word = tokens[0]
        if word in vocab_to_int:
            pretrained_vectors_found += 1
            words_with_pretrained_vector.append(word)
    print(f"There are {pretrained_vectors_found} / {len(vocab_to_int)} pretrained vectors found.")
    
# Create a list of words in the vocabulary that don't have a pre-trained vector
words_without_pretrained_vector = list((Counter(list(vocab_to_int.keys())) - Counter(words_with_pretrained_vector)).elements())

# Save words without pre-trained vector to a csv file
with open("words_without_pretrained_vector.csv", 'w', encoding='utf-8-sig') as f:
    writer = csv.writer(f, quoting=csv.QUOTE_ALL,  delimiter='\n')
    writer.writerow(words_without_pretrained_vector)

There are 116935 / 141222 pretrained vectors found.
