# INFO 4271 - Exercise 2 - Text Representation

Issued: April 22, 2025

Due: April 28, 2025

Please submit this filled sheet via Ilias by the due date.

---

# 1. Bag-of-Words Models
In class we discussed BOW vectorization models under which documents are represented via term frequency counts.

a) Construct term frequency BOW representations for the following sentences:

- "The government is open."
- "The government is closed."
- "Long live Mickey Mouse, emperor of all!"
- "Darn! This will break."

In [1]:
import string

corpus = [['The government is open.'], ['The government is closed.'], ['Long live Mickey Mouse, emperor of all!'], ['Darn! This will break.']]

#Turn a corpus of arbitrary texts into term-frequency weighted BOW vectors.
def TF(corpus):
    vecs = []
    wordlists = []
    for doc in corpus:
        vec = {}
        sntc = doc[0].lower().translate(str.maketrans('', '', string.punctuation))
        wordlists.append(sntc.split())
    lex = set([word for words in wordlists for word in words])

    for words in wordlists:
        vec = {key: 0 for key in lex}
        for word in words:
            if word not in vec:
                vec[word] = 1
            else:
                vec[word] += 1
        vecs.append(vec)
        # print(vec)
    return vecs

TF(corpus)

[{'government': 1,
  'closed': 0,
  'will': 0,
  'long': 0,
  'all': 0,
  'mickey': 0,
  'open': 1,
  'of': 0,
  'live': 0,
  'emperor': 0,
  'the': 1,
  'this': 0,
  'break': 0,
  'mouse': 0,
  'darn': 0,
  'is': 1},
 {'government': 1,
  'closed': 1,
  'will': 0,
  'long': 0,
  'all': 0,
  'mickey': 0,
  'open': 0,
  'of': 0,
  'live': 0,
  'emperor': 0,
  'the': 1,
  'this': 0,
  'break': 0,
  'mouse': 0,
  'darn': 0,
  'is': 1},
 {'government': 0,
  'closed': 0,
  'will': 0,
  'long': 1,
  'all': 1,
  'mickey': 1,
  'open': 0,
  'of': 1,
  'live': 1,
  'emperor': 1,
  'the': 0,
  'this': 0,
  'break': 0,
  'mouse': 1,
  'darn': 0,
  'is': 0},
 {'government': 0,
  'closed': 0,
  'will': 1,
  'long': 0,
  'all': 0,
  'mickey': 0,
  'open': 0,
  'of': 0,
  'live': 0,
  'emperor': 0,
  'the': 0,
  'this': 1,
  'break': 1,
  'mouse': 0,
  'darn': 1,
  'is': 0}]

b) Extend the term frequency model by an inverse document frequency (IDF) component. Estimate IDFs based on the Reuters 21578 collection.

In [1]:
import nltk
from nltk.corpus import reuters
import string
import math
from collections import Counter

#Download the documents
nltk.download("reuters")
documents = reuters.fileids()

docs = list(filter(lambda doc: doc.startswith("train"), documents))
print(str(len(docs)) + " total train documents")

#To access the content of a news article, we can use the reuters.words() function
print("The first document contains "+str(len(reuters.words(docs[0])))+" words.\nHere they are:")
# for word in reuters.words(docs[0]): print(word)
print(reuters.words(docs[0]))

#Estimate inverse document frequencies based on a corpus of documents.
def IDF(corpus):
    idfs = {}
    filtered = []
    for doc in corpus:
        f = reuters.words(doc)
        f = [word.lower() for word in f if word not in string.punctuation]
        filtered.extend(set(f))

    counts = Counter(filtered)
    for i, word in enumerate(set(filtered)):
        idfs[word] = math.log(len(corpus) / counts[word])
    print("Processing IDF complete. unique words:", len(idfs), "documents:", len(corpus))
    return idfs

#Turn a corpus of arbitrary texts into TF-IDF weighted BOW vectors.
def TFIDF(corpus):
    vecs = []
    idf = IDF(corpus)
    filtered = []
    for doc in corpus:
        f = reuters.words(doc)
        f = [word.lower() for word in f if word not in string.punctuation]
        filtered.append(f)

    vec_base = {key: 0 for key in idf.keys()}
    for i, doc in enumerate(filtered):
        vec = vec_base.copy()
        counts = Counter(doc)
        for word in set(doc):
            vec[word] = counts[word] * idf[word]
        vecs.append(vec)
        if i % 1000 == 0: print(i, "of", len(corpus), "documents processed.", end="\r")
    print("Generating TFIDF completed. Generated", len(vecs), "vectors.")
    return vecs

idf = IDF(docs)
print(list(idf.items())[:5])
tfidf = TFIDF(docs)
print(list(tfidf[0].items())[:5])
# for i in tfidf:
#     if list(i.items())[0][1] > 0:
#         print(list(i.items())[:5])

[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\simon\AppData\Roaming\nltk_data...
[nltk_data]   Package reuters is already up-to-date!


7769 total train documents
The first document contains 633 words.
Here they are:
['BAHIA', 'COCOA', 'REVIEW', 'Showers', 'continued', ...]
Processing IDF complete. unique words: 26410 documents: 7769
[('adventist', 8.95789673495042), ('fundamental', 5.662059868946092), ('midmississippi', 7.5716023738305305), ('chance', 4.738389029774314), ('schoufour', 8.264749554390475)]
Processing IDF complete. unique words: 26410 documents: 7769
Generating TFIDF completed. Generated 7769 vectors.
[('adventist', 0), ('fundamental', 0), ('midmississippi', 0), ('chance', 0), ('schoufour', 0)]


c) Bag-of-words models are order invariant. They do not retain the ordering in which terms occur in the document. Is there any way to include term order information in these models? Justify your answer below.

Retaining order could be somewhat possible when saving the vectors in dicts. This would make it possible to track which words occur first before other words by changing the order of the vector values in the dict.

# 2. Topic Models
Topic models represent textual documents in terms of their distribution of latent topics. Imagine you have trained a 10-topic LDA model. Each topic is a frequency distribution over thousands of terms. Is there a good way of illustrating the meaning of the learned topics to a human? Discuss the advantages and disadvantages of some of the possible options below.

When stopwords are removed from the classification it could be very helpful to look at the most frequent words for each topic (e.g. word cloud). These words should have high importance for the generated categories and could give a good overview over the topic.