Text Analytics I HWS 23/24

# Home Assignment 4 (38pts)

**There is a shortcut to get the 17pts from Task 2 without implementing Task 1, see Task 2e)**


Submit your solution via Ilias until 23.59h on Friday, November 24th. Late submissions are **not possible**.

Submit your solutions in teams of 3-4 students. Unless explicitly agreed otherwise in advance, **submissions from teams with more or less members will NOT be graded**.
List all members of the team with their student ID in the cell below, and submit only one notebook per team. Only submit a notebook, do not submit the dataset(s) you used. Also, do NOT compress/zip your submission!

You may use the code from the exercises and basic functionalities that are explained in official documentation of Python packages without citing, __all other sources must be cited__. In case of plagiarism (copying solutions from other teams or from the internet) ALL team members may be expelled from the course without warning.

#### General guidelines:
* Make sure that your code is executable, any task for which the code does not directly run on our machine will be graded with 0 points.
* If you use packages that are not available on the default or conda-forge channel, list them below. Also add a link to installation instructions. 
* Ensure that the notebook does not rely on the current notebook or system state!
  * Use `Kernel --> Restart & Run All` to see if you are using any definitions, variables etc. that 
    are not in scope anymore.
  * Do not rename any of the datasets you use, and load it from the same directory that your ipynb-notebook is located in, i.e., your working directory.
* Make sure you clean up your code before submission, e.g., properly align your code, and delete every line of code that you do not need anymore, even if you may have experimented with it. Minimize usage of global variables. Avoid reusing variable names multiple times!
* Ensure your code/notebook terminates in reasonable time.
* Feel free to use comments in the code. While we do not require them to get full marks, they may help us in case your code has minor errors.
* For questions that require a textual answer, please do not write the answer as a comment in a code cell, but in a Markdown cell below the code. Always remember to provide sufficient justification for all answers.
* You may create as many additional cells as you want, just make sure that the solutions to the individual tasks can be found near the corresponding assignment.
* If you have any general question regarding the understanding of some task, do not hesitate to post in the student forum in Ilias, so we can clear up such questions for all students in the course.

In [1]:
# studentIDs of all team members
team_members = [12345,67899,880800,234242]

Additional packages (if any):
 - Example: `powerlaw`, https://github.com/jeffalstott/powerlaw

In [2]:
from typing import List, Union, Dict, Set, Tuple
from numpy.typing import NDArray
import numpy as np
from nltk.tokenize import word_tokenize

### Task 1: Term Frequency - Inverse Document Frequency (21 pts)

In this task we want to use the term frequency - inverse document frequency (tf-idf) weighting method to compare documents with each other and to queries. In the end, we will apply our method to a subset of wikipedia pages (more specifically: only the introduction sections) that are linked to from the English Wikipedia page of Mannheim.

In case you need to tokenize any sentences in the following tasks, please use a tokenizer from NLTK and not the ``string.split`` function.

__a)__ To test your implementation throughout this task, you are given the example from exercise 8. Start by implementing a function ``process_docs`` that takes the provided dictionary of documents and returns the following data structures. __(4 pts)__

- ``word2index``: a dictionary that maps each word that appears in any document to a unique integer identifier starting at 0 
- ``doc2index``: a dictionary that maps each document name (here given as the dictionary keys) to a unique integer identifier starting at 0
- ``index2doc``: a dictionary that maps each document identifier to the corresponding document name (reverse to ``doc2index``)
- ``doc_word_vectors``: a dictionary that maps each document name to a list of word ids that indicate which words appeared in the document in their order of appearance. Words that appear multiple times must also be included multiple times.

In [3]:
# example from exercise 8
d1 = "cold beer beach"
d2 = "ice cream beer beer"
d3 = "beach cold ice cream"
d4 = "cold beer frozen yogurt frozen beer"
d5 = "frozen ice ice beer ice cream"
d6 = "yogurt ice cream ice cream"

docs = {"d1": d1, "d2": d2, "d3": d3, "d4": d4, "d5": d5, "d6": d6}

In [4]:
def process_docs(docs: Dict[str, str]) -> (Dict[str, int], Dict[str, int], Dict[int, str], Dict[str, List[int]]):
    """
    :params docs: dict that maps each document name to the document content
    :returns:
        - word2index: dict that maps each word to a unique id
        - doc2index: dict that maps each document name to a unique id
        - index2doc: dict that maps ids to their associated document name
        - doc_word_vectors: dict that maps each document name to a list of word ids that appear in it
    """
    word2index = {}
    doc2index = {}
    index2doc = {}
    doc_word_vectors = {}

    word_index = 0
    doc_index = 0

    for doc_name, doc_content in docs.items():
        # Tokenize the document content using NLTK tokenizer
        tokens = word_tokenize(doc_content.lower())  # Convert to lowercase for consistency

        doc_word_vectors[doc_name] = []

        for token in tokens:
            if token not in word2index:
                word2index[token] = word_index
                word_index += 1

            doc_word_vectors[doc_name].append(word2index[token])

        doc2index[doc_name] = doc_index
        index2doc[doc_index] = doc_name
        doc_index += 1

    return word2index, doc2index, index2doc, doc_word_vectors

word2index, doc2index, index2doc, doc_word_vectors = process_docs(docs)

# Print the results
print("word2index:", word2index)
print("doc2index:", doc2index)
print("index2doc:", index2doc)
print("doc_word_vectors:", doc_word_vectors)

word2index: {'cold': 0, 'beer': 1, 'beach': 2, 'ice': 3, 'cream': 4, 'frozen': 5, 'yogurt': 6}
doc2index: {'d1': 0, 'd2': 1, 'd3': 2, 'd4': 3, 'd5': 4, 'd6': 5}
index2doc: {0: 'd1', 1: 'd2', 2: 'd3', 3: 'd4', 4: 'd5', 5: 'd6'}
doc_word_vectors: {'d1': [0, 1, 2], 'd2': [3, 4, 1, 1], 'd3': [2, 0, 3, 4], 'd4': [0, 1, 5, 6, 5, 1], 'd5': [5, 3, 3, 1, 3, 4], 'd6': [6, 3, 4, 3, 4]}


In [5]:
# The output for the provided example could look like this:

# word2index:
# {'cold': 0, 'beer': 1, 'beach': 2, 'ice': 3, 'cream': 4, 'frozen': 5, 'yogurt': 6}

# doc2index:
# {'d1': 0, 'd2': 1, 'd3': 2, 'd4': 3, 'd5': 4, 'd6': 5}

# index2doc
# {0: 'd1', 1: 'd2', 2: 'd3', 3: 'd4', 4: 'd5', 5: 'd6'}

# doc_word_vectors:
# {'d1': [0, 1, 2],
#  'd2': [3, 4, 1, 1],
#  'd3': [2, 0, 3, 4],
#  'd4': [0, 1, 5, 6, 5, 1],
#  'd5': [5, 3, 3, 1, 3, 4],
#  'd6': [6, 3, 4, 3, 4]}

__b)__ Set up a term-document matrix where each column corresponds to a document and each row corresponds to a word that was observed in any of the documents. The row/column indices should correspond to the word/document ids that are set in the input dicts ``word2index`` and ``doc2index``. Count how often each word appears in each document and fill the term document matrix. __(3 pts)__

_Example: The word "beer" with the word id_ ``1`` _appears two times in the document "d4" that has the document id_ ``3``. _Therefore the the entry at position_ ``[1, 3]`` _in the term-document matrix is_ ``2``.



In [5]:
def term_document_matrix(doc_word_v: Dict[str, List[int]], doc2index: Dict[str, int], word2index: Dict[str, int]) -> NDArray[NDArray[float]]:
    """
    :param doc_word_v: dict that maps each document to the list of word ids that appear in it
    :param doc2index: dict that maps each document name to a unique id
    :param word2index: dict that maps each word to a unique id
    :return: term-document matrix (each word is a row, each document is a column) that indicates the count of each word in each document 
    """
    # your code here
    num_words = len(word2index)
    num_docs = len(doc2index)

    term_doc_matrix = np.zeros((num_words, num_docs))

    for doc_name, word_ids in doc_word_v.items():
        doc_index = doc2index[doc_name]

        for word_id in word_ids:
            term_doc_matrix[word_id, doc_index] += 1

    return term_doc_matrix

# Example usage:
term_doc_matrix = term_document_matrix(doc_word_vectors, doc2index, word2index)

# Print the term-document matrix
print(term_doc_matrix)

[[1. 0. 1. 1. 0. 0.]
 [1. 2. 0. 2. 1. 0.]
 [1. 0. 1. 0. 0. 0.]
 [0. 1. 1. 0. 3. 2.]
 [0. 1. 1. 0. 1. 2.]
 [0. 0. 0. 2. 1. 0.]
 [0. 0. 0. 1. 0. 1.]]


__c)__ Implement the function ``to_tf_idf_matrix`` that takes a term-document matrix and returns the corresponding term frequency (tf) matrix. If the parameter ``idf`` is set to ``True``, the tf-matrix should further be transformed to a tf-idf matrix (i.e. every entry corresponds to the tf-idf value of the associated word and document). Your implementation should leave the input term-document matrix unchanged. __(3 pts)__

Recall the following formulas:

\begin{equation}
  tf_{t,d} =
    \begin{cases}
      1+log_{10}\text{count}(t,d) & \text{if count}(t, d) > 0\\
      0 & \text{otherwise}
    \end{cases}       
\end{equation}  

$$idf_t = log_{10}(\frac{N}{df_i})$$  

$$tf-idf_{t,d} = tf_{t,d} \cdot idf_t$$

In [6]:
def to_tf_idf_matrix(td_matrix: NDArray[NDArray[float]], idf: bool=True) -> NDArray[NDArray[float]]:
    """
    :param td_matrix: term-document matrix 
    :param idf: computes the tf-idf matrix if True, otherwise computes only the tf matrix
    :return: matrix with tf(-idf) values for each word-document pair 
    """
    # your code here
    tf_matrix = np.where(td_matrix > 0, 1 + np.log10(td_matrix), 0)

    if idf:
        N = td_matrix.shape[1]  # Number of documents
        df = np.count_nonzero(td_matrix, axis=1)  # Document frequency for each word

        idf_matrix = np.log10(N / df)
        tf_idf_matrix = tf_matrix * idf_matrix[:, np.newaxis]  # Broadcasting idf values to match the matrix dimensions

        return tf_idf_matrix
    else:
        return tf_matrix

# Example usage:
tf_matrix = to_tf_idf_matrix(term_doc_matrix, idf=False)
tf_idf_matrix = to_tf_idf_matrix(term_doc_matrix, idf=True)

# Print the results
print("TF Matrix:")
print(tf_matrix)

print("\nTF-IDF Matrix:")
print(tf_idf_matrix)

TF Matrix:
[[1.         0.         1.         1.         0.         0.        ]
 [1.         1.30103    0.         1.30103    1.         0.        ]
 [1.         0.         1.         0.         0.         0.        ]
 [0.         1.         1.         0.         1.47712125 1.30103   ]
 [0.         1.         1.         0.         1.         1.30103   ]
 [0.         0.         0.         1.30103    1.         0.        ]
 [0.         0.         0.         1.         0.         1.        ]]

TF-IDF Matrix:
[[0.30103    0.         0.30103    0.30103    0.         0.        ]
 [0.17609126 0.22910001 0.         0.22910001 0.17609126 0.        ]
 [0.47712125 0.         0.47712125 0.         0.         0.        ]
 [0.         0.17609126 0.17609126 0.         0.26010814 0.22910001]
 [0.         0.17609126 0.17609126 0.         0.17609126 0.22910001]
 [0.         0.         0.         0.62074906 0.47712125 0.        ]
 [0.         0.         0.         0.47712125 0.         0.47712125]]


  tf_matrix = np.where(td_matrix > 0, 1 + np.log10(td_matrix), 0)


__d)__ We want to test our implementation on our running example. First, print the tf-idf for each word of the query ``ice beer`` with respect to each document. Second, find the two most similar documents from ``d1, d2, d3`` according to cosine similarity and print all similarity values.  __(3 pts)__

In [7]:
from typing import Dict, List
import numpy as np
from numpy import ndarray as NDArray
from nltk.tokenize import word_tokenize
from sklearn.metrics.pairwise import cosine_similarity

word2index, doc2index, index2doc, doc_word_vectors = process_docs(docs)

# Create the term-document matrix
term_doc_matrix = term_document_matrix(doc_word_vectors, doc2index, word2index)

# Convert the term-document matrix to TF-IDF matrix
# tf_idf_matrix = to_tf_idf_matrix(term_doc_matrix, idf=True)

# Query
query = "ice beer"
query_tokens = word_tokenize(query.lower())

# Create a vector for the query
query_vector = np.zeros((len(word2index),))
for token in query_tokens:
    if token in word2index:
        query_vector[word2index[token]] += 1

# Transform the query vector to TF-IDF
query_vector = to_tf_idf_matrix(query_vector.reshape(1, -1), idf=True)

# Print TF-IDF values for each word of the query with respect to each document
print("\nTF-IDF values for the query 'ice beer' with respect to each document:")
for doc_name in docs.keys():
    doc_index = doc2index[doc_name]
    similarity_values = cosine_similarity(query_vector, term_doc_matrix[:, doc_index].reshape(1, -1))

    print(f"{doc_name}: {similarity_values[0][0]}")

# Find the two most similar documents from d1, d2, d3 based on cosine similarity
d1_index = doc2index["d1"]
d2_index = doc2index["d2"]
d3_index = doc2index["d3"]

cosine_similarities = cosine_similarity(term_doc_matrix[:, [d1_index, d2_index, d3_index]])
most_similar_docs = np.argsort(cosine_similarities[0])[-2:][::-1]

# Print all similarity values
print("\nCosine Similarity values:")
for i, doc_index in enumerate([d1_index, d2_index, d3_index]):
    print(f"d{i + 1} to Query: {cosine_similarities[0][i]}")

# Print the two most similar documents
print("\nThe two most similar documents to the query:")
for i, doc_index in enumerate(most_similar_docs):
    doc_name = index2doc[doc_index]
    print(f"{i + 1}. {doc_name} - Similarity: {cosine_similarities[0][doc_index]}")


TF-IDF values for the query 'ice beer' with respect to each document:
d1: 0.408248290463863
d2: 0.8660254037844388
d3: 0.35355339059327373
d4: 0.4472135954999579
d5: 0.816496580927726
d6: 0.4714045207910316

Cosine Similarity values:
d1 to Query: 0.9999999999999998
d2 to Query: 0.3162277660168379
d3 to Query: 0.9999999999999998

The two most similar documents to the query:
1. d3 - Similarity: 0.9999999999999998
2. d1 - Similarity: 0.9999999999999998


  tf_matrix = np.where(td_matrix > 0, 1 + np.log10(td_matrix), 0)


__e)__ In a second step we want to find the documents that are most similar to a provided query. Therefore, implement the function ``process_query`` that creates a vector represention of the query. __(5 pts)__

Create a vector that has an entry for each vocabulary word (words that appeared in any document), where the entry at position ``i`` indicates how often the word with id ``i`` (as indicated by ``word2index``) appears in the query. 

If ``tf`` is set to ``True``, you should transform all entries to tf-values. Similar, if ``idf`` is set to ``True``, return a vector with tf-idf values (cf. task __c)__). The computation of the idf values is based on the corresponding input term-document matrix.

In case the query contains words that are in any of the documents, print an appropriate error message and stop the computation.

In [11]:
def process_query(query: List[str], word2index: Dict[str, int], td_matrix: NDArray[NDArray[float]], tf: bool=True, idf: bool=True) -> NDArray[float]:
    """
    :param query: list of query tokens
    :param word2index: dict that maps each word to a unique id
    :param td_matrix: term-document matrix
    :param tf: computes the tf vector of the query if True
    :param idf: computes the tf-idf vector of the query if True, ignored if tf=False
    :return: vector representation of the input query (using tf(-idf))    
    """
    # your code here
    query_vector = np.zeros((len(word2index),))

    for token in query:
        if token in word2index:
            word_index = word2index[token]
            query_vector[word_index] += 1

    if tf:
        query_vector = np.where(query_vector > 0, 1 + np.log10(query_vector), 0)

    if idf:
        N = td_matrix.shape[1]  # Number of documents
        df = np.count_nonzero(td_matrix, axis=1)  # Document frequency for each word

        idf_vector = np.log10(N / df)
        query_vector *= idf_vector  # Element-wise multiplication for tf-idf values

    return query_vector

# Example usage:
query = ["ice", "beer"]
query_vector_tf_idf = process_query(query, word2index, term_doc_matrix, tf=True, idf=True)
print("TF-IDF Vector for Query 'ice beer':", query_vector_tf_idf)
            

TF-IDF Vector for Query 'ice beer': [0.         0.17609126 0.         0.17609126 0.         0.
 0.        ]


  query_vector = np.where(query_vector > 0, 1 + np.log10(query_vector), 0)


__f)__ Implement a function ``most_similar_docs`` that gets the vector representation of a query (in terms of counts, tf values or tf-idf values) and a term-document matrix that can either contain counts, tf-values or tf-idf values.  Compute the cosine similarity between the query and all documents and return the document names and the cosine similarity values of the top-``k`` documents that are most similar to the query. The value of ``k`` should be specified by the user. __(3 pts)__

In [12]:
def most_similar_docs(query_v: NDArray[float], td_matrix: NDArray[NDArray[float]], index2doc: Dict[int, str], k: int) -> (List[str], List[float]):
    """
    :param query_v: vector representation of the input query
    :param td_matrix: term-document matrix, possibly with tf-(idf) values 
    :param index2doc: dict that maps each document id to its name
    :k: number of documents to return
    :returns:
        - list with names of the top-k most similar documents to the query, ordered by descending similarity
        - list with cosine similarities of the top-k most similar docs, ordered by descending similarity
    """
    # your code here
    cosine_similarities = cosine_similarity(query_v.reshape(1, -1), td_matrix.T)

    # Get the indices of the top-k most similar documents
    top_k_indices = np.argsort(cosine_similarities[0])[-k:][::-1]

    # Get the document names and similarity values
    top_k_docs = [index2doc[i] for i in top_k_indices]
    top_k_similarities = [cosine_similarities[0, i] for i in top_k_indices]

    return top_k_docs, top_k_similarities

# Example usage:
query_vector_tf_idf = process_query(["ice", "beer"], word2index, term_doc_matrix, tf=True, idf=True)
top_k_docs, top_k_similarities = most_similar_docs(query_vector_tf_idf, term_doc_matrix, index2doc, k=2)

# Print the results
print(f"\nTop 2 most similar documents to the query:")
for i, doc in enumerate(top_k_docs):
    print(f"{i + 1}. {doc} - Similarity: {top_k_similarities[i]}")


Top 2 most similar documents to the query:
1. d2 - Similarity: 0.8660254037844388
2. d5 - Similarity: 0.8164965809277261


  query_vector = np.where(query_vector > 0, 1 + np.log10(query_vector), 0)


## Task 2: Text Classification (17pts)
In this task, we want to build a logistic regression classifier to classify 20newsgroups posts. As feature representation, we want to use tf-idf vectors as just implemented.

### Logistic Regression
Implement a logistic regression classifier, similar to exercise 7. Again, you don't need to add a bias weight/feature.

__a)__ Implement the `predict_proba` function in the `LogisticRegression` class below. Your function should return the output of a logistic regression classifier according to the current assignments of weights $\mathbf{w}$, i.e., 
$$
expit(\mathbf{w}^T\mathbf{x})
$$
You can assume that model weights are stored in a variable `self.w`. __(3pts)__

__b)__ Implement the `predict` function in the `LogisticRegression` class below. The prediction should return class `1` if the classifier output is above 0.5, otherwise `0` __(3pts)__

__c)__ Implement the `fit` function to learn the model parameters `w` with stochastic gradient descent for one epoch, i.e., going over the training data once. Store the learned parameters in a variable `self.w`. Only initialize the parameters randomly in the first training iteration and continue with learned parameters in later iterations. Make sure, that you iterate over instances in each epoch randomly.  __(5pts)__


In [14]:
from scipy.special import expit

class LogisticRegression():
    '''Logistic Regression Classifier.'''
    def __init__(self):
        self.w = None
    
    def fit(self, x: NDArray[NDArray[float]], y: NDArray[int], eta: float=0.1):
        '''
        :param x: 2D numpy array where each row is an instance
        :param y: 1D numpy array with target classes for instances in x
        :param eta: learning rate, default is 0.1
        :param epochs: fixed number of epochs as stopping criterion, default is 10
        '''
        # c)
        # Initialize weights randomly if not already initialized
        if self.w is None:
            self.w = np.random.randn(x.shape[1])
        
        # Shuffle the data
        indices = np.arange(x.shape[0])
        np.random.shuffle(indices)
        x = x[indices]
        y = y[indices]
        
        for i in range(x.shape[0]):
            xi = x[i]
            yi = y[i]
            
            # Predict
            y_pred = self.predict_proba(xi)
            
            # Update weights
            gradient = xi * (y_pred - yi)
            self.w -= eta * gradient
        
    def predict_proba(self, x):
        # a)
        return expit(np.dot(x, self.w))
        
    def predict(self, x):
        # b)
        return (self.predict_proba(x) > 0.5).astype(int)
    

__e)__ Apply your model to the two categories 'comp.windows.x' and 'rec.motorcycles' from the 20newsgroups data. To this end, first transform the training data to tf-idf representation with the functions `process_docs`, `term_document_matrix` and `to_tfidf_matrix`. Transform the test documents with `process_query`. Fit your model on the training data for 10 epochs. Calculate the accuracy on the test data. __(6pts)__

**Shortcut**: use the `TfidfVectorizer` from scikit learn (you may need to transform its output to a dense (array) representation).

In [15]:
from sklearn.datasets import fetch_20newsgroups
import math
import re

train = fetch_20newsgroups(subset='train', categories=['comp.windows.x','rec.motorcycles'])
test = fetch_20newsgroups(subset='test', categories=['comp.windows.x','rec.motorcycles'])