## Set operators cheat sheet
| Operator         | Name     | Description |
|--------------|-----------|------------|
| $A \cup B$ | Union | All the elements in $A$ and all the elements in $B$ |
| $A \cap B$ | Intersection | All the elements that are in $A$ and in $B$ |
| $A \setminus B$ | Difference | All the elements in $A$ without the ones in $B$ |


# Q8
## (a)
### (i)
True positives is the set of values that are positive and were identified as positive.

<!-- Solution header so it isn't immediately visible when the field is collapsed -->
$D_1 \cap D_2$

### (ii)
False negatives are values that were excluded, but shouldn't have been.

<!-- Solution header so it isn't immediately visible when the field is collapsed -->
$(D \setminus D_2) \cap D_1$

## (b)

In [None]:
# Count Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
import math, base64

D = ["Process mining is a discipline that performs analysis on process data. The starting point for process mining are event data in the form of a so-called event log.",
     "Business owners can gain insights on their process by analyzing their event data. Process mining software can help organizations capture information from enterprise transaction systems and provides detailed information about how key processes are performing."
]

Q = "event log"

# If we take the given corpus literally without concern for punctuation
# the first question already produces a different answer because "log" and "log."
# are two differen words. I got you now RWTH, you made a boo-boo and I deserve a 1.0
if filterPunctuation:=True:
    D = [
        "".join([c for c in d if c.isalnum() or c.isspace()])
        for d in D
    ]
    Q = "".join([c for c in Q if c.isalnum() or c.isspace()])

DV = [
    CountVectorizer(preprocessor=lambda x: x, tokenizer=lambda x: x.split()).fit([d])
    for d in D
]
DV = [
    {w:c for w,c in zip(dv.get_feature_names_out(), dv.fit_transform([d]).toarray()[0])}
    for d,dv in zip(D,DV)
]
display(DV[0])

In [None]:
def fscore(query,corpus,func):
    if any(func(word,corpus)=="NaN" for word in query.split() if word in corpus):
        return "NaN"
    return sum(func(word,corpus) for word in query.split() if word in corpus)

In [None]:
## i) Term Frequency (TF) Q D1
tf = lambda x,y: y[x]
print(fscore(Q,DV[0],tf))

In [None]:
## ii) Term Set (TS) Q D2
# Side note can someone tell me where the term set score was introduced?
# Because the first mention I found of it was in the solution for instruction 10

# In case someone else was confused ts gives 1 if the term is present, regardless of count
# and 0 otherwise.

# This only returns 1 because fscore already checks for membership.
ts = lambda x,y: 1
print(fscore(Q,DV[1],ts))

In [None]:
## iii)
Qiii = ""

# My answer encrypted for your pleasure
# Qiii = base64.b64decode(b'aW5zaWdodHMgYW5hbHl6aW5n').decode("utf-8")

print(f"tf score for '{Qiii}' in D1: {fscore(Qiii,DV[0],tf)}\ntf score for '{Qiii}' in D2: {fscore(Qiii,DV[1],tf)}")

In [None]:
## iv) tfidf
# w(ord) c(orpus) d(ocument)
idf = lambda w,c: math.log2(len(c)/(sum(int(w in d) for d in c))) if sum(int(w in d) for d in c) > 0 else "NaN"
tfidf = lambda w,d: tf(w,d) * idf(w,DV) if idf(w,DV) != "NaN" else "NaN"

Qiv = ""

# My answer encrypted for your pleasure
# Qiv = base64.b64decode(b'cHJvY2VzcyBtaW5pbmc=').decode("utf-8")

print(f"Query '{Qiv}'\ntf-score: {fscore(Qiv,DV[0],tf)}\ntfidf-score: {fscore(Qiv,DV[0],tfidf)}")