# Tutorial 5: Vector Space Model, Scoring

### Scoring: query-document similarity

# Exercise 1: The Mechanics of tf-idf

Part A: Definitions and Formulas

a) Define $tf_{t,d}$: What does this represent in the context of a document and a term? 

b) Define $df_t$: How is this different from $tf$? Why do we prefer using $df$ over collection frequency ($cf$) for discrimination? 

c) Define $idf_t$: Write the formula for Inverse Document Frequency. Explain what happens to the $idf$ value as a term becomes more rare. 

d) Define $tf\text{-}idf_{t,d}$: Write the full formula combining the parts above. 


Part B: Calculating Inverse Document Frequency (idf)

Consider a collection size $N = 1,000,000$. Calculate the $idf$ for the following two terms using base-10 logarithms ($\log_{10}$).

Term A: Appears in $1,000$ documents ($df_A = 1,000$).

Term B: Appears in $500,000$ documents ($df_B = 500,000$).



Part C: Calculating Final Weights

Now, consider a specific Document $d_1$.

Term A appears $5$ times in $d_1$ ($tf_{A,d1} = 5$).

Term B appears $20$ times in $d_1$ ($tf_{B,d1} = 20$).

Using your results from Part B, calculate the final $tf\text{-}idf$ score for both Term A and Term B in Document $d_1$.


Part D: Analysis

Compare the two scores from Part C.

Which term has the higher weight? 

Explain why Term B, despite appearing 4 times more often than Term A in this specific document, has a lower weight. Relate this to the concept of discriminating power. 



### Answer: Part A and B

Part A:

$tf_{t,d}$: The number of occurrences of term $t$ in document $d$. 

$df_t$: The number of documents in the collection that contain term $t$. We prefer it over collection frequency because it is a document-level statistic that helps discriminate between documents better than a raw collection-wide count. 

$idf_t$: Formula is $\log(N/df_t)$. As a term becomes rarer ($df_t$ decreases), the $idf$ value increases (high discriminating power). 

$tf\text{-}idf_{t,d}$: Formula is $tf_{t,d} \times \log(N/df_t)$. 


Part B:

Term A: $idf_A = \log_{10}(1,000,000 / 1,000) = \log_{10}(1,000) = 3$. Term B: $idf_B = \log_{10}(1,000,000 / 500,000) = \log_{10}(2) \approx 0.301$.



# Answer Part C and D

Part C:

Weight A: $5 \times 3 = 15$.


Weight B: $20 \times 0.301 = 6.02$.


Part D:

Result: Term A has the higher weight ($15$ vs $6.02$).

Reasoning: Term B occurs in half the documents in the collection ($idf \approx 0.3$), meaning it has very little discriminating power. Term A is much rarer ($idf = 3$), so its occurrences are weighted much more heavily. This mechanism attenuates the effect of terms that occur too often to be meaningful. 

# Exercise 2: Cosine Similarity vs. Euclidean Distance

Part A: The Problem with Euclidean Distance

In the vector space model, we represent documents as vectors of term weights. Explain intuitively why using Euclidean distance (the straight-line distance between two points) might be a poor metric for determining if two documents are similar in topic, especially if one document is very long and the other is very short.

How does Cosine Similarity solve this specific problem? (Hint: Think about what the angle represents versus the magnitude).

Part B: 
Let $\vec{q}$ be a query vector and $\vec{d}$ be a document vector. 
Write the formula for the Euclidean distance ($|\vec{q} - \vec{d}|$) between these two vectors. Write the formula for the Cosine similarity between these two vectors. 
What happens to the magnitude (length) of these vectors if we normalize them to be unit vectors?

Part C:  Assume that the query vector $\vec{q}$ and the document vector $\vec{d}$ have both been normalized to unit length (i.e., $|\vec{q}| = 1$ and $|\vec{d}| = 1$). Expand the square of the Euclidean distance formula: $|\vec{q} - \vec{d}|^2 = (\vec{q} - \vec{d}) \cdot (\vec{q} - \vec{d})$. Simplify this expression using the dot product properties.Substitute the known unit lengths. Show how the result relates directly to the Cosine similarity formula.

Part D: Based on your derivation in Part C: If we rank documents by increasing Euclidean distance (smallest distance first), will we get a different order than if we rank them by decreasing Cosine similarity (largest similarity first)? Explain your reasoning.

# Answer: Part A and B

Part A: Euclidean distance is sensitive to the magnitude (length) of the vectors. A long document (e.g., a book) and a short document (e.g., a tweet) might use the exact same distribution of words, but the "book" vector will be very far away from the "tweet" vector simply because it has higher term frequencies. We want to measure content similarity, not length.

Cosine Solution: Cosine similarity measures the angle between the vectors. If two documents have the same relative distribution of terms, their vectors points in the same direction (angle is zero), regardless of how long the documents are.


Part B: $|\vec{q} - \vec{d}| = \sqrt{\sum (q_i - d_i)^2}$.

Cosine: $sim(\vec{q}, \vec{d}) = \frac{\vec{q} \cdot \vec{d}}{|\vec{q}||\vec{d}|}$.

Normalization: If normalized, $|\vec{q}| = 1$ and $|\vec{d}| = 1$.



# Answer Part C and D

### Part C

Expanding the squared Euclidean distance for unit vectors:

$$
|\vec{q} - \vec{d}|^2 = (\vec{q} - \vec{d}) \cdot (\vec{q} - \vec{d})
$$

Expanding the dot product:

$$
(\vec{q} \cdot \vec{q}) - 2(\vec{q} \cdot \vec{d}) + (\vec{d} \cdot \vec{d})
$$

So,

$$
|\vec{q}|^2 - 2(\vec{q} \cdot \vec{d}) + |\vec{d}|^2
$$

Since they are unit vectors, $|\vec{q}|^2 = 1$ and $|\vec{d}|^2 = 1$:

$$
1 - 2(\vec{q} \cdot \vec{d}) + 1
$$

Thus:

$$
2 - 2(\vec{q} \cdot \vec{d})
$$

Since $|\vec{q}|,|\vec{d}| = 1$, the term $(\vec{q} \cdot \vec{d})$ is exactly the cosine similarity.

**We have:**

$$
|\vec{q} - \vec{d}|^2 = 2 - 2 \times \text{CosineSimilarity}(\vec{q}, \vec{d})
$$



Part D: The right-hand side is a strictly decreasing linear function of cosine similarity. So, for any two documents $d_i$ and $d_j$: 

$\text{If } \text{CosSim}(q, d_i) > \text{CosSim}(q, d_j)$, $\text{then } \|q - d_i\| < \|q - d_j\|$

Therefore, the ranking order is identical.

Elaborating: The equation derived in Part C shows that Euclidean distance and Cosine similarity are inversely related linearly. As Cosine similarity increases (gets closer to 1), the term $(2 - 2 \times \text{Cosine})$ decreases, making the Euclidean distance smaller. Therefore, maximizing similarity is mathematically equivalent to minimizing distance for unit vectors.


# Exercise 3 Vector Space Scoring (lnc.ltc weighting variant)

As described in the textbook, lnc.ltc is a common weighting strategy where one puts the $idf$ weight on the query vector only, while the document vector uses only term frequency. This avoids the "Double Counting" of IDF when performing (normalized) dot product.

Calculate the Cosine Similarity score between a query vector and a document vector using the following data:

Query: "digital cameras"

Document: "digital cameras and video cameras"

N (Total Docs): 10,000

Assumptions: The term "and" is a stop word (weight = 0).

$df_{\text{digital}} = 100$

$df_{\text{video}} = 1,000$

$df_{\text{cameras}} = 500$

Use logarithmic weighting for the query vector: $w_{t,q} = (1 + \log(tf_{t,q})) \times idf_t$. (Note: Use log with base 10)


Use only raw $tf$ for the document vector (no $idf$, no log).

Normalize both vectors before taking the dot product.

### Answer



Vocabulary: {digital, cameras, video} ("and" is removed).

IDF Calculations:  $idf_{\text{digital}} = \log(10,000/100) = 2.0$.  $idf_{\text{video}} = \log(10,000/1,000) = 1.0$.  $idf_{\text{cameras}} = \log(10,000/500) = \log(20) \approx 1.3$.

Query Vector Terms in Query: "digital" (1), "cameras" (1).

Weights ($w_{t,q}$): 

digital: $(1 + \log(1)) \times 2.0 = (1 + 0) \times 2.0 = 2.0$.

cameras: $(1 + \log(1)) \times 1.3 = 1.3$.

video: $0$ (does not appear in query).

Vector $\vec{q}$: $[2.0, 1.3, 0]$.

Terms in Doc: 

"digital" (1), "cameras" (2), "video" (1).

Weights ($w_{t,d}$): (Raw tf only) digital: $1$.cameras: $2$.video: $1$.  

Vector $\vec{d}$: $[1, 2, 1]$.

Calculation 

Length $|\vec{q}|$: $\sqrt{2.0^2 + 1.3^2 + 0^2} = \sqrt{4 + 1.69} = \sqrt{5.69} \approx 2.385$.

Length $|\vec{d}|$: $\sqrt{1^2 + 2^2 + 1^2} = \sqrt{6} \approx 2.449$.   Dot Product: $(2.0 \times 1) + (1.3 \times 2) + (0 \times 1) = 2.0 + 2.6 = 4.6$.


Cosine Similarity:  $$\frac{4.6}{2.385 \times 2.449} = \frac{4.6}{5.84} \approx \mathbf{0.788}$$




Remark: Modern IR systems do not generally recommend removing stop words for vector space scoring for the following reason: 

IDF naturally handles them: In vector space scoring, the Inverse Document Frequency ($idf$) component naturally assigns a very low weight to common words. Because words like "the" or "and" appear in almost every document, their $idf$ score is near zero, meaning they have very little impact on the final ranking even if they are left in the index.

Phrase queries rely on them: Removing stop words breaks the ability to search for exact phrases that contain them. The text gives examples like "President of the United States" or "flights to London"â€”if you remove stop words, these specific meanings are lost or become difficult to reconstruct.

Compression handles the cost: The text notes that modern index compression techniques allow systems to store the frequent postings lists for stop words efficiently, so the storage cost is no longer a major reason to remove them.