# COURSE OVERVIEW

# The core topics that will be covered in the course include:
- Information retrieval, search, and ranking
- Neural networks
- Language models
- Recommender systems
- Network modelling and embedding

# LECTURE 02/09/2025
--- 
## What is the Web?
The Web is a
- global decentralized hypertext-based information system.
- Hyperlinks are used to navigate from one document to another.
- This information space build on a set of technical standards for
the identification, retrieval and representation of content.
---
## World Wide Web (WWW)
- The World Wide Web (referred to as WWW or W3 or
simply Web), developed by Tim Berners-Lee in 1989 at
CERN, Switzerland
- The project document described a "hypertext project"
called "WorldWideWeb" in which a "web" of
"hypertext documents" could be viewed by
“browsers”.[1]
- First Website in 1991: info.cern.ch
- First Webpage address:
http://info.cern.ch/hypertext/WWW/TheProject.html
- First Browser
- In 1993: Graphical Web Browsers such as
Mosiac and Netscape Navigator were made
accessible outside of academia.
- In 1994: W3C (World Wide Web Consortium)
was founded by Tim Berners Lee. W3C
publishes recommendations, that are
considered web standards.
- Web standards are blueprints –or building
blocks– of a consistent and harmonious
digitally connected world. They are
implemented in browsers, blogs, search
engines, and other software that power our
experience on the web.
---
## World Wide Web - Data Model

W3 data model enables:
- Information need only be represented once, as a reference may be made instead of making a copy.
- Links allow the topology of the information to evolve, so modelling the state of human knowledge at any time is without constraint.
- The web stretches seamlessly from small personal notes on the local workstation to large databases on other continents.
- Indexes such as phone books are presented as documents, and so may themselves be found by searches and/or following links.
- The documents in the web do not have to exist as files; they can be "virtual" documents generated by a server in response to a query
or document name. They can therefore represent views of databases, or snapshots of changing data (such as the weather forecasts,
financial information, etc.).
Advantages:
- Information access doesn’t require expert knowledge
- Information Retrieval via search engines
---
## World Wide Web - Components
The World Wide Web (WWW) comprise of different components:
- Identification: Universal Resource Identifiers (URIs)- Address system;
globally unique identification of the web resources.
For e.g., the URI of the main page for the first WWW project is
http://info.cern.ch/hypertext/WWW/TheProject.html
- Interaction: Hypertext Transfer Protocol (HTTP) - network protocol used
for transferring information/interacting between the web resources. The
data transferred can be plain text, hypertext, images, etc.
- Content Format: Hypertext Markup Language (HTML) - a markup
language, used to define the structure and the content of the webpage.
HTML supports various content types, including text, images, video,
audio, scripts, and hyperlinks for easy web resource access.
---
## Topology of the Web

The web's structure can be broken down into three overlapping layers:
1. Classic Document Web (Web 1.0): A network of static pages connected by hyperlinks. Its topology is a simple web of linked documents.
2. Social & Application Web (Web 2.0): A platform for dynamic applications where the connections are user interactions, social ties, and data exchange via APIs.
3. Web of Data (Semantic Web): A network of machine-readable data, not pages. The connections are defined relationships between concepts, forming a global knowledge graph.

These layers coexist and intersect. A modern website is a document (1), that hosts an interactive application (2), and contains structured data for machines (3).

---

## What is Web Intelligence?
- Intelligent ways to extract information and knowledge from the web:
- finding relevant information available on the web
- obtaining new knowledge by analyzing web data: the web itself, but also how it evolves, and how users interact on and with the
web
Some applications:
- Intelligent Search
- Recommender Systems
- Business Analytics
- Crowd Sourcing
- Not so nice ones: advertising, manipulation, surveillance
---
## Intelligent Search

- Keyword Search: the words in the query appear frequently in
the document, in any order (bag of words).
- Disadvantages:
- May not retrieve relevant documents that include
synonymous terms (e.g., cannot distinguish between
“restaurant” and “café”)
- May retrieve irrelevant documents that include
ambiguous terms (e.g., cannot distinguish between “bat”
mammal and “bat” baseball)
- Beyond Keywords:
- Considering the meaning of the words used
- Adapting to user feedback (direct or indirect)
- Considering the authority of the source
---
## Recommender Systems
Kinds of recommendations:
- Product Based (collaborative filtering):
similar books
- User- Based (content based filtering):
based on search history
- Hybrid
---
## Node classification
We can represent a domain as a network of nodes, each node represents a different category.
- Example: suppose aau.dk is a domain and is connected by multiple nodes, then each color represents something different like research projects, educational programs, etc...
---
## Levels of the Web
We can distinguish 3 levels of modeling and analytics:
1. Web Content
2. Web Structure
3. Web Dynamics
---

# LECTURE 09/09/2025
---
# NLP BASICS

Natural Language Processing is a sub field of Artificial Intelligence  and Computational LInguistics, comprising of computational methods for understanding or generating Natural Languages.
The goals of this process is to:
- read human language into machine understandable
- decipher human languages
- understand and makes sense of the text

Natural Languages consist of:
- phonology: sounds of the words
- semantics: meaning of the words
- syntax: grammatical rules according to which words are put together

It is really difficult to analyze the text because of the ambiguity of the language, for example a word could have 2 different meanings depending on the context, or the positioning of a word could change the meaning, etc...

The data of languages is unstructured and in order to make it machine understandable is to tokenize it however this process is really language dependant.

---
## Tokenization

Given the phrase:

J.K. Rowling’s book, Harry Potter and the Philosopher's Stone was published in 1997. It’s still a bestseller and truly amazing
how much people enjoy it!

- Word level tokenization:
Tokenized words: ["J.K.", "Rowling’s", "book", ",", "Harry", "Potter", "and", "the", "Philosopher's", "Stone", "was", "published", "in",
"1997", ".", "It’s", "still", "a", "bestseller", "and", "truly", "amazing", "how", "much", "people", "enjoy", "it", "!"]
- Handling contractions:
The contraction "it’s” needs to be split into two tokens: ['it', "'s"]. Some systems might tokenize this as two tokens
- Punctuation handling:
Commas, apostrophes, and periods are separated from words, ensuring they are individual tokens
- Named entity recognition (NER):
The book title and the author are multi-word entities, which ideally should be handled as a single token or entity for proper semantic
analysis.
  - Harry Potter and the Philosopher's Stone
  - J.K. Rowlin

Most tokenizers are rule-based and have differental conventions for example for the world "don't":
- Peen Treebank: it will divide the word into "do" and "n't"
- Moses: it will divide the word into "don" and "'t"
In this case moses is rule-based.

Also we need to take into consideration composed words or names since the proces does involte removal of white spaces but it does not add them.
Another important factor is the encoding of the text since the same symbol could have different ASCII values (in decimal) depending on the encoding.

### Tokenization in Encoding

**Tokenization** in encoding assigns a unique value (token) to each word so that the system can recognize and process words based on their assigned values.
Example:

I live in Copenhagen = [“I”, ”live”, “in”, “Copenhagen”]\
I = 1\
live = 2\
in = 3\
Copenhagen = 4\

We live in Copenhagen = [“We”, ”live”, “in”, “Copenhagen”]\
We = 5\
live = 2\
in = 3\
Copenhagen = 4\

---

## Normalization

This process transforms distinct 'equivalent' tokens into one normalized form, e.g.:
- lower-casing: This -> this
- use non-hyphonated words: anti-discriminatory -> antidiscriminatory
- delete periods: U.S.A. -> USA or usa

However we need to balance the normalization because the result will change depending on how the amount of normalization:
- more normalization: means that if you search the word U.S.A. it also returns the results for usa or USA
- less normalization: if you search for the word C.A.T. it doesn't show the results of cat

---

## Stop words
These are common words that usually do not add significant meaning to a sentence and can be filtered out to reduce noise in text analysis.
However the removal of stop words could cause issues.
For example in a phrase such as "to be or not to be" which is full of stop words but are important part of the expression.

---

## Stemming
It's a similar process to normalization in which it reduces and removes their different grammatical variants and transforms them into their underlying word.
Here are a few examples:
- learn, learnds, learned -> learn
- organize, organizer, organization -> organ

The terms that result from stemming will be included in the index.

### Porter introduced an algorithm for the english language (1980):

- A rule-based stemmer with rules for mostly suffix-stripping such as:
- “ing” → “-” connecting → connect
- “sses” → “ss” caresses → caress
- “ies” → “i” ponies → poni
- “s” → “-” cats → cat
- `[C](VC){m}[V]`
where C is the consonant, V is the vowel i.e., A,E,I,O,U and other than Y preceded by a consonant, m>0 is the number of times (VC) occurs.
- m = 0, TREE, BY
- m = 1, TROUBLE, OATS, TREES
- m = 2, TROUBLES, OATEN, PRIVATE
- The rules for removing suffix: (condition) S1 -> S2

Porter’s Stemming Algorithm
- `[C](VC){m}[V]`
- Stem the word “REPLACEMENT”
- m >1, remove EMENT
- Generates a stem “replac”
- Advantage:
  - It produces the best output as compared to other stemmers, and it has less error rate
- Disadvantage:
  - Morphological variants produced are not always real words (produces stems)

Use the algorithm to stem the word:
MULTIDIMENSIONAL
Remove common suffix: from this word there is no ing, sses, ies, s, ation, ization, etc...
Removal of the suffix AL
  - M U L T I D I M E N S I O N A L
  - C V C C V C V C V C C V V C
  -   V C   V C V C V C     V C
  - m = 5

There are three criteria for evaluating stemmers:
- Correctness
- Efficiency of the task
- Compression performance

However stemming is also language dependant (for example it does not work with Chinese), also stemming could cause an error depending on the amount:
- overstemming: can remove too much from a word changing its meaning.
- understemming: if wrongly reduced it can lead to different root words.

---

## Zipf's law
Zipf's flaw in NLP describes the power-law distribution of word frequencies in a corpus, stating that the frequency of a word is inversely proportional to its rank.

Example:
“The cat sat on the mat, and the dog sat on the mat.”
Tokenize:
[“The”, “cat”, “sat”, “on”, “the”, “mat”, “,”, “and”, “the”, “dog”, “sat”, “on”,“the”, “mat”, ”.”] -> results in 13
Type: a unique word
Token: an instance of a type in a corpus (including repetitions)
Word Frequencies:
the: 3
sat: 2
on: 2
mat: 2
cat: 1
dog: 1
and: 1

So the type/token ratio is 7/13, therefore more data -> lower type/token ratio

Positive Implications of Zipf's law:
- Any document/text will contain a number of words that are very common.
- Help us understand the structure (and possibly meaning) of the text
Negative Implications:
- Any document/text will have a large number of rare words known as Out-of-vocabulary (OOV) words
Handling OOV words:
- Replace the set of OOV words in the training data with an unknown word token (-UNK)
- Use of statistical models
- Use sub-string based representations such as Byte-Pair Encoding
Byte-Pair Encoding: learn which character sequences are common in the vocabulary of the language, and treat those common sequences
as atomic units of the vocabulary

---

## Preprocessing
Is the process in which a sentence is broken down through:
- Tokenization
- Normalization
- Stop word removal
- Stemming

Definitions: 
- Corpus: collection of documents (e.g., all web-pages that have been crawled)
- Raw corpora have only minimal (or no) processing:\
Sentence boundaries may or may not be identified.\
There may or may not be metadata.\
Typos (written text) or disfluencies (spoken language) may or may not be corrected.
- Annotated corpora contain some labels (e.g. POS tags, sentiment labels), or linguistic structures (e.g syntax trees,
semantic interpretations), etc.
- Vocabulary: all terms that appear in the corpus

---

## Part of Speech (POS)
Part-of-speech tagging, or POS tagging, is a task that entails
classifying words in a text according to their grammatical
categories (such as noun, verb, and adjective).\
Advantages:
- Helps in identifying the syntactic structure
- POS tagging often helps disambiguate the meaning of such
words based on their role in the sentence.
- Aids Named Entity Recognition

---

## String Similarity
String similarity can be measured through:
- Hamming Distance: between two equal-length strings of symbols is the number of positions at which the corresponding symbols are different.
- Levenshtein Distance: measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one string into another.
- Jaccard Similarity: is a statistic to determine the similarity or the overlap between the two sets. Jaccard similarity = $\frac{A \cap B}{A \cup B}$

---

# LECTURE 16/09/2025

# Text Processing


## Scoring with Jaccard Coefficient

The Jaccard Coefficient measures similarity by dividing the **size of the overlap** between two sets by the **size of their combined total**. It essentially asks: "Of all the items present, what fraction are in both sets?"

The formula is:
$$Jaccard(A, B) = \frac{|A \cap B|}{|A \cup B|}$$

* **$|A \cap B|$ (Intersection):** The number of items found in **both** set A and set B.
* **$|A \cup B|$ (Union):** The total number of **unique** items across both sets combined.



[Image of a Venn diagram showing intersection and union]


The score is always between 0 and 1:
* **1** means the sets are a **perfect match** (identical).
* **0** means the sets have **no items in common**.


### Worked Example

Let's find the similarity between a query and two sentences.

* **Query (Q):** `{"march"}`
* **Sentence 1 (S1):** `{"Memories", "in", "March", "is", "a", "good", "movie"}`
* **Sentence 2 (S2):** `{"Flowers", "start", "blooming", "in", "March"}`

**1. Jaccard(Q, S1)**
* **Overlap:** `{"march"}` (Size = **1**)
* **Total Unique Words:** `{"march", "Memories", "in", "is", "a", "good", "movie"}` (Size = **7**)
* **Score:** $\frac{1}{7}$

**2. Jaccard(Q, S2)**
* **Overlap:** `{"march"}` (Size = **1**)
* **Total Unique Words:** `{"march", "Flowers", "start", "blooming", "in"}` (Size = **5**)
* **Score:** $\frac{1}{5}$

**Conclusion:** Sentence S2 is considered more similar to the query, as its score ($\frac{1}{5}$) is higher than S1's ($\frac{1}{7}$).

### Limitations of Jaccard's coefficient
- Term frequency (no. of occurrences of a term) is not considered
- Rare terms in a collection are more informative than frequent terms. Jaccard doesn’t consider this information
  - For e.g., adjectives such as good, beautiful, etc.
- Order of the terms not considered

---

## Text Similarity from Unstructured Data
It's a technique used by computers to measure how alike two pieces of raw text are. The goal is to take text from sources like articles, emails, or social media posts and produce a similarity score that quantifies their relationship.

---

## Ranked Retrieval Model
- In a ranked retrieval model, the system returns an ordering over the (top) documents in the collection for a query
- More relevant results are ranked higher than less relevant ones
- Queries are in form of one or more words in natural language text (human languages) and not query languages
- Large result set is not an issue in ranked retrieval systems, usually top k ≈ 10 results are returned
- The output documents (results) are ranked according to how relevant they are to a query, i.e., how well document and query
“match”
- A score – say in [0, 1] – is assigned to each document
For e.g., query term is “Aalborg”
- Score =0, if “Aalborg” does not occur in the document
- More frequent the occurrence of “Aalborg” in the document, higher the score.

---

## Binary Term-Document Incidence Matrix
A binary term-document incidence matrix is a simple way to represent the presence or absence of terms in documents. Each row corresponds to a term, and each column corresponds to a document. A cell contains a 1 if the term is present in the document, and a 0 if it is absent.
Each document is represented by a binary vector ∈ {0,1}^|V|

There is also a Count Vector which is the same thing but instead of 1 it returns the amount of occurrences of that word in the document.
Each document is a count vector (i.e., each column) ∈ ℕ^|V|

---

## Bag of Words model
The Bag of Words (BoW) model is a simple and widely used method for representing text data in natural language processing. In this model, a text (such as a sentence or a document) is represented as an unordered collection of words, disregarding grammar and word order but keeping multiplicity.

Vector representation doesn’t consider the ordering of words in a document
Let's consider a small dataset of three sentences:
  1. Sentence 1: "I like programming."
  2. Sentence 2: "Programming is fun."
  3. Sentence 3: "I like to code.”
Vocabulary: ["I", ”like", "programming", "is", "fun", "to", "code"]
Sentence 1: "I like programming.” Vector: [1, 1, 1, 0, 0, 0, 0]
Sentence 2: "Programming is fun.” Vector: [0, 0, 1, 1, 1, 0, 0]
Sentence 3: "I like to code.” Vector: [1, 1, 0, 0, 0, 1, 1]
Represents text as a collection of word counts, enabling the analysis and processing of text data while ignoring the order of words

Another example:

Sentence 1: John runs faster than Mary.
Sentence 2: Mary runs faster than John.
Generate the vectors
Vocabulary: ["John", "runs", "faster", "than", "Mary"]
  - Sentence 1: "John runs faster than Mary."
    • Vector: [1, 1, 1, 1, 1]
  - Sentence 2: "Mary runs faster than John."
    • Vector: [1, 1, 1, 1, 1]

---

## Term Frequency

Term Frequency
The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d.
Raw term frequency is not what we want:
  - A document with 10 occurrences of the term is more relevant than a document with 1 occurrence of the term.
  - But not 10 times more relevant
Relevance does not increase proportionally with term frequency

tf(t,d) = $\frac{Number of times term t appears in document d}{Total number of terms in document d}$


---

## Log-Frequency Weighting

Log-frequency weighting is a method to score a document's relevance to a query. It improves on simple word counting by using a logarithm to reduce the impact of highly frequent words, reflecting the idea of **diminishing returns**.

### Formulas

The weight `$w$` for a term `t` with a term frequency `$tf_{t,d}$` in a document `d` is:

$$w_{t,d} = 1 + \log_{10}(tf_{t,d})$$
(This is calculated only if the term is present in the document, i.e., `$tf_{t,d} > 0$`).



The final **relevance score** for the document is the sum of these weights for all terms that appear in **both** the query `q` and the document `d`:

$$Score(q, d) = \sum_{t \in q \cap d} (1 + \log_{10}(tf_{t,d}))$$

examples:

Sentence 1: "The dog barks loudly"

First, let's process the sentence and find the term frequencies (tf). We'll ignore common "stop words" like "The" and punctuation.

* **Document (D):** "The dog barks loudly.”
* **Terms:** `{"dog", "barks", "loudly"}`
* **Term Frequencies:**
    * `tf` of "dog" = 1
    * `tf` of "barks" = 1
    * `tf` of "loudly" = 1

Next, we calculate the log-frequency weight for each term using the formula $w = 1 + \log_{10}(tf)$:

* Weight of "dog": $1 + \log_{10}(1) = 1 + 0 = 1.0$
* Weight of "barks": $1 + \log_{10}(1) = 1 + 0 = 1.0$
* Weight of "loudly": $1 + \log_{10}(1) = 1 + 0 = 1.0$

Step 2: Calculate Scores for Assumed Queries

The final score is the sum of the weights for the terms that appear in **both** the query and the document. 🐾

#### **Query 1: "barks"**
The only matching term is "barks".
* **Score** = Weight("barks") = **1.0**

#### **Query 2: "loud dog"**
The matching terms are "loudly" and "dog".
* **Score** = Weight("loudly") + Weight("dog") = `1.0 + 1.0` = **2.0**

#### **Query 3: "the cat"**
There are no matching terms.
* **Score** = **0.0**

---

## Document Frequency (df)

The core idea in search relevance is that not all words are equally important. **Frequent terms are less informative than rare terms.**

Think of it like searching for a name in a phone book. The name "Smith" appears in thousands of entries (high frequency), so a match is not very informative. The name "Pendleton" is rare (low frequency), so a match is a very strong signal you've found the right person.

To quantify this, we use **document frequency ($df_t$)**, which is the number of documents in a collection that contain a specific term `t`. A high `$df_t$` means the term is common; a low `$df_t$` means it's rare.

---

## Inverse Document Frequency (idf)

While `df` tells us how common a term is, for scoring we want a value that is **high** for rare, informative terms and **low** for common ones. This is exactly what **Inverse Document Frequency (idf)** provides. It's a measure of a term's importance or informativeness. ⚖️

The formula is:
$$idf_t = \log_{10}\left(\frac{N}{df_t}\right)$$

Let's break it down:
* **$N$** is the total number of documents in the collection.
* **$\frac{N}{df_t}$** is the "inverse" part. This fraction will be large for rare terms (e.g., a term in 10 documents out of 1,000,000) and small for common terms (e.g., a term in 500,000 documents out of 1,000,000).
* **$\log_{10}(...)$** is used to "dampen" the effect. This ensures that very rare terms don't completely dominate the score. It smooths the weights so they don't grow exponentially.

The result is that terms with a high `idf` are considered significant and are given more weight in relevance calculations.

---

## TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) is one of the most important and widely used weighting schemes in information retrieval. It's a score that evaluates how relevant a word is to a document within a collection of documents.

The core idea is to find a balance: a word is important if it appears **frequently in one document** (high TF), but **rarely across all other documents** (high IDF). 🎯



---

### TF-IDF

#### **Term Frequency (TF)**
This measures how often a term appears in a **single document**. The more a term appears, the more important it is *to that specific document*. We often use the log-weighted formula to dampen the effect of very high counts.
$$tf-weight_{t,d} = 1 + \log_{10}(tf_{t,d})$$

#### **Inverse Document Frequency (IDF)**
This measures how **rare and informative** a term is across the entire collection of documents. Common words like "the" or "is" will have a very low IDF score, while specific, rare terms will have a high IDF score.
$$idf_t = \log_{10}\left(\frac{N}{df_t}\right)$$


*The TF-IDF Weight*

To get the final TF-IDF weight for a term `t` in a document `d`, you simply multiply its TF weight by its IDF weight.

$$w_{t,d} = \text{tf-weight}_{t,d} \times idf_t$$

Or, written out in full:
$$w_{t,d} = (1 + \log_{10}(tf_{t,d})) \times \log_{10}\left(\frac{N}{df_t}\right)$$

A high TF-IDF score indicates that a term is a strong keyword or topic for that particular document.


*Scoring a Document for a Query*

To calculate the total relevance score of a document for a given query, you sum the TF-IDF weights of all terms that appear in **both** the query and the document.

$$Score(q, d) = \sum_{t \in q \cap d} w_{t,d}$$

### TF-IDF Variants
tf(t,d) = $\frac{Number of times term t appears in document d}{Total number of terms in document d}$

---

## Vector Similarity in Information Retrieval – Use Case
- Threshold:
  - For query q, retrieve all documents with similarity above a threshold, e.g., similarity > 0.50.
- Ranking:
  - For query q, return the n most similar documents ranked in order of similarity.

---

## Documents and Queries as Vectors
- Documents
  - So we have a |V|-dimensional vector space
  - Terms are axes of the space
  - Documents are points or vectors in this space
  - Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine
  - These are very sparse vectors - most entries are zero.
- Queries
  - Key idea 1: Do the same for queries: represent them as vectors in the space
  - Key idea 2: Rank documents according to their proximity to the query in this space
  - proximity = similarity of vectors
  - proximity ≈ inverse of distance
  - Recall: We do this because we want to get away from the you’re-either-in-or-out Boolean model.
  - Instead: rank more relevant documents higher than less relevant documents

---

## Comparing Vectors: Euclidean Distance vs. Cosine Similarity

When comparing two vectors (e.g., representing a query and a document), we can use different metrics to measure their relationship. Two common methods are Euclidean distance and Cosine similarity.

---

### Euclidean Distance

**Euclidean distance** measures the straight-line distance between the endpoints of two vectors. It calculates how "far apart" the two vectors are in a multi-dimensional space.


$$
d(q,p) = \sqrt{{(p_1 - q_1)^2}-{(p_2 - q_2)^2}}
$$

A major drawback of this method for document analysis is that **it is highly sensitive to vector length (magnitude)**. For example, a very long document and a short document on the same topic will have a large Euclidean distance, suggesting they are dissimilar even though they are thematically related.

---

### Cosine Similarity

**Cosine similarity** measures the cosine of the angle between two vectors. Instead of measuring distance, it measures orientation. This makes it an excellent choice for tasks like document similarity where the topic (direction) is more important than the document's length (magnitude).

The formula to find the cosine of the angle ($\theta$) between a query vector ($\vec{q}$) and a document vector ($\vec{d}$) is:

$$
\text{similarity} = \cos(\theta) = \frac{\vec{q} \cdot \vec{d}}{||\vec{q}|| \ ||\vec{d}||} = \frac{\sum_{i=1}^{n} q_i d_i}{\sqrt{\sum_{i=1}^{n} q_i^2} \sqrt{\sum_{i=1}^{n} d_i^2}}
$$

The angle between the vectors indicates their similarity:
* An angle of **0°** corresponds to a cosine value of **1**, representing **maximum similarity**. The vectors are pointing in the same direction.
* An angle of **90°** corresponds to a cosine value of **0**, representing **no similarity**. The vectors are orthogonal.

#### Ranking Results
Since the cosine function is monotonically decreasing on the interval $[0^\circ, 180^\circ]$, a smaller angle implies a higher cosine value (and thus higher similarity). Therefore, to find the most relevant documents for a query:
* You rank documents in **decreasing order** of the cosine similarity score (from 1 down to 0).
* This is equivalent to ranking documents in **increasing order** of the angle between the query and document (from 0° up to 90°).

---