# COURSE OVERVIEW

# The core topics that will be covered in the course include:
- Information retrieval, search, and ranking
- Neural networks
- Language models
- Recommender systems
- Network modelling and embedding

# LECTURE 02/09/2025
--- 
## What is the Web?
The Web is a
- global decentralized hypertext-based information system.
- Hyperlinks are used to navigate from one document to another.
- This information space build on a set of technical standards for
the identification, retrieval and representation of content.
---
## World Wide Web (WWW)
- The World Wide Web (referred to as WWW or W3 or
simply Web), developed by Tim Berners-Lee in 1989 at
CERN, Switzerland
- The project document described a "hypertext project"
called "WorldWideWeb" in which a "web" of
"hypertext documents" could be viewed by
“browsers”.[1]
- First Website in 1991: info.cern.ch
- First Webpage address:
http://info.cern.ch/hypertext/WWW/TheProject.html
- First Browser
- In 1993: Graphical Web Browsers such as
Mosiac and Netscape Navigator were made
accessible outside of academia.
- In 1994: W3C (World Wide Web Consortium)
was founded by Tim Berners Lee. W3C
publishes recommendations, that are
considered web standards.
- Web standards are blueprints –or building
blocks– of a consistent and harmonious
digitally connected world. They are
implemented in browsers, blogs, search
engines, and other software that power our
experience on the web.
---
## World Wide Web - Data Model

W3 data model enables:
- Information need only be represented once, as a reference may be made instead of making a copy.
- Links allow the topology of the information to evolve, so modelling the state of human knowledge at any time is without constraint.
- The web stretches seamlessly from small personal notes on the local workstation to large databases on other continents.
- Indexes such as phone books are presented as documents, and so may themselves be found by searches and/or following links.
- The documents in the web do not have to exist as files; they can be "virtual" documents generated by a server in response to a query
or document name. They can therefore represent views of databases, or snapshots of changing data (such as the weather forecasts,
financial information, etc.).
Advantages:
- Information access doesn’t require expert knowledge
- Information Retrieval via search engines
---
## World Wide Web - Components
The World Wide Web (WWW) comprise of different components:
- Identification: Universal Resource Identifiers (URIs)- Address system;
globally unique identification of the web resources.
For e.g., the URI of the main page for the first WWW project is
http://info.cern.ch/hypertext/WWW/TheProject.html
- Interaction: Hypertext Transfer Protocol (HTTP) - network protocol used
for transferring information/interacting between the web resources. The
data transferred can be plain text, hypertext, images, etc.
- Content Format: Hypertext Markup Language (HTML) - a markup
language, used to define the structure and the content of the webpage.
HTML supports various content types, including text, images, video,
audio, scripts, and hyperlinks for easy web resource access.
---
## Topology of the Web

The web's structure can be broken down into three overlapping layers:
1. Classic Document Web (Web 1.0): A network of static pages connected by hyperlinks. Its topology is a simple web of linked documents.
2. Social & Application Web (Web 2.0): A platform for dynamic applications where the connections are user interactions, social ties, and data exchange via APIs.
3. Web of Data (Semantic Web): A network of machine-readable data, not pages. The connections are defined relationships between concepts, forming a global knowledge graph.

These layers coexist and intersect. A modern website is a document (1), that hosts an interactive application (2), and contains structured data for machines (3).

---

## What is Web Intelligence?
- Intelligent ways to extract information and knowledge from the web:
- finding relevant information available on the web
- obtaining new knowledge by analyzing web data: the web itself, but also how it evolves, and how users interact on and with the
web
Some applications:
- Intelligent Search
- Recommender Systems
- Business Analytics
- Crowd Sourcing
- Not so nice ones: advertising, manipulation, surveillance
---
## Intelligent Search

- Keyword Search: the words in the query appear frequently in
the document, in any order (bag of words).
- Disadvantages:
- May not retrieve relevant documents that include
synonymous terms (e.g., cannot distinguish between
“restaurant” and “café”)
- May retrieve irrelevant documents that include
ambiguous terms (e.g., cannot distinguish between “bat”
mammal and “bat” baseball)
- Beyond Keywords:
- Considering the meaning of the words used
- Adapting to user feedback (direct or indirect)
- Considering the authority of the source
---
## Recommender Systems
Kinds of recommendations:
- Product Based (collaborative filtering):
similar books
- User- Based (content based filtering):
based on search history
- Hybrid
---
## Node classification
We can represent a domain as a network of nodes, each node represents a different category.
- Example: suppose aau.dk is a domain and is connected by multiple nodes, then each color represents something different like research projects, educational programs, etc...
---
## Levels of the Web
We can distinguish 3 levels of modeling and analytics:
1. Web Content
2. Web Structure
3. Web Dynamics
---

# LECTURE 09/09/2025
---
# NLP BASICS

Natural Language Processing is a sub field of Artificial Intelligence  and Computational LInguistics, comprising of computational methods for understanding or generating Natural Languages.
The goals of this process is to:
- read human language into machine understandable
- decipher human languages
- understand and makes sense of the text

Natural Languages consist of:
- phonology: sounds of the words
- semantics: meaning of the words
- syntax: grammatical rules according to which words are put together

It is really difficult to analyze the text because of the ambiguity of the language, for example a word could have 2 different meanings depending on the context, or the positioning of a word could change the meaning, etc...

The data of languages is unstructured and in order to make it machine understandable is to tokenize it however this process is really language dependant.

---
## Tokenization

Given the phrase:

J.K. Rowling’s book, Harry Potter and the Philosopher's Stone was published in 1997. It’s still a bestseller and truly amazing
how much people enjoy it!

- Word level tokenization:
Tokenized words: ["J.K.", "Rowling’s", "book", ",", "Harry", "Potter", "and", "the", "Philosopher's", "Stone", "was", "published", "in",
"1997", ".", "It’s", "still", "a", "bestseller", "and", "truly", "amazing", "how", "much", "people", "enjoy", "it", "!"]
- Handling contractions:
The contraction "it’s” needs to be split into two tokens: ['it', "'s"]. Some systems might tokenize this as two tokens
- Punctuation handling:
Commas, apostrophes, and periods are separated from words, ensuring they are individual tokens
- Named entity recognition (NER):
The book title and the author are multi-word entities, which ideally should be handled as a single token or entity for proper semantic
analysis.
  - Harry Potter and the Philosopher's Stone
  - J.K. Rowlin

Most tokenizers are rule-based and have differental conventions for example for the world "don't":
- Peen Treebank: it will divide the word into "do" and "n't"
- Moses: it will divide the word into "don" and "'t"
In this case moses is rule-based.

Also we need to take into consideration composed words or names since the proces does involte removal of white spaces but it does not add them.
Another important factor is the encoding of the text since the same symbol could have different ASCII values (in decimal) depending on the encoding.

### Tokenization in Encoding

**Tokenization** in encoding assigns a unique value (token) to each word so that the system can recognize and process words based on their assigned values.
Example:

I live in Copenhagen = [“I”, ”live”, “in”, “Copenhagen”]\
I = 1\
live = 2\
in = 3\
Copenhagen = 4\

We live in Copenhagen = [“We”, ”live”, “in”, “Copenhagen”]\
We = 5\
live = 2\
in = 3\
Copenhagen = 4\

---

## Normalization

This process transforms distinct 'equivalent' tokens into one normalized form, e.g.:
- lower-casing: This -> this
- use non-hyphonated words: anti-discriminatory -> antidiscriminatory
- delete periods: U.S.A. -> USA or usa

However we need to balance the normalization because the result will change depending on how the amount of normalization:
- more normalization: means that if you search the word U.S.A. it also returns the results for usa or USA
- less normalization: if you search for the word C.A.T. it doesn't show the results of cat

---

## Stop words
These are common words that usually do not add significant meaning to a sentence and can be filtered out to reduce noise in text analysis.
However the removal of stop words could cause issues.
For example in a phrase such as "to be or not to be" which is full of stop words but are important part of the expression.

---

## Stemming
It's a similar process to normalization in which it reduces and removes their different grammatical variants and transforms them into their underlying word.
Here are a few examples:
- learn, learnds, learned -> learn
- organize, organizer, organization -> organ

The terms that result from stemming will be included in the index.

### Porter introduced an algorithm for the english language (1980):

- A rule-based stemmer with rules for mostly suffix-stripping such as:
- “ing” → “-” connecting → connect
- “sses” → “ss” caresses → caress
- “ies” → “i” ponies → poni
- “s” → “-” cats → cat
- `[C](VC){m}[V]`
where C is the consonant, V is the vowel i.e., A,E,I,O,U and other than Y preceded by a consonant, m>0 is the number of times (VC) occurs.
- m = 0, TREE, BY
- m = 1, TROUBLE, OATS, TREES
- m = 2, TROUBLES, OATEN, PRIVATE
- The rules for removing suffix: (condition) S1 -> S2

Porter’s Stemming Algorithm
- `[C](VC){m}[V]`
- Stem the word “REPLACEMENT”
- m >1, remove EMENT
- Generates a stem “replac”
- Advantage:
  - It produces the best output as compared to other stemmers, and it has less error rate
- Disadvantage:
  - Morphological variants produced are not always real words (produces stems)

Use the algorithm to stem the word:
MULTIDIMENSIONAL
Remove common suffix: from this word there is no ing, sses, ies, s, ation, ization, etc...
Removal of the suffix AL
  - M U L T I D I M E N S I O N A L
  - C V C C V C V C V C C V V C
  -   V C   V C V C V C     V C
  - m = 5

There are three criteria for evaluating stemmers:
- Correctness
- Efficiency of the task
- Compression performance

However stemming is also language dependant (for example it does not work with Chinese), also stemming could cause an error depending on the amount:
- overstemming: can remove too much from a word changing its meaning.
- understemming: if wrongly reduced it can lead to different root words.

---

## Zipf's law
Zipf's flaw in NLP describes the power-law distribution of word frequencies in a corpus, stating that the frequency of a word is inversely proportional to its rank.

Example:
“The cat sat on the mat, and the dog sat on the mat.”
Tokenize:
[“The”, “cat”, “sat”, “on”, “the”, “mat”, “,”, “and”, “the”, “dog”, “sat”, “on”,“the”, “mat”, ”.”] -> results in 13
Type: a unique word
Token: an instance of a type in a corpus (including repetitions)
Word Frequencies:
the: 3
sat: 2
on: 2
mat: 2
cat: 1
dog: 1
and: 1

So the type/token ratio is 7/13, therefore more data -> lower type/token ratio

Positive Implications of Zipf's law:
- Any document/text will contain a number of words that are very common.
- Help us understand the structure (and possibly meaning) of the text
Negative Implications:
- Any document/text will have a large number of rare words known as Out-of-vocabulary (OOV) words
Handling OOV words:
- Replace the set of OOV words in the training data with an unknown word token (-UNK)
- Use of statistical models
- Use sub-string based representations such as Byte-Pair Encoding
Byte-Pair Encoding: learn which character sequences are common in the vocabulary of the language, and treat those common sequences
as atomic units of the vocabulary

---

## Preprocessing
Is the process in which a sentence is broken down through:
- Tokenization
- Normalization
- Stop word removal
- Stemming

Definitions: 
- Corpus: collection of documents (e.g., all web-pages that have been crawled)
- Raw corpora have only minimal (or no) processing:\
Sentence boundaries may or may not be identified.\
There may or may not be metadata.\
Typos (written text) or disfluencies (spoken language) may or may not be corrected.
- Annotated corpora contain some labels (e.g. POS tags, sentiment labels), or linguistic structures (e.g syntax trees,
semantic interpretations), etc.
- Vocabulary: all terms that appear in the corpus

---

## Part of Speech (POS)
Part-of-speech tagging, or POS tagging, is a task that entails
classifying words in a text according to their grammatical
categories (such as noun, verb, and adjective).\
Advantages:
- Helps in identifying the syntactic structure
- POS tagging often helps disambiguate the meaning of such
words based on their role in the sentence.
- Aids Named Entity Recognition

---

## String Similarity
String similarity can be measured through:
- Hamming Distance: between two equal-length strings of symbols is the number of positions at which the corresponding symbols are different.
- Levenshtein Distance: measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one string into another.
- Jaccard Similarity: is a statistic to determine the similarity or the overlap between the two sets. Jaccard similarity = $\frac{A \cap B}{A \cup B}$

---