# Semantic Text Similarity

## Applications of semantic similarity

- Grouping similar words into semantic concepts
- As a building block in natural language understanding tasks
    - Textual entailment
    - Paraphrasing

## Wordnet
- Semantic dictionary of (mostly) English words, interliked by semantic relations
- Includes rich linguistic information
    - Part of speech, word senses, synonyms, hypernyms/hyponyms, meronyms, devirationally related forms, ...
- Machine-readable. freely available

<img src="resources/semantic_text_similarity_1.png" width = "400">

### Hierarchy types

- Path similarity

<img src="resources/semantic_text_similarity_2.png" width = "400">

- Lowest commom subsumer (LCS)

<img src="resources/semantic_text_similarity_3.png" width = "400">

- Lin Similarity

<img src="resources/semantic_text_similarity_4.png" width = "400">

### Collocations and Distributional similarity
- Two words that frequently appear in similar contexts are more likely to be semantically related
- **Context**
    - Words before, after, within a small window
    - Parts of speech of words before, after, in a small window
    - Specific syntactic relation to the target word
    - Words in the same sentence, same document, ...
- **Strength of association** between words
    - How frequent do these two words appear together?
    - How frequent are each of these words indiviaully? -> If they appear alone very often, then they probably have no relation even if they appear together a few times


# Topic Modeling

Documents are a mixture of topics
- Topic Modelling is a coarse-level analysis of what's in a text collection
- Topic: the subject (theme) of a discourse
- Topics are represented as a **word distribution**

What's known:
- The text collection or corpus
- Number of topics

What's not known:
- The actual topics
- Topic distribution for each topic

Topic Modelling is
- Essentially, **text clustering** problem
    - Documents and words are clustered together

## Topic modelling approaches
- Probabilistic Latent Semantic Analysis (PLSA)
- Latent Dirichlet Allocation (LDA)

# Generative Models and LDA

<img src="resources/generative_models_1.png" width = "400">

Sometimes we have more than one chest to pull words from.

<img src="resources/generative_models_2.png" width = "400">

## Latent Dirichlet Allocation (LDA)

**Generative model** for a document d
- Choose length of document d
- Choose a mixture of topics for document d
- Use a topic's multinomial distribution to output words to fill that topic's quota

## Topic Modeling in Practice

- How many topics?
    - Finding or even guessing the number of topics is hard
- Interpreting topics
    - Topics are just word distributions
    - Making sense of words / generating labels is subjetive

## Working with LDA in Python
- Pre-processing text
    - Tokenize, normalize (lowercase)
    - Stop word removal
    - Stemming
- Convert tokenized documents to a document - term matrix
- Build LDA models on the doc-term matrix

# Information extraction

- Abundance of unstructured, freeform text
    - How to convert this to structured form?
- Goal: Identify and extract fields of interest from free text
- Fields of interest
    - **Named entities**
    - **Relations**

## Named Entity Recognition
- **Named entities**: Noun phrases that are of specific type and refer to specific individuals, places, organizations, ...
- **Named Entity Recognition**: Rechnique(s) to identify all mentions of pre-defined names entities in text
    - Identify the mention/phrase: Boundary detection
    - Identify the type: Tagging/classification

## Approaches to identify named entities
- For well-formatted fields like date, phone numbers: Regular expressions
- For other fields: Typically a machine learning approach

## Relation extraction
- Identify relationships between named entities
- Co-reference resolution
    - Diambiguate mentions and group mentions together
- Question Answering
    - Given a question, find the most appropiate answer from the text
    - Build on named entity recognition, relation extraction and co-reference resolution