# text similarity

## types of text similarity

Morphological similarity 形态 (e.g., respect-respectful)

Spelling similarity 拼写 (e.g., theater-theatre)

Homophony 同音 (e.g., raise-raze-rays)

Semantic similarity 语义

- Synonymy 近义 including across languages (e.g., talkative-chatty, English-英语)

- Sentence similarity 句子 (e.g., paraphrases 改写)

- Document similarity 文章 (e.g., two news stories on the same event)

## application

Information retrieval: Enhances search engine results by considering semantically related words when matching user queries to documents.

Text summarization: Identifies and groups similar words or phrases to extract key points and reduce redundancy in summaries.

Semantic clustering: Groups words or documents based on their semantic similarity, aiding in topic modeling and categorization tasks.

Spell checking: Suggests corrections for misspelled words by identifying closely related words in the vocabulary.

Machine translation: Improves translation quality by mapping similar words across languages and preserving context.

Word sense disambiguation: Resolves ambiguities in word meanings by considering the similarity between words in a given context.

Sentiment analysis: Detects sentiment in text by identifying words with similar emotional connotations.

Recommendation systems: Recommends relevant items (e.g., news articles or products) by analyzing the similarity between words in descriptions or user preferences.

Paraphrase detection: Identifies paraphrases or alternative expressions by measuring the similarity between word sets in different sentences.

Analogical reasoning: Solves analogies by comparing the semantic relationships between word pairs.

## similarity measures

| Similarity Measure | Formula                                               | Description                                        | suitable data                                                    | not suitable data                                |
|--------------------|:-------------------------------------------------------:|----------------------------------------------------------|---------------------------------------------------------|--------------------------------------|
| Jaccard Similarity | $$J(A, B) = \frac{\|A \cap B\|}{\|A \cup B\|}$$       |  overlap between two sets (A and B)          | discrete set (binary, categorical)                     |  continuous data    |
| **Cosine Similarity**  | $$\cos(A, B) = \frac{\phi(A)\phi(B)}{\|\phi(A)\| \|\phi(B)\|} \in [-1, 1]$$ | angle between two non-zero vectors (A and B) | high-dimensional, sparse data ( **text**) | binary data        |
| Euclidean Distance | $$d(A, B) = \|\|A-B\|\|=\sqrt{\sum_{i=1}^n (A_i - B_i)^2}$$ |  "straight line" distance between two points (A and B) in n-D space | continuous, low-dimensional data (numerical)            | high-dimensional, sparse data |


## text kernel



Text kernel is a function that measures the similarity between two text sequences based on specific properties, such as common substrings, edit distances, or parse tree structures. 

Text kernel maps text sequences into a high-dimensional feature space, where the similarity is calculated as the inner product between the feature vectors of two sequences.

Text kernels are often used in kernel-based machine learning algorithms (SVM, kernel PCA, Gaussian process) for various NLP tasks, such as text classification, sentiment analysis, document clustering, and information retrieval. 

### types

String kernels: Measure similarity based on the common substrings or subsequences between two text sequences.

- substring: a contiguous sequence of characters within a given string where the characters appear in the same order as in the original string and without any gaps. 

    e.g., substrings of string "hello" can be "he", "ell", and "llo".

- subsequence: a sequence of characters formed by removing >=0 characters from original string without changing order. 

    e.g., subsequences of string "hello" can be "hlo", "elo", and "hl".
    

Tree kernels: Measure similarity between two text sequences by comparing common subtrees between their syntactic parse trees.

Feature space kernels: Measure similarity between two text sequences by mapping them into a fixed-length feature space, like Bag-of-Words or TF-IDF.
