# word representation

|Method	|Pros	|Cons      |
|-------|-------|----------|
|Image	|Visual representation|High-dimensional data, may not capture abstract concepts|
|Denotational Semantics	|Formal representation, based on logical relations	|Requires complex, formal knowledge representation|
|Dictionary Definition	|explicit meaning, captures semantic relationships	|Verbose. incomplete synonym. outdated. subjective. manually created. manually created. hard to build for different language or domains. hard to compute word similarity|
|One-hot Encoding vector	|Simple, easy to implement	|High-dimensional ($\|\mathcal{V}\|$). sparse. lacks semantic relationships. orthogonal vectors can't interpretate|
|Distributed Embedding	|Captures semantic relationships, lower-dimensional, dense	| Requires large training data, may not capture rare words|

# word embedding algorithm

## definition

word embedding algorithm: a function $f: \mathcal{V} \rightarrow R^n$ mapping from vocabulary to a **contiouns low dimensional dense** vector space (n = 50-300) that capture semantic and syntactic relationships between words

each word embedding $f(w)$ is hidden feature vector extracted from data that represent word meanings

## types

<table>
  <tr>
    <th colspan='2'>Type</th>
    <th>Model</th>
    <th>Short Description</th>
    <th>Pros</th>
    <th>Cons</th>
  </tr>
  <tr>
    <td colspan='2'>Count-based</td>
    <td>LSA</td>
    <td>Co-occurrence Matrix factorization by SVD</td>
    <td>Simple, captures semantic relationships</td>
    <td>High memory usage, can't capture polysemy</td>
  </tr>
  <tr>
    <td rowspan="5">Prediction-based</td>
    <td rowspan="3">Distributed</td>
    <td >GloVe</td>
    <td >combines count-based and prediction-based </td>
    <td >captures semantic relationships, low memory usage</td>
    <td >Less efficient with rare words</td>
  </tr>
  <tr>
    <td>Word2Vec</td>
    <td>Predicts words in a context, Skip-gram or CBOW </td>
    <td>captures semantic and syntactic relationships, low memory usage</td>
    <td>can't capture polysemy</td>
  </tr>
  <tr>
    <td>FastText</td>
    <td>Word2Vec Extension, subword embedding</td>
    <td>can learn embeddings of rare words and OOV words</td>
    <td>Higher memory usage compared to Word2Vec</td>
  </tr>
  <tr>
    <td rowspan="2">Contextualized</td>
    <td>ELMo</td>
    <td>bidirectional LSTM</td>
    <td>Context-aware, captures polysemy</td>
    <td>Higher memory usage and computational complexity</td>
  </tr>
  <tr>
    <td>BERT</td>
    <td>Encoder-only Transformers</td>
    <td>Context-aware, powerful, supports transfer learning</td>
    <td>High memory usage and computational complexity</td>
  </tr>
</table>

## other types: Syntactic, Mutilingual, Multisense

<table>
  <tr>
    <td rowspan="2">Syntactic</td>
    <td>Dependency-based</td>
    <td>dependency parsing</td>
    <td>Captures syntactic relationships</td>
    <td>Requires accurate dependency parsing, doesn't capture polysemy</td>
  </tr>
  <tr>
    <td>POS-based</td>
    <td>part-of-speech tags</td>
    <td>Simple, captures syntactic relationships</td>
    <td>Requires accurate POS tagging, doesn't capture polysemy</td>
  </tr>
  <tr>
    <td rowspan="3">Multilingual</td>
    <td>MUSE</td>
    <td>Unsupervised alignment of monolingual embeddings in a shared space</td>
<td>Supports multiple languages, allows for cross-lingual tasks</td>
<td>Quality depends on source embeddings</td>

  </tr>
  <tr>
    <td>Aligning FastText</td>
    <td>Aligns FastText embeddings in a shared multilingual space</td>
    <td>Supports multiple languages, allows for cross-lingual tasks</td>
    <td>Quality depends on source embeddings</td>
  </tr>
  <tr>
    <td>XLM</td>
    <td>Cross-lingual Language Model, based on BERT </td>
    <td>Powerful, supports multiple languages, allows for transfer learning</td>
    <td>High memory usage and computational complexity</td>
  </tr>
  <tr>
    <td rowspan="2">Sense</td>
    <td>Sense2Vec</td>
    <td>Adaptation of Word2Vec, generating embeddings for word senses</td>
    <td>Disambiguates word senses, low memory usage</td>
    <td>Requires accurate sense annotation</td>
  </tr>
  <tr>
    <td>DeConf</td>
    <td>Deconfuses Word2Vec embeddings using sense-annotated corpus</td>
    <td>Disambiguates word senses, low memory usage</td>
    <td>Requires accurate sense annotation</td>
  </tr>
</table>


## Distributional semantics hypothesis

distributed word embeddings are based on Distributional Semantics Hypothesis 

**distributional hypothesis** (Harris, 1954): "a word is characterized by the company it keeps"

interpretation: 
    
- linguistic items with similar distributions have similar meanings.

- **words that occur in similar contexts have similar meanings**

    here, contexts are also words that surround target words

- synonyms (words of similar meaning) should have similar embeddings

- Different senses of a same word should have different embeddings

- Relations of words should be preserved: cat-kitten should be similar to dog-puppy

various context features

- The word **before/after** the target word

- Any word **within n words** of the target word

- Any word **within a specific syntactic relationship** with the target word (e.g., the head of the dependency or the subject of the sentence)

- Any word within the **same sentence/document**

## evaluation

effectiveness of word embedding algorithm in capturing semantic and syntactic relationships in language can be evaluated in a combination of intrinsic and extrinsic methods.

- Intrinsic Evaluation: nearest neighbors, information retrieval, Word Similarity

- Extrinsic Evaluation: performance of downstream NLP tasks taking word embeddings as input features.

### Word Similarity

Comparing 2-word cosine similarity between word embeddings with human-annotated similarity scores in benchmark datasets like WordSim-353, SimLex-999, or MEN.

### Word Analogy

solve syntactic analogy task "A is to A' as B is to B'" (e.g., "man" is to "woman" as "king" is to "queen")， predict B' from vocabulary.

- A popular benchmark is Google Word Analogy dataset.


- equation:

    $$
    \phi(king)-\phi(man)\approx \phi(?)-\phi(woman)
    $$

- objective: minimize the $l_2$ norm

    $$
    \hat w = \arg \min_w \left \| \phi(king)-\phi(man)+ \phi(woman)-\phi(w) \right \|^2
    \\[1em]
    \Rightarrow \hat w = queen
    $$

- alternative objective: maximize 2 similarities and one difference

    $$
    \arg \max_{b'} \cos(b', b-a+a') = \cos(b', b)-\cos(b', a)+\cos(b', a')\\[1em]
    \arg \max_{b'}=\frac{(b-a+a')^T b'}{||b-a+a'||}
    $$

- metric: accuracy

    true B' is optimal result of Levy and Goldberg's similarity multiplication method
    
    $$
    B' = \arg \max_{B'} \frac{\cos(B', A')\cos(B',B)}{\cos(B',A)+\epsilon}
    $$

## Word Distribution

words are not distributed evenly across texts.

- Stop Words

    250-300 most common words in English account for >50% of a given text.

    e.g., "the" and "of" represent 10% of tokens, while "and", "to", "a", and "in" make up another 10%. 

- token/type ratio

    In the first chapter of Moby Dick, there are 859 unique words (types) and 2256 word occurrences (tokens). 
    
    The top 65 types cover 1132 tokens, which is >50% of the total tokens.


- Pareto Principle (80/20 Rule)

    uneven distributions in various domains: letters of the alphabet (ETAOIN SHRDLU), city sizes, wealth, etc.

    80% of the wealth goes to 20% of the people or it takes 80% of the effort to build the easier 20% of the system.


- Power-law Distribution

    \begin{align}
    &p(x) =cx^{-\alpha}\\[1em]
    &\text{log-log scale\ } \ln p(x)=c-\alpha \ln x
    \end{align}

    The probability of observing an item of size $x$ is $x$ power to ${-\alpha}$.

    item can be everything, e.g., word frequency, citations, web hits, books sold, telephone calls received, earthquake magnitude.
    
    $c$ is normalization constant ensuring probabilities over all items sum to 1.
    
    $\alpha$ is scaling exponent or power law exponent.

    Power-law distributions are high skew (asymmetry) in linear scale and linear in log-log scale: many items with a small frequency of occurrence and a few items with a very large frequency.



**Zipf's Law in Natural Language**

$$Rank \times Frequency \approx 0.1 \times len(text)$$

- frequency of a word is inversely proportional to its rank in the frequency table.

- nth most frequent word has a frequency approximately proportional to 1/n.

- 2nd frequent word has 1/2 frequency of 1st frequent word

- 3rd frequent word has 1/3 frequency of 1st frequent word

- Although Zipf's Law is not accurate at the tails of the distribution, it is accurate enough for practical purposes in NLP.