# word embedding algorithm

## definition

word embedding algorithm: a function $f: V \rightarrow R^n$ mapping from vocabulary to n-D vector space

- Any algorithm that performs dimensionality reduction can be used to construct a word embedding

- each word embedding $f(w)$ is the result of mapping one word in vocabulary to $R^n$

    i.e. hidden feature vectors extracted from data that represent word meanings

## importance of word embedding

- represent word as 1-hot encoding vector don't contain any semantic info and has too large dimension.

- manually define **features for word** is hard

- word embedding represent words as **semantic and syntactic meaningful** vectors in low-dimensional feature space

- Word embeddings are widely used in various NLP tasks as input features, to help models understand and process natural language.

## types

word embedding algorithms can be divided into 2 types:

- count-based: matrix factorization

    - SVD: $X = USV^T$, $X$ is word-contex matrix, embedding is $US$. 
    
        variants: PMI, PPMI, SPPMI. not use NN

    - GloVe: $X = WC^T$, $X$ is co-occurrence matrix, embedding is $W$. use NN

- prediction-based: embedding is output of hidden layer

    - word2vec: skip-gram (SGNS), CBOW. 3-layer MLP

    - FastText: word2vec extention. unit: character n-grams. can learn embedddings of out-of-vocabulary word.

    - ELMo: bidirectional LSTM

    - BERT: encoder-only Transformer

| Aspect                     | Count-based                          | Prediction-based                      |
|----------------------------|--------------------------------------|---------------------------------------|
| Methodology                | Co-occurrence statistics             | Predictive models                     |
| Example algorithms         | Co-occurrence matrix, GloVe          | Word2Vec, FastText, ELMo, BERT        |
| Computational complexity   | Offline processing, matrix factorization | Online processing, iterative training |
| Scalability                | Can be less scalable                  | More scalable                         |
| Handling rare words        | Dependent on co-occurrence frequency | Subword information can help (FastText)|
| Context sensitivity        | Limited, fixed-size context window    | Better, especially with ELMo, BERT    |
| Training data efficiency   | Can be more efficient                 | May require larger datasets           |


## Distributional semantics hypothesis

- all the word embedding algorithms rely on Distributional Semantics Hypothesis to learn meaningful representations of words.

- Distributional semantics: measure semantic similarities between linguistic items based on their distributional properties in large samples of language data. 

- **distributional hypothesis** is basic idea of Distributional semantics

    first proposed by (Harris, 1954) "a word is characterized by the company it keeps"

    interpretation: 
    
    - linguistic items with similar distributions have similar meanings.

    - **words that occur in same contexts have similar meanings**

        here, contexts are also words that surround target words

## evaluation

effectiveness of word embedding algorithm in capturing semantic and syntactic relationships in language can be evaluated in a combination of intrinsic and extrinsic methods.


Intrinsic Evaluation:

- Word Similarity: Comparing 2-word cosine similarity between word embeddings with human-annotated similarity scores in benchmark datasets like WordSim-353, SimLex-999, or MEN.

    $$
    \text{cosine similarity} = \frac{\phi(A)\phi(B)}{||\phi(A)|| ||\phi(B)||}
    $$

- Word Analogy: Testing embeddings' ability to solve analogy tasks, such as "A is to B as C is to D" (e.g., "man" is to "woman" as "king" is to "queen"). 

    A popular benchmark for this task is the Google Word Analogy dataset.


Extrinsic Evaluation: performance of downstream NLP tasks taking word embeddings as input features.

- Text classification (e.g., sentiment analysis, topic categorization)

- Named Entity Recognition (NER)

- Part-of-Speech (POS) Tagging

- Semantic Role Labeling (SRL)

- Machine Translation

- Question Answering

### analogies 类比

- syntactic analogies: A is to A' as B is to B'

- task: predict B' from vocabulary

e.g.

- `king` is to `man` as `?` is to `woman`

- `Bill Gates` is to `Microsoft` as `?` is to `Google`


- equation:

    $$
    \phi(king)-\phi(man)\approx \phi(?)-\phi(woman)
    $$

- objective: minimize the $l_2$ norm

    $$
    \hat w = \arg \min_w \left \| \phi(king)-\phi(man)+ \phi(woman)-\phi(w) \right \|^2
    \\[1em]
    \Rightarrow \hat w = queen
    $$

- alternative objective: maximize 2 similarities and one difference

    $$
    \arg \max_{b'} \cos(b', b-a+a') = \cos(b', b)-\cos(b', a)+\cos(b', a')
    $$

- metric: accuracy

    true B' is optimal result of Levy and Goldberg's similarity multiplication method
    
    $$
    B' = \arg \max_{B'} \frac{\cos(B', A')\cos(B',B)}{\cos(B',A)+\epsilon}
    $$

gender bias can occur in the word

- extreme **female** occupations

    homemaker, nurse, receptionist, librarian, socialite, hardresser, nanny, bookkeeper, stylist, housekeeper, interior designer, guidance counselor

- extreme **male** occupations

    maestro, skipper, protege, philosopher, captain, architest, financier, warrior, broadcaster, magician, figher pilot, boss
    
    
    

# paper: [*Neural word embedding as implicit matrix factorization*](https://proceedings.neurips.cc/paper/2014/file/feab05aa91085b7a8012516bc3533958-Paper.pdf)

compare 5 word embedding algorithms (SGNS, NCE, SPMI, SPPMI, rank-d SVD) 

these algorithm can all be viewed as implicit matrix factorization

1) sparse or dense

    sparse: PPMI, SPPMI, SVD

    dense: SGNS

2) weighted or unweighted

    weighted: SGNS, SPMI, SPPMI. gives more weight to frequent pairs

    unweighted: SVD

- GloVe

$$
X = WC^T
$$

- SGNS/SPMI

$$
M = \text{PMI}(w,c)-\log k
$$


- SPPMI: 

$$\text{SPPMI} = \text{PPMI}-\log k$$


- SVD: 

$$\text{SPPMI}_d = U_d \Sigma_d V_d^T$$

- NCE

$$
M = \log P(w|c)-\log k
$$

ad of SVD (dense low-dimensional vectors)
    
1) SVD is exact, don't need learning rates or tuning hyperparameter $k$

2) SVD is easy to train on count-aggregated data, i.e., $\left\{(w,c,\#(w,c)) \right\}$ triplets

    thus scalable to larger corpora

    while SGNS needs each observed $(w,c)$ pair to be presented separately
    
3) improved computational efficiency
    
4) better generalization
       
     
dis of SVD

1) SVD suffer from unobserved values, which are very common in word-context matrices

    SGNS can distinguish between observed and unobserved word-context pairs

2) exact weighted SVD is a hard computational problem

    while SGNS's training objective weighs word-context pairs with different frequency differently,

    assign more optimal values to frequent pairs while allow more error for infrequent pairs

3) SVD needs matrix $M$ to be sparse

    while SGNS don't require that coz it only cares about **observed and sampled word-context pairs**

    SGNS can optimize **dense matrix**, e.g. $M=PMI-\log k$