
### What Does *Semantic* Mean?

**Semantics** refers to the **meaning** or **interpretation of words, phrases, and sentences** in a language.

In the context of **NLP (Natural Language Processing)**:

- **Semantic similarity** measures **how close the meanings of two words or texts are**, not just whether the words are the same.
- Two different words (e.g., *good* and *great*) can have **high semantic similarity** if they mean similar things.
- Conversely, words like *good* and *terrible* have **low semantic similarity** even if they appear in similar contexts.

####  Example:
- Lexical (word-based) view:  
  *“good” ≠ “great”* (not the same word)
- Semantic view:  
  *“good” ≈ “great”* (very similar meaning)

#### Semantic vectors capture:
- Word meaning
- Contextual usage
- Relationships like synonymy (similar) and antonymy (opposite)


## Notes on Document and Word Vector Representations

### **Corpus and Word Representation**

- A **corpus** is a collection of documents:
  
  $$
  \begin{align*}
  d_1 &\rightarrow w_3\ w_6\ w_5\ w_3\ -\ - \\
  d_2 &\rightarrow w_1\ w_{12}\ w_6\ w_3\ -\ - \\
  \vdots \\
  d_n &\rightarrow \text{(other sequences of words)}
  \end{align*}
  $$

- Each document $ d_i $ is a sequence of **words** $ w_j $.
- Each word $ w_i $ is mapped to a **vector representation** in $ \mathbb{R}^d $, i.e., a **d-dimensional vector**.

#### Interpretation:
- This mapping can be learned through models like **Word2Vec, GloVe, or FastText**.
- The goal is to encode semantic meaning in vectors.

### **Word Embeddings and Semantic Representations**

Let’s look at examples of how words are embedded as vectors and how we measure similarity:

#### Example 1:
- **good** → $ v_1 $
- **great** → $ v_2 $

Both are semantically similar. So,
$$
\text{CosineSimilarity}(v_1, v_2) \approx 1
$$
This means they point in nearly the same direction in vector space.

#### Example 2:
- **terrible** → $ v_3 $
- Now compare:
$$
\text{CosineSimilarity}(v_1, v_3) \ll 1 \quad \text{(very low)}
$$

#### Insight:
- High cosine similarity → semantically similar
- Low cosine similarity → semantically different or opposite


### **What Does Cosine Similarity Measure?**

- Measures the **angle** between two vectors, not their magnitude.
- Formula:
  &&
  \text{CosineSimilarity}(v_1, v_2) = \frac{v_1 \cdot v_2}{\|v_1\| \|v_2\|}
  &&
- Value ranges from:
  - **+1**: very similar
  - **0**: orthogonal, unrelated
  - **-1**: opposite meanings


### Summary

| Concept               | Description |
|-|-|
| Document $ d_i $     | A sequence of word tokens |
| Word $ w_j $         | Each token in the document |
| Vector $ v \in \mathbb{R}^d $ | d-dimensional vector capturing meaning |
| Cosine Similarity     | Measures semantic closeness between two words |
| High Similarity       | Indicates semantic similarity (e.g., *good*, *great*) |
| Low Similarity        | Indicates semantic difference (e.g., *good*, *terrible*) |




Here's the content with all headings changed to h4 and bolded:

#### **1. Corpus and Vocabulary**

- Given a **corpus** of $ n $ documents:  
  $$
  \mathcal{D} = \{d_1, d_2, \dots, d_n\}
  $$
- Vocabulary extracted:  
  $$
  \mathcal{V} = \{w_1, w_2, \dots, w_N\}
  $$
  where $ N = |\mathcal{V}| $ is the total number of unique words.

#### **2. Document-Term Matrix $ A \in \mathbb{R}^{n \times N} $**

This is a basic way to represent text data numerically.

$$
A =
\begin{pmatrix}
& w_1 & w_2 & \dots & w_N \\
d_1 & 1 & 2 & \dots & \cdots \\
d_2 & 0 & 2 & \dots & \cdots \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
d_n & \cdots & \cdots & \cdots & \cdots
\end{pmatrix}
$$

- **Rows** $ \rightarrow $ Documents $ d_i $
- **Columns** $ \rightarrow $ Words $ w_j $
- **Entries** $ A_{ij} $ = occurrence count of $ w_j $ in $ d_i $

#### **3. Challenges with BoW Representation**

- $ N $ is large $ \rightarrow $ **High dimensionality**
- Each document uses only a small subset of words $ \rightarrow $ **Very sparse**
- **No order information** is retained $ \rightarrow $ **Loss of sequence/context**

#### **4. Word Co-Occurrence Matrix $ \widetilde{W}_{N \times N} $**

To model **word relationships**, a **word-word co-occurrence matrix** is used.

$$
\widetilde{W} =
\begin{pmatrix}
\cdots & \cdots & w_j & \cdots \\
\cdots & \cdots & \cdots & \cdots \\
w_i & \cdots & 1 & \cdots \\
\cdots & \cdots & \cdots & \cdots \\
\end{pmatrix}
\quad \text{(Window-based counts)}
$$

- Rows and columns: Words in vocabulary
- Entry $ \widetilde{W}_{ij} $: Number of times word $ w_j $ appeared in the **context window** of word $ w_i $

This matrix partially encodes **sequential context**, using a window (e.g., size = 5) around a central word.

#### **5. Matrix Factorization of Document-Term Matrix**

We approximate the document-term matrix $ A \in \mathbb{R}^{n \times N} $ using **low-rank matrix factorization**:

$$
A \approx B C^\top
$$

where:
- $ B \in \mathbb{R}^{n \times d} $: document representations (dense)
- $ C \in \mathbb{R}^{N \times d} $: word representations (dense)
- $ d \ll N $: dimension of latent semantic space

In expanded form:
$$
A_{n \times N} \approx B_{n \times d} \cdot C_{N \times d}^\top
$$

##### 5.1 **Interpretation of Factors**

###### Document Matrix $ B $:
$$
B =
\begin{pmatrix}
\vdots \\
\rule{1cm}{0.15mm} \leftarrow i \\
\vdots \\
\end{pmatrix}_{n \times d}
\quad \text{= vector representation of document } d_i
$$

###### Word Matrix $ C $:
$$
C =
\begin{pmatrix}
\vdots & \uparrow \\
\rule{1cm}{0.15mm} & j \\
\vdots & \downarrow \\
\end{pmatrix}_{N \times d}
\quad \text{= vector representation of word } w_j
$$

#### **6. Objective Function for Matrix Factorization**

The aim is to minimize the reconstruction error over the **non-zero entries** of $ A $:

$$
\mathcal{L} = \sum_{\substack{i,j \\ A_{ij} \neq 0}} (A_{ij} - B_i^\top C_j)^2 + \text{regularization terms}
$$

- This objective penalizes the squared difference between actual and predicted co-occurrence
- Can be extended to include L2 regularization on $ B $ and $ C $ for generalization

#### **7. Key Insight**

Matrix factorization achieves two key goals:
- Compresses large sparse matrix $ A $ into lower-dimensional dense matrices $ B, C $
- Captures **semantic similarity**: similar words and documents end up with nearby vectors in $ \mathbb{R}^d $

### From Corpus to Word Vectors — Skip-Gram with Negative Sampling

#### **1. Problem Setup**

We are given a **corpus**:
$$
\mathcal{D} = [w_1, w_2, \dots, w_T]
$$
where each $ w_t \in \mathcal{V} $, and $ \mathcal{V} $ is a vocabulary of size $ N $.

Our goal:  
Map each word $ w_i \in \mathcal{V} $ to a **dense, low-dimensional** vector:
$$
w_i \mapsto v_i \in \mathbb{R}^d \quad \text{(where } d \ll N\text{)}
$$
so that similar words have similar vectors.

#### **2. Context and Co-Occurrence**

We define **context** using a sliding window of size $ k $.  
For a center word $ w_i $, context is:
$$
\text{Context}(w_i) = \{ w_{i-k}, ..., w_{i-1}, w_{i+1}, ..., w_{i+k} \}
$$

From this, we generate **co-occurrence pairs**:
$$
(w_i, w_j) \quad \text{if } w_j \in \text{Context}(w_i)
$$
This gives us a binary co-occurrence matrix:
$$
C_{N \times N} = \begin{cases}
1 & \text{if } (w_i, w_j) \text{ co-occur in context} \\
0 & \text{otherwise}
\end{cases}
$$

#### **3. Neural Architecture (SGNS)**

Each word is first represented as a **One-Hot Encoding (OHE)**:

- For word $ w_j \in \mathcal{V} $,  
$$
\text{OHE}(w_j) = \begin{pmatrix}
0 \\
\vdots \\
1 \leftarrow \text{j-th position} \\
\vdots \\
0
\end{pmatrix} \in \mathbb{R}^N
$$

The architecture:

1. **Input**: OHE vector of target word $ w_i $
   $$
   \text{OHE}(w_i) \rightarrow W_{\text{in}} \in \mathbb{R}^{N \times d}
   $$
   Produces:
   $$
   v_i = W_{\text{in}}^\top \text{OHE}(w_i) \in \mathbb{R}^d
   $$

2. **Output**: OHE vector of context word $ w_j $
   $$
   \text{OHE}(w_j) \rightarrow W_{\text{out}} \in \mathbb{R}^{N \times d}
   $$
   Produces:
   $$
   v_j = W_{\text{out}}^\top \text{OHE}(w_j) \in \mathbb{R}^d
   $$

Thus we get:
$$
v_i = \text{embedding of center word}, \quad v_j = \text{embedding of context word}
$$

#### **4. Prediction**

To predict whether $ w_j $ is a true context word for $ w_i $, compute:
$$
s_{ij} = v_i^\top v_j \quad \text{(dot product)}
$$
and apply:
$$
\hat{y}_{ij} = \sigma(v_i^\top v_j) = \frac{1}{1 + e^{-v_i^\top v_j}} \in [0, 1]
$$

Interpretation:
- If $ \hat{y}_{ij} \approx 1 $, model believes $ w_j $ is a likely context for $ w_i $
- If $ \hat{y}_{ij} \approx 0 $, model believes it is not

#### **5. Training Objective: Negative Sampling**

Training uses:
- **Positive samples**: actual context pairs $(w_i, w_j)$
- **Negative samples**: randomly sampled $ w_k \notin \text{Context}(w_i) $

For each center-context pair $ (w_i, w_j) $, optimize:
$$
\mathcal{L}_{ij} = -\log \sigma(v_i^\top v_j) - \sum_{k=1}^K \mathbb{E}_{w_k \sim P_n(w)} \left[ \log \sigma(-v_i^\top v_k) \right]
$$

Where:
- $ \sigma(\cdot) $ is the sigmoid
- $ K $ is number of negative samples
- $ P_n(w) $ is the noise distribution (typically proportional to $ f(w)^{3/4} $)

#### **6. Learning**

The model is trained using SGD (or Adam). The learned matrices:
- $ W_{\text{in}} \rightarrow $ word vectors (input embeddings)
- $ W_{\text{out}} \rightarrow $ context vectors (output embeddings)

You can use either or average them:
$$
v_w = \frac{W_{\text{in}}(w) + W_{\text{out}}(w)}{2}
$$

#### **7. Outcome**

After training:
- Each word $ w $ gets a dense vector $ v_w \in \mathbb{R}^d $
- Words that **co-occur in similar contexts** end up **closer** in embedding space
- Enables:
  - Analogy tasks (e.g., *king - man + woman ≈ queen*)
  - Semantic clustering
  - Downstream NLP applications

- **2012–2013**:
  - Word2Vec introduced by Mikolov et al.
  - First model: **CBOW** — predict word from context
  - Then: **Skip-Gram** — predict context from word
  - Then: **SGNS** — Skip-Gram + Negative Sampling (computationally efficient)

#### With over 50,000 publications and research papers on the corona-virus family till date, it has become difficult to search across and get useful insights for medical practitioners.

As a Data Scientist at Google, you are tasked with solving this problem with the help of Machine Learning.

Efficient Estimation of Word Representations in Vector Space

GloVe: Global Vectors for Word Representation

How can we solve for this problem?

* Can we match keywords from user queries that are present in abstract?
* If we do keyword matching, will we be able to understand the user's intent? For e.g. 'origin' and 'discovery'
* Should we consider the context of the words, then?

Let us build a search engine using Word Embeddings

Dataset 

* COVID-19 Open Research Dataset, consisting of all publications/research papers related to Covid-19.
* Dataset contains Title, Abstract, Dol among other identifiers.
* To download - [https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge)

### All remaing word in Colab