# TP1 Discussion and Analysis of Word Embeddings
 
**Lab:** TP1 Word Embeddings Training  
**Student:** Ilyes sais, Wenchong PAN, Muhammed Qaisar 
**Year:** 2025–2026  

This notebook provides a detailed discussion and qualitative analysis of the word embeddings trained in TP1.  
We compare different embedding approaches (Word2Vec CBOW, Word2Vec Skip-gram, FastText CBOW) trained on two corpora:

- **Medical domain corpus (QUAERO_FrenchMed)**
- **General press corpus (QUAERO_FrenchPress)**

The analysis is based on semantic similarity results obtained in Notebook 06.


## 1. Experimental Setup Recap

In TP1, we trained six word embedding models using the following configurations:

### Embedding Models
- **Word2Vec CBOW**
- **Word2Vec Skip-gram**
- **FastText CBOW**

### Training Corpora
- **Medical corpus:** QUAERO_FrenchMed (≈ 3,021 sentences)
- **Press corpus:** QUAERO_FrenchPress (≈ 38,548 sentences)

### Shared Hyperparameters
- Embedding dimension: **100**
- Window size: **5**
- Minimum word count: **1**
- Epochs: **10**

The embeddings were evaluated qualitatively using cosine similarity on the following target words:

> **patient, traitement, maladie, solution, jaune**


## 2. Word2Vec Models on the Medical Corpus

### 2.1 CBOW Medical Corpus

The Word2Vec CBOW model trained on the medical corpus shows strong domain-specific behavior.

Examples:
- **"traitement"** → *médecin, TYSABRI, devra*
- **"solution"** → *poudre, flacon, diluer*
- **"maladie"** → *administration, évolution*

These neighbors are clearly related to:
- medical procedures,
- pharmaceutical products,
- treatment instructions.

However, some highly frequent function words (e.g. *que, la, du*) also appear, which indicates that CBOW is sensitive to high-frequency context words in small corpora.


### 2.2 Skip-gram Medical Corpus

The Skip-gram model provides more semantically focused representations.

Notable observations:
- **"traitement"** → *cancer, Parkinson, expérimenté*
- **"solution"** → *injectable, perfusion, intraveineuse*
- **"maladie"** → *Parkinson, charge, liée*

Compared to CBOW:
- Skip-gram captures **rarer but more informative medical terms**
- It better models **specialized medical vocabulary**

This behavior is expected, as Skip-gram is known to perform better on infrequent words.


## 3. Word2Vec Models on the Press Corpus

### 3.1 CBOW Press Corpus

On the press corpus, semantic similarity is more general and abstract.

Examples:
- **"traitement"** → *coût, financement, système*
- **"solution"** → *recette, règle, alternative*
- **"maladie"** → *population, mondialisation*

This reflects:
- political,
- economic,
- societal discourse.

Medical meaning is largely diluted due to the general-domain nature of the corpus.


### 3.2 Skip-gram Press Corpus

Skip-gram on the press corpus still improves semantic precision compared to CBOW.

Examples:
- **"maladie"** → *Alzheimer, grippe, pneumopathie*
- **"solution"** → *pacifique, consensuelle, vitale*

Nevertheless, the representations remain less specialized than those obtained from the medical corpus.

This highlights the strong influence of **training data domain** on embedding quality.


## 4. FastText CBOW Medical Corpus

FastText shows very strong performance on the medical corpus.

Key observations:
- Near-perfect similarity scores (≈ 1.0)
- Strong morphological awareness

Examples:
- **"traitement"** → *Traitement, traitements, Allaitement*
- **"patient"** → *patiente, Patient*
- **"solution"** → *dilution, dissolution*

FastText benefits from:
- character n-grams,
- robustness to capitalization,
- handling of rare or misspelled words.

This makes it particularly well-suited for medical texts.


## 5. FastText CBOW Press Corpus

On the press corpus, FastText mainly captures **morphological similarity** rather than semantic specialization.

Examples:
- **"patient"** → *impatient, patientent*
- **"solution"** → *révolution, résolution*
- **"jaune"** → *lune, brune, Jeune*

While linguistically valid, these neighbors are:
- less semantically meaningful for NER,
- often based on suffix or character overlap.

This confirms that FastText excels at form-level similarity, especially in large general corpora.


## 6. Global Comparison of Embedding Approaches

| Model | Medical Corpus | Press Corpus |
|------|---------------|--------------|
| Word2Vec CBOW | Good contextual coverage, but noisy | General, abstract semantics |
| Word2Vec Skip-gram | Strong medical semantics | Better than CBOW |
| FastText CBOW | Excellent domain + morphology | Mainly morphological similarity |

### Key Findings
- **Domain matters more than model choice**
- Skip-gram outperforms CBOW for specialized vocabulary
- FastText is best for rare words and morphology


## 7. Implications for Named Entity Recognition (TP2)

Based on these observations, we expect:

- **Medical embeddings > Press embeddings** for NER in the medical domain
- **FastText medical embeddings** to perform best for:
  - rare entities,
  - complex terminology,
  - spelling variations
- **Skip-gram medical embeddings** to provide strong semantic consistency

These hypotheses will be validated quantitatively in TP2.


## 8. Conclusion

This qualitative evaluation demonstrates that:

- Word embeddings strongly depend on the **training corpus**
- Medical-domain embeddings capture precise biomedical semantics
- FastText offers superior robustness through subword modeling
- Skip-gram is particularly effective for rare and technical terms

Overall, **FastText and Skip-gram trained on medical data** are the most promising candidates for the downstream NER task in TP2.
