# TP1  Word Embeddings  
## Semantic Similarity Analysis

In this notebook, we perform a **semantic similarity evaluation** of the word
embeddings trained in the previous notebooks.

We compare:
- **Word2Vec (CBOW / Skip-gram)**
- **FastText (CBOW)**

trained on:
- a **medical-domain corpus** (QUAERO_FrenchMed)
- a **general-domain press corpus** (QUAERO_FrenchPress)

The objective is to analyze:
- the impact of the **training corpus domain**,
- the impact of the **embedding method**,
on the semantic neighbors of selected target words.

This qualitative evaluation follows the instructions of TP1 and prepares
the embeddings for downstream evaluation in TP2 (NER).


## Imports

In [1]:
import os
from gensim.models import Word2Vec, FastText


In [2]:
CURRENT_DIR = os.getcwd()
PROJECT_ROOT = os.path.abspath(os.path.join(CURRENT_DIR, "../.."))

EMBEDDINGS_DIR = os.path.join(PROJECT_ROOT, "embeddings")

print("Embeddings directory:", EMBEDDINGS_DIR)


Embeddings directory: /Users/ilyessais/Documents/Lab M2/Text Mining & Chatbots/text-mining-chatbots-lab3/TP_ISD2020/embeddings


## Load All Trained Models

In [3]:
# Word2Vec Medical
w2v_med_cbow = Word2Vec.load(
    os.path.join(EMBEDDINGS_DIR, "word2vec_medical_cbow.model")
)
w2v_med_sg = Word2Vec.load(
    os.path.join(EMBEDDINGS_DIR, "word2vec_medical_skipgram.model")
)

# Word2Vec Press
w2v_press_cbow = Word2Vec.load(
    os.path.join(EMBEDDINGS_DIR, "word2vec_press_cbow.model")
)
w2v_press_sg = Word2Vec.load(
    os.path.join(EMBEDDINGS_DIR, "word2vec_press_skipgram.model")
)

# FastText Medical
ft_med = FastText.load(
    os.path.join(EMBEDDINGS_DIR, "fasttext_medical_cbow.model")
)

# FastText Press
ft_press = FastText.load(
    os.path.join(EMBEDDINGS_DIR, "fasttext_press_cbow.model")
)

print("All models successfully loaded.")


All models successfully loaded.


## Candidate Words

In [4]:
candidate_words = [
    "patient",
    "traitement",
    "maladie",
    "solution",
    "jaune"
]


## Helper Function for Similarity

In [5]:
def show_similarities(model, model_name, words, topn=5):
    print("=" * 80)
    print(model_name)
    print("=" * 80)

    for word in words:
        print(f"\nMost similar words to '{word}':")
        try:
            for w, s in model.wv.most_similar(word, topn=topn):
                print(f"  {w:<15} {s:.3f}")
        except KeyError:
            print("  Word not in vocabulary.")


## Medical Corpus Word2Vec

In [6]:
show_similarities(
    w2v_med_cbow,
    "Word2Vec CBOW Medical Corpus",
    candidate_words
)

show_similarities(
    w2v_med_sg,
    "Word2Vec Skip-gram Medical Corpus",
    candidate_words
)


Word2Vec CBOW Medical Corpus

Most similar words to 'patient':
  cette           0.999
  Le              0.999
  produit         0.999
  plus            0.999
  qui             0.999

Most similar words to 'traitement':
  que             0.995
  devra           0.995
  médecin         0.995
  TYSABRI         0.994
  qu              0.993

Most similar words to 'maladie':
  évolution       0.998
  du              0.998
  administration  0.998
  souris          0.998
  la              0.998

Most similar words to 'solution':
  poudre          0.998
  contient        0.997
  Chaque          0.997
  diluer          0.996
  flacon          0.996

Most similar words to 'jaune':
  entre           0.999
  –               0.999
  24              0.999
  contenant       0.998
  unique          0.998
Word2Vec Skip-gram Medical Corpus

Most similar words to 'patient':
  carte           0.982
  interrompre     0.978
  avoir           0.974
  allaiter        0.974
  interrompu      0.974

Most simil

## Press Corpus  Word2Vec

In [8]:
show_similarities(
    w2v_press_cbow,
    "Word2Vec CBOW Press Corpus",
    candidate_words
)

show_similarities(
    w2v_press_sg,
    "Word2Vec Skip-gram Press Corpus",
    candidate_words
)


Word2Vec CBOW Press Corpus

Most similar words to 'patient':
  représentaient  0.803
  gras            0.797
  gossip          0.785
  onomatopéique   0.782
  sciemment       0.781

Most similar words to 'traitement':
  coût            0.830
  collectif       0.811
  financement     0.791
  système         0.788
  renforcement    0.788

Most similar words to 'maladie':
  puissance       0.799
  population      0.788
  garantie        0.787
  douleur         0.783
  mondialisation  0.777

Most similar words to 'solution':
  recette         0.814
  règle           0.805
  catastrophe     0.799
  coïncidence     0.791
  alternative     0.789

Most similar words to 'jaune':
  maillot         0.864
  Bou             0.864
  cavalier        0.864
  Perrot          0.853
  Wen             0.852
Word2Vec Skip-gram Press Corpus

Most similar words to 'patient':
  épouvantable    0.872
  mollah          0.866
  provocation     0.863
  admirable       0.857
  coq             0.855

Most similar w

## Medical Corpus FastText

In [10]:
show_similarities(
    ft_med,
    "FastText CBOW Medical Corpus",
    candidate_words
)


FastText CBOW Medical Corpus

Most similar words to 'patient':
  Patient         0.999
  tremblements    0.999
  pansements      0.999
  patiente        0.999
  Tremblements    0.999

Most similar words to 'traitement':
  Traitement      1.000
  Taaitement      1.000
  Allaitement     0.999
  allaitement     0.999
  traitements     0.999

Most similar words to 'maladie':
  Maladie         1.000
  malade          1.000
  professionnelle 1.000
  professionnel   1.000
  hyrgathione     1.000

Most similar words to 'solution':
  Dissolution     1.000
  évolution       0.999
  dilution        0.999
  Solution        0.999
  Evolution       0.999

Most similar words to 'jaune':
  zone            1.000
  Une             1.000
  hexane          1.000
  Rhône           1.000
  crâne           1.000


## Press Corpus FastText

In [12]:
show_similarities(
    ft_press,
    "FastText CBOW Press Corpus",
    candidate_words
)


FastText CBOW Press Corpus

Most similar words to 'patient':
  patientent      0.985
  impatient       0.981
  détient         0.975
  ratifient       0.973
  trient          0.972

Most similar words to 'traitement':
  promptement     0.973
  recrutement     0.967
  concrètement    0.965
  farouchement    0.962
  plafonnement    0.961

Most similar words to 'maladie':
  malnutrie       0.925
  trilogie        0.893
  folie           0.889
  magie           0.888
  pie             0.887

Most similar words to 'solution':
  révolution      0.983
  résolution      0.982
  évolution       0.978
  dissolution     0.976
  caution         0.976

Most similar words to 'jaune':
  Neptune         0.970
  brune           0.968
  lune            0.963
  Jeune           0.959
  Saâdoune        0.946


## Discussion

### Impact of the Training Corpus
Embeddings trained on the **medical corpus** clearly capture domain specific
semantic relations, especially for words such as *patient*, *traitement* and
*maladie*.

In contrast, embeddings trained on the **press corpus** provide more generic
associations, reflecting journalistic language and broader contexts.

### Word2Vec vs FastText
FastText consistently produces:
- morphologically related neighbors,
- robust representations for rare or unseen words,
- better handling of inflections and spelling variants.

Word2Vec embeddings tend to be more sensitive to exact word forms and corpus size.

### Domain vs Model
The results confirm that:
- **training data domain** has a stronger impact than the embedding algorithm,
- FastText improves robustness, but cannot fully compensate for domain mismatch.

These observations motivate the evaluation of embeddings on a downstream task
(NER) in TP2.
