# TP1 Word Embeddings  
## Training Word2Vec on the Medical Corpus (QUAERO_FrenchMed)

In this notebook, we train **Word2Vec embeddings** on the **medical corpus**
(QUAERO_FrenchMed).

We follow the instructions of the lab:
- Library: `gensim`
- Models: **CBOW** and **Skip-gram**
- Embedding dimension: **100**
- min_count: **1**

These embeddings will later be:
- qualitatively evaluated using semantic similarity
- reused in TP2 for Named Entity Recognition 


## Imports and Environment Setup

In [2]:
import os
from gensim.models import Word2Vec
from tqdm import tqdm
import multiprocessing


In [4]:
CURRENT_DIR = os.getcwd()
PROJECT_ROOT = os.path.abspath(os.path.join(CURRENT_DIR, "../.."))

# Corpus path
MEDICAL_CORPUS_PATH = os.path.join(
    PROJECT_ROOT,
    "data",
    "embeddings_corpus",
    "QUAERO_FrenchMed",
    "QUAERO_FrenchMed_traindev.ospl"
)

# Output directory for embeddings
EMBEDDINGS_DIR = os.path.join(PROJECT_ROOT, "embeddings")
os.makedirs(EMBEDDINGS_DIR, exist_ok=True)

print("Medical corpus path:", MEDICAL_CORPUS_PATH)
print("Embeddings will be saved in:", EMBEDDINGS_DIR)


Medical corpus path: /Users/ilyessais/Documents/Lab M2/Text Mining & Chatbots/text-mining-chatbots-lab3/TP_ISD2020/data/embeddings_corpus/QUAERO_FrenchMed/QUAERO_FrenchMed_traindev.ospl
Embeddings will be saved in: /Users/ilyessais/Documents/Lab M2/Text Mining & Chatbots/text-mining-chatbots-lab3/TP_ISD2020/embeddings


## Load and Tokenize the Medical Corpus

In [5]:
def load_tokenized_corpus(path):
    """
    Load a corpus in OSPL format:  one sentence per line tokens separated by spaces
    """
    sentences = []
    with open(path, "r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if line:
                tokens = line.split()
                sentences.append(tokens)
    return sentences


medical_sentences = load_tokenized_corpus(MEDICAL_CORPUS_PATH)

print(f"Number of medical sentences: {len(medical_sentences)}")
print("Example tokenized sentence:")
medical_sentences[0][:20]


Number of medical sentences: 3021
Example tokenized sentence:


['EMEA', '/', 'H', '/', 'C', '/', '551']

## Training Configuration

In [7]:
# Hyperparameters 
EMBEDDING_DIM = 100
MIN_COUNT = 1
WINDOW_SIZE = 5
EPOCHS = 10

N_WORKERS = multiprocessing.cpu_count()

print("Embedding dimension:", EMBEDDING_DIM)
print("Min count:", MIN_COUNT)
print("Window size:", WINDOW_SIZE)
print("Epochs:", EPOCHS)
print("Workers:", N_WORKERS)


Embedding dimension: 100
Min count: 1
Window size: 5
Epochs: 10
Workers: 12


## Train Word2Vec CBOW Model

In [8]:
print("Training Word2Vec CBOW model on medical corpus...")

w2v_medical_cbow = Word2Vec(
    sentences=medical_sentences,
    vector_size=EMBEDDING_DIM,
    window=WINDOW_SIZE,
    min_count=MIN_COUNT,
    sg=0,          
    workers=N_WORKERS,
    epochs=EPOCHS
)

print("CBOW training completed.")


Training Word2Vec CBOW model on medical corpus...
CBOW training completed.


## Save CBOW Model and Embeddings

In [9]:
cbow_model_path = os.path.join(EMBEDDINGS_DIR, "word2vec_medical_cbow.model")
cbow_vectors_path = os.path.join(EMBEDDINGS_DIR, "word2vec_medical_cbow.vec")

w2v_medical_cbow.save(cbow_model_path)
w2v_medical_cbow.wv.save_word2vec_format(cbow_vectors_path)

print("CBOW model saved to:", cbow_model_path)
print("CBOW vectors saved to:", cbow_vectors_path)


CBOW model saved to: /Users/ilyessais/Documents/Lab M2/Text Mining & Chatbots/text-mining-chatbots-lab3/TP_ISD2020/embeddings/word2vec_medical_cbow.model
CBOW vectors saved to: /Users/ilyessais/Documents/Lab M2/Text Mining & Chatbots/text-mining-chatbots-lab3/TP_ISD2020/embeddings/word2vec_medical_cbow.vec


## Train Word2Vec Skip-gram Model

In [10]:
print("Training Word2Vec Skip-gram model on medical corpus...")

w2v_medical_sg = Word2Vec(
    sentences=medical_sentences,
    vector_size=EMBEDDING_DIM,
    window=WINDOW_SIZE,
    min_count=MIN_COUNT,
    sg=1,           
    workers=N_WORKERS,
    epochs=EPOCHS
)

print("Skip-gram training completed.")


Training Word2Vec Skip-gram model on medical corpus...
Skip-gram training completed.


## Save Skip-gram Model and Embeddings

In [11]:
sg_model_path = os.path.join(EMBEDDINGS_DIR, "word2vec_medical_skipgram.model")
sg_vectors_path = os.path.join(EMBEDDINGS_DIR, "word2vec_medical_skipgram.vec")

w2v_medical_sg.save(sg_model_path)
w2v_medical_sg.wv.save_word2vec_format(sg_vectors_path)

print("Skip-gram model saved to:", sg_model_path)
print("Skip-gram vectors saved to:", sg_vectors_path)


Skip-gram model saved to: /Users/ilyessais/Documents/Lab M2/Text Mining & Chatbots/text-mining-chatbots-lab3/TP_ISD2020/embeddings/word2vec_medical_skipgram.model
Skip-gram vectors saved to: /Users/ilyessais/Documents/Lab M2/Text Mining & Chatbots/text-mining-chatbots-lab3/TP_ISD2020/embeddings/word2vec_medical_skipgram.vec


## Quick Sanity Check: Vocabulary Size

In [12]:
print("CBOW vocabulary size:", len(w2v_medical_cbow.wv))
print("Skip-gram vocabulary size:", len(w2v_medical_sg.wv))


CBOW vocabulary size: 9104
Skip-gram vocabulary size: 9104


## Quick Semantic Check (Medical Words)

In [13]:
test_words = ["patient", "traitement", "maladie"]

for word in test_words:
    if word in w2v_medical_cbow.wv:
        print(f"\nMost similar words to '{word}' (CBOW):")
        for w, s in w2v_medical_cbow.wv.most_similar(word, topn=5):
            print(f"  {w:<15} {s:.3f}")
    else:
        print(f"\nWord '{word}' not found in CBOW vocabulary.")



Most similar words to 'patient' (CBOW):
  cette           0.999
  Le              0.999
  produit         0.999
  plus            0.999
  qui             0.999

Most similar words to 'traitement' (CBOW):
  que             0.995
  devra           0.995
  médecin         0.995
  TYSABRI         0.994
  qu              0.993

Most similar words to 'maladie' (CBOW):
  évolution       0.998
  du              0.998
  administration  0.998
  souris          0.998
  la              0.998


## Conclusion

In this notebook, we successfully trained two **Word2Vec models**
on the **medical corpus (QUAERO_FrenchMed)**:

- Word2Vec **CBOW**
- Word2Vec **Skip-gram**

Both models:
- use 100 dimensional embeddings
- include all words (min_count = 1)
- capture domain-specific medical vocabulary

These embeddings will be:
- compared with press-domain embeddings
- evaluated using semantic similarity
- reused as pretrained embeddings in TP2 for NER
