# TP1 Word Embeddings  
## Training Word2Vec on the Press Corpus (QUAERO_FrenchPress)

In this notebook, we train **Word2Vec embeddings** on the **press-domain corpus**
(QUAERO_FrenchPress).

The goal is to compare:
- embeddings learned from a **large, general-domain corpus**
- with embeddings learned from a **small, medical-domain corpus**

We follow the lab constraints:
- Library: `gensim`
- Models: **CBOW** and **Skip-gram**
- Embedding dimension: **100**
- min_count: **1**


## Imports and Environment Setup

In [1]:
import os
from gensim.models import Word2Vec
from tqdm import tqdm
import multiprocessing


In [2]:
CURRENT_DIR = os.getcwd()
PROJECT_ROOT = os.path.abspath(os.path.join(CURRENT_DIR, "../.."))

# Press corpus path
PRESS_CORPUS_PATH = os.path.join(
    PROJECT_ROOT,
    "data",
    "embeddings_corpus",
    "QUAERO_FrenchPress",
    "QUAERO_FrenchPress_traindev.ospl"
)

# Output directory
EMBEDDINGS_DIR = os.path.join(PROJECT_ROOT, "embeddings")
os.makedirs(EMBEDDINGS_DIR, exist_ok=True)

print("Press corpus path:", PRESS_CORPUS_PATH)
print("Embeddings will be saved in:", EMBEDDINGS_DIR)


Press corpus path: /Users/ilyessais/Documents/Lab M2/Text Mining & Chatbots/text-mining-chatbots-lab3/TP_ISD2020/data/embeddings_corpus/QUAERO_FrenchPress/QUAERO_FrenchPress_traindev.ospl
Embeddings will be saved in: /Users/ilyessais/Documents/Lab M2/Text Mining & Chatbots/text-mining-chatbots-lab3/TP_ISD2020/embeddings


## Load and Tokenize the Press Corpus

In [3]:
def load_tokenized_corpus(path):
    sentences = []
    with open(path, "r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if line:
                tokens = line.split()
                sentences.append(tokens)
    return sentences


press_sentences = load_tokenized_corpus(PRESS_CORPUS_PATH)

print(f"Number of press sentences: {len(press_sentences)}")
print("Example tokenized sentence:")
press_sentences[0][:20]


Number of press sentences: 38548
Example tokenized sentence:


['Patricia',
 'Martin',
 ',',
 'que',
 'voici',
 ',',
 'que',
 'voilà',
 '!',
 'oh',
 ',',
 'bonjour',
 'Nicolas',
 'Stoufflet',
 '.']

## Training Configuration

In [4]:
# Hyperparameters (same as medical corpus)
EMBEDDING_DIM = 100
MIN_COUNT = 1
WINDOW_SIZE = 5
EPOCHS = 10

N_WORKERS = multiprocessing.cpu_count()

print("Embedding dimension:", EMBEDDING_DIM)
print("Min count:", MIN_COUNT)
print("Window size:", WINDOW_SIZE)
print("Epochs:", EPOCHS)
print("Workers:", N_WORKERS)


Embedding dimension: 100
Min count: 1
Window size: 5
Epochs: 10
Workers: 12


## Train Word2Vec CBOW Model (Press)

In [5]:
print("Training Word2Vec CBOW model on press corpus...")

w2v_press_cbow = Word2Vec(
    sentences=press_sentences,
    vector_size=EMBEDDING_DIM,
    window=WINDOW_SIZE,
    min_count=MIN_COUNT,
    sg=0,             
    workers=N_WORKERS,
    epochs=EPOCHS
)

print("CBOW training completed.")


Training Word2Vec CBOW model on press corpus...


Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'


CBOW training completed.


Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'


## Save CBOW Model and Embeddings

In [6]:
cbow_model_path = os.path.join(EMBEDDINGS_DIR, "word2vec_press_cbow.model")
cbow_vectors_path = os.path.join(EMBEDDINGS_DIR, "word2vec_press_cbow.vec")

w2v_press_cbow.save(cbow_model_path)
w2v_press_cbow.wv.save_word2vec_format(cbow_vectors_path)

print("CBOW model saved to:", cbow_model_path)
print("CBOW vectors saved to:", cbow_vectors_path)


CBOW model saved to: /Users/ilyessais/Documents/Lab M2/Text Mining & Chatbots/text-mining-chatbots-lab3/TP_ISD2020/embeddings/word2vec_press_cbow.model
CBOW vectors saved to: /Users/ilyessais/Documents/Lab M2/Text Mining & Chatbots/text-mining-chatbots-lab3/TP_ISD2020/embeddings/word2vec_press_cbow.vec


## Train Word2Vec Skip-gram Model (Press)

In [7]:
print("Training Word2Vec Skip-gram model on press corpus...")

w2v_press_sg = Word2Vec(
    sentences=press_sentences,
    vector_size=EMBEDDING_DIM,
    window=WINDOW_SIZE,
    min_count=MIN_COUNT,
    sg=1,           
    workers=N_WORKERS,
    epochs=EPOCHS
)

print("Skip-gram training completed.")


Training Word2Vec Skip-gram model on press corpus...


Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'


Skip-gram training completed.


## Save Skip-gram Model and Embeddings

In [8]:
sg_model_path = os.path.join(EMBEDDINGS_DIR, "word2vec_press_skipgram.model")
sg_vectors_path = os.path.join(EMBEDDINGS_DIR, "word2vec_press_skipgram.vec")

w2v_press_sg.save(sg_model_path)
w2v_press_sg.wv.save_word2vec_format(sg_vectors_path)

print("Skip-gram model saved to:", sg_model_path)
print("Skip-gram vectors saved to:", sg_vectors_path)


Skip-gram model saved to: /Users/ilyessais/Documents/Lab M2/Text Mining & Chatbots/text-mining-chatbots-lab3/TP_ISD2020/embeddings/word2vec_press_skipgram.model
Skip-gram vectors saved to: /Users/ilyessais/Documents/Lab M2/Text Mining & Chatbots/text-mining-chatbots-lab3/TP_ISD2020/embeddings/word2vec_press_skipgram.vec


## Vocabulary Size Comparison

In [9]:
print("CBOW vocabulary size:", len(w2v_press_cbow.wv))
print("Skip-gram vocabulary size:", len(w2v_press_sg.wv))


CBOW vocabulary size: 39654
Skip-gram vocabulary size: 39654


## Semantic Sanity Check (Press Words)

In [10]:
test_words = ["patient", "traitement", "maladie", "solution", "jaune"]

for word in test_words:
    if word in w2v_press_cbow.wv:
        print(f"\nMost similar words to '{word}' (CBOW – Press):")
        for w, s in w2v_press_cbow.wv.most_similar(word, topn=5):
            print(f"  {w:<15} {s:.3f}")
    else:
        print(f"\nWord '{word}' not found in press vocabulary.")



Most similar words to 'patient' (CBOW – Press):
  représentaient  0.803
  gras            0.797
  gossip          0.785
  onomatopéique   0.782
  sciemment       0.781

Most similar words to 'traitement' (CBOW – Press):
  coût            0.830
  collectif       0.811
  financement     0.791
  système         0.788
  renforcement    0.788

Most similar words to 'maladie' (CBOW – Press):
  puissance       0.799
  population      0.788
  garantie        0.787
  douleur         0.783
  mondialisation  0.777

Most similar words to 'solution' (CBOW – Press):
  recette         0.814
  règle           0.805
  catastrophe     0.799
  coïncidence     0.791
  alternative     0.789

Most similar words to 'jaune' (CBOW – Press):
  maillot         0.864
  Bou             0.864
  cavalier        0.864
  Perrot          0.853
  Wen             0.852


## Conclusion

In this notebook, we trained **Word2Vec CBOW and Skip-gram models**
on the **press domain corpus (QUAERO_FrenchPress)**.

Compared to the medical corpus:
- the press corpus is **much larger**
- the vocabulary is **significantly richer**
- embeddings tend to capture **general-language semantics**

These embeddings will be:
- compared with medical embeddings in the semantic similarity notebook
- evaluated for their impact on downstream NER performance in TP2
