# TP1 Word Embeddings  
## Training FastText on the Press Corpus (QUAERO_FrenchPress)

In this notebook, we train **FastText word embeddings** on the large
general-domain corpus **QUAERO_FrenchPress**.

FastText extends Word2Vec by representing words as a sum of
**character n-grams**, which allows:
- better modeling of morphology,
- robustness to rare and unseen words,
- richer representations for large vocabularies.




## Imports

In [1]:
import os
import multiprocessing
from gensim.models import FastText


In [2]:
CURRENT_DIR = os.getcwd()
PROJECT_ROOT = os.path.abspath(os.path.join(CURRENT_DIR, "../.."))

PRESS_CORPUS_PATH = os.path.join(
    PROJECT_ROOT,
    "data",
    "embeddings_corpus",
    "QUAERO_FrenchPress",
    "QUAERO_FrenchPress_traindev.ospl"
)

EMBEDDINGS_DIR = os.path.join(PROJECT_ROOT, "embeddings")
os.makedirs(EMBEDDINGS_DIR, exist_ok=True)

print("Press corpus path:", PRESS_CORPUS_PATH)
print("Embeddings directory:", EMBEDDINGS_DIR)


Press corpus path: /Users/ilyessais/Documents/Lab M2/Text Mining & Chatbots/text-mining-chatbots-lab3/TP_ISD2020/data/embeddings_corpus/QUAERO_FrenchPress/QUAERO_FrenchPress_traindev.ospl
Embeddings directory: /Users/ilyessais/Documents/Lab M2/Text Mining & Chatbots/text-mining-chatbots-lab3/TP_ISD2020/embeddings


## Load and Tokenize Press Corpus

In [3]:
def load_tokenized_corpus(path):
    sentences = []
    with open(path, "r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if line:
                sentences.append(line.split())
    return sentences


press_sentences = load_tokenized_corpus(PRESS_CORPUS_PATH)

print(f"Number of press sentences: {len(press_sentences)}")
print("Example tokenized sentence:")
press_sentences[0][:20]


Number of press sentences: 38548
Example tokenized sentence:


['Patricia',
 'Martin',
 ',',
 'que',
 'voici',
 ',',
 'que',
 'voilà',
 '!',
 'oh',
 ',',
 'bonjour',
 'Nicolas',
 'Stoufflet',
 '.']

## Training Configuration

In [4]:
EMBEDDING_DIM = 100
MIN_COUNT = 1
WINDOW_SIZE = 5
EPOCHS = 10
N_WORKERS = multiprocessing.cpu_count()

print("Embedding dimension:", EMBEDDING_DIM)
print("Min count:", MIN_COUNT)
print("Window size:", WINDOW_SIZE)
print("Epochs:", EPOCHS)
print("Workers:", N_WORKERS)


Embedding dimension: 100
Min count: 1
Window size: 5
Epochs: 10
Workers: 12


## Train FastText CBOW (Press)

In [5]:
print("Training FastText CBOW model on press corpus...")

fasttext_press = FastText(
    sentences=press_sentences,
    vector_size=EMBEDDING_DIM,
    window=WINDOW_SIZE,
    min_count=MIN_COUNT,
    sg=0,             # CBOW
    workers=N_WORKERS,
    epochs=EPOCHS
)

print("FastText training completed.")


Training FastText CBOW model on press corpus...
FastText training completed.


## Save Model and Vectors

In [6]:
model_path = os.path.join(EMBEDDINGS_DIR, "fasttext_press_cbow.model")
vectors_path = os.path.join(EMBEDDINGS_DIR, "fasttext_press_cbow.vec")

fasttext_press.save(model_path)
fasttext_press.wv.save_word2vec_format(vectors_path)

print("FastText model saved to:", model_path)
print("FastText vectors saved to:", vectors_path)


FastText model saved to: /Users/ilyessais/Documents/Lab M2/Text Mining & Chatbots/text-mining-chatbots-lab3/TP_ISD2020/embeddings/fasttext_press_cbow.model
FastText vectors saved to: /Users/ilyessais/Documents/Lab M2/Text Mining & Chatbots/text-mining-chatbots-lab3/TP_ISD2020/embeddings/fasttext_press_cbow.vec


## Vocabulary Size

In [7]:
print("FastText press vocabulary size:", len(fasttext_press.wv))


FastText press vocabulary size: 39654


## Semantic Similarity Press Corpus

In [8]:
test_words = ["patient", "traitement", "maladie", "solution", "jaune"]

for word in test_words:
    print(f"\nMost similar words to '{word}' (FastText – Press):")
    for w, s in fasttext_press.wv.most_similar(word, topn=5):
        print(f"  {w:<15} {s:.3f}")



Most similar words to 'patient' (FastText – Press):
  patientent      0.985
  impatient       0.981
  détient         0.975
  ratifient       0.973
  trient          0.972

Most similar words to 'traitement' (FastText – Press):
  promptement     0.973
  recrutement     0.967
  concrètement    0.965
  farouchement    0.962
  plafonnement    0.961

Most similar words to 'maladie' (FastText – Press):
  malnutrie       0.925
  trilogie        0.893
  folie           0.889
  magie           0.888
  pie             0.887

Most similar words to 'solution' (FastText – Press):
  révolution      0.983
  résolution      0.982
  évolution       0.978
  dissolution     0.976
  caution         0.976

Most similar words to 'jaune' (FastText – Press):
  Neptune         0.970
  brune           0.968
  lune            0.963
  Jeune           0.959
  Saâdoune        0.946


## Conclusion

FastText embeddings trained on the press corpus:
- capture general semantic relationships,
- benefit from character-level information,
- provide robust representations for a very large vocabulary.

Compared to the medical corpus:
- press embeddings are more generic,
- medical embeddings are more specialized.

These results confirm the strong impact of **training data domain**
on word embeddings and motivate their evaluation on downstream tasks,
such as Named Entity Recognition in TP2.
