# TP1 Word Embeddings  
## Training FastText on the Medical Corpus (QUAERO_FrenchMed)

In this notebook, we train **FastText word embeddings** on the
medical-domain corpus **QUAERO_FrenchMed**.

Unlike Word2Vec, FastText uses **character n-grams**, allowing:
- better handling of rare words
- generation of embeddings for unseen words

We follow the lab constraints:
- Library: `gensim`
- Model: **FastText (CBOW)**
- Embedding dimension: **100**
- min_count: **1**


## Imports

In [1]:
import os
import multiprocessing
from gensim.models import FastText


In [2]:
CURRENT_DIR = os.getcwd()
PROJECT_ROOT = os.path.abspath(os.path.join(CURRENT_DIR, "../.."))

MEDICAL_CORPUS_PATH = os.path.join(
    PROJECT_ROOT,
    "data",
    "embeddings_corpus",
    "QUAERO_FrenchMed",
    "QUAERO_FrenchMed_traindev.ospl"
)

EMBEDDINGS_DIR = os.path.join(PROJECT_ROOT, "embeddings")
os.makedirs(EMBEDDINGS_DIR, exist_ok=True)

print("Medical corpus path:", MEDICAL_CORPUS_PATH)
print("Embeddings directory:", EMBEDDINGS_DIR)


Medical corpus path: /Users/ilyessais/Documents/Lab M2/Text Mining & Chatbots/text-mining-chatbots-lab3/TP_ISD2020/data/embeddings_corpus/QUAERO_FrenchMed/QUAERO_FrenchMed_traindev.ospl
Embeddings directory: /Users/ilyessais/Documents/Lab M2/Text Mining & Chatbots/text-mining-chatbots-lab3/TP_ISD2020/embeddings


## Load and Tokenize Medical Corpus

In [3]:
def load_tokenized_corpus(path):
    sentences = []
    with open(path, "r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if line:
                sentences.append(line.split())
    return sentences


medical_sentences = load_tokenized_corpus(MEDICAL_CORPUS_PATH)

print(f"Number of medical sentences: {len(medical_sentences)}")
print("Example tokenized sentence:")
medical_sentences[0][:20]


Number of medical sentences: 3021
Example tokenized sentence:


['EMEA', '/', 'H', '/', 'C', '/', '551']

## Training Configuration

In [4]:
EMBEDDING_DIM = 100
MIN_COUNT = 1
WINDOW_SIZE = 5
EPOCHS = 10
N_WORKERS = multiprocessing.cpu_count()

print("Embedding dimension:", EMBEDDING_DIM)
print("Min count:", MIN_COUNT)
print("Window size:", WINDOW_SIZE)
print("Epochs:", EPOCHS)
print("Workers:", N_WORKERS)


Embedding dimension: 100
Min count: 1
Window size: 5
Epochs: 10
Workers: 12


## Train FastText (CBOW)

In [5]:
print("Training FastText CBOW model on medical corpus...")

fasttext_medical = FastText(
    sentences=medical_sentences,
    vector_size=EMBEDDING_DIM,
    window=WINDOW_SIZE,
    min_count=MIN_COUNT,
    sg=0,             # CBOW
    workers=N_WORKERS,
    epochs=EPOCHS
)

print("FastText training completed.")


Training FastText CBOW model on medical corpus...
FastText training completed.


## Save Model and Vectors

In [6]:
model_path = os.path.join(EMBEDDINGS_DIR, "fasttext_medical_cbow.model")
vectors_path = os.path.join(EMBEDDINGS_DIR, "fasttext_medical_cbow.vec")

fasttext_medical.save(model_path)
fasttext_medical.wv.save_word2vec_format(vectors_path)

print("FastText model saved to:", model_path)
print("FastText vectors saved to:", vectors_path)


FastText model saved to: /Users/ilyessais/Documents/Lab M2/Text Mining & Chatbots/text-mining-chatbots-lab3/TP_ISD2020/embeddings/fasttext_medical_cbow.model
FastText vectors saved to: /Users/ilyessais/Documents/Lab M2/Text Mining & Chatbots/text-mining-chatbots-lab3/TP_ISD2020/embeddings/fasttext_medical_cbow.vec


## Vocabulary Size

In [7]:
print("FastText medical vocabulary size:", len(fasttext_medical.wv))


FastText medical vocabulary size: 9104


## Semantic Check (Medical Words)

In [8]:
test_words = ["patient", "traitement", "maladie", "solution", "jaune"]

for word in test_words:
    print(f"\nMost similar words to '{word}' (FastText – Medical):")
    for w, s in fasttext_medical.wv.most_similar(word, topn=5):
        print(f"  {w:<15} {s:.3f}")



Most similar words to 'patient' (FastText – Medical):
  Patient         0.999
  tremblements    0.999
  pansements      0.999
  patiente        0.999
  Tremblements    0.999

Most similar words to 'traitement' (FastText – Medical):
  Traitement      1.000
  Taaitement      1.000
  Allaitement     0.999
  allaitement     0.999
  traitements     0.999

Most similar words to 'maladie' (FastText – Medical):
  Maladie         1.000
  malade          1.000
  professionnelle 1.000
  professionnel   1.000
  hyrgathione     1.000

Most similar words to 'solution' (FastText – Medical):
  Dissolution     1.000
  évolution       0.999
  dilution        0.999
  Solution        0.999
  Evolution       0.999

Most similar words to 'jaune' (FastText – Medical):
  zone            1.000
  Une             1.000
  hexane          1.000
  Rhône           1.000
  crâne           1.000


## Conclusion

FastText embeddings trained on the medical corpus:
- capture **domain-specific terminology**
- benefit from character n-grams
- handle rare and unseen words better than Word2Vec

These embeddings will be:
- compared with Word2Vec embeddings
- evaluated on semantic similarity
- used in TP2 for NER experiments
