# TP1 Corpus Exploration  
## Semantic Representations Word Embeddings

**Objective**

The goal of this notebook is to explore and analyze the corpora provided for TP1:
- **QUAERO_FrenchMed** (medical domain, small corpus)
- **QUAERO_FrenchPress** (general domain, large corpus)

This exploration aims to:
- understand corpus size and structure,
- analyze vocabulary statistics,
- compare medical vs non-medical language,
- prepare the data for word embeddings training.



## Imports

In [9]:
import os
from collections import Counter
import matplotlib.pyplot as plt
import numpy as np


## Paths Configuration

In [26]:
BASE_DATA_DIR = "../../data/embeddings_corpus"

MEDICAL_CORPUS = os.path.join(
    BASE_DATA_DIR,
    "QUAERO_FrenchMed",
    "QUAERO_FrenchMed_traindev.ospl"
)

PRESS_CORPUS = os.path.join(
    BASE_DATA_DIR,
    "QUAERO_FrenchPress",
    "QUAERO_FrenchPress_traindev.ospl"
)

print("Medical corpus found:", os.path.exists(MEDICAL_CORPUS))
print("Press corpus found:", os.path.exists(PRESS_CORPUS))


Medical corpus found: True
Press corpus found: True


## Utility Functions

In [27]:
def load_corpus(path):
    """
    Load a corpus where: one sentence per line,tokens separated by spaces
    """
    if not os.path.exists(path):
        raise FileNotFoundError(f"File not found: {path}")

    with open(path, "r", encoding="utf-8") as f:
        sentences = [line.strip() for line in f if line.strip()]

    return sentences


## Load Corpora

In [28]:
medical_sentences = load_corpus(MEDICAL_CORPUS)
press_sentences = load_corpus(PRESS_CORPUS)

print(f"Medical corpus: {len(medical_sentences)} sentences")
print(f"Press corpus: {len(press_sentences)} sentences")


Medical corpus: 3021 sentences
Press corpus: 38548 sentences


## Example sentences

In [29]:
print("Example sentence from the medical corpus:\n")
print(medical_sentences[0])

print("\nExample sentence from the press corpus:\n")
print(press_sentences[0])


Example sentence from the medical corpus:

EMEA / H / C / 551

Example sentence from the press corpus:

Patricia Martin , que voici , que voilà ! oh , bonjour Nicolas Stoufflet .


## Sentence length analysis

In [30]:
def sentence_lengths(sentences):
    return [len(sentence.split()) for sentence in sentences]

med_lengths = sentence_lengths(medical_sentences)
press_lengths = sentence_lengths(press_sentences)

print("Medical corpus statistics:")
print("  Average sentence length:", np.mean(med_lengths))
print("  Maximum sentence length:", np.max(med_lengths))

print("\nPress corpus statistics:")
print("  Average sentence length:", np.mean(press_lengths))
print("  Maximum sentence length:", np.max(press_lengths))


Medical corpus statistics:
  Average sentence length: 17.17841774246938
  Maximum sentence length: 121

Press corpus statistics:
  Average sentence length: 32.467598837812595
  Maximum sentence length: 617


## Vocabulary construction

In [31]:
def build_vocabulary(sentences):
    tokens = []
    for sentence in sentences:
        tokens.extend(sentence.split())
    return Counter(tokens)

medical_vocab = build_vocabulary(medical_sentences)
press_vocab = build_vocabulary(press_sentences)

print("Medical vocabulary size:", len(medical_vocab))
print("Press vocabulary size:", len(press_vocab))


Medical vocabulary size: 9104
Press vocabulary size: 39654


## Most frequent words

In [32]:
print("Most frequent words in the medical corpus:")
for word, freq in medical_vocab.most_common(10):
    print(word, freq)

print("\nMost frequent words in the press corpus:")
for word, freq in press_vocab.most_common(10):
    print(word, freq)


Most frequent words in the medical corpus:
. 2905
de 2506
, 1159
la 1103
' 1082
et 892
des 869
l 849
’ 815
d 757

Most frequent words in the press corpus:
, 70810
de 54090
. 38523
la 32421
le 27993
l' 23717
à 23219
et 21367
les 21308
est 17170


## Target words analysis

In [33]:
candidate_words = ["patient", "traitement", "maladie", "solution", "jaune"]

print("Frequency of candidate words:\n")
for word in candidate_words:
    print(f"{word:12s} | Medical: {medical_vocab[word]:6d} | Press: {press_vocab[word]:6d}")


Frequency of candidate words:

patient      | Medical:     33 | Press:     11
traitement   | Medical:    251 | Press:     53
maladie      | Medical:     65 | Press:    114
solution     | Medical:     67 | Press:    100
jaune        | Medical:      9 | Press:     23


## Vocabulary overlap & OOV analysis

In [34]:
medical_only = set(medical_vocab.keys()) - set(press_vocab.keys())
press_only = set(press_vocab.keys()) - set(medical_vocab.keys())

print("Words only in medical corpus:", len(medical_only))
print("Words only in press corpus:", len(press_only))


Words only in medical corpus: 5351
Words only in press corpus: 35901


## Conclusion

This corpus exploration highlights strong contrasts between the two datasets:

- The **medical corpus** is smaller, highly specialized, and contains domain-specific terminology.
- The **press corpus** is much larger, more diverse, and covers general language usage.
- Vocabulary size and sentence length distributions differ significantly.

These observations motivate:
- training **separate word embeddings** for each corpus,
- comparing **Word2Vec (CBOW, Skip-gram)** and **FastText**,
- evaluating how corpus domain and size impact semantic similarity,
- and later assessing their impact on **NER performance in TP2**.
