# TP2 Named Entity Recognition
## LSTM with pretrained Word2Vec embeddings

In this notebook, we train a Named Entity Recognition (NER) model using an LSTM
architecture initialized with **pretrained Word2Vec embeddings** learned during TP1.

The objective is to evaluate the impact of domain specific pretrained embeddings
compared to random embeddings on the NER task.

We strictly rely on the LSTM/CNN script provided by the instructor and only adapt
the embedding initialization step.

In this notebook, we train a Named Entity Recognition model using the **official
cnn_classification.py script provided by the instructor**.

The script is executed as an external program and **is not modified**.
All experiments are launched by passing arguments exactly as expected.


## Imports

In [26]:
import os
import numpy as np
import pandas as pd
from gensim.models import Word2Vec, KeyedVectors


In [27]:
DATA_DIR = "../../data/ner_processed/final"

TRAIN_PATH = os.path.join(DATA_DIR, "emea_train.csv")
DEV_PATH   = os.path.join(DATA_DIR, "emea_dev.csv")
TEST_PATH  = os.path.join(DATA_DIR, "emea_test.csv")

# Word2Vec model trained in TP1 (medical corpus)
W2V_MODEL_PATH = "../../embeddings/word2vec_medical_cbow.model"

EMBEDDING_DIM = 100


## Load NER data

In [28]:
train_df = pd.read_csv(TRAIN_PATH)
dev_df   = pd.read_csv(DEV_PATH)
test_df  = pd.read_csv(TEST_PATH)

print("Train size:", train_df.shape)
print("Dev size:", dev_df.shape)
print("Test size:", test_df.shape)

train_df.head()


Train size: (706, 2)
Dev size: (649, 2)
Test size: (578, 2)


Unnamed: 0,review,label
0,PRIALT,1
1,EMEA / H / C / 551,0
2,Qu ’ est ce que Prialt ?,1
3,Prialt est une solution pour perfusion contena...,1
4,Dans quel cas Prialt est - il utilisé ?,1


## Load Word2Vec model

In [29]:
w2v_model = Word2Vec.load(W2V_MODEL_PATH)
w2v = w2v_model.wv

print("Word2Vec vocabulary size:", len(w2v))
print("Embedding dimension:", w2v.vector_size)


Word2Vec vocabulary size: 9104
Embedding dimension: 100


## Build vocabulary from NER data

In [30]:
def get_vocab_from_df(df):
    vocab = set()
    for sent in df["review"]:
        for w in sent.split():
            vocab.add(w)
    return vocab

train_vocab = get_vocab_from_df(train_df)
covered = [w for w in train_vocab if w in w2v]
oov = [w for w in train_vocab if w not in w2v]

print(f"Vocabulary size (NER train): {len(train_vocab)}")
print(f"Covered by Word2Vec: {len(covered)}")
print(f"OOV words: {len(oov)}")
print(f"Coverage ratio: {len(covered) / len(train_vocab):.2%}")


Vocabulary size (NER train): 2599
Covered by Word2Vec: 2599
OOV words: 0
Coverage ratio: 100.00%


## Build embedding matrix (Word2Vec → LSTM)

In [31]:
medical_words = ["patient", "traitement", "maladie", "solution"]

for word in medical_words:
    if word in w2v:
        print(f"\nMost similar words to '{word}':")
        for w, s in w2v.most_similar(word, topn=5):
            print(f"  {w:15s} {s:.3f}")
    else:
        print(f"\n'{word}' not in vocabulary")



Most similar words to 'patient':
  cette           0.999
  Le              0.999
  produit         0.999
  plus            0.999
  qui             0.999

Most similar words to 'traitement':
  que             0.995
  devra           0.995
  médecin         0.995
  TYSABRI         0.994
  qu              0.993

Most similar words to 'maladie':
  évolution       0.998
  du              0.998
  administration  0.998
  souris          0.998
  la              0.998

Most similar words to 'solution':
  poudre          0.998
  contient        0.997
  Chaque          0.997
  diluer          0.996
  flacon          0.996


## Save embedding matrix for LSTM script

In [32]:
OUTPUT_DIR = "../../embeddings/ner"
os.makedirs(OUTPUT_DIR, exist_ok=True)

np.save(
    os.path.join(OUTPUT_DIR, "word2vec_medical_lstm_embeddings.npy"),
    embedding_matrix
)

print("Embedding matrix saved successfully.")


Embedding matrix saved successfully.


## Vérification des fichiers

In [33]:
import os

TRAIN = "../../data/ner_processed/final/emea_train.csv"
DEV   = "../../data/ner_processed/final/emea_dev.csv"
TEST  = "../../data/ner_processed/final/emea_test.csv"

SCRIPT = "../../scripts/cnn_classification.py"

print("Train exists:", os.path.exists(TRAIN))
print("Dev exists:", os.path.exists(DEV))
print("Test exists:", os.path.exists(TEST))
print("Script exists:", os.path.exists(SCRIPT))


Train exists: True
Dev exists: True
Test exists: True
Script exists: True


## Lancer Medical LSTM avec embeddings aléatoires

In [42]:
!python ../../scripts/cnn_classification.py \
    --model lstm \
    --train ../../data/ner_processed/final/emea_train.csv \
    --valid ../../data/ner_processed/final/emea_dev.csv \
    --test ../../data/ner_processed/final/emea_test.csv \
    --epochs 25


loading files...
Merging files...
Building vocab...
Encoding reviews...
100%|█████████████████████████████████████| 706/706 [00:00<00:00, 435531.49it/s]
100%|█████████████████████████████████████| 578/578 [00:00<00:00, 553621.31it/s]
100%|█████████████████████████████████████| 649/649 [00:00<00:00, 538177.80it/s]
[OK] Vocabulary saved to ../../data/ner_processed/final/emea_train.csv_vocab.pkl
Vocabulary size: 4590
Feature Shapes:
Train set: (706, 128)
Validation set: (578, 128)
Test set: (649, 128)
Taille vocabulaire 4590
SentimentModelLSTM(
  (embedding): Embedding(4590, 100)
  (lstm): LSTM(100, 128, batch_first=True, dropout=0.25)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=128, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)
Epoch 1/25 | Train Loss: 0.654 Train Acc: 0.787 | Val Loss: 0.614 Val Acc: 0.893
Epoch 2/25 | Train Loss: 0.584 Train Acc: 0.902 | Val Loss: 0.548 Val Acc: 0.891
Epoch 3/25 | Train Loss: 0.520 Train Acc: 0.898 | Val Loss: 0.492 Val 

## Press LSMT avec embeddings Word2Vec

In [43]:
!python ../../scripts/cnn_classification.py \
    --model lstm \
    --train ../../data/ner_processed/final/press_train_final.csv \
    --valid ../../data/ner_processed/final/press_dev_final.csv \
    --test ../../data/ner_processed/final/press_test_final.csv \
    --epochs 25


loading files...
Merging files...
Building vocab...
Encoding reviews...
100%|█████████████████████████████████| 35723/35723 [00:00<00:00, 232914.38it/s]
100%|███████████████████████████████████| 2880/2880 [00:00<00:00, 271469.89it/s]
100%|███████████████████████████████████| 2825/2825 [00:00<00:00, 277368.59it/s]
[OK] Vocabulary saved to ../../data/ner_processed/final/press_train_final.csv_vocab.pkl
Vocabulary size: 32002
Feature Shapes:
Train set: (35723, 128)
Validation set: (2880, 128)
Test set: (2825, 128)
Taille vocabulaire 32002
SentimentModelLSTM(
  (embedding): Embedding(32002, 100)
  (lstm): LSTM(100, 128, batch_first=True, dropout=0.25)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=128, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)
Epoch 1/25 | Train Loss: 0.580 Train Acc: 0.732 | Val Loss: 0.801 Val Acc: 0.537
Epoch 2/25 | Train Loss: 0.573 Train Acc: 0.745 | Val Loss: 0.790 Val Acc: 0.541
Epoch 3/25 | Train Loss: 0.571 Train Acc: 0.746 | Val Lo