# TP2 Named Entity Recognition
## LSTM model with random embeddings

In this notebook, we train a Named Entity Recognition (NER) model using a **BiLSTM-based architecture** with **randomly initialized embeddings**.

The goal is to establish a **baseline model**, which will later be compared with:
- models using pretrained embeddings from TP1,
- Transformer based models (CamemBERT / BERT).

We strictly rely on the scripts provided by the instructor and only adapt the input data.


## Imports & environment

In [16]:
import os
import pandas as pd
import numpy as np


In [23]:
DATA_DIR = "../../data/ner_processed"

train_path = os.path.join(DATA_DIR, "emea_train.csv")
dev_path   = os.path.join(DATA_DIR, "emea_dev.csv")
test_path  = os.path.join(DATA_DIR, "emea_test.csv")

train_df = pd.read_csv(train_path)
dev_df   = pd.read_csv(dev_path)
test_df  = pd.read_csv(test_path)

print("Train:", train_df.shape)
print("Dev:", dev_df.shape)
print("Test:", test_df.shape)

train_df.head()


Train: (706, 2)
Dev: (649, 2)
Test: (578, 2)


Unnamed: 0,review,label
0,PRIALT,['B-CHEM']
1,EMEA / H / C / 551,"['O', 'O', 'O', 'O', 'O', 'O', 'O']"
2,Qu ’ est ce que Prialt ?,"['O', 'O', 'O', 'O', 'O', 'B-CHEM', 'O']"
3,Prialt est une solution pour perfusion contena...,"['B-CHEM', 'O', 'O', 'B-CHEM', 'O', 'B-PROC', ..."
4,Dans quel cas Prialt est - il utilisé ?,"['O', 'O', 'O', 'B-CHEM', 'O', 'O', 'O', 'O', ..."


## Load CSV files

In [24]:
def parse_labels(df):
    df = df.copy()

    def safe_parse(x):
        if isinstance(x, list):
            return x
        if isinstance(x, str):
            try:
                return ast.literal_eval(x)
            except:
                return [x]
        return [x]

    df["label"] = df["label"].apply(safe_parse)
    return df


In [25]:
train_df = parse_labels(train_df)
dev_df   = parse_labels(dev_df)
test_df  = parse_labels(test_df)


print("Train size:", len(train_df))
print("Dev size:", len(dev_df))
print("Test size:", len(test_df))

train_df.head()


Train size: 706
Dev size: 649
Test size: 578


Unnamed: 0,review,label
0,PRIALT,[B-CHEM]
1,EMEA / H / C / 551,"[O, O, O, O, O, O, O]"
2,Qu ’ est ce que Prialt ?,"[O, O, O, O, O, B-CHEM, O]"
3,Prialt est une solution pour perfusion contena...,"[B-CHEM, O, O, B-CHEM, O, B-PROC, O, O, B-CHEM..."
4,Dans quel cas Prialt est - il utilisé ?,"[O, O, O, B-CHEM, O, O, O, O, O]"


## Convert labels from string to list

## Verify alignment tokens labels

In [27]:
def check_alignment(df, n=5):
    for i in range(n):
        tokens = df.iloc[i]["review"].split()
        labels = df.iloc[i]["label"]

        print(f"Sentence {i}")
        print("Tokens:", tokens)
        print("Labels:", labels)
        print("Lengths:", len(tokens), len(labels))
        print("-" * 40)


## Save files in final format

In [12]:
OUTPUT_DIR = "../../data/ner_processed/final"
os.makedirs(OUTPUT_DIR, exist_ok=True)

train_df.to_csv(os.path.join(OUTPUT_DIR, "emea_train.csv"), index=False)
dev_df.to_csv(os.path.join(OUTPUT_DIR, "emea_dev.csv"), index=False)
test_df.to_csv(os.path.join(OUTPUT_DIR, "emea_test.csv"), index=False)

print("Final CSV files saved.")


Final CSV files saved.


In [28]:
check_alignment(train_df)


Sentence 0
Tokens: ['PRIALT']
Labels: ['B-CHEM']
Lengths: 1 1
----------------------------------------
Sentence 1
Tokens: ['EMEA', '/', 'H', '/', 'C', '/', '551']
Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O']
Lengths: 7 7
----------------------------------------
Sentence 2
Tokens: ['Qu', '’', 'est', 'ce', 'que', 'Prialt', '?']
Labels: ['O', 'O', 'O', 'O', 'O', 'B-CHEM', 'O']
Lengths: 7 7
----------------------------------------
Sentence 3
Tokens: ['Prialt', 'est', 'une', 'solution', 'pour', 'perfusion', 'contenant', 'le', 'principe', 'actif', 'ziconotide', ',', 'à', 'des', 'concentrations', 'de', '100', 'ou', '25', 'microgrammes', 'par', 'millilitre', '.']
Labels: ['B-CHEM', 'O', 'O', 'B-CHEM', 'O', 'B-PROC', 'O', 'O', 'B-CHEM', 'I-CHEM', 'B-CHEM', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Lengths: 23 23
----------------------------------------
Sentence 4
Tokens: ['Dans', 'quel', 'cas', 'Prialt', 'est', '-', 'il', 'utilisé', '?']
Labels: ['O', 'O', 'O', 'B-CHEM', 'O', '

In [29]:
def ner_to_sentence_label(ner_labels):
    for tag in ner_labels:
        if tag != "O":
            return 1
    return 0


In [30]:
for df in [train_df, dev_df, test_df]:
    df["label"] = df["label"].apply(ner_to_sentence_label)


In [31]:
train_df.head()


Unnamed: 0,review,label
0,PRIALT,1
1,EMEA / H / C / 551,0
2,Qu ’ est ce que Prialt ?,1
3,Prialt est une solution pour perfusion contena...,1
4,Dans quel cas Prialt est - il utilisé ?,1


In [33]:
FINAL_DIR = "../../data/ner_processed/final"
os.makedirs(FINAL_DIR, exist_ok=True)

train_df.to_csv(os.path.join(FINAL_DIR, "emea_train.csv"), index=False)
dev_df.to_csv(os.path.join(FINAL_DIR, "emea_dev.csv"), index=False)
test_df.to_csv(os.path.join(FINAL_DIR, "emea_test.csv"), index=False)

print("Final datasets saved.")


Final datasets saved.


## Ready for model training

The NER data is now fully prepared and compatible with the LSTM/CNN script provided by the instructor.

Next steps:
- Run the script `cnn_classification.py` with the LSTM option
- Use **random embeddings** (baseline)
- Evaluate results using precision, recall, and F1-score

No modification has been applied to the original training script.
