# 🧠 Perfume Feature Extraction NER Model
This notebook trains a Named Entity Recognition (NER) model using **spaCy** to extract 8 structured features from English customer sentences about perfume preferences.

### Extracted Features:
- `AGE`: Age (normalized 0–1)
- `GENDER`: Gender (Male, Female, Unisex)
- `PERSONALITY`: Style type (e.g., Elegant, Sporty)
- `PREFERRED_ACCORD`: Preferred scent families (e.g., floral, woody)
- `USAGE_SITUATION`: Usage scenarios (e.g., Work, Date Night)
- `SILLAGE`: Scent trail (Short, Medium, Long)
- `LONGEVITY`: How long it lasts (Short, Medium, Long)
- `PRICE`: Price category (Affordable, Average, High-end)


In [1]:
!pip install -U spacy
!pip install -U "spacy[transformers]"




[notice] A new release of pip is available: 24.0 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import spacy
from spacy.tokens import DocBin
from spacy.util import filter_spans
from spacy.training import Example
import json

In [3]:
# Load từ file JSON
with open("ner_training_data.json", "r", encoding="utf-8") as f:
    TRAIN_DATA = json.load(f)

In [4]:
nlp = spacy.load("en_core_web_trf")
db = DocBin()

for text, annotations in TRAIN_DATA:
    doc = nlp.make_doc(text)
    ents = []
    for start, end, label in annotations["entities"]:
        span = doc.char_span(start, end, label=label)
        if span is not None:
            ents.append(span)
    filtered = filter_spans(ents)
    doc.ents = filtered
    db.add(doc)

db.to_disk("train.spacy")

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
!python -m spacy init config config_trf.cfg --lang en --pipeline ner --optimize accuracy 


[38;5;1m✘ The provided output file already exists. To force overwriting the
config file, set the --force or -F flag.[0m



In [6]:
!python -m spacy train config.cfg --output ./output --paths.train train.spacy --paths.dev train.spacy --training.max_steps=300

[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     38.12    0.00    0.00    0.00    0.00
  0     200        462.16   4088.65   87.46   89.04   85.93    0.87
[38;5;2m✔ Saved pipeline to output directory[0m
output\model-last


In [8]:
from ner_normalize import normalize_entity

nlp_ner = spacy.load("./output/model-last")
test_text = "I'm 20 years old Im male I want something that has aquatic scent Something that lasts long and has light sillage Something that is cheap also"

doc = nlp_ner(test_text)

for ent in doc.ents:
    raw_text = ent.text
    label = ent.label_
    normalized = normalize_entity(label, raw_text)
    print(f"{raw_text} → {normalized} ({label})")

20 → 20 (AGE)
male → men (GENDER)
aquatic → aquatic (PREFERRED_ACCORD)
lasts long → Long (LONGEVITY)
light sillage → Light (SILLAGE)
cheap → Affordable (PRICE)
