**Imports**

In [1]:
import pandas as pd
import numpy as np
#Spacy import
import spacy

#HuggingFace imports
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

**Load the dataset for analysis**  

In [2]:
aggregatedProjectsDF = pd.read_csv('projectsAgg2.csv',sep=";")

### Testing a spaCy pre-built model (no additional training)

**Sample of a description**

In [16]:
print(aggregatedProjectsDF.iloc[88].description)

Dés les 1er mois, les enfants sont sensibles à beaucoup de choses. Un tel lieu serait un formidable espace de pédagogie et un véritable lieu d’exploration. Il permettrait de développer leur sensibilité à l’environnement. Ce serait aussi un espace où les professionnels de la petite enfance pourraient s’exprimer, initier des activités créatives. Ce lieu serait accessible aux professionnel(les) de la petite enfance mais aussi aux parents et grands-parents qui le souhaitent. Cet environnement serait un levier indispensable pour répondre aux besoins fondamentaux des bébés (cognitifs, émotionnels, psychologiques et d’expression par le langage). Le contact avec la nature conditionne le développement et le bien-être d'un enfant et d'autant plus pour celui d'un enfant citadin. Un tel endroit favoriserait sa curiosité, sa construction et son épanouissement. Au-delà même du jardinage, le jardin et la nature sont de formidables espaces de créativité. Un tel environnement permet tout à fait l’éveil

**Load the model downloaded**  
For reference, the package documentation and structures can be found here: https://spacy.io/  
It is important to note that they also have a polish model, so the entity recognition/extraction can also be used in the polish datasets

**This is a test to check the performance of entity recognition of spaCy package by default**

In [19]:
##Load the specific model for the french language
nlp = spacy.load("fr_core_news_sm")

In [20]:
text = aggregatedProjectsDF.iloc[88].description
doc = nlp(text)

In [21]:
for entity in doc.ents:
    print(entity.label_ ,"|", entity.text)

LOC | s’exprimer
MISC | d’
MISC | qu’
PER | Modalités
ORG | Avis


From this example, we can check that we need to train the model to get better entity recognition

### Testing with a huggingface pre-trained model

**Loading the model**

In [22]:
tokenizer = AutoTokenizer.from_pretrained("Babelscape/wikineural-multilingual-ner")
model = AutoModelForTokenClassification.from_pretrained("Babelscape/wikineural-multilingual-ner")

nlp = pipeline("ner", model=model, tokenizer=tokenizer, grouped_entities=True)


Device set to use cuda:0


In [23]:
ner_results = nlp(text)
for result in ner_results:
    print(f"Entity group | {result['entity_group']} | Score--> {result['score']} | Word--> {result['word']} \n")

Entity group | LOC | Score--> 0.9806381464004517 | Word--> parc de la Vache 



We can see that the without a clear instructions, the known entities are not detected and without it a entity extraction of a good quality is not possible at all. This raises a question and also a task to verify, if we structure the entities that we want and guide the LLM via a well structured prompt, can we extract more relevant features?  
The inspiration for this comes from these articles:  
https://medium.com/@lucasmassucci/entity-recognition-with-llms-and-the-importance-of-prompt-engineering-all-languages-ceda8a7ff3e2  
https://medium.com/@manoranjan.rajguru/extracting-entities-from-unstructured-documents-using-large-language-models-f7f2c4d203ee  
With this we can try to use more performant models to do it like ollama, deepseek and etc...


**Using text generation with Deepseek (plus promp engineering)**  

In [None]:
pipeline = pipeline(task="text-generation", model="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")

config.json:   0%|          | 0.00/679 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.55G [00:00<?, ?B/s]

Error while downloading from https://cdn-lfs-us-1.hf.co/repos/80/3b/803bf3a0c970ed5c554d34586697cf1613396ec4158c96d5290742c7421a5d88/58858233513d76b8703e72eed6ce16807b523328188e13329257fb9594462945?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model.safetensors%3B+filename%3D%22model.safetensors%22%3B&Expires=1747846974&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0Nzg0Njk3NH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zLzgwLzNiLzgwM2JmM2EwYzk3MGVkNWM1NTRkMzQ1ODY2OTdjZjE2MTMzOTZlYzQxNThjOTZkNTI5MDc0MmM3NDIxYTVkODgvNTg4NTgyMzM1MTNkNzZiODcwM2U3MmVlZDZjZTE2ODA3YjUyMzMyODE4OGUxMzMyOTI1N2ZiOTU5NDQ2Mjk0NT9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=HHpXasz6AIGD8q%7E6p75xycL2abCFK4f7tRnrK198ZEgytIvafIjVMdvRFz%7EskdXeMoAnZ-SXQ4CHY%7EVnJQSgQoUL0SzKYswSTQI2H%7EzPZTov8ZXaw6BRwIt7UYREHJxJd30LE0FZ12VDTpYK3fFKgXquLlbhvd1YFf5L1pUdb4TymkulT33FpZr0aSsBBzAsiQ5zwW5k0CIHyrQ6sO-7rDxT6k5E1wdGDjCtiuoSWf1%7ESjIVgZvpW1A8Jo

model.safetensors:  24%|##3       | 1.11G/4.67G [00:00<?, ?B/s]