**Imports**

In [10]:
import pandas as pd
import numpy as np
import torch
#Spacy import
import spacy

#HuggingFace imports
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

**Load the dataset for analysis**  

In [2]:
aggregatedProjectsDF = pd.read_csv('projectsAgg2.csv',sep=";")

### Testing a spaCy pre-built model (no additional training)

**Sample of a description**

In [3]:
print(aggregatedProjectsDF.iloc[88].description)

Dés les 1er mois, les enfants sont sensibles à beaucoup de choses. Un tel lieu serait un formidable espace de pédagogie et un véritable lieu d’exploration. Il permettrait de développer leur sensibilité à l’environnement. Ce serait aussi un espace où les professionnels de la petite enfance pourraient s’exprimer, initier des activités créatives. Ce lieu serait accessible aux professionnel(les) de la petite enfance mais aussi aux parents et grands-parents qui le souhaitent. Cet environnement serait un levier indispensable pour répondre aux besoins fondamentaux des bébés (cognitifs, émotionnels, psychologiques et d’expression par le langage). Le contact avec la nature conditionne le développement et le bien-être d'un enfant et d'autant plus pour celui d'un enfant citadin. Un tel endroit favoriserait sa curiosité, sa construction et son épanouissement. Au-delà même du jardinage, le jardin et la nature sont de formidables espaces de créativité. Un tel environnement permet tout à fait l’éveil

**Load the model downloaded**  
For reference, the package documentation and structures can be found here: https://spacy.io/  
It is important to note that they also have a polish model, so the entity recognition/extraction can also be used in the polish datasets

**This is a test to check the performance of entity recognition of spaCy package by default**

In [4]:
##Load the specific model for the french language
nlp = spacy.load("fr_core_news_sm")

In [5]:
text = aggregatedProjectsDF.iloc[88].description
doc = nlp(text)

In [6]:
for entity in doc.ents:
    print(entity.label_ ,"|", entity.text)

LOC | s’exprimer
MISC | d’
MISC | qu’
PER | Modalités
ORG | Avis


From this example, we can check that we need to train the model to get better entity recognition

### Testing with a huggingface pre-trained model

**Loading the model**

In [7]:
tokenizer = AutoTokenizer.from_pretrained("Babelscape/wikineural-multilingual-ner")
model = AutoModelForTokenClassification.from_pretrained("Babelscape/wikineural-multilingual-ner")

nlp = pipeline("ner", model=model, tokenizer=tokenizer, grouped_entities=True)


Device set to use cuda:0


In [8]:
ner_results = nlp(text)
for result in ner_results:
    print(f"Entity group | {result['entity_group']} | Score--> {result['score']} | Word--> {result['word']} \n")

Entity group | LOC | Score--> 0.9806381464004517 | Word--> parc de la Vache 



We can see that the without a clear instructions, the known entities are not detected and without it a entity extraction of a good quality is not possible at all. This raises a question and also a task to verify, if we structure the entities that we want and guide the LLM via a well structured prompt, can we extract more relevant features?  
The inspiration for this comes from these articles:  
https://medium.com/@lucasmassucci/entity-recognition-with-llms-and-the-importance-of-prompt-engineering-all-languages-ceda8a7ff3e2  
https://medium.com/@manoranjan.rajguru/extracting-entities-from-unstructured-documents-using-large-language-models-f7f2c4d203ee  
With this we can try to use more performant models to do it like ollama, deepseek and etc...


**Using text generation with Deepseek (plus promp engineering)**  

In [18]:
torch.cuda.empty_cache()
print(type(torch.cuda.memory_stats()))
for k,v in torch.cuda.memory_stats().items():
    print(f"Key ->{k} || value ->{v}")

<class 'collections.OrderedDict'>
Key ->active.all.allocated || value ->624
Key ->active.all.current || value ->344
Key ->active.all.freed || value ->280
Key ->active.all.peak || value ->344
Key ->active.large_pool.allocated || value ->336
Key ->active.large_pool.current || value ->187
Key ->active.large_pool.freed || value ->149
Key ->active.large_pool.peak || value ->187
Key ->active.small_pool.allocated || value ->288
Key ->active.small_pool.current || value ->157
Key ->active.small_pool.freed || value ->131
Key ->active.small_pool.peak || value ->157
Key ->active_bytes.all.allocated || value ->4242541056
Key ->active_bytes.all.current || value ->3752709632
Key ->active_bytes.all.freed || value ->489831424
Key ->active_bytes.all.peak || value ->3752709632
Key ->active_bytes.large_pool.allocated || value ->4165074944
Key ->active_bytes.large_pool.current || value ->3751936000
Key ->active_bytes.large_pool.freed || value ->413138944
Key ->active_bytes.large_pool.peak || value ->375193

In [12]:
pipeline = pipeline(task="text-generation", model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")

Device set to use cuda:0


OutOfMemoryError: CUDA out of memory. Tried to allocate 250.00 MiB. GPU 0 has a total capacity of 3.63 GiB of which 21.38 MiB is free. Including non-PyTorch memory, this process has 3.57 GiB memory in use. Of the allocated memory 3.49 GiB is allocated by PyTorch, and 5.14 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)