Este Notebook tiene como objetivo naturalizar la columna "docstring" de un dataset orientado a finetuning conversacional para generación de código con LLMs 

# 2 - Naturalización del Dataset

In [1]:
%pip install tqdm

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import os
import pandas as pd
import torch
from transformers import pipeline
from tqdm import tqdm

# Crear la carpeta /datasets si no existe
output_dir = './datasets'
checkpoint_interval = 50
checkpoint_dir = f'{output_dir}/checkpoints'
os.makedirs(checkpoint_dir, exist_ok=True)

# Cargar los DataFrames filtrados desde los archivos JSON
df_train_filtered = pd.read_json(f'{output_dir}/train_filtered.json', orient='records', lines=True)
df_validation_filtered = pd.read_json(f'{output_dir}/validation_filtered.json', orient='records', lines=True)
df_test_filtered = pd.read_json(f'{output_dir}/test_filtered.json', orient='records', lines=True)

# Configurar el dispositivo y cargar el pipeline con LLaMA 3 8B Instruct
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

text_generation_pipeline = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto"
)

# Función para transformar un docstring a una descripción contextual usando el pipeline de LLaMA 3 8B Instruct
def transformar_docstring_con_llama(docstring, code):
    
    messages = [
    {"role": "system", "content": "Imagine you're a user working on a project and you need to request a specific function. Describe your need in a detailed and natural way, as if you were asking a developer to implement this function for you. Do not include any code."},
    {"role": "user", "content": f"I'm working on a project and I need help with the following function:\n\nFunction:\n{code}\n\nCurrent docstring:\n'{docstring}'\n\nPlease describe in a natural way what you need this function to do, including the context or scenario where it will be used."},
    ]

    response = text_generation_pipeline(
        messages,
        max_new_tokens=250,
        do_sample=True,
        temperature=0.6,
        top_p=0.9,
    )
    
    # Extraer solo la respuesta del "assistant"
    generated_text = ""
    if isinstance(response, list) and isinstance(response[0], dict):
        for part in response[0].get("generated_text", []):
            if part.get("role") == "assistant" and "content" in part:
                generated_text += part["content"] + " "
    
    cleaned_text = generated_text.strip()
    
    print("Cleaned text:", cleaned_text)  # Línea de depuración para ver el texto limpio generado
    return cleaned_text

# Función para procesar y guardar en checkpoints
def procesar_y_guardar(df, df_name):
    contextual_docstrings = []
    start_idx = 0
    
    checkpoint_path = os.path.join(checkpoint_dir, f'{df_name}_contextual_checkpoint.json')
    if os.path.exists(checkpoint_path):
        df_checkpoint = pd.read_json(checkpoint_path, orient='records', lines=True)
        contextual_docstrings = df_checkpoint['docstring_contextual'].tolist()
        start_idx = len(contextual_docstrings)
    
    for i, (_, row) in enumerate(tqdm(df.iterrows(), total=len(df), desc=f"Processing {df_name} dataset")):
        if i < start_idx:
            continue  # Saltar las filas ya procesadas
        
        contextual_docstrings.append(transformar_docstring_con_llama(row['docstring'], row['code']))
        
        if (i + 1) % checkpoint_interval == 0:
            df_checkpoint = pd.DataFrame({'docstring_contextual': contextual_docstrings})
            df_checkpoint.to_json(checkpoint_path, orient='records', lines=True)
            print(f"Checkpoint guardado en {checkpoint_path} hasta la fila {i + 1}.")
    
    df['docstring_contextual'] = contextual_docstrings
    df.to_json(f'{output_dir}/{df_name}_filtered_contextual.json', orient='records', lines=True)
    print(f"{df_name.capitalize()} dataset guardado exitosamente en {output_dir}.")

    if os.path.exists(checkpoint_path):
        os.remove(checkpoint_path)

# Procesar cada dataset con checkpoints
procesar_y_guardar(df_train_filtered, "train")
procesar_y_guardar(df_validation_filtered, "validation")
procesar_y_guardar(df_test_filtered, "test")


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Processing train dataset:   0%|          | 0/251820 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  attn_output = torch.nn.functional.scaled_dot_product_attention(
Processing train dataset:   0%|          | 1/251820 [00:07<537:01:09,  7.68s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Cleaned text: I'm building a web scraping tool that extracts data from a specific website. As part of this project, I need to be able to set the text content of certain elements on the webpage. Specifically, I want to be able to set the text content of an element with a specific class.

The function I'm requesting should allow me to set the text content of an element with a given class. For example, if I have an element with the class "heading", I want to be able to set its text content to "New Heading". The function should also allow me to specify the class of the text content, in case there are multiple elements with different classes that need to have their text content set.

In the context of my project, this function will be used to modify the text content of elements on the webpage before extracting the data. This is necessary because the website uses certain elements to display dynamic content that I don't want to include in my data extraction.

I would like the function to be f

Processing train dataset:   0%|          | 2/251820 [00:14<515:16:16,  7.37s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Cleaned text: I'm working on a project that involves processing and analyzing large amounts of text data, specifically using the FoLiA (Format for Language Annotation) format. In this format, each piece of text is represented as an element, which can have various attributes and child elements.

I need a function that allows me to associate a specific document with each element in the FoLiA tree structure. This document contains metadata and additional information about the text data. For example, it might include information about the author, date of creation, or the language in which the text is written.

The function should take an element and a document as input, and then link the element to the document. If the element is part of a larger structure (i.e., it has child elements), the function should recursively apply the same association to all child elements.

Here's an example scenario: let's say I have a FoLiA document that contains a text about a person's life. The document has 

Processing train dataset:   0%|          | 3/251820 [00:21<508:38:32,  7.27s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Cleaned text: I'm working on a project that involves managing annotations on various elements. Annotations are essentially additional pieces of information that can be attached to specific elements. The project uses a hierarchical structure, where elements can have child elements, and these child elements can also have annotations.

The function I'm requesting, `addable`, is used to determine whether a new annotation of a specific class can be added to a given parent element. This is an important check, as some elements may have limitations on the number of annotations they can have.

The function takes three main parameters: `parent`, which is the element that the annotation is being added to; `set`, which is the set that the annotation belongs to (if applicable); and `raiseexceptions`, which determines whether an exception should be raised if the annotation cannot be added.

The function should return `True` if the annotation can be added to the parent, and `False` otherwise. If `rai

Processing train dataset:   0%|          | 4/251820 [00:29<506:24:23,  7.24s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Cleaned text: I'm working on a project that involves creating and manipulating elements within a document. These elements can be added to each other, creating a hierarchical structure. I need a function that will be called after an element is added to another element, and it will perform some checks to ensure that the addition is valid.

The function should check if the element being added was not previously associated with a document, and if so, associate it with the document of its parent element. This ensures that all elements are linked to a valid document.

Additionally, if the document associated with the element has deep validation enabled, the function should call the deep validation method to validate the element and its children.

The function should also have the ability to perform extra checks and raise exceptions if necessary, to prevent invalid additions from occurring.

This function will be used internally within the project, so it doesn't need to be publicly exposed. H

Processing train dataset:   0%|          | 5/251820 [00:35<482:45:18,  6.90s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Cleaned text: I'm building a project that involves parsing and processing text data, and I need a function that can update the textual value of an element based on the text content of its children. This function will be used in a specific scenario where I have a hierarchical structure of elements, and each element can contain other elements or plain text.

The function should iterate through all the children of the element, and if a child is another element, it should call the same function recursively to update its textual value. If a child is a plain string, it should simply add that string to the overall textual value. The function should only be applicable to elements that are instances of a specific class, let's call it `TEXTCONTAINER`.

The ultimate goal is to generate a single, unified textual value that represents the combined text content of all the children. This textual value will then be stored in the `data` attribute of the element.

I'd like the function to be flexible en

Processing train dataset:   0%|          | 6/251820 [00:42<492:02:02,  7.03s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Cleaned text: I'm building a project that involves working with a hierarchical structure of elements, where each element has a parent-child relationship. Think of it like a family tree, where each element is a person and their parent is their parent. 

The function I need is to generate all the ancestors of a given element, starting from the element itself and moving up the hierarchy until it reaches the root element. This means it should yield each element's parent, grandparent, great-grandparent, and so on, until it reaches the top of the tree.

Here's the catch: I might not always know the exact class of the root element. Sometimes, I'll need to find the ancestors of an element regardless of its class, and sometimes I'll need to specify a specific class or a tuple of classes to filter the results.

For example, let's say I have a project where I'm working with different types of nodes, like "Person", "Organization", and "Event". Each node has a parent node, and I need to find the an

Processing train dataset:   0%|          | 7/251820 [00:50<497:20:07,  7.11s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Cleaned text: Here's a natural description of what I need the function to do:

I'm building a project that involves working with different types of elements, such as paragraphs, sentences, and words. Each of these elements has its own specific characteristics and behaviors. In order to make my code more flexible and reusable, I want to be able to find the most immediate ancestor of a given type of element.

For example, let's say I have a word object and I want to find the paragraph that contains it. I know that the word is part of a sentence, which is part of a paragraph. I want the function to return the paragraph object, rather than the sentence or the word.

I'll be calling this function from various parts of my code, and I'll be passing in different classes of elements as arguments. I might call it like this: `paragraph = word.ancestor(folia.Paragraph)`, where `word` is an instance of a word object and `folia.Paragraph` is the class of paragraph objects that I'm interested in.

Th

Processing train dataset:   0%|          | 8/251820 [00:57<501:23:30,  7.17s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Cleaned text: I'm working on a project that involves processing and analyzing linguistic data, specifically FoLiA (Feature-based Linguistic Annotation) elements. These elements are used to represent various linguistic structures, such as words, sentences, and paragraphs, along with their corresponding annotations.

I need a function that can take a FoLiA element and convert it into a Python dictionary, which can then be serialized to JSON. This dictionary should contain all the relevant information about the FoLiA element, including its type, ID, set, class, annotator, and other attributes.

The function should also be able to recursively traverse the FoLiA element's children, if any, and include them in the resulting dictionary. This is important because FoLiA elements can have nested structures, and I need to be able to capture this hierarchy in the JSON output.

Additionally, I'd like the function to allow me to specify a list of attributes to ignore when serializing the FoLiA eleme

Processing train dataset:   0%|          | 9/251820 [01:04<503:18:55,  7.20s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Cleaned text: I'm working on a project that involves processing and analyzing text data using the FoLiA (Format for Linguistic Annotation) format. FoLiA is a markup language used to represent linguistic annotations, such as part-of-speech tags, named entities, and dependencies.

The function I'm asking for is responsible for serializing a FoLiA element and all its contents into an XML string. This function will be used in various parts of my project, including data import, export, and processing.

The function should take an optional parameter `pretty_print`, which determines whether the resulting XML string is formatted with indentation and line breaks for readability or not. If `pretty_print` is `True`, the function should use the `ElementTree` module's `pretty_print` option to format the XML string. If `pretty_print` is `False` (or not provided), the function should use the default `ElementTree` behavior, which is to produce a compact, unformatted XML string.

When serializing the F

Processing train dataset:   0%|          | 10/251820 [01:11<504:40:03,  7.21s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Cleaned text: I'm working on a project that involves parsing and analyzing a large corpus of text data. The text data is organized into a hierarchical structure, where each element in the structure represents a specific type of text component, such as a sentence, phrase, or word.

I need to write a function that can traverse this hierarchical structure and select specific elements that match a certain criteria. Specifically, I want to be able to select all child elements of a certain class (e.g. all sentence elements), and also filter the results based on a specific set (e.g. only select elements that are part of a specific domain or topic).

The function should also allow me to recursively traverse the hierarchy, so that I can select elements that are nested within other elements. For example, if I want to select all sentence elements that are part of a specific topic, I should be able to recursively traverse the hierarchy to find all sentence elements that are nested within elements 

Processing train dataset:   0%|          | 11/251820 [01:19<506:50:44,  7.25s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Cleaned text: I'm working on a project that involves parsing and analyzing documents, and I need a function that can retrieve metadata associated with a specific element within the document. This element can be a part of a larger structure, and I want the function to automatically inherit metadata from parent elements if it's not defined at the current level.

For example, let's say I have a document that contains multiple sections, and each section has its own metadata, such as a title and a date. Within each section, I have sub-elements like paragraphs, headings, and images. I want to be able to retrieve the metadata associated with a specific paragraph or image, but if that metadata isn't defined, I want the function to look up the parent section and retrieve its metadata instead.

The function should take an optional key parameter, which allows me to specify a specific piece of metadata to retrieve. If I don't provide a key, the function should return the entire metadata dictionary

Processing train dataset:   0%|          | 12/251820 [01:26<500:38:57,  7.16s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Cleaned text: I'm working on a project that involves managing a hierarchical structure of elements, where each element can have its own children. I need a function that can help me find the index of a specific child element within this structure. 

The function should be able to find the index of the child element by traversing the hierarchy of elements in a breadth-first manner. This means it should start by checking the immediate children of the current element, then move on to the children of those children, and so on, until it finds the target child element.

However, I also need the function to be able to ignore certain elements in the hierarchy while searching for the target child. For example, I might want to ignore elements that don't have a specific attribute or belong to a certain class. 

The function should return the index of the target child element if it's found, and -1 if it's not. I'd like the function to be recursive by default, but I should also be able to turn off t

Processing train dataset:   0%|          | 13/251820 [01:33<506:37:33,  7.24s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Cleaned text: I'm working on a project that involves creating a hierarchical structure of elements, where each element has a parent-child relationship with other elements. I need to implement a function that determines whether one element comes before another in this hierarchy. In other words, I want to know if one element is a direct or indirect ancestor of the other.

The function should take two elements as input and return a boolean indicating whether the first element precedes the second. If the elements don't share a common ancestor, the function should raise an exception.

Here's a scenario where this function would be used: imagine I have a project with multiple tasks, and each task has a dependency on other tasks. I want to determine whether task A is dependent on task B, or vice versa. In this case, I would call the `precedes` function with task A and task B as arguments, and it would tell me whether task A comes before task B in the dependency hierarchy.

The function should

Processing train dataset:   0%|          | 14/251820 [01:40<510:46:50,  7.30s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Cleaned text: I'm working on a project that involves traversing a complex data structure, specifically a tree-like structure where each node has a list of child nodes. The goal is to perform a depth-first search (DFS) on this structure, which means visiting each node and its children as deeply as possible before backtracking.

To make the search more flexible and reusable, I'd like to be able to pass a callback function that will be executed on each node as we visit it. This callback function can perform any necessary operations on the node, such as processing its data, modifying its state, or returning a result.

The function I'm requesting should start at the root node of the tree (which is represented by the `self` object) and recursively visit each node in a depth-first manner. For each node, it should call the provided callback function and pass the node as an argument. If the callback function returns a result (i.e., a non-None value), the function should immediately return that 

Processing train dataset:   0%|          | 15/251820 [01:48<512:29:58,  7.33s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Cleaned text: I'm working on a project that involves processing and analyzing text data, and I need a function that can help me navigate through the hierarchical structure of the text elements. Specifically, I want to be able to move to the next element in the structure, while also considering certain constraints.

The function should take three arguments: `Class`, `scope`, and `reverse`. `Class` determines the type of element I'm looking for - it can be a specific class, a tuple of classes, or `True` to match the same class as the current element. `scope` is a list of classes that I don't want to cross when searching for the next element - it can be `True` to use a default list of structure elements, or `None` to ignore this constraint. `reverse` is a boolean that determines the direction of the search - if `True`, I want to move to the previous element, otherwise I want to move to the next one.

The function should return the next element that matches the specified class and scope, o

Processing train dataset:   0%|          | 16/251820 [01:55<512:17:50,  7.32s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Cleaned text: I'm working on a project that involves processing and analyzing text data, and I need a function that allows me to retrieve the previous element in a sequence of elements. For example, if I'm working with a text document, I might have a sequence of sentences, paragraphs, and sections, and I want to be able to access the previous sentence or paragraph in the sequence.

The function should take two optional arguments: `Class` and `scope`. The `Class` argument allows me to specify the type of element I'm interested in (e.g., a sentence, a paragraph, etc.). If I set `Class` to `True`, the function should return the previous element of the same class as the current element. If I set `Class` to `None`, the function should return the previous element of any class.

The `scope` argument allows me to specify a list of classes that I don't want to cross when looking for the previous element. For example, if I'm working with a text document and I'm currently in a paragraph, I might 

Processing train dataset:   0%|          | 17/251820 [02:02<508:23:05,  7.27s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Cleaned text: Here's a natural description of what I need the function to do:

I'm building a document processing system that allows users to create and manipulate documents. In this system, documents can contain multiple elements, such as text, images, and other types of content. These elements are represented by objects that inherit from an `AbstractElement` class.

I need a function that allows me to remove a specific child element from a document. This function should take the child element as an argument and ensure that it's removed from the document's internal data structure.

Here's the scenario: when a user wants to delete a specific piece of content from a document, I need to remove the corresponding element from the document's data structure. This function should do the following:

* Check if the child element is an instance of `AbstractElement`. If it's not, raise a `ValueError`.
* Check if the child element is currently associated with the document (i.e., its `parent` attri

Processing train dataset:   0%|          | 18/251820 [02:09<506:09:57,  7.24s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Cleaned text: I'm building a system that allows users to manage and analyze annotations on a specific class of objects. These annotations can be added to individual objects or to the class itself, and they provide additional information about the object's properties or behavior.

The function I need is called `hasannotation` and it should take three parameters: `self`, `Class`, and `set` (which is optional). The `self` parameter is a reference to the object that owns the function, `Class` is the class of objects that we're interested in, and `set` is an optional parameter that allows us to filter the results.

What I want the `hasannotation` function to do is to return an integer that indicates whether any annotations exist for the specified `Class`, and if so, how many. If there are no annotations, the function should return 0. If there are annotations, the function should return the number of annotations.

The function should use the `select` method to retrieve the annotations, and i

Processing train dataset:   0%|          | 19/251820 [02:17<505:08:06,  7.22s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Cleaned text: I'm working on a project that involves processing and analyzing text data, and I need a function that can retrieve a specific type of annotation from a collection of annotations. The function should take two arguments: the type of annotation to retrieve, and an optional set parameter that allows me to filter the results based on a specific set.

For example, let's say I have a collection of annotations that represent different senses of a word, and each sense is associated with a specific set (e.g., "http://some/path/cornetto"). I want to write a function that can retrieve a single sense annotation for a given word, and I want to be able to filter the results based on the set.

The function should return a single annotation element (an instance derived from AbstractElement) that matches the specified type and set (if provided). If no such annotation exists, it should raise a NoSuchAnnotation exception.

In the context of my project, I'll be using this function to retrieve

Processing train dataset:   0%|          | 20/251820 [02:24<503:42:35,  7.20s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Cleaned text: Here's a description of the function in a natural way:

I'm building a project that involves working with annotation layers, which are essentially a way to categorize and organize data. In this project, I need to be able to check if a specific annotation layer exists. 

The function should take in two optional parameters: `annotationtype` and `set`. `annotationtype` would specify the type of annotation layer I'm looking for (e.g. " bounding box", "polygon", etc.), and `set` would specify the set of layers I want to check (e.g. "all layers", "only layers with a certain property", etc.).

The function should return a boolean value indicating whether the specified annotation layer exists or not. If it does exist, the function would return `True`; otherwise, it would return `False`.

For example, if I'm working with a set of layers that contain bounding box annotations, and I want to check if a specific bounding box annotation layer exists, I would call the function like this

Processing train dataset:   0%|          | 21/251820 [02:31<504:37:32,  7.21s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Cleaned text: Here's a natural description of what I need this function to do:

I'm working on a project that involves processing and validating text content from a specific source. The text content is stored in a hierarchical structure, and each piece of content has a unique reference that identifies it. The reference is a key part of the content's metadata, and it's used to look up the actual text data in the source.

The `getreference` function is crucial to my project because it retrieves the reference for a given piece of text content and validates its integrity. The function should first check if the reference is present and valid. If it's not, it should raise an error.

If the reference is valid, the function should then check if the reference points to the correct text content. It should do this by comparing the actual text content with the expected text content at a specific offset. If the texts don't match, it should raise an error.

If the reference and text content match, t

Processing train dataset:   0%|          | 22/251820 [02:38<508:05:39,  7.26s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Cleaned text: Here's a natural description of what I need the function to do:

I'm working on a project that involves processing and analyzing phonetic content, such as speech or audio data. This content is stored in a document and referenced by a unique identifier. I need a function that can retrieve and validate this reference, ensuring that it's correct and points to the right phonetic content.

The function should first check if the reference is available, and if not, it should return None. If the reference is available, it should then check if it contains the correct phonetic content, specified by a class (e.g., "phoneme" or "syllable"). If the reference doesn't contain the correct phonetic content, it should raise an exception.

If the reference is valid, the function should then validate the phonetic content at a specific offset within the reference. This offset corresponds to a specific point in the audio data, and the function should compare the phonetic content at this offset

Processing train dataset:   0%|          | 23/251820 [02:46<510:12:25,  7.29s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Cleaned text: I'm working on a project that involves processing and analyzing linguistic annotations in a specific domain. I have a word-level annotation system where each word is annotated with various types of information, such as part-of-speech tags, named entities, and semantic roles.

I need a function that allows me to retrieve all the span annotations of a specific type that contain a given word. For example, if I have a word "apple" and I want to find all the chunks (a type of annotation) that contain this word, I would use this function to get a list of chunk annotations.

The function should take two arguments: the type of annotation I'm interested in (e.g., "Chunk"), and an optional set constraint (e.g., "food"). This means that if I only care about chunks that are related to food, I can pass "food" as the set constraint.

The function should yield a list of span annotation instances that match the specified type and contain the given word. In the example I provided, the out

Processing train dataset:   0%|          | 24/251820 [02:53<510:28:53,  7.30s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Cleaned text: I'm working on a project that involves building a linguistic analysis tool, specifically a span element that can hold different types of linguistic units such as words, morphemes, and phonemes. 

The function I'm asking for, `setspan`, is used to reset the span element and populate it with new linguistic units. This function should take in any number of arguments, which can be instances of `Word`, `Morpheme`, or `Phoneme`. 

When called, `setspan` should clear out any existing data within the span element and then add the provided linguistic units to the element. This allows me to easily update the span element with new data without having to manually manage the existing data.

For example, if I have a span element that currently contains a word and a morpheme, and I want to add a new phoneme to the span, I would call `setspan` with the word, morpheme, and phoneme as arguments. The function would clear out the existing data and then add the new phoneme to the span, effect

Processing train dataset:   0%|          | 25/251820 [03:01<512:28:14,  7.33s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Cleaned text: I'm working on a project that involves analyzing and processing linguistic data, specifically word references and their relationships with other linguistic elements. The function I'm requesting is a crucial part of this process.

The `_helper_wrefs` function should be an internal helper function that takes two parameters: `targets` and `recurse`. The `targets` parameter is a list that will store the results of the function, and the `recurse` parameter is a boolean that determines whether the function should recursively explore the relationships between linguistic elements.

The function should iterate over each element in the current object (`self`). If an element is a `Word`, `Morpheme`, or `Phoneme`, it should be added to the `targets` list. If an element is a `WordReference`, the function should try to resolve the reference by looking up the corresponding word in the document (`self.doc`) using the `id` attribute. If the reference is resolved, the corresponding word sh