En este script obtenemos las respuestas de los clickbaits con contenido empleando los modelos preentrenados.

In [None]:
# Instalar dependencias
!pip install transformers accelerate --quiet
!pip install sentencepiece --quiet  # Necesario para algunos modelos

# Importar librerías
import pandas as pd
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
from tqdm.notebook import tqdm

# Definir ruta del dataset
DATASET_PATH = "dataset_clickbait_QA_limpio.csv"

# Cargar dataset desde archivo
df = pd.read_csv(DATASET_PATH)
df = df.dropna(subset=["title", "content"]).reset_index(drop=True)

# Función para aplicar Question Answering
def generate_answers(model_id, model_name, df):
    print(f"\nProcesando con modelo: {model_id}")
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForQuestionAnswering.from_pretrained(model_id)

    qa_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer)

    answers = []
    for idx, row in tqdm(df.iterrows(), total=len(df)):
        try:
            result = qa_pipeline({
                "question": row["title"],
                "context": row["content"][:4000],  # Evitar contextos demasiado largos
            })
            answer = result.get("answer", "")
        except Exception as e:
            answer = f"ERROR: {str(e)}"
        answers.append(answer)

    df_copy = df.copy()
    df_copy["answer"] = answers
    output_filename = f"qa_output_{model_name}.csv"
    df_copy.to_csv(output_filename, index=False)
    print(f"Archivo guardado: {output_filename}")

Usamos el modelo DeBERTa:

In [None]:
generate_answers("timpal0l/mdeberta-v3-base-squad2", "mdeberta", df)


Procesando con modelo: timpal0l/mdeberta-v3-base-squad2


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Device set to use cpu


  0%|          | 0/205 [00:00<?, ?it/s]



Archivo guardado: qa_output_mdeberta.csv


Usamos el modelo DistilBERT:

In [None]:
generate_answers("mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es", "bert_spanish_distilled", df)


Procesando con modelo: mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es


Some weights of the model checkpoint at mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


  0%|          | 0/205 [00:00<?, ?it/s]



Archivo guardado: qa_output_bert_spanish_distilled.csv


Usamos el modelo BERT:

In [None]:
generate_answers("mrm8488/bert-base-spanish-wwm-cased-finetuned-spa-squad2-es", "bert_spanish_alt", df)


Procesando con modelo: mrm8488/bert-base-spanish-wwm-cased-finetuned-spa-squad2-es


Some weights of the model checkpoint at mrm8488/bert-base-spanish-wwm-cased-finetuned-spa-squad2-es were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


  0%|          | 0/205 [00:00<?, ?it/s]



Archivo guardado: qa_output_bert_spanish_alt.csv
