<a href="https://colab.research.google.com/github/AdaliaFlores/DetectorSesgo/blob/main/Detector_de_sesgo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers

from transformers import pipeline

# Cargamos un modelo preentrenado de análisis de sentimiento (como prueba)
classifier = pipeline("sentiment-analysis")

# Prueba con un texto
texto = "Todos los políticos son corruptos y solo piensan en robar"
resultado = classifier(texto)

print("Resultado:", resultado)




No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Device set to use cpu


Resultado: [{'label': 'NEGATIVE', 'score': 0.9867910146713257}]


In [2]:
import pandas as pd
import csv

# Cargar los archivos ignorando líneas corruptas y problemas de comillas
fake_df = pd.read_csv("DataSet_Misinfo_FAKE.csv", on_bad_lines='skip', quoting=csv.QUOTE_NONE, encoding='utf-8', engine='python')
true_df = pd.read_csv("DataSet_Misinfo_TRUE.csv", on_bad_lines='skip', quoting=csv.QUOTE_NONE, encoding='utf-8', engine='python')

# Etiquetas
fake_df["label"] = 1
true_df["label"] = 0

# Unir datasets
df = pd.concat([fake_df, true_df], ignore_index=True)

# Mostrar resumen
print(df["label"].value_counts())
df.head()


label
0    10585
1     7566
Name: count, dtype: int64


Unnamed: 0.1,Unnamed: 0,text,label
0,to all the people who voted for this a hole t...,you were wrong! 70-year-old men don t change ...,1
1,,,1
2,,,1
3,,,1
4,,,1


In [3]:
import pandas as pd

# Leer el archivo sin separar columnas (todo como una sola línea por fila)
true_df = pd.read_csv("DataSet_Misinfo_TRUE.csv", header=None, names=["text"], engine="python", on_bad_lines='skip')

# Eliminar filas vacías
true_df = true_df[true_df["text"].notnull()]
true_df = true_df[true_df["text"].str.len() > 10]

# Añadir la etiqueta de clase
true_df["label"] = 0

# Mostrar cuántos textos válidos hay
print(f"Número de textos verdaderos válidos: {len(true_df)}")
true_df.head()


Número de textos verdaderos válidos: 22960


Unnamed: 0,text,label
0.0,The head of a conservative Republican faction ...,0
1.0,Transgender people will be allowed for the fir...,0
2.0,The special counsel investigation of links bet...,0
3.0,Trump campaign adviser George Papadopoulos tol...,0
4.0,President Donald Trump called on the U.S. Post...,0


In [4]:
# Combinar columnas de texto en una sola (usando fillna para evitar errores)
df["texto"] = df["text"].fillna('') + " " + df["label"].fillna('').astype(str)

# Ahora nos quedamos solo con las filas que tengan texto real (más de 10 caracteres)
df = df[df["texto"].str.len() > 10]

# Eliminar duplicados (opcional)
df = df.drop_duplicates(subset="texto")

# Mostrar resumen
print(f"Número de textos válidos: {len(df)}")
df[["texto", "label"]].head()


Número de textos válidos: 81


Unnamed: 0,texto,label
0,you were wrong! 70-year-old men don t change ...,1
165,look at me! I m violating the U.S. flag code ...,1
274,she finishes.The whole thing sounds like an ...,1
509,honor the fact that he never wanted to see yo...,1
523,but at least he now admits it.Featured image ...,1


In [5]:
# Verificamos que fake_df esté limpio
fake_df = fake_df[fake_df["text"].notnull()]
fake_df = fake_df[fake_df["text"].str.len() > 10]
fake_df = fake_df.drop_duplicates(subset="text")
fake_df["label"] = 1

# Combinamos ambos datasets
df = pd.concat([fake_df, true_df], ignore_index=True)

# Barajamos el dataset para mezclar bien
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

# Mostramos el resultado
print("Distribución de clases final:")
print(df["label"].value_counts())
df.head()


Distribución de clases final:
label
0    22960
1       50
Name: count, dtype: int64


Unnamed: 0.1,Unnamed: 0,text,label
0,,Donald Trump’s choice to run the U.S. Transpor...,0
1,,A man accused of being part of a plot to kidna...,0
2,,Wisconsin’s Republican-controlled state Assemb...,0
3,,"Matt Latimer, a former Bush speechwriter and c...",0
4,,"With the U.S. election just days away, America...",0


In [6]:
# Tomar todas las muestras de clase 1
fake_sample = fake_df

# Submuestrear clase 0 para que tenga igual cantidad que la clase 1
true_sample = true_df.sample(n=len(fake_sample), random_state=42)

# Combinar ambos conjuntos balanceados
df_balanced = pd.concat([fake_sample, true_sample], ignore_index=True)

# Mezclar
df_balanced = df_balanced.sample(frac=1, random_state=42).reset_index(drop=True)

# Ver distribución
print(df_balanced["label"].value_counts())
df_balanced.head()


label
0    50
1    50
Name: count, dtype: int64


Unnamed: 0.1,Unnamed: 0,text,label
0,,"Merrick Garland, President Barack Obama’s U.S....",0
1,,British foreign minister Boris Johnson said on...,0
2,,The head of Russia s central election commissi...,0
3,it s sad,and it s unnecessary.Featured image via Timot...,1
4,free,"and hopeful America.Featured image via Gawker""",1


In [7]:
from sklearn.model_selection import train_test_split

# Dividir
train_df, test_df = train_test_split(df_balanced, test_size=0.2, random_state=42, stratify=df_balanced["label"])

print(f"Entrenamiento: {len(train_df)} muestras")
print(f"Prueba: {len(test_df)} muestras")


Entrenamiento: 80 muestras
Prueba: 20 muestras


In [8]:
!pip install transformers datasets torch


Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

In [9]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize_function(examples):
    return tokenizer(examples, padding='max_length', truncation=True, max_length=128)

# Tokenizar textos de entrenamiento y prueba
train_texts = train_df['text'].tolist()
train_labels = train_df['label'].tolist()

test_texts = test_df['text'].tolist()
test_labels = test_df['label'].tolist()

train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=128)
test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=128)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [10]:
import torch

class FakeNewsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __len__(self):
        return len(self.labels)
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

train_dataset = FakeNewsDataset(train_encodings, train_labels)
test_dataset = FakeNewsDataset(test_encodings, test_labels)


In [18]:
import os
os.environ["WANDB_DISABLED"] = "true"

from transformers import BertForSequenceClassification, TrainingArguments, Trainer

# Cargar el modelo BERT con una capa final de clasificación
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Configurar los argumentos del entrenamiento (solo los compatibles con 4.5.3)
training_args = TrainingArguments(
    output_dir='./results',               # Carpeta para guardar resultados
    num_train_epochs=3,                   # Número de épocas
    per_device_train_batch_size=8,        # Batch size para entrenamiento
    per_device_eval_batch_size=8,         # Batch size para validación
    logging_dir='./logs',                 # Carpeta para logs
    logging_steps=10,                     # Cada cuántos pasos registrar logs
    seed=42,                               # Semilla para reproducibilidad
    report_to="none"
)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [19]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset  # Si quieres evaluar durante el entrenamiento
)

trainer.train()


Step,Training Loss
10,0.3824
20,0.0567
30,0.0178


TrainOutput(global_step=30, training_loss=0.15229013115167617, metrics={'train_runtime': 402.66, 'train_samples_per_second': 0.596, 'train_steps_per_second': 0.075, 'total_flos': 15786663321600.0, 'train_loss': 0.15229013115167617, 'epoch': 3.0})

In [20]:
eval_result = trainer.evaluate()
print(eval_result)


{'eval_loss': 0.017220476642251015, 'eval_runtime': 7.5383, 'eval_samples_per_second': 2.653, 'eval_steps_per_second': 0.398, 'epoch': 3.0}


In [21]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Obtener predicciones del modelo sobre el set de prueba
def evaluar_modelo(model, test_dataset):
    trainer = Trainer(model=model)
    raw_pred, _, _ = trainer.predict(test_dataset)
    pred_labels = torch.argmax(torch.tensor(raw_pred), axis=1)
    return pred_labels

# Evaluar
predictions = evaluar_modelo(model, test_dataset)
accuracy = accuracy_score(test_labels, predictions)
precision, recall, f1, _ = precision_recall_fscore_support(test_labels, predictions, average='binary')

print(f"🎯 Exactitud (Accuracy): {accuracy:.4f}")
print(f"✅ Precisión: {precision:.4f}")
print(f"🔁 Recall: {recall:.4f}")
print(f"📊 F1-score: {f1:.4f}")


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


🎯 Exactitud (Accuracy): 1.0000
✅ Precisión: 1.0000
🔁 Recall: 1.0000
📊 F1-score: 1.0000


In [25]:
def predecir_noticia(texto):
    inputs = tokenizer(texto, return_tensors='pt', truncation=True, padding=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    pred = torch.argmax(outputs.logits, dim=1).item()
    return "VERDADERA" if pred == 0 else "FALSA"

# Lista de ejemplos para probar
ejemplos = [
    "NASA confirms discovery of water on the Moon.",
    "Scientists say drinking bleach cures COVID-19.",
    "Government announces new economic stimulus package.",
    "Aliens landed in New York last night.",
    "Study shows that chocolate improves brain function.",
    "Vaccines cause autism, according to a new report.",
    "The president signed a bill to improve education.",
    "Secret society controls the world’s governments.",
    "Local man wins lottery twice in one week.",
    "Eating carrots cures all types of cancer."
]

for texto in ejemplos:
    resultado = predecir_noticia(texto)
    print(f"📰 Texto: {texto}\n🧠 Predicción: {resultado}\n{'-'*50}")


📰 Texto: NASA confirms discovery of water on the Moon.
🧠 Predicción: VERDADERA
--------------------------------------------------
📰 Texto: Scientists say drinking bleach cures COVID-19.
🧠 Predicción: FALSA
--------------------------------------------------
📰 Texto: Government announces new economic stimulus package.
🧠 Predicción: VERDADERA
--------------------------------------------------
📰 Texto: Aliens landed in New York last night.
🧠 Predicción: FALSA
--------------------------------------------------
📰 Texto: Study shows that chocolate improves brain function.
🧠 Predicción: FALSA
--------------------------------------------------
📰 Texto: Vaccines cause autism, according to a new report.
🧠 Predicción: VERDADERA
--------------------------------------------------
📰 Texto: The president signed a bill to improve education.
🧠 Predicción: VERDADERA
--------------------------------------------------
📰 Texto: Secret society controls the world’s governments.
🧠 Predicción: FALSA
-----------

Unnamed: 0.1,Unnamed: 0,text,label
36,,"After months of internal discord, Republicans ...",0
5,without penalty,from anyone who hides behind the Bible.Featur...,1
18,Trump should be removed from office as soon a...,and the military would never voluntarily make...,1
28,and crooks quite a group assembled by Trump,the man who only hires the best people.Featur...,1
31,as well as elsewhere in the OPT. To get an id...,see the video below:[youtube https://www.yout...,1
