# Projeto final do grupo de NLP - Panda
## Ciclo 2024/1
### Tema: Transformers

Autor: Leticia Bossatto Marchez

Tarefa: Detecção de fakenews baseada no texto de notícias em inglês utilizando fine-tuning do modelo BERT

Dataset disponível em: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset?select=True.csv

In [1]:
!pip install transformers datasets torch

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting requests (from transformers)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.5.0,>=2023.1.0 (from fsspec[http]<=2024.5.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.5.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda

In [2]:
import pandas as pd
import torch

In [21]:
url = "https://drive.google.com/file/d/1Hz6rL2qHRW-jnuf6kvVCYAijV6B_5YTC/view?usp=sharing"
url_a = 'https://drive.google.com/uc?id=' + url.split('/')[-2]
fake_df = pd.read_csv(url_a)

In [22]:
url = "https://drive.google.com/file/d/1LReKbkBlvdzXiMoTezd026PTl_EVZnfc/view?usp=sharing"
url_b = 'https://drive.google.com/uc?id=' + url.split('/')[-2]
true_df = pd.read_csv(url_b)

Classificação binária: 1 ou 0
1 -> Positivo (Fake news)
0 -> Negativo (Non-fake news)

In [23]:
fake_df["class"] = 1
true_df["class"] = 0

In [24]:
df = pd.concat([fake_df,true_df]).reset_index(drop=True)

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44898 entries, 0 to 44897
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    44898 non-null  object
 1   text     44898 non-null  object
 2   subject  44898 non-null  object
 3   date     44898 non-null  object
 4   class    44898 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 1.7+ MB


In [9]:
df.head(10)

Unnamed: 0,title,text,subject,date,class
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",1
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",1
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",1
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",1
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",1
5,Racist Alabama Cops Brutalize Black Boy While...,The number of cases of cops brutalizing and ki...,News,"December 25, 2017",1
6,"Fresh Off The Golf Course, Trump Lashes Out A...",Donald Trump spent a good portion of his day a...,News,"December 23, 2017",1
7,Trump Said Some INSANELY Racist Stuff Inside ...,In the wake of yet another court decision that...,News,"December 23, 2017",1
8,Former CIA Director Slams Trump Over UN Bully...,Many people have raised the alarm regarding th...,News,"December 22, 2017",1
9,WATCH: Brand-New Pro-Trump Ad Features So Muc...,Just when you might have thought we d get a br...,News,"December 21, 2017",1


In [10]:
df["class"].value_counts()

Unnamed: 0_level_0,count
class,Unnamed: 1_level_1
1,23481
0,21417


In [11]:
df["subject"].value_counts()

Unnamed: 0_level_0,count
subject,Unnamed: 1_level_1
politicsNews,11272
worldnews,10145
News,9050
politics,6841
left-news,4459
Government News,1570
US_News,783
Middle-east,778


In [12]:
import numpy as np

Comprimento médio do texto:

In [13]:
np.mean(df["text"].apply(lambda x: len(x)))

2469.1096930820972

Comprimento médio dos títulos:

In [14]:
np.mean(df["title"].apply(lambda x: len(x)))

80.11171989843646

## Visualização de embeddings com Gensim
Palavras similares

In [15]:
import gensim

In [16]:
gensim.utils.simple_preprocess

Exemplo de pré-processamento utilizando a biblioteca gensim

In [17]:
df["title"][0]

' Donald Trump Sends Out Embarrassing New Year’s Eve Message; This is Disturbing'

In [18]:
gensim.utils.simple_preprocess(df["title"][0])

['donald',
 'trump',
 'sends',
 'out',
 'embarrassing',
 'new',
 'year',
 'eve',
 'message',
 'this',
 'is',
 'disturbing']

In [None]:
texts_preprocessed = df["text"].apply(lambda x: gensim.utils.simple_preprocess(x))

In [20]:
model = gensim.models.Word2Vec(
    window=10,
    min_count=2,
    workers=4
)

In [21]:
model.build_vocab(texts_preprocessed, progress_per=1000)

In [22]:
model.epochs

5

In [23]:
model.corpus_count

44898

Treinamento dos embeddings:

In [24]:
model.train(texts_preprocessed, total_examples=model.corpus_count, epochs=model.epochs)

(70032509, 87559630)

Visualização das palavras mais similares:

Trump está relacionado com pronomes masculinos (he, his e him)

In [25]:
model.wv.most_similar("trump")

[('he', 0.5560417771339417),
 ('his', 0.537224292755127),
 ('cruz', 0.5362269878387451),
 ('hagling', 0.5249783396720886),
 ('him', 0.5098312497138977),
 ('obama', 0.4926687180995941),
 ('rumsfeld', 0.48771145939826965),
 ('rubio', 0.4829549491405487),
 ('elect', 0.47865116596221924),
 ('abe', 0.43096035718917847)]

'Black' está frequentemente relacionado à etnia negra, assim gerando maior similaridade com outras etnias (white, hispanic, colored, latino, african)

In [26]:
model.wv.most_similar("black")

[('color', 0.6409909129142761),
 ('whites', 0.5921503305435181),
 ('blacks', 0.5818833112716675),
 ('hispanic', 0.5266276001930237),
 ('cop', 0.5130081176757812),
 ('young', 0.5094086527824402),
 ('colored', 0.5007658004760742),
 ('latino', 0.4931713342666626),
 ('hooded', 0.4922417402267456),
 ('african', 0.4894956052303314)]

A sigla LGBTQ mostra semelhança alta com grupos minoritários, incluindo os da própria sigla (gay, transgender, bisexual), mas também com 'indigenous' e a palavra 'marginalized'

In [27]:
model.wv.most_similar("lgbtq")

[('lgbt', 0.8329174518585205),
 ('gay', 0.6541381478309631),
 ('transgender', 0.6330890655517578),
 ('gays', 0.6258832812309265),
 ('advancement', 0.6153060793876648),
 ('marginalized', 0.6013498902320862),
 ('indigenous', 0.598896861076355),
 ('religious', 0.5725387334823608),
 ('bisexual', 0.565801203250885),
 ('jewish', 0.5535891652107239)]

In [28]:
model.save("word2vec-fake-news.model")

## Pré-processamento

In [31]:
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
import torch
import numpy as np
from datasets import Dataset
from sklearn.model_selection import train_test_split

Remoção de colunas extras

In [None]:
df = df.drop(columns=["title","subject","date"])

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44898 entries, 0 to 44897
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    44898 non-null  object
 1   class   44898 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 701.7+ KB


Separação dos dados entre treino e validação

**Split Treino/Validação:** 70 % / 30 %

In [None]:
train_df, test_df = train_test_split(df, test_size=0.3, random_state=42, stratify=df["class"])

Tokenização dos dados com o tokenizador próprio do modelo BERT

In [None]:
# Carregamento do tokenizador: https://huggingface.co/google-bert/bert-base-uncased
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize_function(examples):
    # Tokenizing the text
    model_inputs = tokenizer(examples['text'], truncation=True, padding="max_length", max_length=128)
    # Adding labels
    model_inputs['labels'] = examples['class']
    return model_inputs

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [None]:
# Transformando o dataframe em Dataset
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

# Aplicando tokenização
train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/31428 [00:00<?, ? examples/s]

Map:   0%|          | 0/13470 [00:00<?, ? examples/s]

In [None]:
print(train_dataset[0])

{'text': 'After GOP Rep. Steve Scalise was shot during a congressional baseball practice, provocateur and has been rock star Ted Nugent made a promise to dial down his own violent hate-filled rhetoric.During an appearance on the WABC Radio Show  Curtis & Eboni,  Nugent said that  my wife has convinced me that I just can t use those harsh terms. I cannot, and I will not, and I encourage even my friends-slash-enemies on the left in the Democrat and liberal world that we have got to be civil to each other. Source: SalonFor the man who threatened to kill both Barack Obama and Hillary Clinton, that pledge lasted just over a month.During a concert in Bonner Springs, Kansas on Friday night, Nugent became unhinged over the President   no, not Trump, the one who s no longer in office and has absolutely zero to do with Nugent s life. He also insulted country music. I was going to play a country song but I still have a (penis) so I can t do that.  And before  Dog Eat Dog,  he praised the presiden

Carregamento do modelo:

In [None]:
# Fonte do modelo: https://huggingface.co/google-bert/bert-base-uncased
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=df['class'].nunique())

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Argumentos para treinamento do modelo:

In [None]:
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',   # Evaluate at the end of each epoch
    save_strategy='epoch',         # Save at the end of each epoch to match evaluation strategy
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,   # Load the best model at the end of training based on metric
    metric_for_best_model='accuracy',
)



In [None]:
from datasets import load_metric

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    # Convert logits to a tensor if they are not already
    if isinstance(logits, np.ndarray):
        logits = torch.tensor(logits)
    predictions = torch.argmax(logits, dim=-1)
    return metric.compute(predictions=predictions.numpy(), references=labels)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

  metric = load_metric("accuracy")


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

The repository for accuracy contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/accuracy.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Fine-tuning

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.0041,0.000799,0.999926
2,0.0015,0.002786,0.999777
3,0.0,0.00196,0.999852


TrainOutput(global_step=11787, training_loss=0.0036224082190515127, metrics={'train_runtime': 2790.6502, 'train_samples_per_second': 33.786, 'train_steps_per_second': 4.224, 'total_flos': 6201790685890560.0, 'train_loss': 0.0036224082190515127, 'epoch': 3.0})

In [None]:
trainer.evaluate(test_dataset)

In [None]:
model.save_pretrained('./saved_model')
tokenizer.save_pretrained('./saved_model')

('./saved_model/tokenizer_config.json',
 './saved_model/special_tokens_map.json',
 './saved_model/vocab.txt',
 './saved_model/added_tokens.json')

In [None]:
res = trainer.predict(test_dataset)

In [None]:
len(test_dataset)

13470

In [None]:
len(res[0])

13470

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!ls "/content/drive/My Drive/Colab Notebooks/Panda/"

In [None]:
path = "/content/drive/My Drive/Colab Notebooks/Panda/"

In [None]:
model.save_pretrained(path+'saved_model')
tokenizer.save_pretrained(path+'saved_model')

('/content/drive/My Drive/Colab Notebooks/Panda/saved_model/tokenizer_config.json',
 '/content/drive/My Drive/Colab Notebooks/Panda/saved_model/special_tokens_map.json',
 '/content/drive/My Drive/Colab Notebooks/Panda/saved_model/vocab.txt',
 '/content/drive/My Drive/Colab Notebooks/Panda/saved_model/added_tokens.json')

In [None]:
import pandas as pd
df_results = pd.DataFrame(res[0])
df_results.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13470 entries, 0 to 13469
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       13470 non-null  float32
 1   1       13470 non-null  float32
dtypes: float32(2)
memory usage: 105.4 KB


In [None]:
df_results.to_csv(path+"Pred_bert_fakenews.csv")

In [26]:
df_results = pd.read_csv("Pred_bert_fakenews.csv")

In [27]:
df_results.head()

Unnamed: 0.1,Unnamed: 0,0,1
0,0,-4.1599,4.680323
1,1,5.433088,-5.893195
2,2,5.427989,-5.892473
3,3,-4.115668,4.65182
4,4,-4.1136,4.638439


In [35]:
pred_y = [0 if row["0"] > row["1"] else 1 for idx,row in df_results.iterrows()]

In [38]:
true_y = test_df["class"]

In [42]:
len(test_df)

13470

In [44]:
print("Taxa de acerto das amostras:")
sum([1 if i==j else 0 for i,j in zip(test_df["class"],pred_y)])/len(test_df)

Taxa de acerto das amostras:


0.9999257609502599

Classification report do resultado das predições:

In [41]:
from sklearn.metrics import classification_report
print(classification_report(true_y, pred_y))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      6425
           1       1.00      1.00      1.00      7045

    accuracy                           1.00     13470
   macro avg       1.00      1.00      1.00     13470
weighted avg       1.00      1.00      1.00     13470

