# Bert Credibility Model Training and Testing
This notebook shows the process to train a bert model to predict the credibility of news data as well as the evaluation of the model.

### Requirement Installation and Data Preparation
In this first section we will install our dependencies, import the necessary packages, and prepare our data for training tasks. Our dependencies can be found in the requirements.txt file, but we are primarily using pandas for data manipulation, the HuggingFace transformers library for training, and sklearn for model evaluation. Our data is in 3 csv files that have already been split into train/dev/test. The data in our dataset has 3 columns: an article's title, its text, and a binary credibility label with 1 being credible and 0 being fake.

#### Installs and Imports 

In [2]:
!pip install -r /kaggle/input/requirements/requirements.txt
!pip install scikit-learn

Collecting langdetect (from -r /kaggle/input/requirements/requirements.txt (line 3))
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting seqeval (from -r /kaggle/input/requirements/requirements.txt (line 7))
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting evaluate (from -r /kaggle/input/requirements/requirements.txt (line 9))
  Downloading evaluate-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting responses<0.19 (from evaluate->-r /kaggle/input/requirements/requirements.txt (line 9))
  Downloading responses-0.18.0-py3-none-any.whl.metadata (29 kB)
Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━

In [1]:
import pandas as pd

from transformers import BertTokenizer, BertForSequenceClassification
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer
from transformers import Trainer, TrainingArguments

import torch
from torch.utils.data import DataLoader, Dataset
from torch.nn import functional as F
from torch.nn.functional import softmax

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

import os
import shutil
import zipfile

from IPython.display import HTML
from IPython.display import FileLink

2024-04-12 15:01:33.021942: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-12 15:01:33.022055: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-12 15:01:33.158123: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


#### Data Preparation:
After the data is read in, it still is not ready for BERT to train. First, we need to concatenate the title and text with the proper separator token, and then we need to build a custom dataset object to properly set up the data for bert

In [2]:
train_data = pd.read_csv("/kaggle/input/credibility-data/full_data_train.csv")
test_data = pd.read_csv("/kaggle/input/credibility-data/full_data_test.csv")
dev_data = pd.read_csv("/kaggle/input/credibility-data/full_data_dev.csv")

train_data.head()

Unnamed: 0,title,text,label
0,'protests in paris ahead of putin visit to fre...,'paris (ap) - human rights activists are ga...,1
1,"'donald trump gives $10,000 to pastor's family'",'chilling: what netanyahu is bracing for obama...,0
2,'california democrats propose in-state tuition...,'california democrats have proposed a law to g...,1
3,'inner earth glows like in the movie avatar','can there be light below the surface of the e...,0
4,'clinton foundation ceo goes missing after tru...,' another astonishing security council (sc) re...,0


In [3]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

train_df = train_data
train_df["text"] = tokenizer.cls_token + train_df['title'] + tokenizer.sep_token + train_df['text'] + tokenizer.sep_token
train_df = train_df.drop(['title'], axis=1)

dev_df = dev_data
dev_df["text"] = tokenizer.cls_token + dev_df['title'] + tokenizer.sep_token + dev_df['text'] + tokenizer.sep_token
dev_df = dev_df.drop(['title'], axis=1)

test_df = test_data
test_df["text"] = tokenizer.cls_token + test_df['title'] + tokenizer.sep_token + test_df['text'] + tokenizer.sep_token
test_df = test_df.drop(['title'], axis=1)

train_df.head()

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Unnamed: 0,text,label
0,[CLS]'protests in paris ahead of putin visit t...,1
1,"[CLS]'donald trump gives $10,000 to pastor's f...",0
2,[CLS]'california democrats propose in-state tu...,1
3,[CLS]'inner earth glows like in the movie avat...,0
4,[CLS]'clinton foundation ceo goes missing afte...,0


In [4]:
class CustomDataset(Dataset):
    def __init__(self, data, tokenizer, max_length):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        text = self.data.iloc[idx]['text']
        label = self.data.iloc[idx]['label']

        # Tokenize text
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )

        input_ids = encoding['input_ids'].squeeze(0)
        attention_mask = encoding['attention_mask'].squeeze(0)

        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'label': torch.tensor(label, dtype=torch.long)
        }

In [5]:
train_dataset = CustomDataset(train_df, tokenizer, 512)
dev_dataset = CustomDataset(dev_df, tokenizer, 512)
test_dataset = CustomDataset(test_df, tokenizer, 512)

## Model Training
Now we can actually get into the training of our model. We will create a metrics function, set our arguments, and train our model before saving it so it can be used later without retraining.

In [6]:
def get_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    
    # Calculate metrics using scikit-learn
    accuracy = accuracy_score(labels, preds)
    precision = precision_score(labels, preds, average='weighted')
    recall = recall_score(labels, preds, average='weighted')
    f1 = f1_score(labels, preds, average='weighted')

    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

In [None]:
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)

os.environ["WANDB_DISABLED"] = "true"

training_args = TrainingArguments(
    output_dir='/kaggle/working/models',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='/kaggle/working/logs',            # directory for storing
    logging_steps=10,                # log training loss every n steps
    evaluation_strategy="epoch",     # evaluate model at the end of each epoch
    save_strategy="epoch",             # save model checkpoint at the end of each epoch
    save_total_limit=3,              # Limit the total number of saved models
    save_steps=500,
)

credibility_trainer = Trainer(
    model=model,                     # the instantiated 🤗 Transformers model to be trained
    args=training_args,              # training arguments
    train_dataset=train_dataset,     # training dataset
    eval_dataset=dev_dataset,        # evaluation dataset
    tokenizer=tokenizer,             # tokenizer for encoding input data
    compute_metrics=get_metrics
)

In [12]:
credibility_trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.0381,0.029144,0.990336,0.990338,0.990336,0.990334
2,0.014,0.020633,0.990844,0.990858,0.990844,0.990842
3,0.0301,0.030843,0.990844,0.990858,0.990844,0.990842


TrainOutput(global_step=2949, training_loss=0.05536111957614969, metrics={'train_runtime': 3024.9646, 'train_samples_per_second': 15.596, 'train_steps_per_second': 0.975, 'total_flos': 6249546933792768.0, 'train_loss': 0.05536111957614969, 'epoch': 3.0})

In [16]:
def zip_and_move_folder(source_folder, zip_name, destination_folder):
    # Ensure source_folder exists
    if not os.path.exists(source_folder):
        print(f"Error: Folder '{source_folder}' not found.")
        return

    # Ensure destination_folder exists
    if not os.path.exists(destination_folder):
        print(f"Error: Destination folder '{destination_folder}' not found.")
        return

    # Ensure zip_name has a .zip extension
    if not zip_name.endswith('.zip'):
        zip_name += '.zip'

    # Zip the source_folder
    zip_path = os.path.join(destination_folder, zip_name)
    with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for root, dirs, files in os.walk(source_folder):
            for file in files:
                file_path = os.path.join(root, file)
                zipf.write(file_path, os.path.relpath(file_path, source_folder))

    # Move the zipped folder to destination_folder
    shutil.move(zip_path, os.path.join(destination_folder, zip_name))

    print(f"Folder '{source_folder}' zipped as '{zip_name}' and moved to '{destination_folder}'.")

source_folder = "/kaggle/working/models/checkpoint-2949"
zip_name = "model.zip"
destination_folder = '/kaggle/working'

zip_and_move_folder(source_folder, zip_name, destination_folder)

Folder '/kaggle/working/models/checkpoint-2949' zipped as 'model.zip' and moved to '/kaggle/working'.


In [None]:
test_dataset = CustomDataset(test_df, tokenizer, 512)

results = credibility_trainer.evaluate(eval_dataset=test_dataset)

results_df = pd.DataFrame(results, index=[0])
results_df

In [7]:
# free up memory for different tasks
del train_data
del test_data
del dev_data
del train_df
del train_dataset
del dev_df
del dev_dataset
del credibility_trainer

NameError: name 'credibility_trainer' is not defined

### Loading and Predicting
This final section hows to load the model, make predictions, and save them.

In [8]:
# Load the trained weights from the checkpoint file
checkpoint_path = "/kaggle/input/modelzip"
model = DistilBertForSequenceClassification.from_pretrained(checkpoint_path)

In this next part we show the predictions evaluated ont he test data of our newly loaded model, verifying that the saved and loaded model is the same as the previously trained model.

In [10]:
os.environ["WANDB_DISABLED"] = "true"

# Instantiate Trainer with the trained model and evaluation arguments
trainer = Trainer(
    model=model,  # The trained model
    tokenizer=tokenizer,  # The tokenizer associated with the model
    compute_metrics=get_metrics,  # Function to compute evaluation metrics
)

# Evaluate the model
results = trainer.evaluate(test_dataset)

# Print the evaluation results
results_df = pd.DataFrame(results, index=[0])
results_df

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Unnamed: 0,eval_loss,eval_accuracy,eval_precision,eval_recall,eval_f1,eval_runtime,eval_samples_per_second,eval_steps_per_second
0,0.039523,0.99084,0.990843,0.99084,0.99084,1209.5748,1.625,0.203


In [12]:
data_to_predict = test_df["text"].tolist()

#tokenized_data = tokenizer(, truncation=True, padding=True, return_tensors="pt")

In [13]:
#torch.cuda.empty_cache()
# tokenized_data_gpu = {key: val.to('cuda') for key, val in tokenized_data.items()}
model = model.to('cuda')

In [17]:
probs = []
batch_size = 4
for i in range(0, len(data_to_predict), batch_size):
    batch_data = data_to_predict[i:i+batch_size]

    # Tokenize batch_data here using your tokenizer
    tokenized_data = tokenizer(batch_data, truncation=True, padding=True, return_tensors="pt")

    with torch.no_grad():
        # Process tokenized_data here
        outputs = model(**tokenized_data.to('cuda'))  # Assuming model is on the appropriate device
        logits = outputs.logits
        probabilities = torch.softmax(logits, dim=1)
        
        probs.append(probabilities)

In [21]:
combined_probabilities = []

# Iterate through probs and concatenate every batch_size tensors
for i in range(0, len(probs), batch_size):
    # Extract a batch of tensors
    batch_tensors = probs[i:i + batch_size]
    
    # Concatenate the batch tensors along dim=0 (assuming they have the same shape)
    concatenated_tensor = torch.cat(batch_tensors, dim=0)
    
    # Append the concatenated tensor to the combined list
    combined_probabilities.append(concatenated_tensor)

class1_probs = []

# Process each batch
for probs_batch in combined_probabilities:
    # Extract probabilities of class 1 (index 1)
    class1_probs_batch = probs_batch[:, 1]  # Assuming class 1 is in the second column
    class1_probs.extend(class1_probs_batch.tolist())  # Convert to list and extend the main list

In [23]:
test_df['predicted_prob'] = class1_probs
test_df.head(10)

Unnamed: 0,text,label,predicted_prob
0,[CLS]'why the truth might get you fired'[SEP]'...,0,0.00011
1,"[CLS]'monica lewinsky, clinton sex scandal set...",1,0.999976
2,[CLS]'humiliated hillary tries to hide what ca...,0,0.00011
3,[CLS]'study: more than half of car crashes inv...,1,0.999977
4,[CLS]'mindful eating as way to fight bingeing ...,1,0.999965
5,"[CLS]'massive anti-trump protests, union squar...",0,0.000117
6,[CLS]'turkey threatens to open migrant 'land p...,1,0.999975
7,[CLS]'mike birbiglia's 6 tips for making it sm...,1,0.999977
8,[CLS]''chapo trap house': new left-wing podcas...,0,0.000106
9,[CLS]'the beautiful prehistoric world: is eart...,0,0.000125
