<a href="https://colab.research.google.com/github/Naomiball/notebooks/blob/master/Sercero_Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# This notebook will show how to build a text claddification based on health_fact dataset on huggingface hub to performe classification task, using one of the pre-trained transformers. 

Install transformers, dataset, scipy and sklearn

In [None]:
! pip install git+https://github.com/huggingface/transformers.git
! pip install git+https://github.com/huggingface/datasets.git
! pip install scipy sklearn

In [None]:
!pip install torch==1.5.1+cpu torchvision==0.6.1+cpu -f https://download.pytorch.org/whl/torch_stable.html -vvv
import torch
print(torch.__version__)

Install Git-LFS and setup Git

In [None]:
!apt install git-lfs
!git config --global user.email "totoshuang@yahoo.com"
!git config --global user.name "Naomiball"

Check version of Transformers:

In [4]:
import transformers

print(transformers.__version__)

4.12.0.dev0


## Loading the dataset

In [5]:
import datasets
ds_list = datasets.list_datasets()
[ds for ds in ds_list if "health_fact" in ds]

['health_fact']

In [7]:
dataset = datasets.load_dataset("health_fact")

Using custom data configuration default


Downloading and preparing dataset health_fact/default (download: 23.74 MiB, generated: 64.34 MiB, post-processed: Unknown size, total: 88.08 MiB) to /root/.cache/huggingface/datasets/health_fact/default/1.1.0/99503637e4255bd805f84d57031c18fe4dd88298f00299d56c94fc59ed68ec19...


Downloading:   0%|          | 0.00/24.9M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset health_fact downloaded and prepared to /root/.cache/huggingface/datasets/health_fact/default/1.1.0/99503637e4255bd805f84d57031c18fe4dd88298f00299d56c94fc59ed68ec19. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

# Explore Dataset

In [8]:
#type(dataset)
type(dataset['train'])

datasets.arrow_dataset.Dataset

In [9]:

dataset['train'].features

{'claim': Value(dtype='string', id=None),
 'claim_id': Value(dtype='string', id=None),
 'date_published': Value(dtype='string', id=None),
 'explanation': Value(dtype='string', id=None),
 'fact_checkers': Value(dtype='string', id=None),
 'label': ClassLabel(num_classes=4, names=['false', 'mixture', 'true', 'unproven'], names_file=None, id=None),
 'main_text': Value(dtype='string', id=None),
 'sources': Value(dtype='string', id=None),
 'subjects': Value(dtype='string', id=None)}

In [10]:
# check the length of main text from dataset
from pandas import Series
text_len = [len(text) for text in dataset['train']['claim']]
Series(text_len).describe()

count    9832.000000
mean       87.992372
std        96.687607
min         0.000000
25%        57.000000
50%        68.000000
75%       105.000000
max      4925.000000
dtype: float64

There exist observations with target labeled as "-1" which seems like a mistake to me, so some data cleaning need to be done here.

In [11]:

new_dataset = dataset['train']
targets = set(new_dataset['label'])
print("targets:{}".format(targets))

# Use train_test_split function to first remove all '-1' labels, then applie split
# again to prepare for training
new_dataset = new_dataset.sort('label')
new_dataset = new_dataset.train_test_split(test_size=0.995, shuffle=False)
print(set(new_dataset['test']['label']))

dataset = new_dataset['test'].train_test_split(test_size=0.1)

targets:{0, 1, 2, 3, -1}
{0, 1, 2, 3}


By analyzing the length of the claims, I decided to set maximum length equals to 128 for our model, which means the rest of the text will be truncated. Later in the tuning process, I will increase that parameter for better performance if needed

In [12]:
# define base model and parameters
model_checkpoint = "distilbert-base-uncased"
max_length = 128
batch_size = 16

## Preprocessing the data

In [13]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def tokenize_function(corpus):
    return tokenizer(
        corpus['claim'], padding="max_length", truncation=True, max_length=max_length
    )

tokenized_datasets = dataset.map(tokenize_function)
print(tokenized_datasets.column_names)



Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

  0%|          | 0/8804 [00:00<?, ?ex/s]

  0%|          | 0/979 [00:00<?, ?ex/s]

{'train': ['claim_id', 'claim', 'date_published', 'explanation', 'fact_checkers', 'main_text', 'sources', 'label', 'subjects', 'input_ids', 'attention_mask'], 'test': ['claim_id', 'claim', 'date_published', 'explanation', 'fact_checkers', 'main_text', 'sources', 'label', 'subjects', 'input_ids', 'attention_mask']}


In [14]:
tokenized_datasets = tokenized_datasets.remove_columns(['date_published', 'explanation',
                                         'fact_checkers', 'main_text', 'sources', 'subjects'])
tokenized_datasets = tokenized_datasets.rename_column('label', 'labels')
tokenized_datasets = tokenized_datasets.with_format("torch")
tokenized_datasets["train"]

Dataset({
    features: ['claim_id', 'claim', 'labels', 'input_ids', 'attention_mask'],
    num_rows: 8804
})

## Fine-tuning the model

In [15]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=4)


Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

In [16]:
# define args for trainer
from transformers import TrainingArguments

training_args = TrainingArguments(
    "test-trainer",
    evaluation_strategy='epoch',
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    learning_rate=2e-5,
    weight_decay=0.01
)

In [17]:
from transformers import Trainer
import numpy as np
from torch import nn
from sklearn import preprocessing
from datasets import load_metric

# To run the whole dataset requires to many computing power and time, thus I created
# a smaller dataset for training purpose
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(2000))
small_test_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(200))

# Define metrics 

metric = load_metric("accuracy")

def compute_metrics(eval_preds):
    logits, labels_pred = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels_pred)

# initiate a training class
trainer = Trainer(
    model,
    training_args,
    train_dataset = small_train_dataset,
    eval_dataset = small_test_dataset,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

# training
trainer.train()




Downloading:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

The following columns in the training set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: claim, claim_id.
***** Running training *****
  Num examples = 2000
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 625


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.788918,0.67
2,No log,0.759135,0.675


The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: claim, claim_id.
***** Running Evaluation *****
  Num examples = 200
  Batch size = 16
The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: claim, claim_id.
***** Running Evaluation *****
  Num examples = 200
  Batch size = 16


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.788918,0.67
2,No log,0.759135,0.675
3,No log,0.759955,0.68
4,0.776300,0.830315,0.685
5,0.776300,0.821917,0.7


The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: claim, claim_id.
***** Running Evaluation *****
  Num examples = 200
  Batch size = 16
Saving model checkpoint to test-trainer/checkpoint-500
Configuration saved in test-trainer/checkpoint-500/config.json
Model weights saved in test-trainer/checkpoint-500/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-500/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-500/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: claim, claim_id.
***** Running Evaluation *****
  Num examples = 200
  Batch size = 16
The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: claim, claim_id.
*

TrainOutput(global_step=625, training_loss=0.7272814208984375, metrics={'train_runtime': 6669.9403, 'train_samples_per_second': 1.499, 'train_steps_per_second': 0.094, 'total_flos': 331180308480000.0, 'train_loss': 0.7272814208984375, 'epoch': 5.0})

Conclusion: With a training set of size 2000 corps, we built a model to classify claims with accuracy of 70%. 
For further imporovement, there are something we can try to improve performance:
1. Increase traing set and computing power
2. Improving performance by analysing metrics per each target class and falsely predicted data points
3. Hyper-parameters tuning  