In [27]:
import pandas as pd

In [28]:
# train = pd.read_parquet('./data/train-00000-of-00001.parquet')
# test = pd.read_parquet('./data/train-00000-of-00001.parquet')

## Preprocessing

The below code is used for preprocessing and cleaning the dataset. The preprocessing steps include:

1. Removing the first three characters from each string in the list of strings in the 'text' column (to remove the initial numbers from each string).
2. Joining the list of strings into a single string.
3. Converting the text to lowercase.

After these steps, a function named `clean_text` is defined to further clean the text data. The `clean_text` function performs the following operations:

- Removes text enclosed in square brackets.
- Removes punctuation.
- Removes words containing numbers.

This function is then applied to the 'text' column of the `train` DataFrame to clean the text data.

The rows containing missing values are removed. Duplicate rows are also dropped.

In [29]:
# # data preprocessing & cleaning

# train['text'] = train['text'].apply(lambda x: [i[3:] for i in x])
# train['text'] = train['text'].apply(lambda x: ' '.join(x))
# train['text'] = train['text'].apply(lambda x: x.lower())

# import re
# import string

# def clean_text(text):
#     '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
#     text = re.sub('\[.*?\]', '', text) # remove text in square brackets
#     text = re.sub('[%s]' % re.escape(string.punctuation), '', text) # remove punctuation
#     text = re.sub('\w*\d\w*', '', text) # remove words containing numbers
#     return text

# clean = lambda x: clean_text(x)

# train['text'] = train.text.apply(clean)

# # remove missing values and duplicate rows
# train = train.dropna()
# train = train.drop_duplicates()

# train.head()

In [30]:
# save the cleaned data
# from google.colab import drive
# drive.mount('/content/drive')

# file_path = '/content/drive/MyDrive/Augnito/train_clean.csv'
file_path = 'train_clean.csv'

train = pd.read_csv(file_path)
train.head()

Unnamed: 0,text,labels
0,at the beginning of the events relevant to t...,[4]
1,the applicant is the monarch of liechtenstein...,[8 3 9]
2,in june plots of agricultural land owned by ...,[3]
3,in mr dušan slobodník a research worker in t...,[6 8 5]
4,the applicant is an italian citizen born in ...,[8 3]


In [31]:
# test = pd.read_csv('/content/drive/MyDrive/Augnito/test_clean.csv')
test = pd.read_csv('test_clean.csv')
test.head()

Unnamed: 0,text,labels
0,the applicant is a journalist for dnno a norw...,[6]
1,the applicant was born in and lives in odesa...,[4]
2,the applicant was born in and lives in smědč...,[3]
3,the applicant was born in and lives in kyiv ...,[3]
4,the applicant was born in and lives in staro...,[1 3]


In [32]:
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertModel, BertConfig

In [33]:
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

### Hyperparameters 

In [34]:
MAX_LEN = 512
BATCH_SIZE = 8
EPOCHS = 5
LEARNING_RATE = 1e-05
N_CLASSES = 11
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

## Dataset

### Dataset Description

- The Dataset is from the European Court of Human Rights (ECtHR) which hears allegations that a state has breached human rights provisions of the European Convention of Human Rights (ECHR). 
- For each case, the dataset provides a list of factual paragraphs (facts) from the case description. Each case is mapped to articles of ECHR that were allegedly violated (considered by the court).
- Each case (fact) can be matched to between 0 to 10 of the following articles: "Article 2", "Article 3", "Article 5", "Article 6", "Article 8", "Article 9", "Article 10", "Article 11", "Article 14" and "Article 1 of Protocol 1".
- The dataset contains 9000 rows in the train set and 1000 rows in the test set.

### Code Implementation

The below code defines a dataset class for our legal dataset, named `CustomDataset`. This class is used to process and transform the data for the given text classification task. It inherits from PyTorch's `Dataset` class.

The `CustomDataset` class has the following methods:

- `__init__(self, dataframe, tokenizer, max_len)`: This is the constructor method for the `CustomDataset` class. It initializes the class with a dataframe containing the data, a tokenizer for text processing, and a maximum length for the tokenized text sequences. It also processes the labels from the dataframe into a suitable format.

- `__len__(self)`: This method returns the number of samples in the dataset.

- `__getitem__(self, index)`: This method is used to get the data sample corresponding to a given index. It tokenizes the text data, applies padding and truncation to ensure that all sequences have the same length, and returns a dictionary containing the tokenized inputs and the corresponding targets.

The returned dictionary from `__getitem__` method includes:

- 'ids': The input IDs from the tokenizer.
- 'mask': The attention mask that indicates to the model which tokens should be attended to, and which should not.
- 'token_type_ids': This is used for models that have a token type IDs input (like BERT and related models).
- 'targets': The target labels for the text. The labels are transformed into one-hot encoding and then summed across the first dimension to ensure a single vector of class counts.

In [35]:
import torch
import numpy as np
class CustomDataset(Dataset):

    def __init__(self, dataframe, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.text = dataframe.text
        self.targets = []
        for i in self.data.labels:
            i = np.fromstring(i[1:-1], dtype=int, sep=' ')
            self.targets.append(i)
        self.max_len = max_len

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        # text = str(self.text[index])
        # text = " ".join(text.split())
        text = self.text[index]

        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True
        )

        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        token_type_ids = inputs["token_type_ids"]

        # convert targets into one-hot encoding

        targets = self.targets[index]
        targets = torch.nn.functional.one_hot(torch.tensor(targets), num_classes=11)
        targets = targets.sum(dim=0)


        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
            'targets': torch.tensor(targets, dtype=torch.long)
        }

In [36]:
training_set = CustomDataset(train, tokenizer, MAX_LEN)

In [37]:
training_loader = DataLoader(training_set, batch_size=BATCH_SIZE)

## Model

The below code defines a class named `LegalSentimentClassifier` which is a custom classifier built on top of the pre-trained BERT model. This class is a subclass of PyTorch's `torch.nn.Module` class, which means it inherits all its functionality.

The `LegalSentimentClassifier` class has the following methods:

- `__init__(self, n_classes)`: This is the constructor method for the `LegalSentimentClassifier` class. It initializes the class with a specified number of output classes (`n_classes`). It also initializes a BERT model (`self.bert`) from the pre-trained 'bert-base-uncased' model and a linear layer (`self.l0`) that maps the output of the BERT model to the number of output classes.

- `forward(self, input_ids, attention_mask, token_type_ids)`: This method is used to perform a forward pass through the model. It takes as input the `input_ids`, `attention_mask`, and `token_type_ids` for a batch of text data. The BERT model processes this input and returns a pooled output. This pooled output is then passed through the linear layer to produce the final output logits.

The model is designed to be used for sentiment classification tasks in the legal domain. The number of output classes should be set based on the number of sentiment classes in the specific task (e.g., positive, negative, neutral). The number of classes is 11 in our case.

In [38]:
class LegalSentimentClassifier(torch.nn.Module):
    def __init__(self, n_classes):
        super(LegalSentimentClassifier, self).__init__()
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.l0 = torch.nn.Linear(768, n_classes)

    def forward(self, input_ids, attention_mask, token_type_ids):
        _, pooled_output = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            return_dict=False
        )
        logits = self.l0(pooled_output)
        return logits


In [39]:
model = LegalSentimentClassifier(N_CLASSES)
model.to(device)

LegalSentimentClassifier(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_

## Pretrained BERT

In [40]:
test_set = CustomDataset(test, tokenizer, MAX_LEN)
testing_loader = DataLoader(test_set, batch_size=BATCH_SIZE)

In [41]:
from tqdm import tqdm

def validation(epoch):
    model.eval()
    fin_targets = []
    fin_outputs = []

    with torch.no_grad():
        # Wrap the loader with tqdm for a progress bar
        pbar = tqdm(enumerate(testing_loader), total=len(testing_loader))
        for _, data in pbar:
            ids = data['ids'].to(device, dtype=torch.long)
            mask = data['mask'].to(device, dtype=torch.long)
            token_type_ids = data['token_type_ids'].to(device, dtype=torch.long)
            targets = data['targets'].to(device, dtype=torch.float)

            outputs = model(ids, mask, token_type_ids)

            fin_targets.extend(targets.cpu().detach().numpy().tolist())
            fin_outputs.extend(torch.sigmoid(outputs).cpu().detach().numpy().tolist())
            pbar.set_description(f'Validation Epoch: {epoch}')

    return fin_outputs, fin_targets


In [42]:
from sklearn import metrics

for epoch in range(EPOCHS):
    outputs, targets = validation(epoch)
    outputs = np.array(outputs) >= 0.5
    accuracy = metrics.accuracy_score(targets, outputs)
    jaccard_score = metrics.jaccard_score(targets, outputs, average='samples')
    f1_score_micro = metrics.f1_score(targets, outputs, average='micro')
    f1_score_macro = metrics.f1_score(targets, outputs, average='macro')
    print(f"Accuracy Score = {accuracy}")
    print(f"Jaccard Score = {jaccard_score}")
    print(f"F1 Score (Micro) = {f1_score_micro}")
    print(f"F1 Score (Macro) = {f1_score_macro}")

  0%|          | 0/125 [00:00<?, ?it/s]Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
  'targets': torch.tensor(targets, dtype=torch.long)
Validation Epoch: 0: 100%|██████████| 125/125 [00:44<00:00,  2.82it/s]


Accuracy Score = 0.0
Jaccard Score = 0.15581071428571425
F1 Score (Micro) = 0.2599364069952305
F1 Score (Macro) = 0.15876036291991663


  'targets': torch.tensor(targets, dtype=torch.long)
Validation Epoch: 1: 100%|██████████| 125/125 [00:43<00:00,  2.85it/s]


Accuracy Score = 0.0
Jaccard Score = 0.15581071428571425
F1 Score (Micro) = 0.2599364069952305
F1 Score (Macro) = 0.15876036291991663


  'targets': torch.tensor(targets, dtype=torch.long)
Validation Epoch: 2: 100%|██████████| 125/125 [00:42<00:00,  2.91it/s]


Accuracy Score = 0.0
Jaccard Score = 0.15581071428571425
F1 Score (Micro) = 0.2599364069952305
F1 Score (Macro) = 0.15876036291991663


  'targets': torch.tensor(targets, dtype=torch.long)
Validation Epoch: 3: 100%|██████████| 125/125 [00:43<00:00,  2.90it/s]


Accuracy Score = 0.0
Jaccard Score = 0.15581071428571425
F1 Score (Micro) = 0.2599364069952305
F1 Score (Macro) = 0.15876036291991663


  'targets': torch.tensor(targets, dtype=torch.long)
Validation Epoch: 4: 100%|██████████| 125/125 [00:43<00:00,  2.90it/s]

Accuracy Score = 0.0
Jaccard Score = 0.15581071428571425
F1 Score (Micro) = 0.2599364069952305
F1 Score (Macro) = 0.15876036291991663





## Finetuned BERT

In [43]:
optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE)
loss_fn = torch.nn.BCEWithLogitsLoss()

In [44]:
import tqdm

In [45]:
from tqdm import tqdm

def train(epoch):
    model.train()
    pbar = tqdm(enumerate(training_loader), total=len(training_loader))
    avg_loss = 0
    for i, data in pbar:
        ids = data['ids'].to(device, dtype=torch.long)
        mask = data['mask'].to(device, dtype=torch.long)
        token_type_ids = data['token_type_ids'].to(device, dtype=torch.long)
        targets = data['targets'].to(device, dtype=torch.float)

        outputs = model(ids, mask, token_type_ids)

        optimizer.zero_grad()
        loss = loss_fn(outputs, targets)
        # if _ % 5000 == 0:
        #     print(f'Epoch: {epoch}, Loss: {loss.item()}')


        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        avg_loss += loss.item()
        pbar.set_description(f'Epoch: {epoch}, Loss: {loss.item()}')
    print("Average Loss: ", avg_loss/len(training_loader))

for epoch in range(EPOCHS):
    train(epoch)

  'targets': torch.tensor(targets, dtype=torch.long)
Epoch: 0, Loss: 0.15609687566757202: 100%|██████████| 1125/1125 [09:32<00:00,  1.97it/s]


Average Loss:  0.23738365139563877


Epoch: 1, Loss: 0.12361260503530502: 100%|██████████| 1125/1125 [09:29<00:00,  1.98it/s]


Average Loss:  0.16014755400684144


Epoch: 2, Loss: 0.10439779609441757: 100%|██████████| 1125/1125 [09:34<00:00,  1.96it/s]


Average Loss:  0.1339489597843753


Epoch: 3, Loss: 0.08795338869094849: 100%|██████████| 1125/1125 [09:33<00:00,  1.96it/s] 


Average Loss:  0.11342500474221176


Epoch: 4, Loss: 0.06652417033910751: 100%|██████████| 1125/1125 [09:30<00:00,  1.97it/s] 

Average Loss:  0.0955879259871112





In [46]:
# save model
save_path = 'trained_model.pt'
torch.save(model, save_path)

In [47]:
test_set = CustomDataset(test, tokenizer, MAX_LEN)
testing_loader = DataLoader(test_set, batch_size=BATCH_SIZE)

In [48]:
from tqdm import tqdm

def validation(epoch):
    model.eval()
    fin_targets = []
    fin_outputs = []

    with torch.no_grad():
        # Wrap the loader with tqdm for a progress bar
        pbar = tqdm(enumerate(testing_loader), total=len(testing_loader))
        for _, data in pbar:
            ids = data['ids'].to(device, dtype=torch.long)
            mask = data['mask'].to(device, dtype=torch.long)
            token_type_ids = data['token_type_ids'].to(device, dtype=torch.long)
            targets = data['targets'].to(device, dtype=torch.float)

            outputs = model(ids, mask, token_type_ids)

            fin_targets.extend(targets.cpu().detach().numpy().tolist())
            fin_outputs.extend(torch.sigmoid(outputs).cpu().detach().numpy().tolist())
            pbar.set_description(f'Validation Epoch: {epoch}')

    return fin_outputs, fin_targets


In [49]:
from sklearn import metrics



for epoch in range(EPOCHS):
    outputs, targets = validation(epoch)
    outputs = np.array(outputs) >= 0.5
    accuracy = metrics.accuracy_score(targets, outputs)
    jaccard_score = metrics.jaccard_score(targets, outputs, average='samples')
    f1_score_micro = metrics.f1_score(targets, outputs, average='micro')
    f1_score_macro = metrics.f1_score(targets, outputs, average='macro')
    print(f"Accuracy Score = {accuracy}")
    print(f"Jaccard Score = {jaccard_score}")
    print(f"F1 Score (Micro) = {f1_score_micro}")
    print(f"F1 Score (Macro) = {f1_score_macro}")

  'targets': torch.tensor(targets, dtype=torch.long)
Validation Epoch: 0: 100%|██████████| 125/125 [00:42<00:00,  2.91it/s]
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


Accuracy Score = 0.506
Jaccard Score = 0.6681833333333332
F1 Score (Micro) = 0.718294051627385
F1 Score (Macro) = 0.5553858849871911


  'targets': torch.tensor(targets, dtype=torch.long)
Validation Epoch: 1: 100%|██████████| 125/125 [00:42<00:00,  2.91it/s]
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


Accuracy Score = 0.506
Jaccard Score = 0.6681833333333332
F1 Score (Micro) = 0.718294051627385
F1 Score (Macro) = 0.5553858849871911


  'targets': torch.tensor(targets, dtype=torch.long)
Validation Epoch: 2: 100%|██████████| 125/125 [00:42<00:00,  2.91it/s]
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


Accuracy Score = 0.506
Jaccard Score = 0.6681833333333332
F1 Score (Micro) = 0.718294051627385
F1 Score (Macro) = 0.5553858849871911


  'targets': torch.tensor(targets, dtype=torch.long)
Validation Epoch: 3: 100%|██████████| 125/125 [00:42<00:00,  2.91it/s]
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


Accuracy Score = 0.506
Jaccard Score = 0.6681833333333332
F1 Score (Micro) = 0.718294051627385
F1 Score (Macro) = 0.5553858849871911


  'targets': torch.tensor(targets, dtype=torch.long)
Validation Epoch: 4: 100%|██████████| 125/125 [00:42<00:00,  2.91it/s]

Accuracy Score = 0.506
Jaccard Score = 0.6681833333333332
F1 Score (Micro) = 0.718294051627385
F1 Score (Macro) = 0.5553858849871911



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


## Results

When we run a Pretrained BERT model directly on the dataset we obsere significantly reduced performance as compared to a BERT model that has been Finetuned on the ETHCR_b legal dataset. The Finetuned model shows better results for the following reasons:
- **Domain-Specific Adaptation:** Fine-tuning allows the model to adapt to the nuances of a specific domain or task by updating its parameters based on a smaller, domain-specific dataset. This is particularly useful for tasks with unique linguistic patterns or vocabulary.
- **Improved Task Performance:** Fine-tuning can lead to superior performance on domain-specific tasks, as the model becomes more attuned to the intricacies of the target domain.