#PET implementation for Basque corpora.
Though different semi-supervised methods have been tested for low resource scenarios, most of them are applied to the English language. Basque corpora and models will encounter this scenarios more often. Thus, we propose to apply one of those semi-supervised methods to the case of Basque. 

In this notebook we will use the inplementation of [PET](https://github.com/timoschick/pet) to test if it can be used in Basque corpora and models. 

The procedure will be tested on the [Basque Headlines Topic Classification (BHTC)](https://hizkuntzateknologiak.elhuyar.eus/es/recursos). A Basque corpus made up of headlines from different news of the Argia newspaper. This headlines are labeled with 12 different topics. 

The chosen pretrained language model to be used throughout the procedure is the BERTeus model. This model had already been proven useful in the topic classification task in  [Give your Text Representation Models some Love: the Case for Basque](https://arxiv.org/abs/2004.00033). Thus, we will check if the model still performs well when training it in a few-shot scenario.

The pretrained version (without having been trained in a downstream task) will be used to compare the results obtained using PET (and iPET) and doing the usual supervised training in a few-shot scenario.

In [None]:
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)
%cd /content/drive/MyDrive/Colab Notebooks/PET_basque
# ! git clone https://github.com/timoschick/pet.git
! pip install -r pet/requirements.txt
#restart the runtime

In [None]:
%cd /content/drive/MyDrive/Colab Notebooks/PET_basque/pet

/content/drive/MyDrive/Colab Notebooks/PET_basque/pet


In [None]:
import pandas as pd
from collections import Counter, defaultdict
import tensorflow as tf
import torch

In [None]:
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

'Tesla K80'

In [None]:
train = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/PET_basque/bhtc_corpus/train_original.tsv", sep="\t", header=None)
dev = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/PET_basque/bhtc_corpus/dev_original.tsv", sep="\t", header=None)
train.info()
train.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8682 entries, 0 to 8681
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       8682 non-null   object
 1   1       8682 non-null   object
dtypes: object(2)
memory usage: 135.8+ KB


Unnamed: 0,0,1
0,Euskara,म र च १९ क द न ब स क न गर कहर ल क न पन द शल आफ...
1,Politika,agiri baten bitartez adierazi dute mugimenduko...
2,Euskara,ekainaren 14an heldu den igandean behaskaneko ...
3,Politika,ekineko zuzendaritzako kide izatea leporatuta ...
4,Ingurumena,energi trantsizioa landuko dute larunbatean ir...


In [None]:
#There are repeated examples in the training dataset
print('Unique comments: ', train.iloc[:,1].nunique() == train.shape[0])
print('Null values: ', train.isnull().values.any())

Unique comments:  False
Null values:  False


In [None]:
#Usually the headlines are sort but outliers can appear.
print('average sentence length: ', train.iloc[:,1].str.split().str.len().mean())
print('stdev sentence length: ', train.iloc[:,1].str.split().str.len().std())
print('max sentence length: ', train.iloc[:,1].str.split().str.len().max())

average sentence length:  33.744413729555404
stdev sentence length:  16.202055636557947
max sentence length:  200


In [None]:
#The labels are unbalanced.
Counter(train.iloc[:,0]).most_common()

[('Gizartea', 2438),
 ('Politika', 1349),
 ('Nazioartea', 1092),
 ('Ekonomia', 817),
 ('Ingurumena', 790),
 ('Kultura', 777),
 ('Euskara', 495),
 ('Historia', 330),
 ('Iritzia', 303),
 ('Komunikazioa', 119),
 ('Euskal_Herria', 109),
 ('Zientzia', 63)]

In [None]:
train[train.iloc[:,1].duplicated(keep=False).values].sort_values(1)

Unnamed: 0,0,1
2466,Gizartea,124 274 pertsona deituak zituen gure esku dago...
8118,Gizartea,124 274 pertsona deituak zituen gure esku dago...
5698,Ekonomia,157 kaleratze aurreikusten dituen espedientea ...
439,Ekonomia,157 kaleratze aurreikusten dituen espedientea ...
7394,Iritzia,17 28 ziren ordu penintsularra eta negua argit...
...,...,...
8141,Gizartea,teoria feministak praktikara eraman dituzte an...
406,Politika,unai rementeria bizkaiko ahaldun nagusiak aurk...
2981,Politika,unai rementeria bizkaiko ahaldun nagusiak aurk...
8613,Gizartea,usurbilgo geltokian mozal legearen aurkako pro...


In [None]:
#dropping the duplicates in the training set, since we are taking few examples, we don't want redundant ones.
#Also to prevent the same headline appearing in the unlabeled dataset.
train_nodup = train.drop_duplicates() 

##First test

First we tried with 2 examples for each label, with just one model of PET for each PVP, and downsampling also the development dataset.

In [None]:
new_train_list = []
unlabeled_list = []
new_dev_list = []

In [None]:
#We take 2 examples from each label, and the rest become the unlabeled data-set.
#Instead, we could also make the new training data-set following the proportion of examples per label.
for group in train_nodup.groupby(train_nodup.iloc[:,0]):
    label_train = group[1]
    new_train = label_train.sample(n=2)
    new_train_list.append(new_train)
    dropped = label_train.drop(new_train.index)
    dropped = dropped.drop([0], axis=1)
    unlabeled_list.append(dropped)

In [None]:
#We also downsample the development data-set. Since, having a much bigger
# dev than training does not make sense in the scenario that we want to test.
for group in dev.groupby(dev.iloc[:,0]):
    label_dev = group[1]
    new_dev = label_dev.sample(n=3)
    new_dev_list.append(new_dev)

In [None]:
new_train_df = pd.concat(new_train_list)
unlabeled_df = pd.concat(unlabeled_list)
unlabeled_df.insert(0, 'unlabeled', 'unlabeled')
new_dev_df = pd.concat(new_dev_list)

In [None]:
new_train_df.to_csv("/content/drive/MyDrive/Colab Notebooks/PET_Basque/bhtc_corpus/train.tsv", sep="\t", index=False, header=False)
unlabeled_df.to_csv("/content/drive/MyDrive/Colab Notebooks/PET_Basque/bhtc_corpus/unlabeled.tsv", sep="\t", index=False, header=False)
new_dev_df.to_csv("/content/drive/MyDrive/Colab Notebooks/PET_Basque/bhtc_corpus/dev.tsv", sep="\t", index=False, header=False)

In [None]:
%cd /content/drive/My Drive/Colab Notebooks/PET_Basque/pet

/content/drive/My Drive/Colab Notebooks/PET_Basque/pet


In [None]:
#PET method with the BHTC Corpus using BERTeus as the pretrained model, 1 model per PVP.
!python cli.py \
--method pet \
--pattern_ids 0 1 2 \
--data_dir "/content/drive/MyDrive/Colab Notebooks/PET_Basque/bhtc_corpus" \
--model_type bert \
--model_name_or_path "ixa-ehu/berteus-base-cased" \
--task_name "basque-topic-classification" \
--output_dir "../model-PET_24" \
--do_train \
--do_eval \
--pet_repetitions 1

In [None]:
#Classification method with the BHTC Corpus using BERTeus as the pretrained model.
!python cli.py \
--method sequence_classifier \
--data_dir "/content/drive/MyDrive/Colab Notebooks/PET_Basque/bhtc_corpus" \
--model_type bert \
--model_name_or_path "ixa-ehu/berteus-base-cased" \
--task_name "basque-topic-classification" \
--output_dir "../model-sc_24" \
--do_train \
--do_eval \

##Second test

Then we tried with 10 examples for each label, with 3 models for each PVP. 

In [None]:
new_train_list = []
unlabeled_list = []
new_dev_list = []

In [None]:
for group in train_nodup.groupby(train_nodup.iloc[:,0]):
    label_train = group[1]
    new_train = label_train.sample(n=10)
    new_train_list.append(new_train)
    dropped = label_train.drop(new_train.index)
    dropped = dropped.drop([0], axis=1)
    unlabeled_list.append(dropped)

In [None]:
for group in dev.groupby(dev.iloc[:,0]):
    label_dev = group[1]
    new_dev = label_dev.sample(n=5)
    new_dev_list.append(new_dev)

In [None]:
new_train_df = pd.concat(new_train_list)
unlabeled_df = pd.concat(unlabeled_list)
unlabeled_df.insert(0, 'unlabeled', 'unlabeled')
new_dev_df = pd.concat(new_dev_list)

In [None]:
new_train_df.to_csv("/content/drive/MyDrive/Colab Notebooks/PET_basque/bhtc_corpus/train.tsv", sep="\t", index=False, header=False)
unlabeled_df.to_csv("/content/drive/MyDrive/Colab Notebooks/PET_basque/bhtc_corpus/unlabeled.tsv", sep="\t", index=False, header=False)
new_dev_df.to_csv("/content/drive/MyDrive/Colab Notebooks/PET_basque/bhtc_corpus/dev.tsv", sep="\t", index=False, header=False)

In [None]:
%cd /content/drive/My Drive/Colab Notebooks/PET_basque/pet

/content/drive/My Drive/Colab Notebooks/PET_basque/pet


In [None]:
#CLassification method with the BHTC Corpus using BERTeus as the pretrained model.
!python cli.py \
--method sequence_classifier \
--data_dir "/content/drive/MyDrive/Colab Notebooks/PET_Basque/bhtc_corpus" \
--model_type bert \
--model_name_or_path "ixa-ehu/berteus-base-cased" \
--task_name "basque-topic-classification" \
--output_dir "../model_sc" \
--do_train \
--do_eval \

In [None]:
#PET method with the BHTC Corpus using BERTeus as the pretrained model, 3 models per PVP.
!python cli.py \
--method pet \
--pattern_ids 0 1 2 \
--data_dir "/content/drive/MyDrive/Colab Notebooks/PET_Basque/bhtc_corpus" \
--model_type bert \
--model_name_or_path "ixa-ehu/berteus-base-cased" \
--task_name "basque-topic-classification" \
--output_dir "../model-PET_120" \
--do_train \
--do_eval \

In [None]:
#iPET method with the BHTC Corpus using BERTeus as the pretrained model, 3 models per PVP.
!python cli.py \
--method ipet \
--pattern_ids 0 1 2 \
--data_dir "/content/drive/MyDrive/Colab Notebooks/PET_basque/bhtc_corpus" \
--model_type bert \
--model_name_or_path "ixa-ehu/berteus-base-cased" \
--task_name "basque-topic-classification" \
--output_dir "../model-iPET_120" \
--do_train \
--do_eval \
--overwrite_output_dir #restarting the training from the last model created