**The Second Workshop on Speech and Language Technologies for Dravidian Languages at ACL 2022 (DravidianLangTech-2022)** 

Workshop link: https://dravidianlangtech.github.io/2022/

**2nd Shared Task**

Shared Task Topic: Emotional analysis in Tamil

Link for the task:https://competitions.codalab.org/competitions/36403


**Section 1: Package Installation**

In [None]:
!pip install --upgrade transformers



In [None]:
!pip install simpletransformers



In [None]:
!pip install gputil
!pip install psutil
!pip install humanize



**Task A**

**Section 2: Importing datasets**

In [None]:
import pandas as pd

In [None]:
tamil_train = pd.read_csv('ta-emotion10-train.csv',sep='\t',names = ['labels','text'])
tamil_dev = pd.read_csv('ta-emotion10-dev.csv',sep='\t',names = ['labels','text'])

tamil_test = pd.read_csv('test_without_labels_task_a.csv',sep='\t')


In [None]:
tamil_train['labels'].value_counts()

Neutral         4841
Joy             2134
Ambiguous       1689
Trust           1254
Disguist         910
Anger            834
Anticipation     828
Sadness          695
Love             675
Surprise         248
Fear             100
Name: labels, dtype: int64

In [None]:
tamil_train

Unnamed: 0,labels,text
0,Neutral,நாளைக்கு அரிசிக்கு இந்த நிலமை வந்தா 🙂
1,Anger,மானம் கேட்ட அன்புமணி
2,Neutral,தவறு இஸ்ரேல் இருக்காது இதை நான் கூறவில்லை ஹமாஸ...
3,Joy,கொங்கு நாட்டு சிங்கம் உன்மையும் நேர்மையும் உலை...
4,Neutral,இவர் யார்? ஒவ்வொரு வார்த்தையும் முன்னுக்கு பின...
...,...,...
14203,Trust,பெ மணியரசன் கூறுவதைஉணர்ந்து. செயலாற்றுவதேஇன்ற...
14204,Ambiguous,இன்னும் எத்தன நாள் வச்சி செய்வீங்க.
14205,Anticipation,அடுத்த ஏதோ தயார்பன்னிட்டான்
14206,Ambiguous,தமிழ் மற்றும் சமஸ்கிருதம்


In [None]:
tamil_dev

Unnamed: 0,labels,text
0,Joy,அருமை அற்புதம் பிரமாதம் நண்பரே வாழ்த்துக்கள் ந...
1,Anticipation,வேல்ராஜ் வேலையா தான் இருக்கும்
2,Joy,அண்ணன் கிட்டுக்கு வாழ்த்துக்கள் 👍👍
3,Trust,ஆமா நானும் இதான் யோசித்தேன் 🤣🤣
4,Anticipation,மொத்த மக்களும் ஒன்னு சேர்ந்தாதான் இந்த அரசாங்க...
...,...,...
3547,Anticipation,ஐயா தூத்துக்குடி ல கட்டுப்பாடுகளை மீறி சில பேக...
3548,Joy,உங்கள் கருத்துப்படி மகிழ்ச்சி கோமதி
3549,Love,அறுமையான விலக்கம் நன்றி ஆன்டவர் உங்களை ஆசிர்வத...
3550,Joy,அப்ப மட்டும் இல்ல இப்பவும் கூட . அரை வயிற்றுக்...


In [None]:
import psutil
import humanize
import os
import GPUtil as GPU

import numpy as np
import pandas as pd
from tqdm import tqdm
import warnings
warnings.simplefilter('ignore')
import gc
from scipy.special import softmax
from simpletransformers.classification import ClassificationModel
from sklearn.model_selection import train_test_split, StratifiedKFold, KFold 
import sklearn
from sklearn.metrics import log_loss
from sklearn.metrics import *
from sklearn.model_selection import *
import re
import random
import torch


#Seed everything for reproducability

def seed_all(seed_value):
    random.seed(seed_value) # Python
    np.random.seed(seed_value) # cpu vars
    torch.manual_seed(seed_value) # cpu  vars
    
    if torch.cuda.is_available(): 
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value) # gpu vars
        torch.backends.cudnn.deterministic = True  #needed
        torch.backends.cudnn.benchmark = False
SEED = 2
seed_all(SEED)

In [None]:
!pip install keras



In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
df_train = tamil_train
df_dev = tamil_dev
#df_test = tamil_test #whenever this is released

In [None]:
labels = list(df_train['labels'].unique())
labels

['Neutral',
 'Anger',
 'Joy',
 'Disguist',
 'Trust',
 'Anticipation',
 'Ambiguous',
 'Love',
 'Surprise',
 'Sadness',
 'Fear']

In [None]:
# One hot encoding
df_train ['labels'] = df_train['labels'].astype('category')
df_dev ['labels'] = df_dev['labels'].astype('category')



In [None]:
train_y = df_train['labels']
print(train_y.value_counts())
train_y = train_y.to_numpy()

Neutral         4841
Joy             2134
Ambiguous       1689
Trust           1254
Disguist         910
Anger            834
Anticipation     828
Sadness          695
Love             675
Surprise         248
Fear             100
Name: labels, dtype: int64


In [None]:
dev_y = df_dev['labels']
print(dev_y.value_counts())
dev_y = dev_y.to_numpy()

Neutral         1222
Joy              558
Ambiguous        437
Trust            272
Anticipation     213
Disguist         210
Sadness          191
Love             189
Anger            184
Surprise          53
Fear              23
Name: labels, dtype: int64


In [None]:
# One hot encode
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(train_y.reshape(-1,1))
print(enc.categories_)
train_y = enc.transform(train_y.reshape(-1,1)).toarray()
dev_y = enc.transform(dev_y.reshape(-1,1)).toarray()
train_y

[array(['Ambiguous', 'Anger', 'Anticipation', 'Disguist', 'Fear', 'Joy',
       'Love', 'Neutral', 'Sadness', 'Surprise', 'Trust'], dtype=object)]


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 1., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.]])

In [None]:
labels_dic = dict(enumerate(df_train['labels'].cat.categories))

In [None]:
df_train['labels'] = df_train['labels'].cat.codes
df_dev['labels'] = df_dev['labels'].cat.codes

In [None]:
df_train.head()

Unnamed: 0,labels,text
0,7,நாளைக்கு அரிசிக்கு இந்த நிலமை வந்தா 🙂
1,1,மானம் கேட்ட அன்புமணி
2,7,தவறு இஸ்ரேல் இருக்காது இதை நான் கூறவில்லை ஹமாஸ...
3,5,கொங்கு நாட்டு சிங்கம் உன்மையும் நேர்மையும் உலை...
4,7,இவர் யார்? ஒவ்வொரு வார்த்தையும் முன்னுக்கு பின...


In [None]:
labels_dic

{0: 'Ambiguous',
 1: 'Anger',
 2: 'Anticipation',
 3: 'Disguist',
 4: 'Fear',
 5: 'Joy',
 6: 'Love',
 7: 'Neutral',
 8: 'Sadness',
 9: 'Surprise',
 10: 'Trust'}

In [None]:
# Deal with class imbalance
from sklearn.utils import class_weight
import numpy as np

class_weights = class_weight.compute_class_weight( class_weight = 'balanced',
                                                 classes = np.unique(df_train['labels']),
                                                 y = df_train['labels'])
# Reset non language class weight
class_weights[-1] = 0.5
    
class_weights = list(class_weights)
class_weights

[0.7647343775230099,
 1.5487246566383257,
 1.5599472990777339,
 1.4193806193806193,
 12.916363636363636,
 0.6052654000170401,
 1.9135353535353536,
 0.26681189085650975,
 1.858469587965991,
 5.2082111436950145,
 0.5]

In [None]:
sweep_config = {
    "method": "bayes",  # grid, random
    "metric": {"name": "eval_loss", "goal": "minimize"},
    "parameters": {
        "num_train_epochs": {"values": [2, 3, 5, 6]},
        "learning_rate": {"min": 5e-6, "max": 4e-3},
        'weight_decay':  {"values": [0, 0.1, 0.01, 0.001]}
    },
}

In [None]:
import logging
import sklearn
import wandb

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.INFO)

**RoBERTa model**

In [None]:
 model = ClassificationModel('xlmroberta', 'xlm-roberta-base', use_cuda=True,num_labels=11, args={
                                                                    'train_batch_size':20,
                                                                    'reprocess_input_data': True,
                                                                    #"weight":  class_weights,
                                                                    'overwrite_output_dir': True,
                                                                    'fp16': False,
                                                                    'do_lower_case': False,
                                                                    'num_train_epochs': 6,
                                                                    'max_seq_length': 256,
                                                                    'regression': False,
                                                                    'manual_seed': SEED,
                                                                    "learning_rate":1e-5,
                                                                    'weight_decay':0,
                                                                    "save_eval_checkpoints": False,
                                                                    "save_model_every_epoch": False,
                                                                    "silent": False,
                                                                    "verbose": True,
                                                                    "dataloader_num_workers": 0,
                                                                    "evaluate_during_training": True,
                                                                    "use_early_stopping":True,
                                                                    'use_multiprocessing': False,
                                                                    "output_dir":'outputs/tamil/',})

https://huggingface.co/xlm-roberta-base/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpmqfqj53n


Downloading:   0%|          | 0.00/512 [00:00<?, ?B/s]

storing https://huggingface.co/xlm-roberta-base/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/87683eb92ea383b0475fecf99970e950a03c9ff5e51648d6eee56fb754612465.ab95cf27f9419a99cce4f19d09e655aba382a2bafe2fe26d0cc24c18cf1a1af6
creating metadata file for /root/.cache/huggingface/transformers/87683eb92ea383b0475fecf99970e950a03c9ff5e51648d6eee56fb754612465.ab95cf27f9419a99cce4f19d09e655aba382a2bafe2fe26d0cc24c18cf1a1af6
loading configuration file https://huggingface.co/xlm-roberta-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/87683eb92ea383b0475fecf99970e950a03c9ff5e51648d6eee56fb754612465.ab95cf27f9419a99cce4f19d09e655aba382a2bafe2fe26d0cc24c18cf1a1af6
Model config XLMRobertaConfig {
  "architectures": [
    "XLMRobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id

Downloading:   0%|          | 0.00/1.04G [00:00<?, ?B/s]

storing https://huggingface.co/xlm-roberta-base/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/97d0ea09f8074264957d062ec20ccb79af7b917d091add8261b26874daf51b5d.f42212747c1c27fcebaa0a89e2a83c38c6d3d4340f21922f892b88d882146ac2
creating metadata file for /root/.cache/huggingface/transformers/97d0ea09f8074264957d062ec20ccb79af7b917d091add8261b26874daf51b5d.f42212747c1c27fcebaa0a89e2a83c38c6d3d4340f21922f892b88d882146ac2
loading weights file https://huggingface.co/xlm-roberta-base/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/97d0ea09f8074264957d062ec20ccb79af7b917d091add8261b26874daf51b5d.f42212747c1c27fcebaa0a89e2a83c38c6d3d4340f21922f892b88d882146ac2
Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'lm_he

Downloading:   0%|          | 0.00/4.83M [00:00<?, ?B/s]

storing https://huggingface.co/xlm-roberta-base/resolve/main/sentencepiece.bpe.model in cache at /root/.cache/huggingface/transformers/9df9ae4442348b73950203b63d1b8ed2d18eba68921872aee0c3a9d05b9673c6.00628a9eeb8baf4080d44a0abe9fe8057893de20c7cb6e6423cddbf452f7d4d8
creating metadata file for /root/.cache/huggingface/transformers/9df9ae4442348b73950203b63d1b8ed2d18eba68921872aee0c3a9d05b9673c6.00628a9eeb8baf4080d44a0abe9fe8057893de20c7cb6e6423cddbf452f7d4d8
https://huggingface.co/xlm-roberta-base/resolve/main/tokenizer.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp0cchz6dd


Downloading:   0%|          | 0.00/8.68M [00:00<?, ?B/s]

storing https://huggingface.co/xlm-roberta-base/resolve/main/tokenizer.json in cache at /root/.cache/huggingface/transformers/daeda8d936162ca65fe6dd158ecce1d8cb56c17d89b78ab86be1558eaef1d76a.a984cf52fc87644bd4a2165f1e07e0ac880272c1e82d648b4674907056912bd7
creating metadata file for /root/.cache/huggingface/transformers/daeda8d936162ca65fe6dd158ecce1d8cb56c17d89b78ab86be1558eaef1d76a.a984cf52fc87644bd4a2165f1e07e0ac880272c1e82d648b4674907056912bd7
loading file https://huggingface.co/xlm-roberta-base/resolve/main/sentencepiece.bpe.model from cache at /root/.cache/huggingface/transformers/9df9ae4442348b73950203b63d1b8ed2d18eba68921872aee0c3a9d05b9673c6.00628a9eeb8baf4080d44a0abe9fe8057893de20c7cb6e6423cddbf452f7d4d8
loading file https://huggingface.co/xlm-roberta-base/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/daeda8d936162ca65fe6dd158ecce1d8cb56c17d89b78ab86be1558eaef1d76a.a984cf52fc87644bd4a2165f1e07e0ac880272c1e82d648b4674907056912bd7
loading file h

In [None]:
from pprint import pprint
pprint(model.weight)

None


In [None]:
import shutil

shutil.rmtree('outputs')

In [None]:
model.train_model(df_train, eval_df = df_dev)

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.
INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_train_xlmroberta_256_11_2


Epoch:   0%|          | 0/6 [00:00<?, ?it/s]

Running Epoch 0 of 6:   0%|          | 0/711 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3552 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_xlmroberta_256_11_2
Configuration saved in outputs/best_model/config.json
Model weights saved in outputs/best_model/pytorch_model.bin
tokenizer config file saved in outputs/best_model/tokenizer_config.json
Special tokens file saved in outputs/best_model/special_tokens_map.json


Running Epoch 1 of 6:   0%|          | 0/711 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3552 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_xlmroberta_256_11_2
Configuration saved in outputs/best_model/config.json
Model weights saved in outputs/best_model/pytorch_model.bin
tokenizer config file saved in outputs/best_model/tokenizer_config.json
Special tokens file saved in outputs/best_model/special_tokens_map.json


Running Epoch 2 of 6:   0%|          | 0/711 [00:00<?, ?it/s]

Configuration saved in outputs/tamil/checkpoint-2000/config.json
Model weights saved in outputs/tamil/checkpoint-2000/pytorch_model.bin
tokenizer config file saved in outputs/tamil/checkpoint-2000/tokenizer_config.json
Special tokens file saved in outputs/tamil/checkpoint-2000/special_tokens_map.json
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3552 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_xlmroberta_256_11_2
INFO:simpletransformers.classification.classification_model: No improvement in eval_loss
INFO:simpletransformers.classification.classification_model: Current step: 1
INFO:simpletransformers.classification.classification_model: Early stopping patience: 3
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3552 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_xlmroberta_256_11_2


Running Epoch 3 of 6:   0%|          | 0/711 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3552 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_xlmroberta_256_11_2


Running Epoch 4 of 6:   0%|          | 0/711 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3552 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_xlmroberta_256_11_2


Running Epoch 5 of 6:   0%|          | 0/711 [00:00<?, ?it/s]

Configuration saved in outputs/tamil/checkpoint-4000/config.json
Model weights saved in outputs/tamil/checkpoint-4000/pytorch_model.bin
tokenizer config file saved in outputs/tamil/checkpoint-4000/tokenizer_config.json
Special tokens file saved in outputs/tamil/checkpoint-4000/special_tokens_map.json
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3552 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_xlmroberta_256_11_2
INFO:simpletransformers.classification.classification_model: No improvement in eval_loss
INFO:simpletransformers.classification.classification_model: Current step: 2
INFO:simpletransformers.classification.classification_model: Early stopping patience: 3
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3552 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_xlmroberta_256_11_2
Configuration saved in outputs/tamil/config.json
Model weights saved in outputs/tamil/pytorch_model.bin
tokenizer config file saved in outputs/tamil/tokenizer_config.json
Special tokens file saved in outputs/tamil/special_tokens_map.json
INFO:simpletransformers.classification.classification_model: Training of xlmroberta model complete. Saved to outputs/tamil/.


(4266,
 defaultdict(list,
             {'eval_loss': [1.5515856892541722,
               1.5501589403898866,
               1.5694093453857276,
               1.6070878762114156,
               1.6358419290265522,
               1.6601698002568237,
               1.6691478122327779,
               1.6716374787795651],
              'global_step': [711, 1422, 2000, 2133, 2844, 3555, 4000, 4266],
              'mcc': [0.34122126613685594,
               0.34274424123755926,
               0.3393286079155427,
               0.3395107966069171,
               0.324129889281061,
               0.3275965016845351,
               0.32687447717960155,
               0.32423825058634687],
              'train_loss': [1.9190318584442139,
               2.5880603790283203,
               1.3165628910064697,
               1.5532948970794678,
               0.9876564145088196,
               1.1034525632858276,
               1.1482433080673218,
               0.41929641366004944]}))

In [None]:
result, model_outputs, wrong_predictions = model.eval_model(df_dev)
print(result)


INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3552 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_xlmroberta_256_11_2


Running Evaluation:   0%|          | 0/444 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_model:{'mcc': 0.32423825058634687, 'eval_loss': 1.6716374787795651}


{'mcc': 0.32423825058634687, 'eval_loss': 1.6716374787795651}


In [None]:
raw_outputs_vals = softmax(model_outputs,axis=1)

In [None]:
raw_outputs_vals

array([[0.00263342, 0.00047898, 0.00170651, ..., 0.00103841, 0.00571006,
        0.02257604],
       [0.01947621, 0.00918585, 0.27521281, ..., 0.00424737, 0.00919271,
        0.07573426],
       [0.00154378, 0.00043582, 0.00090813, ..., 0.00078274, 0.00315756,
        0.00356561],
       ...,
       [0.00182788, 0.00045288, 0.00098333, ..., 0.00121275, 0.00463687,
        0.00790163],
       [0.05133177, 0.03619023, 0.01468058, ..., 0.08797547, 0.04243442,
        0.00842735],
       [0.03851034, 0.2845659 , 0.0024859 , ..., 0.0439739 , 0.01075376,
        0.00231652]])

In [None]:
import numpy as np
y_pred = [np.argmax(i) for i in  raw_outputs_vals]

In [None]:
sklearn.metrics.f1_score(df_dev['labels'].to_numpy(),y_pred,average = 'micro')

0.4467905405405405

In [None]:
print(sklearn.metrics.classification_report(df_dev['labels'].to_numpy(),y_pred))

              precision    recall  f1-score   support

           0       0.60      0.62      0.61       437
           1       0.24      0.26      0.25       184
           2       0.29      0.32      0.31       213
           3       0.22      0.19      0.21       210
           4       0.40      0.09      0.14        23
           5       0.52      0.66      0.58       558
           6       0.28      0.11      0.15       189
           7       0.51      0.49      0.50      1222
           8       0.38      0.39      0.39       191
           9       0.00      0.00      0.00        53
          10       0.30      0.37      0.33       272

    accuracy                           0.45      3552
   macro avg       0.34      0.32      0.31      3552
weighted avg       0.43      0.45      0.44      3552



In [None]:
 !zip -r roberts_emotional.zip outputs/

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  adding: outputs/ (stored 0%)
  adding: outputs/best_model/ (stored 0%)
  adding: outputs/best_model/model_args.json (deflated 62%)
  adding: outputs/best_model/training_args.bin (deflated 49%)
  adding: outputs/best_model/eval_results.txt (stored 0%)
  adding: outputs/best_model/pytorch_model.bin (deflated 31%)
  adding: outputs/best_model/optimizer.pt (deflated 70%)
  adding: outputs/best_model/special_tokens_map.json (deflated 50%)
  adding: outputs/best_model/tokenizer_config.json (deflated 47%)
  adding: outputs/best_model/sentencepiece.bpe.model (deflated 49%)
  adding: outputs/best_model/config.json (deflated 58%)
  adding: outputs/best_model/tokenizer.json (deflated 61%)
  adding: outputs/best_model

**Testing RoBERTa Model**

In [None]:
import torch

In [None]:
model = ClassificationModel('xlmroberta', 'outputs/tamil/')

loading configuration file outputs/tamil/config.json
Model config XLMRobertaConfig {
  "_name_or_path": "xlm-roberta-base",
  "architectures": [
    "XLMRobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8",
    "9": "LABEL_9",
    "10": "LABEL_10"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_10": 10,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5,
    "LABEL_6": 6,
    "LABEL_7": 7,
    "LABEL_8": 8,
    "LABEL_9": 9
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "xlm-roberta",
  "num_at

In [None]:
tamil_test = pd.read_csv('test_without_labels_task_a.csv',sep='\t')
df_test = tamil_test

In [None]:
df_test

Unnamed: 0,0,நம்மூரில் நம்மொழியில் வழிபாடு செய்ய இவ்வளவு இடையூறு ஏன்?
0,1,தமிழ் நாட்டிற்க்கு வெளியே போய் வாழ்ந்து பாருங்...
1,2,ஆழி ரொம்ப சொம்பு தூக்காத திமுகவிற்கு
2,3,நா என்ன சொன்னேன்.
3,4,மிக நல்ல அரசியல் கலாச்சாரம் நம்ம முதல்வர் 🙏🙏🙏🙏...
4,5,கார்த்திட ஆயிரத்தில் ஒருவன் படம் பாேடுங்க அண்ண...
...,...,...
4434,4435,குஜராத்தில் அதிகம் இறப்பது ஜெயின் சமூகம்தான் இ...
4435,4436,பணம் இருந்தால் மனம் இருந்தால் குணம் இருக்கனும்...
4436,4437,இருப்பவர்கள் இடம் வாங்கி இல்லாதவர்களுக்கு கொடு...
4437,4438,அருமை அண்ணா மிக்க மகிழ்ச்சி நன்றி


In [None]:
test_sents = list(df_test['நம்மூரில் நம்மொழியில் வழிபாடு செய்ய இவ்வளவு இடையூறு ஏன்?'])

In [None]:
predictions, raw_outputs = model.predict(test_sents)

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/4439 [00:00<?, ?it/s]

  0%|          | 0/555 [00:00<?, ?it/s]

In [None]:
predictions

[3,
 0,
 7,
 5,
 7,
 7,
 5,
 0,
 7,
 7,
 7,
 8,
 10,
 7,
 0,
 0,
 3,
 2,
 8,
 1,
 5,
 7,
 8,
 7,
 8,
 1,
 3,
 1,
 8,
 3,
 5,
 7,
 8,
 5,
 0,
 7,
 1,
 6,
 7,
 7,
 7,
 5,
 7,
 7,
 5,
 5,
 0,
 5,
 0,
 7,
 7,
 2,
 7,
 7,
 7,
 5,
 0,
 1,
 5,
 0,
 5,
 1,
 1,
 7,
 2,
 5,
 2,
 10,
 10,
 0,
 0,
 7,
 7,
 10,
 10,
 1,
 7,
 5,
 0,
 5,
 3,
 5,
 8,
 0,
 8,
 7,
 5,
 7,
 5,
 8,
 1,
 10,
 7,
 3,
 7,
 5,
 5,
 10,
 7,
 7,
 7,
 5,
 2,
 8,
 0,
 5,
 7,
 5,
 7,
 7,
 10,
 0,
 10,
 5,
 2,
 10,
 4,
 5,
 3,
 5,
 7,
 0,
 0,
 7,
 7,
 7,
 8,
 7,
 8,
 5,
 7,
 0,
 0,
 7,
 7,
 7,
 0,
 7,
 2,
 7,
 7,
 7,
 0,
 7,
 7,
 7,
 7,
 0,
 5,
 5,
 0,
 7,
 5,
 8,
 7,
 7,
 0,
 5,
 1,
 10,
 0,
 10,
 7,
 0,
 7,
 1,
 2,
 1,
 7,
 8,
 7,
 7,
 6,
 5,
 2,
 0,
 7,
 3,
 7,
 5,
 3,
 7,
 10,
 7,
 7,
 2,
 7,
 1,
 7,
 2,
 0,
 10,
 7,
 5,
 7,
 7,
 10,
 0,
 7,
 0,
 8,
 5,
 8,
 5,
 10,
 7,
 5,
 0,
 5,
 5,
 7,
 0,
 7,
 10,
 8,
 7,
 1,
 7,
 5,
 3,
 1,
 8,
 2,
 6,
 7,
 6,
 8,
 2,
 7,
 5,
 1,
 7,
 10,
 7,
 7,
 0,
 10,
 2,
 2,
 7,
 5,
 8,
 7,
 7,
 7,
 

In [None]:
labels_dic

{0: 'Ambiguous',
 1: 'Anger',
 2: 'Anticipation',
 3: 'Disguist',
 4: 'Fear',
 5: 'Joy',
 6: 'Love',
 7: 'Neutral',
 8: 'Sadness',
 9: 'Surprise',
 10: 'Trust'}

In [None]:
predictions = [labels_dic[i] for i in predictions]
predictions

['Disguist',
 'Ambiguous',
 'Neutral',
 'Joy',
 'Neutral',
 'Neutral',
 'Joy',
 'Ambiguous',
 'Neutral',
 'Neutral',
 'Neutral',
 'Sadness',
 'Trust',
 'Neutral',
 'Ambiguous',
 'Ambiguous',
 'Disguist',
 'Anticipation',
 'Sadness',
 'Anger',
 'Joy',
 'Neutral',
 'Sadness',
 'Neutral',
 'Sadness',
 'Anger',
 'Disguist',
 'Anger',
 'Sadness',
 'Disguist',
 'Joy',
 'Neutral',
 'Sadness',
 'Joy',
 'Ambiguous',
 'Neutral',
 'Anger',
 'Love',
 'Neutral',
 'Neutral',
 'Neutral',
 'Joy',
 'Neutral',
 'Neutral',
 'Joy',
 'Joy',
 'Ambiguous',
 'Joy',
 'Ambiguous',
 'Neutral',
 'Neutral',
 'Anticipation',
 'Neutral',
 'Neutral',
 'Neutral',
 'Joy',
 'Ambiguous',
 'Anger',
 'Joy',
 'Ambiguous',
 'Joy',
 'Anger',
 'Anger',
 'Neutral',
 'Anticipation',
 'Joy',
 'Anticipation',
 'Trust',
 'Trust',
 'Ambiguous',
 'Ambiguous',
 'Neutral',
 'Neutral',
 'Trust',
 'Trust',
 'Anger',
 'Neutral',
 'Joy',
 'Ambiguous',
 'Joy',
 'Disguist',
 'Joy',
 'Sadness',
 'Ambiguous',
 'Sadness',
 'Neutral',
 'Joy',
 '

In [None]:
len(predictions)

4439

In [None]:
df_test['label_pred'] = predictions

In [None]:
df_test.to_csv('EmotionAnalysis_Roberta_output.csv')


In [None]:
import shutil

shutil.rmtree('outputs')

**DEBERTA model**

In [None]:
 model2 = ClassificationModel('deberta', 'microsoft/deberta-base', use_cuda=True,num_labels=11, args={
                                                                    'train_batch_size':8,
                                                                    'reprocess_input_data': True,
                                                                    #"weight":  class_weights,
                                                                    'overwrite_output_dir': True,
                                                                    'fp16': False,
                                                                    'do_lower_case': False,
                                                                    'num_train_epochs': 10,
                                                                    'max_seq_length': 256,
                                                                    'regression': False,
                                                                    'manual_seed': SEED,
                                                                    "learning_rate":1e-5,
                                                                    'weight_decay':0,
                                                                    "save_eval_checkpoints": False,
                                                                    "save_model_every_epoch": False,
                                                                    "silent": False,
                                                                    "verbose": True,
                                                                    "dataloader_num_workers": 0,
                                                                    "evaluate_during_training": True,
                                                                    "use_early_stopping":True,
                                                                    'use_multiprocessing': False,
                                                                    "output_dir":'outputs/tamil/',})

https://huggingface.co/microsoft/deberta-base/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp4fiae11z


Downloading:   0%|          | 0.00/474 [00:00<?, ?B/s]

storing https://huggingface.co/microsoft/deberta-base/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/e313266bff73867debdfa78c78a9a4966d5e78281ac4ed7048c178b16a37eba7.fb501413b9cef9cef6babdc543bb4153cbec58d52bce077647efba3e3f14ccf3
creating metadata file for /root/.cache/huggingface/transformers/e313266bff73867debdfa78c78a9a4966d5e78281ac4ed7048c178b16a37eba7.fb501413b9cef9cef6babdc543bb4153cbec58d52bce077647efba3e3f14ccf3
loading configuration file https://huggingface.co/microsoft/deberta-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/e313266bff73867debdfa78c78a9a4966d5e78281ac4ed7048c178b16a37eba7.fb501413b9cef9cef6babdc543bb4153cbec58d52bce077647efba3e3f14ccf3
Model config DebertaConfig {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5

Downloading:   0%|          | 0.00/533M [00:00<?, ?B/s]

storing https://huggingface.co/microsoft/deberta-base/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/dde0725208c11536042f6f416c538792d44a2d57d1ae399bbd1bc5867e02c465.0a3ec262cb3d4f634c72ce55f2766bb88771e6499b2512830e2e63bf19dbf97a
creating metadata file for /root/.cache/huggingface/transformers/dde0725208c11536042f6f416c538792d44a2d57d1ae399bbd1bc5867e02c465.0a3ec262cb3d4f634c72ce55f2766bb88771e6499b2512830e2e63bf19dbf97a
loading weights file https://huggingface.co/microsoft/deberta-base/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/dde0725208c11536042f6f416c538792d44a2d57d1ae399bbd1bc5867e02c465.0a3ec262cb3d4f634c72ce55f2766bb88771e6499b2512830e2e63bf19dbf97a
Some weights of the model checkpoint at microsoft/deberta-base were not used when initializing DebertaForSequenceClassification: ['lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predict

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

storing https://huggingface.co/microsoft/deberta-base/resolve/main/vocab.json in cache at /root/.cache/huggingface/transformers/ce0ac094af27cf80bbf403595a6d47f1fc632981bf1d4c5bf69968568cbea410.e8ad27cc324bb0dc448d4d95f63e48f72688fb318a4c4c3f623485621b0b515c
creating metadata file for /root/.cache/huggingface/transformers/ce0ac094af27cf80bbf403595a6d47f1fc632981bf1d4c5bf69968568cbea410.e8ad27cc324bb0dc448d4d95f63e48f72688fb318a4c4c3f623485621b0b515c
https://huggingface.co/microsoft/deberta-base/resolve/main/merges.txt not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp7soj18mb


Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

storing https://huggingface.co/microsoft/deberta-base/resolve/main/merges.txt in cache at /root/.cache/huggingface/transformers/05056f257c8d2b63ad16fd26f847c9ab9ee34e33cdfad926e132be824b237869.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
creating metadata file for /root/.cache/huggingface/transformers/05056f257c8d2b63ad16fd26f847c9ab9ee34e33cdfad926e132be824b237869.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
https://huggingface.co/microsoft/deberta-base/resolve/main/tokenizer_config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp6lurltd9


Downloading:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

storing https://huggingface.co/microsoft/deberta-base/resolve/main/tokenizer_config.json in cache at /root/.cache/huggingface/transformers/c2bc27a1c7529c177696ff76b1e74cba8667be14e202359f20f9114e407f43e2.a39abb1c6179fb264c2db685f9a056b7cb8d4bc48d729888d292a2280debf8e2
creating metadata file for /root/.cache/huggingface/transformers/c2bc27a1c7529c177696ff76b1e74cba8667be14e202359f20f9114e407f43e2.a39abb1c6179fb264c2db685f9a056b7cb8d4bc48d729888d292a2280debf8e2
loading file https://huggingface.co/microsoft/deberta-base/resolve/main/vocab.json from cache at /root/.cache/huggingface/transformers/ce0ac094af27cf80bbf403595a6d47f1fc632981bf1d4c5bf69968568cbea410.e8ad27cc324bb0dc448d4d95f63e48f72688fb318a4c4c3f623485621b0b515c
loading file https://huggingface.co/microsoft/deberta-base/resolve/main/merges.txt from cache at /root/.cache/huggingface/transformers/05056f257c8d2b63ad16fd26f847c9ab9ee34e33cdfad926e132be824b237869.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
loadin

In [None]:
from pprint import pprint
pprint(model2.weight)

None


In [None]:
model2.train_model(df_train, eval_df = df_dev)

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.
INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_train_deberta_256_11_2


Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 0 of 10:   0%|          | 0/1776 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3552 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_deberta_256_11_2
Configuration saved in outputs/best_model/config.json
Model weights saved in outputs/best_model/pytorch_model.bin
tokenizer config file saved in outputs/best_model/tokenizer_config.json
Special tokens file saved in outputs/best_model/special_tokens_map.json


Running Epoch 1 of 10:   0%|          | 0/1776 [00:00<?, ?it/s]

Configuration saved in outputs/tamil/checkpoint-2000/config.json
Model weights saved in outputs/tamil/checkpoint-2000/pytorch_model.bin
tokenizer config file saved in outputs/tamil/checkpoint-2000/tokenizer_config.json
Special tokens file saved in outputs/tamil/checkpoint-2000/special_tokens_map.json
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3552 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_deberta_256_11_2
Configuration saved in outputs/best_model/config.json
Model weights saved in outputs/best_model/pytorch_model.bin
tokenizer config file saved in outputs/best_model/tokenizer_config.json
Special tokens file saved in outputs/best_model/special_tokens_map.json
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3552 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_deberta_256_11_2
Configuration saved in outputs/best_model/config.json
Model weights saved in outputs/best_model/pytorch_model.bin
tokenizer config file saved in outputs/best_model/tokenizer_config.json
Special tokens file saved in outputs/best_model/special_tokens_map.json


Running Epoch 2 of 10:   0%|          | 0/1776 [00:00<?, ?it/s]

Configuration saved in outputs/tamil/checkpoint-4000/config.json
Model weights saved in outputs/tamil/checkpoint-4000/pytorch_model.bin
tokenizer config file saved in outputs/tamil/checkpoint-4000/tokenizer_config.json
Special tokens file saved in outputs/tamil/checkpoint-4000/special_tokens_map.json
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3552 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_deberta_256_11_2
INFO:simpletransformers.classification.classification_model: No improvement in eval_loss
INFO:simpletransformers.classification.classification_model: Current step: 1
INFO:simpletransformers.classification.classification_model: Early stopping patience: 3
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3552 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_deberta_256_11_2
Configuration saved in outputs/best_model/config.json
Model weights saved in outputs/best_model/pytorch_model.bin
tokenizer config file saved in outputs/best_model/tokenizer_config.json
Special tokens file saved in outputs/best_model/special_tokens_map.json


Running Epoch 3 of 10:   0%|          | 0/1776 [00:00<?, ?it/s]

Configuration saved in outputs/tamil/checkpoint-6000/config.json
Model weights saved in outputs/tamil/checkpoint-6000/pytorch_model.bin
tokenizer config file saved in outputs/tamil/checkpoint-6000/tokenizer_config.json
Special tokens file saved in outputs/tamil/checkpoint-6000/special_tokens_map.json
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3552 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_deberta_256_11_2
Configuration saved in outputs/best_model/config.json
Model weights saved in outputs/best_model/pytorch_model.bin
tokenizer config file saved in outputs/best_model/tokenizer_config.json
Special tokens file saved in outputs/best_model/special_tokens_map.json
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3552 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_deberta_256_11_2
Configuration saved in outputs/best_model/config.json
Model weights saved in outputs/best_model/pytorch_model.bin
tokenizer config file saved in outputs/best_model/tokenizer_config.json
Special tokens file saved in outputs/best_model/special_tokens_map.json


Running Epoch 4 of 10:   0%|          | 0/1776 [00:00<?, ?it/s]

Configuration saved in outputs/tamil/checkpoint-8000/config.json
Model weights saved in outputs/tamil/checkpoint-8000/pytorch_model.bin
tokenizer config file saved in outputs/tamil/checkpoint-8000/tokenizer_config.json
Special tokens file saved in outputs/tamil/checkpoint-8000/special_tokens_map.json
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3552 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_deberta_256_11_2
Configuration saved in outputs/best_model/config.json
Model weights saved in outputs/best_model/pytorch_model.bin
tokenizer config file saved in outputs/best_model/tokenizer_config.json
Special tokens file saved in outputs/best_model/special_tokens_map.json
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3552 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_deberta_256_11_2
Configuration saved in outputs/best_model/config.json
Model weights saved in outputs/best_model/pytorch_model.bin
tokenizer config file saved in outputs/best_model/tokenizer_config.json
Special tokens file saved in outputs/best_model/special_tokens_map.json


Running Epoch 5 of 10:   0%|          | 0/1776 [00:00<?, ?it/s]

Configuration saved in outputs/tamil/checkpoint-10000/config.json
Model weights saved in outputs/tamil/checkpoint-10000/pytorch_model.bin
tokenizer config file saved in outputs/tamil/checkpoint-10000/tokenizer_config.json
Special tokens file saved in outputs/tamil/checkpoint-10000/special_tokens_map.json
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3552 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_deberta_256_11_2
Configuration saved in outputs/best_model/config.json
Model weights saved in outputs/best_model/pytorch_model.bin
tokenizer config file saved in outputs/best_model/tokenizer_config.json
Special tokens file saved in outputs/best_model/special_tokens_map.json
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3552 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_deberta_256_11_2


Running Epoch 6 of 10:   0%|          | 0/1776 [00:00<?, ?it/s]

Configuration saved in outputs/tamil/checkpoint-12000/config.json
Model weights saved in outputs/tamil/checkpoint-12000/pytorch_model.bin
tokenizer config file saved in outputs/tamil/checkpoint-12000/tokenizer_config.json
Special tokens file saved in outputs/tamil/checkpoint-12000/special_tokens_map.json
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3552 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_deberta_256_11_2
Configuration saved in outputs/best_model/config.json
Model weights saved in outputs/best_model/pytorch_model.bin
tokenizer config file saved in outputs/best_model/tokenizer_config.json
Special tokens file saved in outputs/best_model/special_tokens_map.json
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3552 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_deberta_256_11_2


Running Epoch 7 of 10:   0%|          | 0/1776 [00:00<?, ?it/s]

Configuration saved in outputs/tamil/checkpoint-14000/config.json
Model weights saved in outputs/tamil/checkpoint-14000/pytorch_model.bin
tokenizer config file saved in outputs/tamil/checkpoint-14000/tokenizer_config.json
Special tokens file saved in outputs/tamil/checkpoint-14000/special_tokens_map.json
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3552 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_deberta_256_11_2
INFO:simpletransformers.classification.classification_model: No improvement in eval_loss
INFO:simpletransformers.classification.classification_model: Current step: 1
INFO:simpletransformers.classification.classification_model: Early stopping patience: 3
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3552 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_deberta_256_11_2


Running Epoch 8 of 10:   0%|          | 0/1776 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3552 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_deberta_256_11_2


Running Epoch 9 of 10:   0%|          | 0/1776 [00:00<?, ?it/s]

Configuration saved in outputs/tamil/checkpoint-16000/config.json
Model weights saved in outputs/tamil/checkpoint-16000/pytorch_model.bin
tokenizer config file saved in outputs/tamil/checkpoint-16000/tokenizer_config.json
Special tokens file saved in outputs/tamil/checkpoint-16000/special_tokens_map.json
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3552 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_deberta_256_11_2
INFO:simpletransformers.classification.classification_model: No improvement in eval_loss
INFO:simpletransformers.classification.classification_model: Current step: 2
INFO:simpletransformers.classification.classification_model: Early stopping patience: 3
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3552 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_deberta_256_11_2
Configuration saved in outputs/tamil/config.json
Model weights saved in outputs/tamil/pytorch_model.bin
tokenizer config file saved in outputs/tamil/tokenizer_config.json
Special tokens file saved in outputs/tamil/special_tokens_map.json
INFO:simpletransformers.classification.classification_model: Training of deberta model complete. Saved to outputs/tamil/.


(17760,
 defaultdict(list,
             {'eval_loss': [1.9535042318674896,
               1.9317949355185569,
               1.8235470585457914,
               1.827587828040123,
               1.7854441311713811,
               1.7694157430449047,
               1.7483315871911005,
               1.7424361583617356,
               1.7376278340816498,
               1.7284192582508464,
               1.7436017171219662,
               1.721729813261075,
               1.7226479806610056,
               1.7309933955873456,
               1.7303239915285025,
               1.7262228626657177,
               1.727981347221512,
               1.7339461139730505],
              'global_step': [1776,
               2000,
               3552,
               4000,
               5328,
               6000,
               7104,
               8000,
               8880,
               10000,
               10656,
               12000,
               12432,
               14000,
               142

In [None]:
result2, model_outputs2, wrong_predictions2 = model2.eval_model(df_dev)
print(result2)


INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3552 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_deberta_256_11_2


Running Evaluation:   0%|          | 0/444 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_model:{'mcc': 0.26651079234963687, 'eval_loss': 1.7339461139730505}


{'mcc': 0.26651079234963687, 'eval_loss': 1.7339461139730505}


In [None]:
raw_outputs_vals = softmax(model_outputs2,axis=1)

In [None]:
import numpy as np
y_pred = [np.argmax(i) for i in  raw_outputs_vals]

In [None]:
sklearn.metrics.f1_score(df_dev['labels'].to_numpy(),y_pred,average = 'micro')

0.44031531531531537

In [None]:
print(sklearn.metrics.classification_report(df_dev['labels'].to_numpy(),y_pred))

              precision    recall  f1-score   support

           0       0.51      0.49      0.50       437
           1       0.21      0.07      0.11       184
           2       0.32      0.15      0.20       213
           3       0.00      0.00      0.00       210
           4       0.00      0.00      0.00        23
           5       0.50      0.65      0.57       558
           6       0.27      0.02      0.03       189
           7       0.43      0.72      0.54      1222
           8       0.42      0.13      0.20       191
           9       0.00      0.00      0.00        53
          10       0.31      0.12      0.18       272

    accuracy                           0.44      3552
   macro avg       0.27      0.21      0.21      3552
weighted avg       0.38      0.44      0.38      3552



In [None]:
 !zip -r deberta.zip outputs/

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  adding: outputs/ (stored 0%)
  adding: outputs/best_model/ (stored 0%)
  adding: outputs/best_model/tokenizer_config.json (deflated 76%)
  adding: outputs/best_model/config.json (deflated 58%)
  adding: outputs/best_model/optimizer.pt (deflated 33%)
  adding: outputs/best_model/vocab.json (deflated 58%)
  adding: outputs/best_model/training_args.bin (deflated 49%)
  adding: outputs/best_model/merges.txt (deflated 53%)
  adding: outputs/best_model/special_tokens_map.json (deflated 81%)
  adding: outputs/best_model/model_args.json (deflated 62%)
  adding: outputs/best_model/scheduler.pt (deflated 49%)
  adding: outputs/best_model/pytorch_model.bin (deflated 7%)
  adding: outputs/best_model/eval_results.txt (

**Testing deBERTA model**

In [None]:
import torch
model = ClassificationModel('deberta', 'outputs/tamil/')

loading configuration file outputs/tamil/config.json
Model config DebertaConfig {
  "_name_or_path": "microsoft/deberta-base",
  "architectures": [
    "DebertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8",
    "9": "LABEL_9",
    "10": "LABEL_10"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_10": 10,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5,
    "LABEL_6": 6,
    "LABEL_7": 7,
    "LABEL_8": 8,
    "LABEL_9": 9
  },
  "layer_norm_eps": 1e-07,
  "max_position_embeddings": 512,
  "max_relative_positions": -1,
  "model_type": "deberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 1

In [None]:
tamil_test = pd.read_csv('test_without_labels_task_a.csv',sep='\t')
df_test = tamil_test

In [None]:
df_test

Unnamed: 0,0,நம்மூரில் நம்மொழியில் வழிபாடு செய்ய இவ்வளவு இடையூறு ஏன்?
0,1,தமிழ் நாட்டிற்க்கு வெளியே போய் வாழ்ந்து பாருங்...
1,2,ஆழி ரொம்ப சொம்பு தூக்காத திமுகவிற்கு
2,3,நா என்ன சொன்னேன்.
3,4,மிக நல்ல அரசியல் கலாச்சாரம் நம்ம முதல்வர் 🙏🙏🙏🙏...
4,5,கார்த்திட ஆயிரத்தில் ஒருவன் படம் பாேடுங்க அண்ண...
...,...,...
4434,4435,குஜராத்தில் அதிகம் இறப்பது ஜெயின் சமூகம்தான் இ...
4435,4436,பணம் இருந்தால் மனம் இருந்தால் குணம் இருக்கனும்...
4436,4437,இருப்பவர்கள் இடம் வாங்கி இல்லாதவர்களுக்கு கொடு...
4437,4438,அருமை அண்ணா மிக்க மகிழ்ச்சி நன்றி


In [None]:
test_sents = list(df_test['நம்மூரில் நம்மொழியில் வழிபாடு செய்ய இவ்வளவு இடையூறு ஏன்?'])

In [None]:
predictions, raw_outputs = model.predict(test_sents)

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/4439 [00:00<?, ?it/s]

  0%|          | 0/555 [00:00<?, ?it/s]

In [None]:
predictions

[7,
 7,
 7,
 5,
 7,
 0,
 5,
 0,
 7,
 5,
 7,
 7,
 10,
 7,
 0,
 0,
 0,
 7,
 7,
 1,
 7,
 7,
 7,
 7,
 0,
 7,
 7,
 7,
 7,
 7,
 7,
 0,
 7,
 6,
 0,
 7,
 7,
 10,
 0,
 7,
 7,
 5,
 7,
 7,
 5,
 5,
 5,
 7,
 0,
 7,
 7,
 2,
 7,
 7,
 7,
 7,
 0,
 7,
 5,
 5,
 5,
 7,
 7,
 7,
 7,
 5,
 2,
 10,
 7,
 0,
 0,
 7,
 7,
 7,
 7,
 7,
 7,
 5,
 0,
 10,
 7,
 5,
 5,
 0,
 7,
 7,
 5,
 7,
 0,
 7,
 7,
 7,
 7,
 7,
 10,
 5,
 5,
 7,
 7,
 7,
 7,
 7,
 5,
 7,
 0,
 5,
 7,
 5,
 7,
 5,
 7,
 0,
 7,
 5,
 7,
 7,
 0,
 5,
 7,
 5,
 7,
 7,
 0,
 7,
 7,
 7,
 7,
 7,
 8,
 5,
 0,
 7,
 7,
 7,
 7,
 7,
 0,
 7,
 0,
 10,
 7,
 7,
 0,
 7,
 7,
 7,
 7,
 0,
 5,
 5,
 0,
 7,
 5,
 7,
 7,
 7,
 0,
 5,
 7,
 7,
 0,
 7,
 7,
 0,
 7,
 0,
 7,
 2,
 7,
 8,
 7,
 7,
 7,
 5,
 5,
 0,
 0,
 7,
 7,
 5,
 7,
 7,
 7,
 7,
 7,
 2,
 7,
 7,
 7,
 7,
 8,
 7,
 7,
 5,
 7,
 1,
 7,
 0,
 7,
 7,
 8,
 5,
 7,
 7,
 7,
 7,
 7,
 0,
 5,
 5,
 7,
 7,
 7,
 7,
 7,
 7,
 0,
 0,
 7,
 8,
 0,
 5,
 2,
 7,
 7,
 5,
 7,
 2,
 7,
 5,
 0,
 7,
 7,
 7,
 7,
 5,
 2,
 7,
 1,
 7,
 7,
 7,
 7,
 7,
 7,
 5,
 7,
 5,
 2

In [None]:
labels_dic

{0: 'Ambiguous',
 1: 'Anger',
 2: 'Anticipation',
 3: 'Disguist',
 4: 'Fear',
 5: 'Joy',
 6: 'Love',
 7: 'Neutral',
 8: 'Sadness',
 9: 'Surprise',
 10: 'Trust'}

In [None]:
predictions = [labels_dic[i] for i in predictions]
predictions

['Neutral',
 'Neutral',
 'Neutral',
 'Joy',
 'Neutral',
 'Ambiguous',
 'Joy',
 'Ambiguous',
 'Neutral',
 'Joy',
 'Neutral',
 'Neutral',
 'Trust',
 'Neutral',
 'Ambiguous',
 'Ambiguous',
 'Ambiguous',
 'Neutral',
 'Neutral',
 'Anger',
 'Neutral',
 'Neutral',
 'Neutral',
 'Neutral',
 'Ambiguous',
 'Neutral',
 'Neutral',
 'Neutral',
 'Neutral',
 'Neutral',
 'Neutral',
 'Ambiguous',
 'Neutral',
 'Love',
 'Ambiguous',
 'Neutral',
 'Neutral',
 'Trust',
 'Ambiguous',
 'Neutral',
 'Neutral',
 'Joy',
 'Neutral',
 'Neutral',
 'Joy',
 'Joy',
 'Joy',
 'Neutral',
 'Ambiguous',
 'Neutral',
 'Neutral',
 'Anticipation',
 'Neutral',
 'Neutral',
 'Neutral',
 'Neutral',
 'Ambiguous',
 'Neutral',
 'Joy',
 'Joy',
 'Joy',
 'Neutral',
 'Neutral',
 'Neutral',
 'Neutral',
 'Joy',
 'Anticipation',
 'Trust',
 'Neutral',
 'Ambiguous',
 'Ambiguous',
 'Neutral',
 'Neutral',
 'Neutral',
 'Neutral',
 'Neutral',
 'Neutral',
 'Joy',
 'Ambiguous',
 'Trust',
 'Neutral',
 'Joy',
 'Joy',
 'Ambiguous',
 'Neutral',
 'Neutral

In [None]:
len(predictions)

4439

In [None]:
df_test['label_pred'] = predictions

In [None]:
df_test.head()

Unnamed: 0,0,நம்மூரில் நம்மொழியில் வழிபாடு செய்ய இவ்வளவு இடையூறு ஏன்?,label_pred
0,1,தமிழ் நாட்டிற்க்கு வெளியே போய் வாழ்ந்து பாருங்...,Neutral
1,2,ஆழி ரொம்ப சொம்பு தூக்காத திமுகவிற்கு,Neutral
2,3,நா என்ன சொன்னேன்.,Neutral
3,4,மிக நல்ல அரசியல் கலாச்சாரம் நம்ம முதல்வர் 🙏🙏🙏🙏...,Joy
4,5,கார்த்திட ஆயிரத்தில் ஒருவன் படம் பாேடுங்க அண்ண...,Neutral


In [None]:
df_test.to_csv('EmotionAnalysis_deBERTA_predictions.csv')

**Task B**

Section 2: Importing datasets

In [None]:
import pandas as pd

In [None]:
tamil_train = pd.read_csv('ta-train.tsv',sep='\t',names = ['text','labels'])
tamil_dev = pd.read_csv('ta-dev.tsv',sep='\t',names = ['text','labels'])

tamil_test = pd.read_csv('test_without_labels_task_b.tsv',sep='\t')


In [None]:
tamil_train=tamil_train.iloc[1:,:]

In [None]:
tamil_train

Unnamed: 0,text,labels
1,எந்த ஒரு மிகப் பெரிய விஷயமாக இருந்தாலும் அதோட ...,ஒப்புதல்
2,ராணி தேனி எப்படி கண்டுபிடிக்கிறது,எதிர்பார்ப்பு
3,இன்னும் நிறைய கண்டு பிடிப்புகள் சொல்லவில்லை. ப...,நடுநிலை
4,100% உண்மை..... வாழ்த்துக்கள் உண்மையை உரக்க சொ...,உண்மையை உணர்தல்
5,இது உண்மையாக இருக்கட்டும்,ஒப்புதல்
...,...,...
30175,"இப்டியாவது இந்தியாவையும், பாகிஸ்தானையும் சேத்த...",ஆசை
30176,எங்கடா இருந்திங்க நீங்கலாம் இவளவு நாளா தமிழ் ந...,கிண்டல்
30177,சுஷில்ஹரி பள்ளியை பற்றி தகவல் வருகிறது அதையும்...,எதிர்பார்ப்பு
30178,அருமை... அங்குள்ள தற்போதைய நிலவரத்தை கூற முடிய...,எதிர்பார்ப்பு


In [None]:
tamil_dev=tamil_dev.iloc[1:,:]

In [None]:
tamil_dev

Unnamed: 0,text,labels
1,வைரஸ். வெட்டு கிளி வேர என்ன வச்சிட்டு இருக்கீங்கட,கேளிக்கை
2,நல்ல முயற்சி!!!! தொடர வாழ்த்துக்கள் !!!,போற்றுதல்
3,அக்கா சூப்பர் க்கா உங்க குழந்தைகள் நலமாக 100ஆண...,நம்பிக்கை
4,மிஸ்டர் தமிழன் வாய்ஸ் முதலிடத்தை பிடிக்கிறார்....,பெருமை
5,ஒரு இன்ஜினியரிங் பட்டாதாரி அந்த சமுகத்தில் வந்...,சோகம்
...,...,...
4265,வணக்கம் அண்ணா. சில நாட்களாக எனது முகநூல் கணக்க...,எதிர்பார்ப்பு
4266,கறுத்த அணில் குஞ்சு; கற்பப்பையை விட்டு வெளிவந்...,உண்மையை உணர்தல்
4267,ப்ரோ மிக்க நன்றி,போற்றுதல்
4268,இதை தேர்தலுக்கு முன் தெரிவித்திருந்தால் 234ம்...,உண்மையை உணர்தல்


In [None]:
tamil_train['labels'].value_counts()

போற்றுதல்                              4760
உண்மையை உணர்தல்                        3499
எதிர்பார்ப்பு                          2191
கிண்டல்                                2128
ஒப்புதல்                               1853
கோபம்                                  1738
எரிச்சல்                               1277
மகிழ்ச்சி                              1276
நடுநிலை                                1232
பெருமை                                  963
நன்றியறிதல்                             880
ஆர்வம்                                  782
நம்பிக்கை                               713
குழப்பம்                                709
கேளிக்கை                                625
உற்சாகம்                                548
அக்கறை                                  497
சங்கடம்                                 484
சோகம்                                   470
அன்பு                                   453
ஏமாற்றம்                                422
மறுப்பு                                 421
அருவருப்பு                      

In [None]:
import psutil
import humanize
import os
import GPUtil as GPU

import numpy as np
import pandas as pd
from tqdm import tqdm
import warnings
warnings.simplefilter('ignore')
import gc
from scipy.special import softmax
from simpletransformers.classification import ClassificationModel
from sklearn.model_selection import train_test_split, StratifiedKFold, KFold 
import sklearn
from sklearn.metrics import log_loss
from sklearn.metrics import *
from sklearn.model_selection import *
import re
import random
import torch


#Seed everything for reproducability

def seed_all(seed_value):
    random.seed(seed_value) # Python
    np.random.seed(seed_value) # cpu vars
    torch.manual_seed(seed_value) # cpu  vars
    
    if torch.cuda.is_available(): 
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value) # gpu vars
        torch.backends.cudnn.deterministic = True  #needed
        torch.backends.cudnn.benchmark = False
SEED = 2
seed_all(SEED)

In [None]:
!pip install keras



In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
df_train = tamil_train
df_dev = tamil_dev
df_test = tamil_test #whenever this is released

In [None]:
labels = list(df_train['labels'].unique())
labels

['ஒப்புதல்',
 'எதிர்பார்ப்பு',
 'நடுநிலை',
 'உண்மையை உணர்தல்',
 'ஆசை',
 'நன்றியறிதல்',
 'கேளிக்கை',
 'அன்பு',
 'அக்கறை',
 'போற்றுதல்',
 'மகிழ்ச்சி',
 'அருவருப்பு',
 'கோபம்',
 'கிண்டல்',
 'ஏமாற்றம்',
 'எதிர்காலத்தைப் பற்றிய நம்பிக்கை',
 'எரிச்சல்',
 'சோகம்',
 'ஆர்வம்',
 'சங்கடம்',
 'குழப்பம்',
 'மறுப்பு',
 'நம்பிக்கை',
 'பெருமை',
 'துயர் நீக்கம்',
 'பதட்டம்',
 'ஆச்சரியம்',
 'குற்றமுணர்ந்ததால் ஏற்படும் வருத்தம்',
 'உற்சாகம்',
 'பயம்',
 'துக்கம்']

In [None]:
# One hot encoding
df_train ['labels'] = df_train['labels'].astype('category')
df_dev ['labels'] = df_dev['labels'].astype('category')

In [None]:
train_y = df_train['labels']
print(train_y.value_counts())
train_y = train_y.to_numpy()

போற்றுதல்                              4760
உண்மையை உணர்தல்                        3499
எதிர்பார்ப்பு                          2191
கிண்டல்                                2128
ஒப்புதல்                               1853
கோபம்                                  1738
எரிச்சல்                               1277
மகிழ்ச்சி                              1276
நடுநிலை                                1232
பெருமை                                  963
நன்றியறிதல்                             880
ஆர்வம்                                  782
நம்பிக்கை                               713
குழப்பம்                                709
கேளிக்கை                                625
உற்சாகம்                                548
அக்கறை                                  497
சங்கடம்                                 484
சோகம்                                   470
அன்பு                                   453
ஏமாற்றம்                                422
மறுப்பு                                 421
அருவருப்பு                      

In [None]:
dev_y = df_dev['labels']
print(dev_y.value_counts())
dev_y = dev_y.to_numpy()

போற்றுதல்                              673
உண்மையை உணர்தல்                        520
கிண்டல்                                299
எதிர்பார்ப்பு                          282
ஒப்புதல்                               263
கோபம்                                  259
மகிழ்ச்சி                              199
எரிச்சல்                               186
நடுநிலை                                156
பெருமை                                 141
நன்றியறிதல்                            116
ஆர்வம்                                 115
குழப்பம்                               101
உற்சாகம்                                94
நம்பிக்கை                               85
கேளிக்கை                                76
அக்கறை                                  72
அன்பு                                   66
சங்கடம்                                 64
மறுப்பு                                 64
சோகம்                                   63
ஏமாற்றம்                                54
எதிர்காலத்தைப் பற்றிய நம்பிக்கை         51
அருவருப்பு 

In [None]:
# One hot encode
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(train_y.reshape(-1,1))
print(enc.categories_)
train_y = enc.transform(train_y.reshape(-1,1)).toarray()
dev_y = enc.transform(dev_y.reshape(-1,1)).toarray()
train_y

[array(['அக்கறை', 'அன்பு', 'அருவருப்பு', 'ஆசை', 'ஆச்சரியம்', 'ஆர்வம்',
       'உண்மையை உணர்தல்', 'உற்சாகம்', 'எதிர்காலத்தைப் பற்றிய நம்பிக்கை',
       'எதிர்பார்ப்பு', 'எரிச்சல்', 'ஏமாற்றம்', 'ஒப்புதல்', 'கிண்டல்',
       'குற்றமுணர்ந்ததால் ஏற்படும் வருத்தம்', 'குழப்பம்', 'கேளிக்கை',
       'கோபம்', 'சங்கடம்', 'சோகம்', 'துக்கம்', 'துயர் நீக்கம்', 'நடுநிலை',
       'நன்றியறிதல்', 'நம்பிக்கை', 'பதட்டம்', 'பயம்', 'பெருமை',
       'போற்றுதல்', 'மகிழ்ச்சி', 'மறுப்பு'], dtype=object)]


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [None]:
labels_dic = dict(enumerate(df_train['labels'].cat.categories))

In [None]:
df_train['labels'] = df_train['labels'].cat.codes
df_dev['labels'] = df_dev['labels'].cat.codes

In [None]:
df_train.head()

Unnamed: 0,text,labels
1,எந்த ஒரு மிகப் பெரிய விஷயமாக இருந்தாலும் அதோட ...,12
2,ராணி தேனி எப்படி கண்டுபிடிக்கிறது,9
3,இன்னும் நிறைய கண்டு பிடிப்புகள் சொல்லவில்லை. ப...,22
4,100% உண்மை..... வாழ்த்துக்கள் உண்மையை உரக்க சொ...,6
5,இது உண்மையாக இருக்கட்டும்,12


In [None]:
labels_dic

{0: 'அக்கறை',
 1: 'அன்பு',
 2: 'அருவருப்பு',
 3: 'ஆசை',
 4: 'ஆச்சரியம்',
 5: 'ஆர்வம்',
 6: 'உண்மையை உணர்தல்',
 7: 'உற்சாகம்',
 8: 'எதிர்காலத்தைப் பற்றிய நம்பிக்கை',
 9: 'எதிர்பார்ப்பு',
 10: 'எரிச்சல்',
 11: 'ஏமாற்றம்',
 12: 'ஒப்புதல்',
 13: 'கிண்டல்',
 14: 'குற்றமுணர்ந்ததால் ஏற்படும் வருத்தம்',
 15: 'குழப்பம்',
 16: 'கேளிக்கை',
 17: 'கோபம்',
 18: 'சங்கடம்',
 19: 'சோகம்',
 20: 'துக்கம்',
 21: 'துயர் நீக்கம்',
 22: 'நடுநிலை',
 23: 'நன்றியறிதல்',
 24: 'நம்பிக்கை',
 25: 'பதட்டம்',
 26: 'பயம்',
 27: 'பெருமை',
 28: 'போற்றுதல்',
 29: 'மகிழ்ச்சி',
 30: 'மறுப்பு'}

**RoBERTa Model**

In [None]:
 model = ClassificationModel('xlmroberta', 'xlm-roberta-base', use_cuda=True,num_labels=31, args={
                                                                    'train_batch_size':32,
                                                                    'reprocess_input_data': True,
                                                                    #"weight":  class_weights,
                                                                    'overwrite_output_dir': True,
                                                                    'fp16': False,
                                                                    'do_lower_case': False,
                                                                    'num_train_epochs': 6,
                                                                    'max_seq_length': 256,
                                                                    'regression': False,
                                                                    'manual_seed': SEED,
                                                                    "learning_rate":1e-5,
                                                                    'weight_decay':0,
                                                                    "save_eval_checkpoints": False,
                                                                    "save_model_every_epoch": False,
                                                                    "silent": False,
                                                                    "verbose": True,
                                                                    "dataloader_num_workers": 0,
                                                                    "evaluate_during_training": True,
                                                                    "use_early_stopping":True,
                                                                    'use_multiprocessing': False,
                                                                    "output_dir":'outputs/tamil/',})

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForSequenceClassification: ['lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'roberta.pooler.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.out_p

In [None]:
import shutil

shutil.rmtree('outputs')

In [None]:
model.train_model(df_train, eval_df = df_dev)

Epoch:   0%|          | 0/6 [00:00<?, ?it/s]

Running Epoch 0 of 6:   0%|          | 0/944 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  0%|          | 0/4269 [00:00<?, ?it/s]

Running Epoch 1 of 6:   0%|          | 0/944 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  0%|          | 0/4269 [00:00<?, ?it/s]

Running Epoch 2 of 6:   0%|          | 0/944 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  0%|          | 0/4269 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  0%|          | 0/4269 [00:00<?, ?it/s]

Running Epoch 3 of 6:   0%|          | 0/944 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  0%|          | 0/4269 [00:00<?, ?it/s]

Running Epoch 4 of 6:   0%|          | 0/944 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  0%|          | 0/4269 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  0%|          | 0/4269 [00:00<?, ?it/s]

Running Epoch 5 of 6:   0%|          | 0/944 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  0%|          | 0/4269 [00:00<?, ?it/s]

(5664,
 defaultdict(list,
             {'global_step': [944, 1888, 2000, 2832, 3776, 4000, 4720, 5664],
              'train_loss': [0.8106401562690735,
               1.6649608612060547,
               2.2285993099212646,
               1.9723715782165527,
               2.2849647998809814,
               2.353518009185791,
               1.733634352684021,
               2.603182077407837],
              'mcc': [0.20893566566171073,
               0.2427244790020088,
               0.24580523261091416,
               0.24984184190574849,
               0.2547317254727216,
               0.25744162150894445,
               0.25382988343920165,
               0.25758475312743057],
              'eval_loss': [2.6574051322115495,
               2.521470093102044,
               2.525236466404204,
               2.4867102470290794,
               2.4867022515236217,
               2.4830132725086997,
               2.480380286661427,
               2.4827107480179507]}))

In [None]:
result, model_outputs, wrong_predictions = model.eval_model(df_dev)
print(result)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  0%|          | 0/4269 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/534 [00:00<?, ?it/s]

{'mcc': 0.25758475312743057, 'eval_loss': 2.4827107480179507}


In [None]:
raw_outputs_vals = softmax(model_outputs,axis=1)

In [None]:
import numpy as np
y_pred = [np.argmax(i) for i in  raw_outputs_vals]

In [None]:
sklearn.metrics.f1_score(df_dev['labels'].to_numpy(),y_pred,average = 'micro')

0.3249004450691028

In [None]:
print(sklearn.metrics.classification_report(df_dev['labels'].to_numpy(),y_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        72
           1       0.24      0.08      0.11        66
           2       0.00      0.00      0.00        49
           3       0.00      0.00      0.00        18
           4       0.00      0.00      0.00        27
           5       0.22      0.14      0.17       115
           6       0.27      0.49      0.35       520
           7       0.25      0.01      0.02        94
           8       0.00      0.00      0.00        51
           9       0.30      0.52      0.38       282
          10       0.17      0.02      0.04       186
          11       0.00      0.00      0.00        54
          12       0.21      0.09      0.13       263
          13       0.30      0.52      0.38       299
          14       0.00      0.00      0.00        34
          15       0.38      0.12      0.18       101
          16       0.00      0.00      0.00        76
          17       0.27    

In [None]:
import torch

In [None]:
model = ClassificationModel('xlmroberta', 'outputs/tamil/')

In [None]:
tamil_test = pd.read_csv('test_without_labels_task_b.tsv',sep='\t')
df_test = tamil_test

In [None]:
df_test

Unnamed: 0,0,ஹாஹா ஹாஹா ....வந்துடுச்சு 😂😂😂👍👍👍👍👍😉😉😉🙏🙏🙏
0,1,"உண்மைகள் வெளிவரும் தருணம் இது , தங்களுடைய தேவை..."
1,2,இதற்கு ஒரே தீர்வு...; டிஷ் ஷுக்கு பணம் கட்டுறத...
2,3,மோடி ஆதரவாளர்கள் செய்யும் அட்டூழியம் தாங்க ம...
3,4,முழுசா படிச்சிருக்கேன் அதில் எனக்கு மிகவும் பி...
4,5,வேர லெவல் பங்கம்
...,...,...
4263,4264,என் நாடு...தமிழ் நாடு....
4264,4265,இல்வாழ்க்கையில் கணவன் மனைவி இருவரும் அன்பு செல...
4265,4266,மொழியே தெய்வம் ....
4266,4267,எல்லாம் சரிதான். ஆனால் தலை முடியை பின்னிண்டு ப...


In [None]:
test_sents = list(df_test['ஹாஹா ஹாஹா ....வந்துடுச்சு 😂😂😂👍👍👍👍👍😉😉😉🙏🙏🙏'])

In [None]:
predictions, raw_outputs = model.predict(test_sents)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  0%|          | 0/4268 [00:00<?, ?it/s]

  0%|          | 0/534 [00:00<?, ?it/s]

In [None]:
predictions = [labels_dic[i] for i in predictions]
predictions

['உண்மையை உணர்தல்',
 'உண்மையை உணர்தல்',
 'உண்மையை உணர்தல்',
 'உண்மையை உணர்தல்',
 'போற்றுதல்',
 'ஒப்புதல்',
 'போற்றுதல்',
 'போற்றுதல்',
 'கோபம்',
 'மகிழ்ச்சி',
 'போற்றுதல்',
 'போற்றுதல்',
 'போற்றுதல்',
 'போற்றுதல்',
 'எதிர்பார்ப்பு',
 'போற்றுதல்',
 'போற்றுதல்',
 'ஆர்வம்',
 'போற்றுதல்',
 'கிண்டல்',
 'போற்றுதல்',
 'உண்மையை உணர்தல்',
 'போற்றுதல்',
 'போற்றுதல்',
 'போற்றுதல்',
 'கிண்டல்',
 'கிண்டல்',
 'உண்மையை உணர்தல்',
 'அன்பு',
 'போற்றுதல்',
 'எரிச்சல்',
 'போற்றுதல்',
 'எதிர்பார்ப்பு',
 'எதிர்பார்ப்பு',
 'கோபம்',
 'கோபம்',
 'உண்மையை உணர்தல்',
 'கிண்டல்',
 'எதிர்பார்ப்பு',
 'கோபம்',
 'எதிர்பார்ப்பு',
 'போற்றுதல்',
 'உண்மையை உணர்தல்',
 'போற்றுதல்',
 'ஒப்புதல்',
 'உண்மையை உணர்தல்',
 'எதிர்பார்ப்பு',
 'போற்றுதல்',
 'உண்மையை உணர்தல்',
 'உண்மையை உணர்தல்',
 'உண்மையை உணர்தல்',
 'போற்றுதல்',
 'ஆர்வம்',
 'எரிச்சல்',
 'ஒப்புதல்',
 'போற்றுதல்',
 'போற்றுதல்',
 'சோகம்',
 'கோபம்',
 'போற்றுதல்',
 'உண்மையை உணர்தல்',
 'உண்மையை உணர்தல்',
 'போற்றுதல்',
 'கிண்டல்',
 'எதிர்பார்ப்பு',
 'கோபம்',
 'எதிர்பார்ப்பு',

In [None]:
len(predictions)

4268

In [None]:
df_test['label_pred'] = predictions

In [None]:
df_test.to_csv('EmotionAnalysis_taskB_roberta_predictions.csv')