<a href="https://colab.research.google.com/github/Ishani-Mondal/EPOCH_Chat_Analysis/blob/main/Copy_of_Build_a_BERT_Text_classification_model_in_a_different_language_than_English.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial

We are going to use [Simple Transformers](https://github.com/ThilinaRajapakse/simpletransformers) - an NLP library based on the [Transformers](https://github.com/huggingface/transformers) library by HuggingFace. Simple Transformers allows us to fine-tune Transformer models in a few lines of code.  

As the dataset, we are going to use the [Germeval 2019](https://projects.fzai.h-da.de/iggsa/projekt/), which consists of German tweets. We are going to detect and classify abusive language tweets. These tweets are categorized in 4 classes: `PROFANITY`, `INSULT`, `ABUSE`, and `OTHERS`. The highest score achieved on this dataset is `0.7361`.

### We are going to

- install Simple Transformers library
- select a pre-trained monolingual model
- load the dataset
- train/fine-tune our model
- evaluate the results of it
- save and load the model
- test the loaded model on a real example

In [31]:
import pandas as pd
f1=open('Combined_Train.txt')
content=f1.read()
text=[]
for line in content.split("\n"):
  text.append(line)

f2=open('Combined_Train_Labels.txt')
content=f2.read()
labels=[]
for line in content.split("\n"):
  labels.append(int(line))

print(len(text), len(labels))
train_df = pd.DataFrame(
    {'text': text,
     'labels': labels
    })

'''
f1=open('test_text.txt')
content=f1.read()
text=[]
for line in content.split("\n"):
  text.append(line)

f2=open('test_labels.txt')
content=f2.read()
labels=[]
for line in content.split("\n"):
  labels.append(int(line))

print(len(text), len(labels))
test_df = pd.DataFrame(
    {'text': text,
     'labels': labels
    })

'''
f1=open('Eng_test.txt')
content=f1.read()
text=[]
for line in content.split("\n"):
  text.append(line)

f2=open('Eng_test_labels.txt')
content=f2.read()
labels=[]
for line in content.split("\n"):
  labels.append(int(line))

print(len(text), len(labels))
test_df1 = pd.DataFrame(
    {'text': text,
     'labels': labels
    })

1104 1104
870 870


In [32]:
import random

import numpy as np
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
from tqdm.notebook import tqdm

# Ensure deterministic behavior
torch.backends.cudnn.deterministic = True
random.seed(hash("setting random seeds") % 2**14 - 1)
np.random.seed(hash("improves reproducibility") % 2**14 - 1)
torch.manual_seed(hash("by removing stochasticity") % 2**14 - 1)
torch.cuda.manual_seed_all(hash("so runs are repeatable") % 2**14 - 1)

# Install Simple Transformers library 

In [33]:
# install simpletransformers
!pip install simpletransformers

# check installed version
!pip freeze | grep simpletransformers
# simpletransformers==0.28.2

simpletransformers==0.61.13


# Select a pre-trained monolingual model

As mentioned above the Simple Transformers library is based on the Transformers library from HuggingFace. This enables us to use every pre-trained model provided in the [Transformers library](https://huggingface.co/transformers/pretrained_models.html) and all community-uploaded models. For a list that includes community-uploaded models, refer to [https://huggingface.co/models](https://huggingface.co/models).

We are going to use the `distilbert-base-german-cased` model. [DistilBERT is a small, fast, cheaper version of BERT](https://huggingface.co/transformers/model_doc/distilbert.html). It has 40% less parameters than `bert-base-uncased` and runs 60% faster while preserving over 95% of Bert’s performance.

# Load the dataset

In [None]:
!wget https://projects.fzai.h-da.de/iggsa/wp-content/uploads/2019/08/germeval2019GoldLabelsSubtask1_2.txt
!wget https://projects.fzai.h-da.de/iggsa/wp-content/uploads/2019/09/germeval2019.training_subtask1_2_korrigiert.txt

--2021-09-04 17:39:46--  https://projects.fzai.h-da.de/iggsa/wp-content/uploads/2019/08/germeval2019GoldLabelsSubtask1_2.txt
Resolving projects.fzai.h-da.de (projects.fzai.h-da.de)... 141.100.60.75, 2001:67c:2184:82a:21a:4aff:fe16:1e6
Connecting to projects.fzai.h-da.de (projects.fzai.h-da.de)|141.100.60.75|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 543975 (531K) [text/plain]
Saving to: ‘germeval2019GoldLabelsSubtask1_2.txt’


2021-09-04 17:39:49 (414 KB/s) - ‘germeval2019GoldLabelsSubtask1_2.txt’ saved [543975/543975]

--2021-09-04 17:39:49--  https://projects.fzai.h-da.de/iggsa/wp-content/uploads/2019/09/germeval2019.training_subtask1_2_korrigiert.txt
Resolving projects.fzai.h-da.de (projects.fzai.h-da.de)... 141.100.60.75, 2001:67c:2184:82a:21a:4aff:fe16:1e6
Connecting to projects.fzai.h-da.de (projects.fzai.h-da.de)|141.100.60.75|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 697779 (681K) [text/plain]
Saving to: ‘germeval20

In [11]:
#from sklearn.model_selection import train_test_split

#train_df, test_df = train_test_split(df, test_size=0.10)

print('train shape: ',train_df.shape)
print('test shape: ',test_df1.shape)

test_df1.head()

train shape:  (1839, 2)
test shape:  (870, 2)


Unnamed: 0,text,labels
0,Trying to have a conversation with my dad abou...,0
1,#latestnews 4 #newmexico #politics + #nativeam...,1
2,@user You are a stand up guy and a Gentleman V...,2
3,@user @user @user Looks like Flynn isn't too p...,0
4,perfect pussy clips #vanessa hudgens zac efron...,1


# Load pre-trained model

In [12]:
from simpletransformers.classification import ClassificationModel

# define hyperparameter
train_args ={"reprocess_input_data": True,
             "overwrite_output_dir": True,
             "fp16":False,
             "num_train_epochs": 3}

# Create a ClassificationModel
model1 = ClassificationModel(
    "xlmroberta", "xlm-roberta-base",
    num_labels=3,
    args=train_args
)

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForSequenceClassification: ['lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.bias', 'classifier.dense.weig

# Train model

In [34]:
# Train the model
model1.train_model(train_df)



Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 0 of 3:   0%|          | 0/138 [00:00<?, ?it/s]

Running Epoch 1 of 3:   0%|          | 0/138 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/138 [00:00<?, ?it/s]

(414, 0.5904768971987249)

In [35]:
from sklearn.metrics import f1_score, accuracy_score

predict=[]
actual=[]

def f1_multiclass(labels, preds):
    for i in range(len(labels)):
      actual.append(labels[i])
      predict.append(preds[i])
    return f1_score(labels, preds, average='micro')

    
result, model_outputs, wrong_predictions = model1.eval_model(test_df1, f1=f1_multiclass, acc=accuracy_score)
result

  0%|          | 0/870 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/109 [00:00<?, ?it/s]

{'acc': 0.6632183908045977,
 'eval_loss': 1.0981042248393418,
 'f1': 0.6632183908045977,
 'mcc': 0.49912615747532696}

In [36]:
from sklearn.metrics import classification_report
print(classification_report(actual, predict))

              precision    recall  f1-score   support

         0.0       0.65      0.80      0.72       290
         1.0       0.61      0.48      0.53       290
         2.0       0.73      0.71      0.72       290

    accuracy                           0.66       870
   macro avg       0.66      0.66      0.66       870
weighted avg       0.66      0.66      0.66       870



In [38]:
### NEW ADDITION

f1=open('test_text_G1.txt')
content=f1.read()
text=[]
for line in content.split("\n"):
  text.append(line)

f2=open('test_text_G1_labels.txt')
content=f2.read()
labels=[]
i=1
for line in content.split("\n"):
  #print(i, int(line))
  labels.append(int(line))
  i=i+1

print(len(text), len(labels))
test_df2 = pd.DataFrame(
    {'text': text,
     'labels': labels
    })

from sklearn.metrics import f1_score, accuracy_score

predict=[]
actual=[]

def f1_multiclass(labels, preds):
    for i in range(len(labels)):
      actual.append(labels[i])
      predict.append(preds[i])
    return f1_score(labels, preds, average='micro')

    
result, model_outputs, wrong_predictions = model1.eval_model(test_df2, f1=f1_multiclass, acc=accuracy_score)

from sklearn.metrics import classification_report
print(classification_report(actual, predict))

f3=open('Test_predictions_XLMR_G1_Non-Zeroshot_Original.txt','w')
for i in range(len(actual)):
  #print(true_labels[i])
  f3.write(str(text[i])+"\t"+str(actual[i])+"\t"+str(predict[i])+"\n")


1657 1657


  0%|          | 0/1657 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/208 [00:00<?, ?it/s]

              precision    recall  f1-score   support

         0.0       0.14      0.81      0.23        91
         1.0       0.98      0.51      0.67      1546
         2.0       0.05      0.85      0.10        20

    accuracy                           0.53      1657
   macro avg       0.39      0.72      0.33      1657
weighted avg       0.92      0.53      0.64      1657



In [40]:
### NEW ADDITION

f1=open('Translated_test_G1.txt')
content=f1.read()
text=[]
for line in content.split("\n"):
  text.append(line)

f2=open('test_text_G1_labels.txt')
content=f2.read()
labels=[]
i=1
for line in content.split("\n"):
  #print(i, int(line))
  labels.append(int(line))
  i=i+1

print(len(text), len(labels))
test_df2 = pd.DataFrame(
    {'text': text,
     'labels': labels
    })

from sklearn.metrics import f1_score, accuracy_score

predict=[]
actual=[]

def f1_multiclass(labels, preds):
    for i in range(len(labels)):
      actual.append(labels[i])
      predict.append(preds[i])
    return f1_score(labels, preds, average='micro')

    
result, model_outputs, wrong_predictions = model1.eval_model(test_df2, f1=f1_multiclass, acc=accuracy_score)

from sklearn.metrics import classification_report
print(classification_report(actual, predict))


f3=open('Test_predictions_XLMR_G1_Non-Zeroshot_Translated.txt','w')
for i in range(len(actual)):
  #print(true_labels[i])
  f3.write(str(text[i])+"\t"+str(actual[i])+"\t"+str(predict[i])+"\n")

1657 1657


  0%|          | 0/1657 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/208 [00:00<?, ?it/s]

              precision    recall  f1-score   support

         0.0       0.13      0.55      0.21        91
         1.0       0.96      0.53      0.68      1546
         2.0       0.04      0.90      0.08        20

    accuracy                           0.53      1657
   macro avg       0.38      0.66      0.32      1657
weighted avg       0.90      0.53      0.65      1657



In [None]:
f3=open('Test_predictions_XLMR_G1_Zeroshot.txt','w')
for i in range(len(actual)):
  #print(true_labels[i])
  f3.write(str(text[i])+"\t"+str(actual[i])+"\t"+str(predict[i])+"\n")

In [None]:
from sklearn.metrics import f1_score, accuracy_score

predict=[]
actual=[]

def f1_multiclass(labels, preds):
    for i in range(len(labels)):
      actual.append(labels[i])
      predict.append(preds[i])
    return f1_score(labels, preds, average='micro')

    
result, model_outputs, wrong_predictions = model1.eval_model(test_df2, f1=f1_multiclass, acc=accuracy_score)

from sklearn.metrics import classification_report
print(classification_report(actual, predict))

  0%|          | 0/1655 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/207 [00:00<?, ?it/s]

              precision    recall  f1-score   support

         0.0       0.13      0.21      0.16        90
         1.0       0.94      0.58      0.72      1546
         2.0       0.03      0.89      0.06        19

    accuracy                           0.56      1655
   macro avg       0.37      0.56      0.31      1655
weighted avg       0.88      0.56      0.68      1655



# save and load the model

save files without outputs/ 


In [None]:
import os
import tarfile

def save_model(model_path='',file_name=''):
  files = [files for root, dirs, files in os.walk(model_path)][0]
  with tarfile.open(file_name+ '.tar.gz', 'w:gz') as f:
    for file in files:
      f.add(f'{model_path}/{file}')

In [None]:
save_model('outputs','our_model')

In [None]:
!tar -zxvf ./germeval-distilbert-german.tar.gz

sample_data/README.md
sample_data/anscombe.json
sample_data/mnist_train_small.csv
sample_data/mnist_test.csv
sample_data/california_housing_test.csv
sample_data/california_housing_train.csv


In [None]:
!rm -rf outputs

# Test the loaded model on a real example

In [None]:
import os
import tarfile

def unpack_model(model_name=''): 
  tar = tarfile.open(f"{model_name}.tar.gz", "r:gz")
  tar.extractall()
  tar.close()

unpack_model('our_model')

In [None]:
class_list = ['Neg','Neu','Pos']

test_tweet = "I am happy for you"

predictions, raw_outputs = model.predict([test_tweet])

print(class_list[predictions[0]])

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Pos


In [None]:
from simpletransformers.classification import ClassificationModel

# define hyperparameter
train_args ={"reprocess_input_data": True,
             "overwrite_output_dir": True,
             "fp16":False,
             "num_train_epochs": 2}

# Create a ClassificationModel
model = ClassificationModel(
    "xlmroberta", "outputs/",
    num_labels=3,
    args=train_args
)

  f"use_multiprocessing automatically disabled as {model_type}"


In [None]:
f1=open('Filtered_Train.txt')
content=f1.read()
text=[]
for line in content.split("\n"):
  text.append(line)

f2=open('Filtered_Train_labels.txt')
content=f2.read()
labels=[]
for line in content.split("\n"):
  labels.append(int(line))

print(len(text), len(labels))
train_df = pd.DataFrame(
    {'text': text,
     'labels': labels
    })

208 208


In [None]:
model.train_model(train_df)

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 0 of 2:   0%|          | 0/26 [00:00<?, ?it/s]

Running Epoch 1 of 2:   0%|          | 0/26 [00:00<?, ?it/s]

(52, 1.035253250828156)

In [None]:
from sklearn.metrics import f1_score, accuracy_score

predict=[]
actual=[]

def f1_multiclass(labels, preds):
    for i in range(len(labels)):
      actual.append(labels[i])
      predict.append(preds[i])
    return f1_score(labels, preds, average='micro')

    
result, model_outputs, wrong_predictions = model.eval_model(test_df2, f1=f1_multiclass, acc=accuracy_score)
result

from sklearn.metrics import classification_report
print(classification_report(actual, predict))

  0%|          | 0/1655 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/207 [00:00<?, ?it/s]

              precision    recall  f1-score   support

         0.0       0.10      0.91      0.18        90
         1.0       0.98      0.21      0.34      1546
         2.0       0.03      0.84      0.06        19

    accuracy                           0.25      1655
   macro avg       0.37      0.65      0.19      1655
weighted avg       0.93      0.25      0.33      1655



In [None]:
f3=open('Test_predictions_XLMR_G1_Non_Zeroshot.txt','w')
for i in range(len(actual)):
  #print(true_labels[i])
  f3.write(str(text[i])+"\t"+str(actual[i])+"\t"+str(predict[i])+"\n")

IndexError: ignored

In [None]:
import pandas as pd
f1=open('Filtered_Train.txt')
content=f1.read()
text=[]
for line in content.split("\n"):
  text.append(line)

f2=open('Filtered_Train_labels.txt')
content=f2.read()
labels=[]
for line in content.split("\n"):
  labels.append(int(line))

#print(len(text), len(labels))
train_df = pd.DataFrame(
    {'text': text,
     'labels': labels
    })

In [None]:
from simpletransformers.classification import ClassificationModel

# define hyperparameter
train_args ={"reprocess_input_data": True,
             "overwrite_output_dir": True,
             "fp16":False,
             "num_train_epochs": 3}

# Create a ClassificationModel
model = ClassificationModel(
    "xlmroberta", "outputs/",
    num_labels=3,
    args=train_args
)

model.train_model(train_df)

  f"use_multiprocessing automatically disabled as {model_type}"


Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 0 of 3:   0%|          | 0/26 [00:00<?, ?it/s]

Running Epoch 1 of 3:   0%|          | 0/26 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/26 [00:00<?, ?it/s]

(78, 0.6858285543246146)

In [None]:
f1=open('test_text_G1.txt')
content=f1.read()
text=[]
for line in content.split("\n"):
  text.append(line)

f2=open('test_text_G1_labels.txt')
content=f2.read()
labels=[]
for line in content.split("\n"):
  labels.append(int(line))

print(len(text), len(labels))
test_df2 = pd.DataFrame(
    {'text': text,
     'labels': labels
    })

1655 1655


In [None]:
from sklearn.metrics import f1_score, accuracy_score

predict=[]
actual=[]

def f1_multiclass(labels, preds):
    for i in range(len(labels)):
      actual.append(labels[i])
      predict.append(preds[i])
    return f1_score(labels, preds, average='micro')

    
result, model_outputs, wrong_predictions = model.eval_model(test_df2, f1=f1_multiclass, acc=accuracy_score)
result

from sklearn.metrics import classification_report
print(classification_report(actual, predict))

  0%|          | 0/1655 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/207 [00:00<?, ?it/s]

              precision    recall  f1-score   support

         0.0       0.10      0.87      0.18        90
         1.0       0.97      0.34      0.51      1546
         2.0       0.04      0.79      0.08        19

    accuracy                           0.38      1655
   macro avg       0.37      0.67      0.26      1655
weighted avg       0.92      0.38      0.49      1655



In [None]:
class_list = ['Neg','Neu','Pos']

test_tweet = "P"

predictions, raw_outputs = model.predict([test_tweet])

print(class_list[predictions[0]])

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Pos


In [None]:
f3=open('Test_predictions_XLMR_G1.txt','w')
for i in range(len(actual)):
  #print(true_labels[i])
  f3.write(str(text[i])+"\t"+str(actual[i])+"\t"+str(predict[i])+"\n")