# Tutorial

We are going to use [Simple Transformers](https://github.com/ThilinaRajapakse/simpletransformers) - an NLP library based on the [Transformers](https://github.com/huggingface/transformers) library by HuggingFace. Simple Transformers allows us to fine-tune Transformer models in a few lines of code.  

As the dataset, we are going to use the [Germeval 2019](https://projects.fzai.h-da.de/iggsa/projekt/), which consists of German tweets. We are going to detect and classify abusive language tweets. These tweets are categorized in 4 classes: `PROFANITY`, `INSULT`, `ABUSE`, and `OTHERS`. The highest score achieved on this dataset is `0.7361`.

### We are going to

- install Simple Transformers library
- select a pre-trained monolingual model
- load the dataset
- train/fine-tune our model
- evaluate the results of it
- save and load the model
- test the loaded model on a real example

# Install Simple Transformers library 

In [18]:
# install simpletransformers
!pip install simpletransformers

# check installed version
!pip freeze | grep simpletransformers
# simpletransformers==0.28.2

simpletransformers==0.43.2


# Select a pre-trained monolingual model

As mentioned above the Simple Transformers library is based on the Transformers library from HuggingFace. This enables us to use every pre-trained model provided in the [Transformers library](https://huggingface.co/transformers/pretrained_models.html) and all community-uploaded models. For a list that includes community-uploaded models, refer to [https://huggingface.co/models](https://huggingface.co/models).

We are going to use the `distilbert-base-german-cased` model. [DistilBERT is a small, fast, cheaper version of BERT](https://huggingface.co/transformers/model_doc/distilbert.html). It has 40% less parameters than `bert-base-uncased` and runs 60% faster while preserving over 95% of Bert’s performance.

# Load the dataset

In [19]:
!wget https://projects.fzai.h-da.de/iggsa/wp-content/uploads/2019/08/germeval2019GoldLabelsSubtask1_2.txt
!wget https://projects.fzai.h-da.de/iggsa/wp-content/uploads/2019/09/germeval2019.training_subtask1_2_korrigiert.txt

--2020-07-07 12:54:25--  https://projects.fzai.h-da.de/iggsa/wp-content/uploads/2019/08/germeval2019GoldLabelsSubtask1_2.txt
Resolving projects.fzai.h-da.de (projects.fzai.h-da.de)... 141.100.60.75, 2001:67c:2184:82a:21a:4aff:fe16:1e6
Connecting to projects.fzai.h-da.de (projects.fzai.h-da.de)|141.100.60.75|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 543975 (531K) [text/plain]
Saving to: ‘germeval2019GoldLabelsSubtask1_2.txt.1’


2020-07-07 12:54:25 (7.30 MB/s) - ‘germeval2019GoldLabelsSubtask1_2.txt.1’ saved [543975/543975]

--2020-07-07 12:54:27--  https://projects.fzai.h-da.de/iggsa/wp-content/uploads/2019/09/germeval2019.training_subtask1_2_korrigiert.txt
Resolving projects.fzai.h-da.de (projects.fzai.h-da.de)... 141.100.60.75, 2001:67c:2184:82a:21a:4aff:fe16:1e6
Connecting to projects.fzai.h-da.de (projects.fzai.h-da.de)|141.100.60.75|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 697779 (681K) [text/plain]
Saving to: ‘germe

In [20]:
import pandas as pd

class_list =['PersonalVerwaltung','FI','Zeitwirtschaft','Reisemanagement','HCM','Abrechnung','Fremdsprachen','SAP','ArchiveLink','BC ','Frühwarnsystem','Großprojektmanagement','Lotus','Mercury InteractiveTestdirector','Migration','MS Outlook','MS-Project','Qualitätssicherung','Prozessanalyse','Reviewdurchführung','SAP Anwenderschulungen',
             'Material Management','Schulungsentwicklung','Management',
 'Softwareentwicklung','Prozessmanagement','Präsentation','PSM',
 'Mediation und Konfliktmanagement','Projektmanagement','Anbieterauswahl',
 'Anforderungsanalyse','Anforderungsmanagement','Angebots Projektleitung',
 'Angebotsbewertung','Angebotsmanagement','Angebotsverhandlung',
 'Angebotsvorlage','Anlagemanagement','ARIS','Aufgabenanalyse',
 'Ausschreibung','Auswahl','Genehmigungsverf','Cash management',
 'Brainstorming','CO-OM OPA','Beschwerdemanagement','Controlling',
 'Kostenmanagement','Doppik','EC-PCA','EPC','Fachliche QSAngebot',
 'Finanzen','Forderungsmanagement','Fragebogen','GP Geschäftspartner',
 'GPO','Individuelle Migrationen','Interview','Kalkulation',
 'Kommunalverwaltung','Laufzettel','Mindmap','MM','MS Access',
 'Planung Projektverlauf','Prince2','PSCD','Qualitätsmanagement',
 'SAP LSMW','SAP HANA','Selbstaufschreibung','Funktionsbeschreibung',
 'Methoden & Verfahren','Testmanagement','V-Modell XT','VISIO','Workshop',
 'Zeitaufnahme','SAP ABAP','ABAP','ARS','Basis']

df = pd.read_csv('/content/merged_sample_data.csv')

# df1 = pd.read_csv('germeval2019GoldLabelsSubtask1_2.txt',sep='\t', lineterminator='\n',encoding='utf8',names=["tweet", "task1", "task2"])
# df2 = pd.read_csv('germeval2019.training_subtask1_2_korrigiert.txt',sep='\t', lineterminator='\n',encoding='utf8',names=["tweet", "task1", "task2"])

# df = pd.concat([df1,df2])
df['project_description'] = df['project_description'].str.replace('\r', "")
df['tag_id'] = df.apply(lambda x:  class_list.index(x['tag_id']),axis=1)


df1 = df[['project_description','tag_id']]

print(df1.shape)
df1.head()

(5000, 2)


Unnamed: 0,project_description,tag_id
0,Tätigkeit: Sachbearbeiterin Entgeltabrechnung-...,0
1,Tätigkeit: Sachbearbeiterin Entgeltabrechnung-...,1
2,Tätigkeit: Sachbearbeiterin Entgeltabrechnung-...,2
3,Tätigkeit: Sachbearbeiterin Entgeltabrechnung-...,3
4,Tätigkeit: Sachbearbeiterin Entgeltabrechnung-...,4


In [21]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df1, test_size=0.10)

print('train shape: ',train_df.shape)
print('test shape: ',test_df.shape)

train shape:  (4500, 2)
test shape:  (500, 2)


# Load pre-trained model

In [22]:
from simpletransformers.classification import ClassificationModel

# define hyperparameter
train_args ={"reprocess_input_data": True,
             "overwrite_output_dir": True,
             "fp16":False,
             "num_train_epochs": 4}

# Create a ClassificationModel
model = ClassificationModel(
    "bert", "distilbert-base-german-cased",
    num_labels=85,
    args=train_args
)

Some weights of the model checkpoint at distilbert-base-german-cased were not used when initializing BertForSequenceClassification: ['distilbert.embeddings.word_embeddings.weight', 'distilbert.embeddings.position_embeddings.weight', 'distilbert.embeddings.LayerNorm.weight', 'distilbert.embeddings.LayerNorm.bias', 'distilbert.transformer.layer.0.attention.q_lin.weight', 'distilbert.transformer.layer.0.attention.q_lin.bias', 'distilbert.transformer.layer.0.attention.k_lin.weight', 'distilbert.transformer.layer.0.attention.k_lin.bias', 'distilbert.transformer.layer.0.attention.v_lin.weight', 'distilbert.transformer.layer.0.attention.v_lin.bias', 'distilbert.transformer.layer.0.attention.out_lin.weight', 'distilbert.transformer.layer.0.attention.out_lin.bias', 'distilbert.transformer.layer.0.sa_layer_norm.weight', 'distilbert.transformer.layer.0.sa_layer_norm.bias', 'distilbert.transformer.layer.0.ffn.lin1.weight', 'distilbert.transformer.layer.0.ffn.lin1.bias', 'distilbert.transformer.lay

# Train model

In [23]:
# Train the model
model.train_model(train_df)



  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."


HBox(children=(FloatProgress(value=0.0, max=4500.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Epoch', max=4.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Running Epoch 0 of 4', max=563.0, style=ProgressStyle(des…








HBox(children=(FloatProgress(value=0.0, description='Running Epoch 1 of 4', max=563.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 2 of 4', max=563.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 3 of 4', max=563.0, style=ProgressStyle(des…





In [24]:
from sklearn.metrics import f1_score, accuracy_score


def f1_multiclass(labels, preds):
    return f1_score(labels, preds, average='micro')
    
result, model_outputs, wrong_predictions = model.eval_model(test_df, f1=f1_multiclass, acc=accuracy_score)

result

  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."


HBox(children=(FloatProgress(value=0.0, max=500.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Running Evaluation', max=63.0, style=ProgressStyle(descri…




{'acc': 0.092,
 'eval_loss': 3.9425552913120816,
 'f1': 0.092,
 'mcc': 0.0693072294428793}

# save and load the model

save files without outputs/ 


In [25]:
import os
import tarfile

def save_model(model_path='',file_name=''):
  files = [files for root, dirs, files in os.walk(model_path)][0]
  with tarfile.open(file_name+ '.tar.gz', 'w:gz') as f:
    for file in files:
      f.add(f'{model_path}/{file}')

In [26]:
save_model('outputs','germeval-distilbert-german')

In [27]:
!tar -zxvf ./germeval-distilbert-german.tar.gz

outputs/pytorch_model.bin
outputs/eval_results.txt
outputs/config.json
outputs/vocab.txt
outputs/model_args.json
outputs/tokenizer_config.json
outputs/special_tokens_map.json
outputs/training_args.bin


In [28]:
!rm -rf outputs

# Test the loaded model on a real example

In [29]:
import os
import tarfile

def unpack_model(model_name=''): 
  tar = tarfile.open(f"{model_name}.tar.gz", "r:gz")
  tar.extractall()
  tar.close()

unpack_model('germeval-distilbert-german')

In [31]:
from simpletransformers.classification import ClassificationModel

# define hyperparameter
train_args ={"reprocess_input_data": True,
             "overwrite_output_dir": True,
             "fp16":False,
             "num_train_epochs": 4}

# Create a ClassificationModel
model = ClassificationModel(
    "bert", "outputs/",
    num_labels=85,
    args=train_args
)

In [32]:
class_list =['PersonalVerwaltung','FI','Zeitwirtschaft','Reisemanagement','HCM','Abrechnung','Fremdsprachen','SAP','ArchiveLink','BC ','Frühwarnsystem','Großprojektmanagement','Lotus','Mercury InteractiveTestdirector','Migration','MS Outlook','MS-Project','Qualitätssicherung','Prozessanalyse','Reviewdurchführung','SAP Anwenderschulungen',
             'Material Management','Schulungsentwicklung','Management',
 'Softwareentwicklung','Prozessmanagement','Präsentation','PSM',
 'Mediation und Konfliktmanagement','Projektmanagement','Anbieterauswahl',
 'Anforderungsanalyse','Anforderungsmanagement','Angebots Projektleitung',
 'Angebotsbewertung','Angebotsmanagement','Angebotsverhandlung',
 'Angebotsvorlage','Anlagemanagement','ARIS','Aufgabenanalyse',
 'Ausschreibung','Auswahl','Genehmigungsverf','Cash management',
 'Brainstorming','CO-OM OPA','Beschwerdemanagement','Controlling',
 'Kostenmanagement','Doppik','EC-PCA','EPC','Fachliche QSAngebot',
 'Finanzen','Forderungsmanagement','Fragebogen','GP Geschäftspartner',
 'GPO','Individuelle Migrationen','Interview','Kalkulation',
 'Kommunalverwaltung','Laufzettel','Mindmap','MM','MS Access',
 'Planung Projektverlauf','Prince2','PSCD','Qualitätsmanagement',
 'SAP LSMW','SAP HANA','Selbstaufschreibung','Funktionsbeschreibung',
 'Methoden & Verfahren','Testmanagement','V-Modell XT','VISIO','Workshop',
 'Zeitaufnahme','SAP ABAP','ABAP','ARS','Basis']

test_tweet = "Erfüllung der regulatorischen Anforderungen aus der Basel II Richtlinie für 2008 mit Hilfe des SAP Bank Analyzers. Anbindung von 5 weiteren Liefersystemen für Geschäftsdaten, Berücksichtigung KSA; Erfüllung der Anforderungen GroMiKV zum 1.1.2008Tätigkeit: Qualitätsmanager und ChangemanagerAbstimmung, Definition und Umsetzung des projektbegleitenden Qualitätsmanagements für alle Projekte im Rahmen der Projektgruppe Basel II 2008.Leitung der Querschnittsfunktion Qualitätsmanagement mit 3 Mitarbeitern. Berücksichtigung der Anforderungen CMMI, Pilotierung neuer geänderter Prozesse im Projektmanagement und in der Qualitätssicherung.Management der Change-Request der Projektgruppe.Durchführung der Qualitätssicherungsmaßnahmen (i.d.R. Reviews der Ergebnisse).Entwicklungsumgebung: SAP BA, kundeneigene ETL-Schicht"

predictions, raw_outputs = model.predict([test_tweet])

print(class_list[predictions[0]])

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


FI


In [None]:
class_list = ['INSULT','ABUSE','PROFANITY','OTHER']

test_tweet = "Frau #Böttinger meine Meinung dazu ist sie sollten uns mit ihrem Pferdegebiss nicht weiter belästigen #WDR"

predictions, raw_outputs = model.predict([test_tweet])

print(class_list[predictions[0]])
# INSULT