# Notebook: Train Model

This notebook is used to train a classification model given a dataset of tweets. Results of the training are saved in CSV and JSON.
<br>**Contributors:** [Nils Hellwig](https://github.com/NilsHellwig/) | [Markus Bink](https://github.com/MarkusBink/)

## Packages

In [1]:
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score
from simpletransformers.classification import ClassificationModel
from get_germeval_2017_dataset import get_germeval_2017_dataset
from sklearn.metrics import f1_score
import pandas as pd
import numpy as np
import random
import json
import os

## Parameters

In [2]:
SPLIT_ID = 0
TEST_DATASET_PATH = f'../Datasets/k_fold_splits/TRAIN_TEST_{SPLIT_ID}/test.csv'
N_TRAIN_EPOCHS = 4
TRAIN_BATCH_SIZE = 32
TEST_BATCH_SIZE = 32
USE_CUDA = False
SEED_VALUE = 0
MODEL_TYPE = "bert"
MODEL_NAME = "deepset/gbert-base"
MODEL_DIRECTORY_PATH = "output"
PATH_RESULT_DATA = f'../Models/Results/GermEval_and_Annotaded_it_{SPLIT_ID}'
SAVE_MODEL = False
N_LABELS = 2
EVALUATE_MODEL = True
LABEL_DEFINITION = {'negative': 1, 'positive': 0, 'neutral': 2}

## Code

### 1. Get Reproducable Results

In [3]:
os.environ['PYTHONHASHSEED'] = str(SEED_VALUE)
random.seed(SEED_VALUE)
np.random.seed(SEED_VALUE)

### 2. Load Dataframes

#### Load Training Data
**Important:** Comment out unnecessary data frames

In [4]:
train_df_annotated_split = pd.read_csv(f'../Datasets/k_fold_splits/TRAIN_TEST_{SPLIT_ID}/train.csv', encoding="utf-8")[["tweet","sentiment_label"]].rename(columns={"tweet":"text"})
train_df_germeval = get_germeval_2017_dataset()
train_df_annotated_total = pd.read_csv("../Datasets/annotations.csv", encoding="utf-8")[["tweet","sentiment_label"]].rename(columns={"tweet":"text"})

In [5]:
train_df = pd.concat([train_df_annotated_split, train_df_germeval], axis=0).sample(frac=1, random_state=SEED_VALUE).reset_index(drop=True)
train_df['sentiment_label'] = train_df['sentiment_label'].str.lower()

Check Labels

#### Load Test Data

In [6]:
if EVALUATE_MODEL:
    test_df = pd.read_csv(TEST_DATASET_PATH, encoding="utf-8")[["tweet","sentiment_label"]].rename(columns={"tweet":"text"})
    test_df['sentiment_label'] = test_df['sentiment_label'].str.lower()

#### Replace label strings with numbers

In [7]:
train_df['sentiment_label'] = train_df['sentiment_label'].replace(LABEL_DEFINITION)

In [8]:
if EVALUATE_MODEL:
    test_df['sentiment_label'] = test_df['sentiment_label'].replace(LABEL_DEFINITION)

### 3. Create Model

In [9]:
training_args = {
    "fp16":False,
    "num_train_epochs":N_TRAIN_EPOCHS,
    "overwrite_output_dir":True,
    "train_batch_size":TRAIN_BATCH_SIZE,
    "eval_batch_size":TEST_BATCH_SIZE,
    "manual_seed": SEED_VALUE,
    "reprocess_input_data":True,
    "no_save":True,
    "no_cache":True
}

In [10]:
model = ClassificationModel(model_type=MODEL_TYPE, model_name=MODEL_NAME, num_labels=N_LABELS, args=training_args, use_cuda=USE_CUDA)

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint a

### 4. Train Model

In [11]:
#model.train_model(train_df)

### 5. Define Metrics

In [12]:
accuracy_metric = accuracy_score

def f1_metrics(labels, predictions):
    metrics = {
      "f1_macro": f1_score(labels, predictions, average='macro'),
      "f1_micro": f1_score(labels, predictions, average='micro'),
      "f1_weighted": f1_score(labels, predictions, average='weighted')
    }
    return metrics

def precision_metrics(labels, predictions):
    metrics = {
      "precision_macro": precision_score(labels, predictions, average='macro'),
      "precision_micro": precision_score(labels, predictions, average='micro'),
      "precision_weighted": precision_score(labels, predictions, average='weighted')
    }
    return metrics

def recall_metrics(labels, predictions):
    metrics = {
      "recall_macro": recall_score(labels, predictions, average='macro'),
      "recall_micro": recall_score(labels, predictions, average='micro'),
      "recall_weighted": recall_score(labels, predictions, average='weighted')
    }
    return metrics

In [13]:
def precision_recall_each_class(labels, predictions):
    precision_recall = {}
    for c in set(labels):
        label_idx = [i for i, x in enumerate(labels) if x == c]
        pred_idx = [i for i, x in enumerate(predictions) if x == c]
        precision = len(set(label_idx).intersection(set(pred_idx))) / len(pred_idx) if len(pred_idx) > 0 else 0
        recall = len(set(label_idx).intersection(set(pred_idx))) / len(label_idx) if len(label_idx) > 0 else 0
        precision_recall[c] = {"precision": precision, "recall": recall}
    return {"precision_recall_each_class": precision_recall}

### 6. Evaluate Model

In [14]:
if EVALUATE_MODEL:
    result, model_outputs, wrong_predictions = model.eval_model(test_df, acc=accuracy_metric, f1=f1_metrics, precision=precision_metrics, recall=recall_metrics, precision_recall_each_class=precision_recall_each_class)



  0%|          | 0/400 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Running Evaluation:   0%|          | 0/13 [00:00<?, ?it/s]

In [15]:
if EVALUATE_MODEL:
    with open(PATH_RESULT_DATA+".json", 'w') as f:
        json.dump(result, f, default=str)

### 7. Save Evaluated Test Dataframe

In [16]:
test_data = test_df
texts = []
for index, row in test_data.iterrows():
    texts.append(row["text"])
predictions, raw_outputs = model.predict(texts)

  0%|          | 0/400 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  0%|          | 0/13 [00:00<?, ?it/s]

In [17]:
test_df_out = test_df.assign(pred=pd.Series(predictions))
test_df_out.to_csv(PATH_RESULT_DATA+".csv")

### 8. Save Model

In [18]:
if SAVE_MODEL:
    model.save_model(MODEL_DIRECTORY_PATH)