# Notebook: Train Model

This notebook is used to train a classification model given a dataset of tweets. Results of the training are saved in a CSV file.
<br>**Contributors:** [Nils Hellwig](https://github.com/NilsHellwig/) | [Markus Bink](https://github.com/MarkusBink/)

## Packages

In [17]:
from simpletransformers.classification import ClassificationModel
from sklearn.metrics import f1_score, accuracy_score
import pandas as pd
import numpy as np
import random
import os

## Parameters

In [18]:
TRAIN_DATASET_PATH = "../Datasets/k_fold_splits/TRAIN_TEST_0/train.csv"
TEST_DATASET_PATH = "../Datasets/k_fold_splits/TRAIN_TEST_0/test.csv"
N_TRAIN_EPOCHS = 4
TRAIN_BATCH_SIZE = 32
TEST_BATCH_SIZE = 32
USE_CUDA = False
SEED_VALUE = 0

MODEL_TYPE = "bert"
MODEL_NAME = "deepset/gbert-base"
MODEL_DIRECTORY_PATH = "output"

N_LABELS = 2

## Code

### 1. Get Reproducable Results

In [19]:
os.environ['PYTHONHASHSEED'] = str(SEED_VALUE)
random.seed(SEED_VALUE)
np.random.seed(SEED_VALUE)

### 2. Load Dataframes

In [20]:
train_df = pd.read_csv(TRAIN_DATASET_PATH, encoding="utf-8")[["tweet","sentiment_label"]]
test_df = pd.read_csv(TEST_DATASET_PATH, encoding="utf-8")[["tweet","sentiment_label"]]

Replace label strings with numbers

In [21]:
train_df['sentiment_label'] = train_df['sentiment_label'].replace({'Negative': 1, 'Positive': 0, 'Neutral': 2})
test_df['sentiment_label'] = test_df['sentiment_label'].replace({'Negative': 1, 'Positive': 0, 'Neutral': 2})

In [22]:
train_df

Unnamed: 0,tweet,sentiment_label
0,@JuliaMaiano @EskenSaskia @NowaboFM @spdde @Ol...,1
1,@Sarayatennis @_FriedrichMerz @CDU Das ist ja ...,0
2,@LeBoomio @theNeo42 @InRi5555 @n_roettgen @CDU...,1
3,@MickyBeisenherz @n_roettgen @_FriedrichMerz @...,0
4,@MGrosseBroemer @JM_Luczak @jensspahn @cducsub...,1
...,...,...
1594,Hallo @TwitterSupport @Ralf_Stegner hat recht...,1
1595,@SteveundJulian @minimalist_h @fdp @fdp MORGEN...,0
1596,@AlexWFotografie @Paul67M @JoanaCotar Wer verh...,1
1597,@MarcoBuschmann Ich habe mir gerade den Koalit...,1


In [23]:
train_df.sentiment_label.value_counts(), test_df.sentiment_label.value_counts()

(0    803
 1    796
 Name: sentiment_label, dtype: int64,
 1    204
 0    196
 Name: sentiment_label, dtype: int64)

### 3. Create Model

In [24]:
training_args = {
   "fp16":False,
    "num_train_epochs":N_TRAIN_EPOCHS,
    "overwrite_output_dir":True,
    "train_batch_size":TRAIN_BATCH_SIZE,
    "eval_batch_size":TEST_BATCH_SIZE,
    "manual_seed": SEED_VALUE,
    "reprocess_input_data": True
}

In [25]:
model = ClassificationModel(model_type=MODEL_TYPE, model_name=MODEL_NAME, num_labels=N_LABELS, args=training_args, use_cuda=USE_CUDA)

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint a

### 4. Train Model

In [26]:
#model.train_model(train_df)

### 5. Define Metrics

In [None]:
accuracy_metric = accuracy_score

### 4. Evaluate Model

In [27]:
result, model_outputs, wrong_predictions = model.eval_model(test_df, acc=accuracy_metric)



  0%|          | 0/400 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Running Evaluation:   0%|          | 0/13 [00:00<?, ?it/s]

In [28]:
result

{'mcc': 0.0481014623999032,
 'tp': 118,
 'tn': 92,
 'fp': 104,
 'fn': 86,
 'auroc': 0.5450430172068828,
 'auprc': 0.5486292049753088,
 'eval_loss': 0.6913179296713609}

### 4. Save Model

In [29]:
model.save_model(MODEL_DIRECTORY_PATH)