# Notebook: Train Model

This notebook is used to train a classification model given a dataset of tweets. Results of the training are saved in a CSV file.
<br>**Contributors:** [Nils Hellwig](https://github.com/NilsHellwig/) | [Markus Bink](https://github.com/MarkusBink/)

## Packages

In [29]:
from simpletransformers.classification import ClassificationModel
import pandas as pd
import numpy as np

## Parameters

In [30]:
TRAIN_DATASET_PATH = "../Datasets/k_fold_splits/TRAIN_TEST_0/train.csv"
TEST_DATASET_PATH = "../Datasets/k_fold_splits/TRAIN_TEST_0/test.csv"
N_TRAIN_EPOCHS = 4
TRAIN_BATCH_SIZE = 32
TEST_BATCH_SIZE = 32
USE_CUDA = False

MODEL_TYPE = "bert"
MODEL_NAME = "deepset/gbert-base"
MODEL_DIRECTORY_PATH = "output"

N_LABELS = 4

## Code

### 1. Load Dataframes

In [31]:
train_df = pd.read_csv(TRAIN_DATASET_PATH, encoding="utf-8")
test_df = pd.read_csv(TEST_DATASET_PATH, encoding="utf-8")

Replace label strings with numbers

In [32]:
train_df['sentiment_label'] = train_df['sentiment_label'].replace({'Negative': 1, 'Positive': 0, 'Neutral': 2})
test_df['sentiment_label'] = test_df['sentiment_label'].replace({'Negative': 1, 'Positive': 0, 'Neutral': 2})

In [33]:
train_df

Unnamed: 0,id,source_account,tweet,sentiment_label
0,1345000784115552000,larsklingbeil,@JuliaMaiano @EskenSaskia @NowaboFM @spdde @Ol...,1
1,1345311824439348992,CDU,@Sarayatennis @_FriedrichMerz @CDU Das ist ja ...,0
2,1345330989275508992,CDU,@LeBoomio @theNeo42 @InRi5555 @n_roettgen @CDU...,1
3,1345336738328284928,CDU,@MickyBeisenherz @n_roettgen @_FriedrichMerz @...,0
4,1345432731321306880,cducsubt,@MGrosseBroemer @JM_Luczak @jensspahn @cducsub...,1
...,...,...,...,...
1594,1475937148230586112,Ralf_Stegner,Hallo @TwitterSupport @Ralf_Stegner hat recht...,1
1595,1476584052987777024,fdp,@SteveundJulian @minimalist_h @fdp @fdp MORGEN...,0
1596,1476608087280786944,JoanaCotar,@AlexWFotografie @Paul67M @JoanaCotar Wer verh...,1
1597,1476841285873020928,MarcoBuschmann,@MarcoBuschmann Ich habe mir gerade den Koalit...,1


In [34]:
train_df.sentiment_label.value_counts(), test_df.sentiment_label.value_counts()

(0    803
 1    796
 Name: sentiment_label, dtype: int64,
 1    204
 0    196
 Name: sentiment_label, dtype: int64)

### 2. Create Model

In [35]:
training_args = {
   "fp16":False,
    "num_train_epochs":N_TRAIN_EPOCHS,
    "overwrite_output_dir":True,
    "train_batch_size":TRAIN_BATCH_SIZE,
    "eval_batch_size":TEST_BATCH_SIZE
}

In [36]:
model = ClassificationModel(model_type=MODEL_TYPE, model_name=MODEL_NAME, num_labels=N_LABELS, args=training_args, use_cuda=USE_CUDA)

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint a

### 3. Train Model

In [37]:
#model.train_model(train_df)

### 4. Save Model

In [38]:
model.save_model(MODEL_DIRECTORY_PATH)