# Model Training

In [1]:
import pandas as pd

from training.model import Model

### 1. Creating model

In [2]:
model = Model(model_path="nlptown/bert-base-multilingual-uncased-sentiment")

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at nlptown/bert-base-multilingual-uncased-sentiment.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


### 2. Loading dataset

In [3]:
data_json = pd.read_json("data/preprocessed/covid", orient="records", lines=True)
tf_train, tf_test = model.prepare_train_test_data(data_json)



INFO:tensorflow:Assets written to: ram://a92f5256-a82c-4254-823b-f3b289069d7a/assets


INFO:tensorflow:Assets written to: ram://a92f5256-a82c-4254-823b-f3b289069d7a/assets


Map:   0%|          | 0/8160 [00:00<?, ? examples/s]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


INFO:tensorflow:Assets written to: ram://ce175290-3638-4562-ab35-b9000fd6658e/assets


INFO:tensorflow:Assets written to: ram://ce175290-3638-4562-ab35-b9000fd6658e/assets


Map:   0%|          | 0/2041 [00:00<?, ? examples/s]

### 3. Training

In [4]:
model.compile()
model.fit(train_data=tf_train, epochs=1, validation_data=tf_test)



<keras.callbacks.History at 0x20b966c67f0>

### 4. Model saving

In [5]:
COVID_MODEL_PATH = "model/covid/english_v1.h5py"

In [6]:
model.save_model(COVID_MODEL_PATH)

### 5. Separate evaluation

In [7]:
results = model.evaluate(dataset=tf_test)
test_loss = results[0]
test_accuracy = results[1]

print(f"Test loss: {test_loss}, Test accuracy: {test_accuracy}")

Test loss: 0.07972569018602371, Test accuracy: 0.976870059967041


### 6. Testing

In [13]:
text = "WhatsApp censors the messages that circulate on its platform if it believes that they are hoaxes with the help of the media that verify false information in Spain." # 0
# text = "Coronavirus: New Covid-19 tracing tool appears on smartphones" # 1
# text = "Good oral hygiene destroys the coronavirus and prevents its spread" # 0
# text = "Post about a video claims that it is a protest against confination in the town of Aranda de Duero (Burgos)" # 0

prediction = model.classify_text(text=text)

Predicted class: 0
Probability distribution: [9.9570507e-01 4.1160374e-03 7.8641082e-05 3.8738748e-05 6.1628692e-05]
