# Model Training

In [1]:
import pandas as pd

from training.model import Model

### 1. Creating model

In [5]:
model = Model(model_path="bert-base-multilingual-uncased")

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at cl-tohoku/bert-base-japanese and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### 2. Loading dataset

In [8]:
data_json = pd.read_json("data/preprocessed/japanese", orient="records", lines=True)
tf_train, tf_test = model.prepare_train_test_data(data_json)

INFO:root:Start of preprocessing ...
INFO:root:Hashtags preprocessing ...
INFO:root:Emojis preprocessing ...
INFO:root:Analyzing sentiment ...
INFO:root:Saving preprocessed data ...


INFO:tensorflow:Assets written to: ram://7116b24a-a981-4ae5-b7b7-6a2bddaec59a/assets


INFO:tensorflow:Assets written to: ram://7116b24a-a981-4ae5-b7b7-6a2bddaec59a/assets


Map:   0%|          | 0/1599 [00:00<?, ? examples/s]



INFO:tensorflow:Assets written to: ram://58f9c442-2dd2-4944-8ae7-f9afdc84de4b/assets


INFO:tensorflow:Assets written to: ram://58f9c442-2dd2-4944-8ae7-f9afdc84de4b/assets


Map:   0%|          | 0/400 [00:00<?, ? examples/s]

### 3. Training

In [9]:
model.compile()
model.fit(train_data=tf_train, epochs=3, validation_data=tf_test)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x2bea1ede4f0>

### 4. Model saving

In [10]:
COVID_MODEL_PATH = "model/japanese/cl-tohoku-bert-base-japanese"

In [11]:
model.save_model(COVID_MODEL_PATH)

### 5. Separate evaluation

In [None]:
results = model.evaluate(dataset=tf_test)
test_loss = results[0]
test_accuracy = results[1]

print(f"Test loss: {test_loss}, Test accuracy: {test_accuracy}")

Test loss: 0.07972569018602371, Test accuracy: 0.976870059967041


### 6. Testing

In [None]:
text = "WhatsApp censors the messages that circulate on its platform if it believes that they are hoaxes with the help of the media that verify false information in Spain." # 0
# text = "Coronavirus: New Covid-19 tracing tool appears on smartphones" # 1
# text = "Good oral hygiene destroys the coronavirus and prevents its spread" # 0
# text = "Post about a video claims that it is a protest against confination in the town of Aranda de Duero (Burgos)" # 0

prediction = model.classify_text(text=text)

Predicted class: 0
Probability distribution: [9.9570507e-01 4.1160374e-03 7.8641082e-05 3.8738748e-05 6.1628692e-05]
