## Read me

This notebook is an example of a binary ADR sentence classification for the SMM4H 2020 Shared task.

We use the EnRuDR-BERT model from:

https://github.com/cimm-kzn/RuDReC

In this notebook, we use the Google's Tensorflow implementation of BERT.





In [None]:
%tensorflow_version 1.x

TensorFlow 1.x selected.


In [None]:
!nvidia-smi

Thu Dec  3 15:41:59 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P8    10W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

For a batch size of 32 you need a GPU with 15+ Gb memory. If you get less memory, you can reset runtime by clicking "Factory reset runtime" under "Runtime" section of Colab menu.

### Pulling necessary code

In [None]:
!git init
!git clone https://github.com/google-research/bert.git
!git clone https://github.com/Andoree/med_project.git
!git clone https://github.com/Andoree/smm4h_classification.git
!cp med_project/multilabel_code/bert_preprocessing.py ./bert
!cp med_project/multilabel_code/multilabel_bert.py ./bert
!cp /content/smm4h_classification/training/run_binary_classification.py ./bert
!cp /content/smm4h_classification/training/multilabel_pretraining.py ./bert

Initialized empty Git repository in /content/.git/
Cloning into 'bert'...
remote: Enumerating objects: 340, done.[K
remote: Total 340 (delta 0), reused 0 (delta 0), pack-reused 340[K
Receiving objects: 100% (340/340), 317.85 KiB | 17.66 MiB/s, done.
Resolving deltas: 100% (185/185), done.
Cloning into 'med_project'...
remote: Enumerating objects: 395, done.[K
remote: Counting objects: 100% (395/395), done.[K
remote: Compressing objects: 100% (253/253), done.[K
remote: Total 2488 (delta 251), reused 267 (delta 124), pack-reused 2093[K
Receiving objects: 100% (2488/2488), 21.22 MiB | 20.89 MiB/s, done.
Resolving deltas: 100% (961/961), done.
Cloning into 'smm4h_classification'...
remote: Enumerating objects: 118, done.[K
remote: Counting objects: 100% (118/118), done.[K
remote: Compressing objects: 100% (87/87), done.[K
remote: Total 118 (delta 55), reused 78 (delta 26), pack-reused 0[K
Receiving objects: 100% (118/118), 2.80 MiB | 22.61 MiB/s, done.
Resolving deltas: 100% (55/

#### Downloading EnRuDR-BERT model

In [None]:
!mkdir bert_models/
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1jCCcafZO6RMl324W8WinXTHBCX86lXnS' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1jCCcafZO6RMl324W8WinXTHBCX86lXnS" -O bert_models/EnRuDR_BERT.tar.gz && rm -rf /tmp/cookies.txt 
!tar -xvf bert_models/EnRuDR_BERT.tar.gz -C bert_models
!ls bert_models/multilingual_eng_rus_reviews_finetuned

--2020-12-03 15:42:05--  https://docs.google.com/uc?export=download&confirm=isLe&id=1jCCcafZO6RMl324W8WinXTHBCX86lXnS
Resolving docs.google.com (docs.google.com)... 172.217.12.238, 2607:f8b0:4004:809::200e
Connecting to docs.google.com (docs.google.com)|172.217.12.238|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-0g-2k-docs.googleusercontent.com/docs/securesc/6j9t4lec85dbv73nrnltij8454k62ajr/7u9krb1qn5bmorc1ia24gnrivar0mstn/1607010075000/06930042168325031160/14649094590169478743Z/1jCCcafZO6RMl324W8WinXTHBCX86lXnS?e=download [following]
--2020-12-03 15:42:05--  https://doc-0g-2k-docs.googleusercontent.com/docs/securesc/6j9t4lec85dbv73nrnltij8454k62ajr/7u9krb1qn5bmorc1ia24gnrivar0mstn/1607010075000/06930042168325031160/14649094590169478743Z/1jCCcafZO6RMl324W8WinXTHBCX86lXnS?e=download
Resolving doc-0g-2k-docs.googleusercontent.com (doc-0g-2k-docs.googleusercontent.com)... 172.217.15.97, 2607:f8b0:4004:811::2001
Connecting to doc-0g

#### Downloading the annotated part of the RuDReC corpus and splitting it into senteces

In [None]:
!mkdir -p data/rudrec_annotated
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1plApL6qmdHtNNP3OXgJQEmo7Lfp6MVeO' -O- > /dev/null | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1plApL6qmdHtNNP3OXgJQEmo7Lfp6MVeO" -O data/rudrec_annotated/rudrec.zip && rm -rf /tmp/cookies.txt

--2020-12-03 15:43:40--  https://docs.google.com/uc?export=download&confirm=&id=1plApL6qmdHtNNP3OXgJQEmo7Lfp6MVeO
Resolving docs.google.com (docs.google.com)... 172.217.12.238, 2607:f8b0:4004:809::200e
Connecting to docs.google.com (docs.google.com)|172.217.12.238|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-00-7o-docs.googleusercontent.com/docs/securesc/fi17g1ceqnjp8e2ecekcq9a0b9eeakj8/ulblg4pthv8lfuo63o6b66b893r3cfue/1607010225000/06930042168325031160/06409174565933036453Z/1plApL6qmdHtNNP3OXgJQEmo7Lfp6MVeO?e=download [following]
--2020-12-03 15:43:57--  https://doc-00-7o-docs.googleusercontent.com/docs/securesc/fi17g1ceqnjp8e2ecekcq9a0b9eeakj8/ulblg4pthv8lfuo63o6b66b893r3cfue/1607010225000/06930042168325031160/06409174565933036453Z/1plApL6qmdHtNNP3OXgJQEmo7Lfp6MVeO?e=download
Resolving doc-00-7o-docs.googleusercontent.com (doc-00-7o-docs.googleusercontent.com)... 172.217.15.97, 2607:f8b0:4004:811::2001
Connecting to doc-00-7o-

In [None]:
!unzip -q data/rudrec_annotated/rudrec.zip -d data/rudrec_annotated/
!rm data/rudrec_annotated/rudrec.zip

### Downloading PsyTAR sentences

In [None]:
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1zBFgdsiWlvb3LsNp0s65MpDYyBGC-jih' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1zBFgdsiWlvb3LsNp0s65MpDYyBGC-jih" -O data/psytar_csvs.zip && rm -rf /tmp/cookies.txt

--2020-12-03 15:44:00--  https://docs.google.com/uc?export=download&confirm=&id=1zBFgdsiWlvb3LsNp0s65MpDYyBGC-jih
Resolving docs.google.com (docs.google.com)... 172.217.12.238, 2607:f8b0:4004:807::200e
Connecting to docs.google.com (docs.google.com)|172.217.12.238|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-00-bo-docs.googleusercontent.com/docs/securesc/79ninmtchkemtrvudqdfal1nepfgiij2/7dvjhk744osokg22upaepeo9uetpq70k/1607010225000/03082508243073582503/17384356599220833547Z/1zBFgdsiWlvb3LsNp0s65MpDYyBGC-jih?e=download [following]
--2020-12-03 15:44:00--  https://doc-00-bo-docs.googleusercontent.com/docs/securesc/79ninmtchkemtrvudqdfal1nepfgiij2/7dvjhk744osokg22upaepeo9uetpq70k/1607010225000/03082508243073582503/17384356599220833547Z/1zBFgdsiWlvb3LsNp0s65MpDYyBGC-jih?e=download
Resolving doc-00-bo-docs.googleusercontent.com (doc-00-bo-docs.googleusercontent.com)... 172.217.15.97, 2607:f8b0:4004:811::2001
Connecting to doc-00-bo-

In [None]:
!unzip -q data/psytar_csvs.zip -d data/
!rm data/psytar_csvs.zip

"otzovik_reviews_formatting.py" script tokenizes reviews into sentences and splits the data into train, test, and dev sets.

**n_splits=1** corresponds to a simple splitting of data into training, validation, and test sets. **n_splits > 1** is cross-validation.

In [None]:
%cd /content/med_project/bert_multilabel/formatting/
!python otzovik_reviews_formatting.py \
--reviews_dir=/content/data/rudrec_annotated/annotation \
--output_dir=/content/data/rudrec_annotated/sentences \
--n_splits=1
!ls /content/data/rudrec_annotated/sentences

/content/med_project/bert_multilabel/formatting
dev.csv  test.csv  train.csv


In [None]:
!mv /content/data/rudrec_annotated/sentences /content/data/rudrec

In [None]:
!pip install emoji

Collecting emoji
[?25l  Downloading https://files.pythonhosted.org/packages/ff/1c/1f1457fe52d0b30cbeebfd578483cedb3e3619108d2d5a21380dfecf8ffd/emoji-0.6.0.tar.gz (51kB)
[K     |██████▍                         | 10kB 21.6MB/s eta 0:00:01[K     |████████████▉                   | 20kB 15.5MB/s eta 0:00:01[K     |███████████████████▎            | 30kB 14.0MB/s eta 0:00:01[K     |█████████████████████████▊      | 40kB 14.0MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 5.6MB/s 
[?25hBuilding wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-0.6.0-cp36-none-any.whl size=49716 sha256=d8da26a2c8fcacf35dade71d39f5d94d1a3f085a4ee0c509fe29fac7c9ab1221
  Stored in directory: /root/.cache/pip/wheels/46/2c/8b/9dcf5216ca68e14e0320e283692dce8ae321cdc01e73e17796
Successfully built emoji
Installing collected packages: emoji
Successfully installed emoji-0.6.0


Merging PsyTAR and RuDReC sentences.

In [None]:
%cd /content/
!ls smm4h_classification/preprocessing/
!python smm4h_classification/preprocessing/create_merged_reviews_corpus.py \
--psytar_dir /content/data/psytar_csvs/ \
--rudrec_dir /content/data/rudrec/ \
--output_dir /content/data/combined_psytar_rudrec

/content
create_merged_reviews_corpus.py  preprocessing_parameters.py
__init__.py			 preprocessing_utils.py
merge_tweets_sets.py		 preprocess_tweet_texts.py


## Tweets preprocessing

During preprocessing, we mask user mentions and urls; replace emojis with words; fix &'s representations: "\&amp;" 

In [None]:
!python /content/smm4h_classification/preprocessing/preprocess_tweet_texts.py  --input_dir /content/smm4h_classification/data/raw/ru/ --lang ru --output_dir /content/smm4h_classification/data/preprocessed/ru/

In [None]:
!python /content/smm4h_classification/preprocessing/preprocess_tweet_texts.py  --input_dir /content/smm4h_classification/data/raw/en/ --lang en --output_dir /content/smm4h_classification/data/preprocessed/en/

## Multilabel classification pretraining

In [None]:
%cd /content/bert/
!python multilabel_pretraining.py \
--corpus_dir=/content/data/combined_psytar_rudrec \
--bert_vocab=/content/bert_models/multilingual_eng_rus_reviews_finetuned/vocab.txt \
--bert_checkpoint=/content/bert_models/multilingual_eng_rus_reviews_finetuned/bert_model.ckpt \
--bert_config=/content/bert_models/multilingual_eng_rus_reviews_finetuned/bert_config.json \
--batch_size=32 \
--epochs=5 \
--warmup_proportion=0.1 \
--max_seq_length=128 \
--learning_rate=2e-5 \
--save_summary_steps=500 \
--text_column=sentences \
--num_labels=5 \
--threshold=0.5 \
--output_dir=/content/pretrained_model/ \
--prediction_filename=predicted_labels.csv

/content/bert






INFO:tensorflow:***** Running training *****
INFO:tensorflow:  Num examples = 8088
INFO:tensorflow:  Batch size = 32
INFO:tensorflow:  Num steps = 1263

Beginning Training!
INFO:tensorflow:Not using Distribute Coordinator.
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps 252 or save_checkpoints_secs None.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on fun

## Training

In [None]:
!ls /content/pretrained_model/
!mkdir /content/classification_results

checkpoint
eval
eval_results.txt
eval.tf_record
events.out.tfevents.1607010281.51ffb976f9f6
graph.pbtxt
model.ckpt-1263.data-00000-of-00001
model.ckpt-1263.index
model.ckpt-1263.meta
predicted_labels.csv
train.tf_record


In [None]:
%cd /content
!python smm4h_classification/preprocessing/merge_tweets_sets.py \
--input_files smm4h_classification/data/preprocessed/ru/train.tsv smm4h_classification/data/preprocessed/en/train.tsv \
--output_path smm4h_classification/data/preprocessed/ruen/train.tsv
!cp smm4h_classification/data/preprocessed/ru/dev.tsv smm4h_classification/data/preprocessed/ruen/dev.tsv
!cp smm4h_classification/data/preprocessed/ru/test.tsv smm4h_classification/data/preprocessed/ruen/test.tsv

/content
smm4h_classification/data/preprocessed/ru/train.tsv
(6090, 2)
smm4h_classification/data/preprocessed/en/train.tsv
(20544, 2)
Result shape (26634, 2)


In [None]:
%cd /content/bert
!python run_binary_classification.py \
--task_name=cola \
--do_lower_case=false \
--do_predict=true \
--do_train=true \
--do_eval=true \
--data_dir=/content/smm4h_classification/data/preprocessed/ruen \
--vocab_file=/content/bert_models/multilingual_eng_rus_reviews_finetuned/vocab.txt \
--bert_config_file=/content/bert_models/multilingual_eng_rus_reviews_finetuned/bert_config.json \
--init_checkpoint=/content/pretrained_model/model.ckpt-1263 \
--max_seq_length=128 \
--train_batch_size=32 \
--learning_rate=3e-5 \
--num_train_epochs=5 \
--save_checkpoints_steps=1000 \
--output_dir=/content/classification_results/1

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
INFO:tensorflow:global_step/sec: 1.06272
I1203 18:22:52.560968 140409631090560 tpu_estimator.py:2307] global_step/sec: 1.06272
INFO:tensorflow:examples/sec: 34.0071
I1203 18:22:52.561199 140409631090560 tpu_estimator.py:2308] examples/sec: 34.0071
INFO:tensorflow:global_step/sec: 1.06325
I1203 18:22:53.501524 140409631090560 tpu_estimator.py:2307] global_step/sec: 1.06325
INFO:tensorflow:examples/sec: 34.024
I1203 18:22:53.501772 140409631090560 tpu_estimator.py:2308] examples/sec: 34.024
INFO:tensorflow:global_step/sec: 1.06642
I1203 18:22:54.439234 140409631090560 tpu_estimator.py:2307] global_step/sec: 1.06642
INFO:tensorflow:examples/sec: 34.1256
I1203 18:22:54.439508 140409631090560 tpu_estimator.py:2308] examples/sec: 34.1256
INFO:tensorflow:global_step/sec: 1.06753
I1203 18:22:55.375972 140409631090560 tpu_estimator.py:2307] global_step/sec: 1.06753
INFO:tensorflow:examples/sec: 34.1608
I1203 18:22:55.376255 140409

In [None]:
%cd /content
!mkdir /content/classification_results/dev_probs
!mv /content/classification_results/1/eval_results.txt  /content/classification_results/dev_probs/eval_results_1.txt

/content
mkdir: cannot create directory ‘/content/classification_results/dev_probs’: File exists


## Evaluation

The parameters of the script below are:

**labels_path** is the path where the labels are saved. Label decisions are made using the threshold parameter.

**output_path** is the path to file with the results of quality assesment. It contains P, R, F1 scores which are the same as in the output cell.

In [None]:
%cd /content/smm4h_classification/
!python evaluation/evaluate_classification.py \
--true_labels_path=data/preprocessed/ru/dev.tsv \
--predicted_labels_path=/content/classification_results/dev_probs/eval_results_1.txt \
--threshold=0.5 \
--labels_path=/content/classification_results/labels/dev_labels.tsv \
--output_path=/content/classification_results/classification_statistics/eval_results_1.txt

/content/smm4h_classification
              precision    recall  f1-score   support

           0       0.95      0.93      0.94      1389
           1       0.43      0.53      0.47       133

    accuracy                           0.90      1522
   macro avg       0.69      0.73      0.71      1522
weighted avg       0.91      0.90      0.90      1522

Precision 0.4268292682926829
Recall 0.5263157894736842
F-score 0.47138047138047134


## Ensembling
For the ensembling, you need to train at least two models. You can do it by iteratively running all the cells between the training and this sections (Not including this one: run this section after you've done training your models).

Note: To train a new model, you need to change output paths in the training section. For example, you can do the following changes in scripts parameters:

content/classification_results/0 -> content/classification_results/1

eval_results_0.txt -> eval_results_1.txt

**predicted_probs_dir** is the directory of predicted probabilities.

**data_tsv** is the path to the original dataset. The final labels will be added to this dataset. The dataset may not contain true labels. In this case, do not use **"calculate metrics"** parameter. 

In [None]:
%cd /content/smm4h_classification/
!git pull
!python evaluation/ensemble_results.py \
--predicted_probs_dir=/content/classification_results/dev_probs \
--data_tsv=data/preprocessed/ru/dev.tsv \
--calculate_metrics \
--threshold=0.5 \
--output_path=/content/classification_results/labels/dev_labels_ensemble.tsv

/content/smm4h_classification
Already up to date.
              precision    recall  f1-score   support

           0       0.96      0.92      0.94      1389
           1       0.40      0.58      0.47       133

    accuracy                           0.89      1522
   macro avg       0.68      0.75      0.70      1522
weighted avg       0.91      0.89      0.90      1522

Precision 0.39896373056994816
Recall 0.5789473684210527
F-score 0.4723926380368098
