# Fine Tuning a SciBERT for Named Entitiy Extraction 

This notebook has been converted from Google Colabs and would need to be run from Colabs with the GPU enabled. Instructions on how to do this can be found in the markdown. The models full directory and files can be found in the attached Google Drive found here: https://drive.google.com/drive/folders/1R_9Z34b96PVv_xqsbuAj3iZYh6ASRZDJ?usp=sharing

___The script below walks through the following steps taken to traing the BERT models:___
- download the dependencies
- convert the annotated data into binary spacy files
- activate parallel processing in the GPU (which enables faster computations) 
- download the Spacy library, pytorch, and transformer model (here showing SciBERT) 
- train the model using the annotated discussions as training data and annotated conclusions as test 
- evaluate the model on an unseen abstract from a psychedelic research paper 

### Dependencies: Install Spacy

In [45]:
pip install -U spacy



In [3]:
cd /content/drive/MyDrive/Psychedelic_KGs/

/content/drive/MyDrive/Psychedelic_KGs


### Convert tsv files into json files 

In [30]:
!python -m spacy convert Data/Discussions.tsv ./Data/ -t json -n 10 -c iob

[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;3m⚠ Document delimiters found, automatic document segmentation with `-n`
disabled.[0m
[38;5;2m✔ Generated output file (1 documents): Data/Discussions.json[0m


In [31]:
!python -m spacy convert Data/Conclusions.tsv ./Data/ -t json -n 10 -c iob

[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;3m⚠ Document delimiters found, automatic document segmentation with `-n`
disabled.[0m
[38;5;3m⚠ No sentence boundaries found. Use `-s` to automatically segment
sentences.[0m
[38;5;2m✔ Generated output file (1 documents): Data/Conclusions.json[0m


### Convert json files into binary spacy files

In [32]:
!python -m spacy convert Data/Discussions.json ./Data/ -t spacy

[38;5;2m✔ Generated output file (34 documents): Data/Discussions.spacy[0m


In [33]:
!python -m spacy convert Data/Conclusions.json ./Data/ -t spacy

[38;5;2m✔ Generated output file (30 documents): Data/Conclusions.spacy[0m


## Activate parallel processing using cuda

In [None]:
!wget https://developer.nvidia.com/compute/cuda/9.2/Prod/local_installers/cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64 -O cuda-repo-ubuntu1604–9–2-local_9.2.88–1_amd64.deb
!dpkg -i cuda-repo-ubuntu1604–9–2-local_9.2.88–1_amd64.deb
!apt-key add /var/cuda-repo-9–2-local/7fa2af80.pub
!apt-get update
!apt-get install cuda-9.2

In [15]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Wed_Apr_11_23:16:29_CDT_2018
Cuda compilation tools, release 9.2, V9.2.88


### Dependencies: Download Spacy pipeline, torch, and the transformer models

In [None]:
!python -m spacy download en_core_web_trf

In [17]:
!pip install torch==1.7.1+cu92 torchvision==0.8.2+cu92 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.7.1+cu92
  Downloading https://download.pytorch.org/whl/cu92/torch-1.7.1%2Bcu92-cp37-cp37m-linux_x86_64.whl (577.3 MB)
[K     |████████████████████████████████| 577.3 MB 3.5 kB/s 
[?25hCollecting torchvision==0.8.2+cu92
  Downloading https://download.pytorch.org/whl/cu92/torchvision-0.8.2%2Bcu92-cp37-cp37m-linux_x86_64.whl (12.5 MB)
[K     |████████████████████████████████| 12.5 MB 38.9 MB/s 
[?25hCollecting torchaudio==0.7.2
  Downloading torchaudio-0.7.2-cp37-cp37m-manylinux1_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 5.3 MB/s 
Installing collected packages: torch, torchvision, torchaudio
  Attempting uninstall: torch
    Found existing installation: torch 1.10.0+cu111
    Uninstalling torch-1.10.0+cu111:
      Successfully uninstalled torch-1.10.0+cu111
  Attempting uninstall: torchvision
    Found existing installation: torchvision 0.11.1+cu111
    Uninstalling to

### Dependencies: Install cuda

In [18]:
!pip install -U spacy[cuda92,transformers]
!export CUDA_PATH=”/usr/local/cuda-9.2"
!export LD_LIBRARY_PATH=$CUDA_PATH/lib64:$LD_LIBRARY_PATH
!pip install cupy

Collecting cupy-cuda92<10.0.0,>=5.0.0b4
  Downloading cupy_cuda92-9.6.0-cp37-cp37m-manylinux1_x86_64.whl (55.0 MB)
[K     |████████████████████████████████| 55.0 MB 35.5 MB/s 
Installing collected packages: cupy-cuda92
Successfully installed cupy-cuda92-9.6.0
/bin/bash: -c: line 0: unexpected EOF while looking for matching `"'
/bin/bash: -c: line 1: syntax error: unexpected end of file
Collecting cupy
  Downloading cupy-10.0.0.tar.gz (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 5.0 MB/s 
[?25h  Downloading cupy-9.6.0.tar.gz (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 34.8 MB/s 
Building wheels for collected packages: cupy
  Building wheel for cupy (setup.py) ... [?25l[?25hdone
  Created wheel for cupy: filename=cupy-9.6.0-cp37-cp37m-linux_x86_64.whl size=53780857 sha256=886ca73d6af63ea05ca61bdd29a2e4211814bef3107954eb7a9f797d81d00754
  Stored in directory: /root/.cache/pip/wheels/57/44/0b/5c540f032d681b9c7bcafad447177f7e356cba004e48df1d2a
Successfully b

## Merge the model Parameters for SciBERT 

Here, we define the model parameters we want to train our model on, using Spacy's training config file. This allows us to change the path directory to the train and development datasets, select the huggingface model we want to use, and reset any of Spacy's default parameters. The config file can be found in the data folder by the name of 'config_spacy.cfg'.

In [36]:
!python -m spacy init fill-config /content/drive/MyDrive/Psychedelic_KGs/Data/base_config.cfg ./Data/config_spacy.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
Data/config_spacy.cfg
You can now add your data and train your pipeline:
python -m spacy train config_spacy.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


___Debug the config file before training___

In [37]:
!python -m spacy debug data Data/config_spacy.cfg

[1m
Some weights of the model checkpoint at allenai/scibert_scivocab_uncased were not used when initializing BertModel: ['cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[38;5;2m✔ Pipeline can be initialized with data[0m
[38;5;2m✔ Corpus is loadable[0m
[1m
Language: e

The model is informing us that we're low on data files, which is something we know due to the limited training data. 

## Train the Model

Finally, with the files in the correct format, all the dependencies downloaded, and the config file ready, we can train the model on the GPU. This will take around 4.5hrs with parallel processing activated. 

In [41]:
!python -m spacy train Data/config_spacy.cfg --output ./ --gpu-id 0

[38;5;4mℹ Saving to output directory: .[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-01-02 11:20:35,832] [INFO] Set up nlp object from config
[2022-01-02 11:20:35,847] [INFO] Pipeline: ['transformer', 'ner']
[2022-01-02 11:20:35,853] [INFO] Created vocabulary
[2022-01-02 11:20:35,855] [INFO] Finished initializing nlp object
Some weights of the model checkpoint at allenai/scibert_scivocab_uncased were not used when initializing BertModel: ['cls.predictions.decoder.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This I

## Evaluate the model qualitatively 

We know that the model can achieve around 64% accuracy, however we want to know more about what it is missing in the text. Therefore to see it working in action, we'll test it on the abstract of a top cited research paper, which is data the model has not seen before. 

The tested abstract is from 'The Entropic Brain Revisited' doi = 10.1016/j.neuropharm.2018.03.010 

#### Load the best model saved in the Google Drive

In [51]:
import spacy 
nlp = spacy.load('/content/drive/MyDrive/Psychedelic_KGs/model-best')

In [54]:
# Test on text scraped from 'The Entropic Brain Revisited' doi = 10.1016/j.neuropharm.2018.03.010
text = ['Research into the basic effects and therapeutic applications of psychedelic drugs has grown considerably in recent years. Yet, pressing questions remain regarding the substances’ lasting effects. Although individual studies have begun monitoring sustained changes, no study to-date has synthesized this information. Therefore, this systematic review aims to fill this important gap in the literature by synthesizing results from 34 contemporary experimental studies which included classic psychedelics, human subjects, and follow-up latencies of at least two weeks. The bulk of this work was published in the last five years, with psilocybin being the most frequently administered drug. Enduring changes in personality/attitudes, depression, spirituality, anxiety, wellbeing, substance misuse, meditative practices, and mindfulness were documented. Mystical experiences, connectedness, emotional breakthrough, and increased neural entropy were related to these long-term changes in psychological functioning. Finally, with proper screening, preparation, supervision, and integration, limited aversive side effects were noted by study participants. Future researchers should focus on including larger and more diverse samples, lengthier longitudinal designs, stronger control conditions, and standardized dosages.']

#### Run the model through the models pipeline

We run the text through the model's pipeline and ask it to extract the entity word and entity label assigned to it. 

In [55]:
for doc in nlp.pipe(text, disable=["tagger", "parser"]):
    print([(ent.text, ent.label_) for ent in doc.ents])

[('basic effects', 'OUTCOME'), ('therapeutic applications', 'OUTCOME'), ('psychedelic drugs', 'DRUG'), ('lasting effects', 'OUTCOME'), ('psychedelics', 'DRUG'), ('psilocybin', 'DRUG'), ('Enduring changes in personality/attitudes', 'OUTCOME'), ('depression', 'HEALTH'), ('spirituality', 'SUBJECTIVE'), ('anxiety', 'HEALTH'), ('mindfulness', 'SUBJECTIVE'), ('Mystical experiences', 'SUBJECTIVE'), ('connectedness', 'SUBJECTIVE'), ('emotional breakthrough', 'OUTCOME'), ('increased neural entropy', 'OUTCOME'), ('long-term changes in psychological functioning', 'OUTCOME'), ('limited aversive side effects', 'OUTCOME')]


The model seems capable of capturing many of our entities. By eyeballing over the results, we see that the model missed 2 entities of 'sustained changes'and 'mystical experiences'. This is interesting because it shows that the model has generalised well, but lacks data as a term like 'mystical experiences' would surely be picked up with deeper learning. Nevertheless, from this one paragraph the model was able to capture 17 entities, without misclassifying any. 


## Results

The best model could achieve an f1 score of 64.5%, with a precision rate of 67.4% and recall of 61.8%. It was able to perform best on classifying the group 'Drug', followed by 'Health' which is unsurprising as these are quite clearly defined categories. It was not able to capture antidepressant drugs at all (f1 = 0) and struggled especially to capture neural correlates, which is unsurprising as these are often abbreviated, or joined in complex ways.  

For more details on the models performance, check out the meta.json file in the 'data' folder. 