### This Notebook is created on 02 of August for biobert-pytorch NER model results on HunFlair corpora
### Created by Salma

#### Working directory
The working directory is on "Alvis" 


/cephyr/NOBACKUP/groups/snic2021-23-312/Salma-files/NLP_project/NER_200/biobert-pyorch/named-entity-recognition


#### Repositories

The repo of the code coming from 

https://github.com/dmis-lab/biobert-pytorch


The model is the coming from

https://huggingface.co/dmis-lab/biobert-large-cased-v1.1/tree/main







#### bash script used for fine-tuning the model


In [None]:
##########################################################################################

#!/usr/bin/env bash

#SBATCH -A SNIC2021-7-54 -p alvis
#SBATCH -N 1 --gpus-per-node=V100:2  # We're launching 1 nodes with 2 Nvidia T4 GPUs each
#SBATCH -t 40:10:00


#cd named-entity-recognition


## Choose dataset and run 
export DATA_DIR=./ner_inputs
export ENTITY=HunFlair_NER_gene/gene_all_combined

python run_ner.py \
    --data_dir ${DATA_DIR}/${ENTITY} \
    --labels ${DATA_DIR}/${ENTITY}/labels.txt \
    --model_name_or_path dmis-lab/biobert-large-cased-v1.1 \
    --output_dir output/${ENTITY} \
    --max_seq_length 128 \
    --num_train_epochs 5 \
    --per_device_train_batch_size 32 \
    --save_steps 1000 \
    --seed 1 \
    --do_train \
    --do_eval \
    --do_predict 
 
##############################################################################################

#### Size of corpora 

In [5]:
import pandas as pd

data = {'train_dev': [676622,3462521,648468,3360188,1044897],

        'devel': [98357,487065,89269,474491,150534],

        'test': [279826,14964407,270441,1407706,393821]}
df = pd.DataFrame(data,columns=['train_dev',  'devel',  'test'],index = ['cell_line','chemical','disease','gene','species'])

df

Unnamed: 0,train_dev,devel,test
cell_line,676622,98357,279826
chemical,3462521,487065,14964407
disease,648468,89269,270441
gene,3360188,474491,1407706
species,1044897,150534,393821


#### Datasets
There are 5 different datasets 

Cell_line

chemical

disease

gene

and species.


The datasets are in form of coNLL 2003 format (IOB format) and are on  
/cephyr/NOBACKUP/groups/snic2021-23-312/Salma-files/NLP_project/NER_200/biobert-pytorch/ner_inputs directory.

The datasets were taken from 
https://github.com/Aitslab/BioNLP/tree/master/Salma/HUNER_DATASET

ner_inputs.

But then also Marcus made the new datasets and the conversion code is also in ner_inputs directory and the previous data is transfered into previous_data directory.




#### Environment

For running the code on gpu, I had to make the conda environment as following.


ml Anaconda3/5.3.0

conda create myenv

conda activate myenv




The following modules are required to load (ml) for gpu.

We load only 1, 7, and 8. The other moduls will be loaded as dependencies.

  1) Anaconda3/5.3.0   3) zlib/1.2.11     5) GCC/10.2.0            7) cuDNN/8.2.1.32-CUDA-11.3.1
  
  2) GCCcore/10.2.0    4) binutils/2.35   6) iccifort/2020.4.304   8) CUDA/10.1.243


pip3 install torch==1.7.0 torchvision==0.8.1 -f https://download.pytorch.org/whl/cu102/torch_stable.html

#### Problems

There were some issues regarding the prediction results for three datasets of gene, cell_line, and species for the large model. The loss values are too high and all the scores are zero.

This problem is also confirmed by Marcus and will be reported on guthub page of biobert model.






#### Results

****I changed the initial model for the three datasets to the fine-tuned model for chemical and  disease



In [3]:
import pandas as pd


'''
cell_line  Fined tunes on:

chemical
eval_loss = 0.03225823636723142
eval_precision = 0.5979628520071899
eval_recall = 0.6348600508905853
eval_f1 = 0.6158593026843566

disease
eval_loss = 0.04601614683546491
eval_precision = 0.618562874251497
eval_recall = 0.6571246819338422
eval_f1 = 0.6372609500308453


base model
eval_loss = 0.0564503005160991
eval_precision = 0.6189078097475044
eval_recall = 0.6704834605597965
eval_f1 = 0.6436641221374046
    




chemical
eval_loss = 0.08976771759454548
eval_precision = 0.8740117716713683
eval_recall = 0.8842982487050284
eval_f1 = 0.87912492117985

disease
eval_loss = 0.1167252083387655
eval_precision = 0.7965159773431656
eval_recall = 0.8377922661870504
eval_f1 = 0.8166328822659289

gene fined tuned on:
chemical
eval_loss = 0.07465716853512087
eval_precision = 0.7478534919897372
eval_recall = 0.7851613040066177
eval_f1 = 0.7660534321130725

disease
eval_loss = 0.09484741590608804
eval_precision = 0.7523252992559042
eval_recall = 0.7890528905289053
eval_f1 = 0.7702515267570644

base_model
eval_loss = 0.06820519475748847
eval_precision = 0.7431238307526967
eval_recall = 0.7919697965936326
eval_f1 = 0.7667696857063648


species fined tuned on:
chemical
eval_loss = 0.02506276776638715
eval_precision = 0.7932584269662921
eval_recall = 0.7665580890336591
eval_f1 = 0.7796797349530645
disease:
eval_loss = 0.03272246095173788
eval_precision = 0.787292817679558
eval_recall = 0.7736156351791531
eval_f1 = 0.78039430449069

'''







data = {'Loss': [0.032,0.046,0.056,0.090,0.117,0.075,0.095,0.068,0.025,0.032],

        'Precision': [0.59796,0.61856,0.61891,0.87401,0.79652,0.74785,0.75233,0.74312,0.79326,0.78729],

        'Recall':    [0.63486,0.65712,0.67048,0.884298,0.83779,0.78516,0.78905,0.79197,0.766558,0.77362],
        'F1-score':  [0.615869,0.63726,0.64366,0.87912,0.81663,0.76605,0.77025,0.76677,0.77968,0.78039],
        'Initial_model':['Large model/FT on Chemical','Large model/FT on disease','base model/Huggingface','Large model/Huggingface','Large model/Huggingface',
                         'Large model/FT on Chemical','Large model/FT on disease','base model/Huggingface','Large model/FT on Chemical','Large model/FT on disease'] }
  
df = pd.DataFrame(data,columns=['Loss','Precision','Recall','F1-score','Initial_model'],index = ['cell_line','cell_line','cell_line','chemical','disease','gene','gene','gene','species','species'])

df

Unnamed: 0,Loss,Precision,Recall,F1-score,Initial_model
cell_line,0.032,0.59796,0.63486,0.615869,Large model/FT on Chemical
cell_line,0.046,0.61856,0.65712,0.63726,Large model/FT on disease
cell_line,0.056,0.61891,0.67048,0.64366,base model/Huggingface
chemical,0.09,0.87401,0.884298,0.87912,Large model/Huggingface
disease,0.117,0.79652,0.83779,0.81663,Large model/Huggingface
gene,0.075,0.74785,0.78516,0.76605,Large model/FT on Chemical
gene,0.095,0.75233,0.78905,0.77025,Large model/FT on disease
gene,0.068,0.74312,0.79197,0.76677,base model/Huggingface
species,0.025,0.79326,0.766558,0.77968,Large model/FT on Chemical
species,0.032,0.78729,0.77362,0.78039,Large model/FT on disease
