In [3]:
import fastai
from fastai.text import *
import pandas as pd
import numpy as np
# import fastbook
from fastai.vision.core import *
from fastai.vision.data import *
from fastai.data.block import DataBlock, RandomSplitter, ColReader
from fastai.text.all import *
from fastai.text.data import TextBlock
from fastai.vision import *
from fastai.text.learner import text_classifier_learner, language_model_learner
from fastai.metrics import accuracy



In [4]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


# Universal Language Model Fine-tuning for Text Classification (ULMFiT)

Universal Language Model Fine-tuning for Text Classification (ULMFiT) is a transfer learning method that can be applied to
any task in NLP. 
It was created by Jeremy Howard and Sebastian Ruder and made public in 2018 - with the article [Universal Language Model Fine-tuning for Text Classification](https://arxiv.org/pdf/1801.06146v5.pdf)

## Main elements

The model consists of three main elements and several advantages:
1. General-domain Language Model Pretraining - the model is pre-trained on Wikitext-103 (with 28,595 preprocessed Wikipedia articles and 103 million words). This allows even small datasets to achieve generalization and bring good results.

2. Target task Language Model fine-tuning - the model adapts to the specifics of the target data fast and easily, due to the discriminative fine-tuning of the model and slanted triangular learning rates. 


> *   Discriminative fine-tuning - since different layers
capture different types of information, they should be fine-tuned to different extents. Therefore, each layer is tuned with different learning
rates. The regular stochastic gradient
descent update of a model’s parameters θ at
time step t is $θ_t = θ_{t−1} − η · ∇_θJ(θ)$.
> The parameters θ is split here into ${θ^1
, . . . , θ^L}$ where $θ^l$
contains
the parameters of the model at the l-th layer and
L is the number of layers of the model. The learning rate is defined in the same way ${η^1
, . . . , η^L}$ where $η^l$ is the learning
rate of the l-th layer.
The SGD update with discriminative finetuning is then the following: $θ_t^l = θ_{t-1}^l − η^l· ∇_{θ^l}J(θ)$
The authors claim that they empirically found it to work well to first choose the learning rate $η^L$ of the last layer by
fine-tuning only the last layer and using $η^{l−1} =
η^l/2.6$ as the learning rate for lower layers.



> * The Slanted triangular learning rates first linearly and quickly increases the learning rate and then linearly and slowly decays it. 


3. Target task classifier fine-tuning - each block here uses batch normalization and dropout, with ReLU activations for the intermediate layer and a softmax activation that outputs a probability distribution over target classes at the last layer. The parameters in these task-specific classifier layers are the only ones that are learned from
scratch. The first linear layer takes as the input the
pooled last hidden layer states.

##Other advantages

Concat pooling: The signal in text classification
tasks often can be found only in a few words, which may occur anywhere in the document. As input documents can consist of hundreds of words, information may get lost if only the last hidden state of the model is considered. For this reason, the hidden state at the last time step is concatenated with both the max-pooled and the
mean-pooled representation of the hidden states.


Gradual unfreezing: The authors recommend gradually unfreezing the model starting from the last layer as this contains the least
general knowledge. After unfreezing the last layer and fine-tuning all unfrozen layers for one epoch, then comes the next unfreezing of a layer and so on, until we finetune all layers until convergence at the last iteration.

Backpropagation through time (BPTT) - it is used in order for this model to be useful for large documents (with large input sequences). The documents are divided into fixed-length batches of specific sizes. At the beginning of each
batch, the model is initialized with the final state
of the previous batch; but the hidden states for mean and max-pooling are also tracked. So, gradients are back-propagated to the batches whose hidden states contributed to the final prediction.


## Results

The experiments with IMDb dataset (for Sentiment Analysis), TREC dataset (for Question Classification), AG news and DBpedia
ontology datasets (for Topic classification) prove that this model outperforms other models that do not contain or contain only part of the specifics of this model.  

##Implementation

In order to test ULMFiT we are going to use a dataset of 31 abstracts or conclusions from [Pubmed](https://pubmed.ncbi.nlm.nih.gov/) scientific articles. 

Base for the code of this work: 

[Fastai Documentation](https://docs.fast.ai/)

[Predicting Medical Specialities from Transcripts: A Complete Walkthrough using ULMFiT](https://towardsdatascience.com/predicting-medical-specialities-from-transcripts-a-complete-walkthrough-using-ulmfit-b8a075777723)

[Using FastAI’s ULMFiT to make a state-of-the-art multi-class text classifier](https://medium.com/technonerds/using-fastais-ulmfit-to-make-a-state-of-the-art-multi-label-text-classifier-bf54e2943e83)

In [5]:
texts = pd.read_csv('abstract_or_conclusions.csv')


In [6]:
texts.head(5)

Unnamed: 0,text,title,score,illness,health_substance,url
0,"Conclusion: While there is conflicting evidence about the benefits of omega-3 fatty acid supplementation on SLE disease activity, specific measures have demonstrated benefits. Current data show that there is a potential benefit on disease activity as demonstrated by SLAM-R, Systemic Lupus Erythematosus Disease Activity Index (SLEDAI), and British Isles Lupus Assessment Group (BILAG) scores and plasma membrane arachidonic acid composition and urinary 8-isoprostane levels, with minimal adverse events. Keywords: Systemic lupus erythematosus; fish oil; omega-3 fatty acids.",The effect of Omega-3 fatty acid supplementation in systemic lupus erythematosus patients: A systematic review,1,"lupus erythematosus, sle",omega-3,https://pubmed.ncbi.nlm.nih.gov/35023407/
1,Conclusion: Resveratrol possesses protective effects in pristane-induced lupus mice and may represent a novel approach for the management of SLE.,Resveratrol possesses protective effects in a pristane-induced lupus mouse model,2,"lupus erythematosus, sle",resveratrol,https://pubmed.ncbi.nlm.nih.gov/25501752/
2,"Abstract Nettle root is recommended for complaints associated with benign prostatic hyperplasia (BPH). We therefore conducted a comprehensive review of the literature to summarise the pharmacological and clinical effects of this plant material. Only a few components of the active principle have been identified and the mechanism of action is still unclear. It seems likely that sex hormone binding globulin (SHBG), aromatase, epidermal growth factor and prostate steroid membrane receptors are involved in the anti-prostatic effect, but less likely that 5alpha-reductase or androgen receptors ar...",A comprehensive review on the stinging nettle effect and efficacy profiles. Part II: urticae radix,2,"benign prostatic hyperplasia, bph",stinging nettle,https://pubmed.ncbi.nlm.nih.gov/17509841/
3,"Abstract For decades, the focus of managing autoimmune hypothyroidism has been on thyroxine replacement. Correcting lab parameters such as thyroid stimulating hormone (TSH) has been a primary goal. The increasing prevalence of Hashimoto’s thyroiditis (HT) continues to impact the quality of life in patients. We believe a holistic approach to this disease entity, considering its underlying complex etiopathogenesis, would benefit patients. Nutraceuticals are combinations of essential nutrients and are becoming a part of novel medical treatments despite the lack of regulation. This review aims...",Minerals: An Untapped Remedy for Autoimmune Hypothyroidism?,1,"hashimoto’s thyroiditis, ht, hypothyroidism","minerals, zinc, selenium, magnesium, iron",https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7574993/
4,"Abstract Background: Hashimoto's thyroiditis (HT) is a common autoimmune disease characterized by high levels of thyroid peroxidase antibody (TPOAb) and thyroid globulin antibody (TgAb) as well as infiltration of lymphocytes in thyroid. In recent years, metformin has been proven to be effective in a variety of autoimmune diseases, such as systemic lupus erythematosus, rheumatoid arthritis and multiple sclerosis.\n\nMethods: This study systematically explored the therapeutic effect of metformin on HT and its underlying mechanism by comprehensively utilizing methods including animal model, i...",Metformin Reverses Hashimoto’s Thyroiditis by Regulating Key Immune Events,2,"hashimoto’s thyroiditis, ht, hypothyroidism",metformin,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8193849/


### First model

The parameters of the first model will be the same as those in the Daniel Ching's article (with some extra lines of code, because they were missing).

In [7]:

dls_ok = DataBlock (
    blocks=TextBlock.from_df('text', is_lm=True),
    get_x=ColReader('text'), 
    splitter=RandomSplitter(0.1)
  ).dataloaders(texts, bs=11, seq_len=2045)

First, creating a language_model_learner with AWD_LSTM architecture.

In [17]:
learn = language_model_learner(dls_ok, AWD_LSTM, drop_mult=0.3, metrics=accuracy)

In [18]:
learn.save('res_search')

Path('models/res_search.pth')

In [19]:
learner_oks = learn.load('res_search')

In [20]:
learner_oks.unfreeze()

Traing on the texts without the scores and without the task-specific final layers. Training here is general.

In [21]:
learner_oks.fit_one_cycle(15, 2e-2)

epoch,train_loss,valid_loss,accuracy,time
0,4.054708,3.794576,0.299465,00:01
1,3.886221,4.166476,0.272727,00:00
2,3.979374,6.05119,0.286096,00:00
3,4.661844,5.668819,0.122995,00:00
4,4.908409,4.817206,0.243316,00:00
5,4.911751,4.628443,0.237968,00:00
6,4.885194,4.455484,0.208556,00:00
7,4.834794,4.392136,0.272727,00:00
8,4.784651,4.153449,0.294118,00:00
9,4.71768,4.09422,0.299465,00:00


Saving the language model as an encoder.


In [22]:
learner_oks.save_encoder("readymade_encoder")

Creating specific DataBlock - already including get_y (our lebels - the text scores)

In [23]:
class_dls_new = DataBlock (
    blocks=(TextBlock.from_df('text', vocab = dls_ok.vocab), CategoryBlock),
    get_x=ColReader('text'), 
    get_y=ColReader('score'),
    splitter=RandomSplitter()
  ).dataloaders(texts, bs=11)

In [24]:
lrner = learner_oks.load_encoder("readymade_encoder")

A test with unfreezing. Traing once and after that freezing the layers until the second one (0 and 1) and unfreezing the next layers.

In [25]:
lrner.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,time
0,3.45308,3.823498,0.304813,00:00


In [26]:
lrner.freeze_to(-2)

In [27]:
lrner.fit_one_cycle(1, slice(1e-2/(2.6**4), 1e-2))

epoch,train_loss,valid_loss,accuracy,time
0,3.448787,3.821321,0.307487,00:00


Now, let's classify.

In [28]:
classifier_lrner = text_classifier_learner(class_dls_new, AWD_LSTM, drop_mult=0.5, metrics=accuracy) 

In [29]:
classifier_lrner.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,time
0,1.243818,0.766715,0.833333,00:00


This time freezing till  the third layer (0, 1, and 2). After some tests on our own, this way turned out to be better.

In [30]:
classifier_lrner.freeze_to(-3)

In [31]:
classifier_lrner.fit_one_cycle(1, slice(5e-3/(2.6**4), 5e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.695794,0.735861,0.833333,00:01


Because we see better result, we are going to export the model, so that we could use it later.

In [32]:
classifier_lrner.export('new_classifier_lrner_833_735.pth')

In [33]:
classifier_lrner.unfreeze()

In [34]:
classifier_lrner.fit_one_cycle(2, slice(1e-3/(2.6**4), 1e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.580617,0.728007,0.833333,00:01
1,0.473089,0.741843,0.833333,00:01


We see that that the valid_loss starts incresing. Probably the accuracy will start decreasing, if we continue.

In [35]:
classifier_lrner.fit_one_cycle(2, slice(1e-3/(2.6**4), 1e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.388291,0.755657,0.833333,00:01
1,0.392638,0.768982,0.833333,00:01


In [36]:
classifier_lrner.fit_one_cycle(2, slice(1e-3/(2.6**4), 1e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.359383,0.756378,0.833333,00:01
1,0.304097,0.767251,0.833333,00:01


In [37]:
classifier_lrner.fit_one_cycle(2, slice(1e-3/(2.6**4), 1e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.292897,0.775017,0.833333,00:01
1,0.301331,0.783734,0.833333,00:01


In [38]:
classifier_lrner.fit_one_cycle(10, slice(1e-3/(2.6**4), 1e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.265858,0.793795,0.833333,00:01
1,0.255346,0.801724,0.666667,00:01
2,0.248132,0.814386,0.666667,00:01
3,0.229448,0.797779,0.666667,00:01
4,0.210497,0.752668,0.666667,00:01
5,0.201893,0.736991,0.666667,00:01
6,0.191534,0.734301,0.666667,00:01
7,0.187231,0.715206,0.666667,00:01
8,0.180925,0.71908,0.666667,00:01
9,0.179831,0.716069,0.666667,00:01


As we expected it, it increased. It seems that valid_loss around 0.8 maked the accuracy lower.

### Another model

Let's perform our own test with different learning rate and two  un-freezings. 

In [179]:
new_classifier_learner_1 = text_classifier_learner(class_dls_new, AWD_LSTM, drop_mult=0.5, metrics=accuracy) 

In [180]:
new_classifier_learner_1.fit_one_cycle(5, 2e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.284495,1.05308,0.666667,00:01
1,1.114591,1.020147,0.666667,00:00
2,0.923566,1.014729,0.666667,00:00
3,0.818531,1.031657,0.5,00:00
4,0.748069,1.051837,0.333333,00:00


Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_NOT_INITIALIZED: initialization error
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy_backends/cuda/api/driver.pyx", line 217, in cupy_backends.cuda.api.driver.moduleUnload
  File "cupy_backends/cuda/api/driver.pyx", line 60, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_NOT_INITIALIZED: initialization error


In [181]:
new_classifier_learner_1.freeze_to(-2)

In [182]:
new_classifier_learner_1.fit_one_cycle(1, slice(1e-3/(2.6**4), 1e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.448674,1.060746,0.666667,00:01


In [183]:
new_classifier_learner_1.freeze_to(-3)

In [184]:
new_classifier_learner_1.fit_one_cycle(1, slice(5e-3/(2.6**4), 5e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.498019,1.025493,0.833333,00:01


Because the result is quite high (but the valid_accuracy is above 1), we are going to save it before we proceed.

In [190]:
new_classifier_learner_1.export('new_classifier_learner_1_best.pth')

In [185]:
new_classifier_learner_1.save('new_classifier_learner_1')

Path('models/new_classifier_learner_1.pth')

We are not going to proceed more because the model starts perfoming worse. After some trails and errors the parameters of Daniel Ching turned out to be close to ideal for our case as well. But with some changes we saw that we can achieve the same results.

###Classification tests

We are going to see now several texts that we score. Zero is given mainly when the specific health substance or practice is not directly connected to the desease. Most of the texts have a '1' score because the alternative treatment could have some benefits but more research is needed. 2-scored are those rare papers that claim the substance or practice is quite effective and can be used as a treatment for the specific disease.

So, let's see one easy text - it it obvious, that its score is 1. [Souce](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5963652/)

'Conclusion
current study shows certain positive effects of Nettle in the management of allergic rhinitis on controlling the symptoms based on the SNOT-22 and also similar effects was demonstrated by placebo. Hence, the exact efficacy of Urtica dioica in this respect could not be determined in this study. We believe that our limitations underscore the need for larger, longer term studies of different pharmaceutical dosage forms of Nettle for the treatment of allergic rhinitis.'

In [191]:
new_classifier_learner.predict('current study shows certain positive effects of Nettle in the management of allergic rhinitis on controlling the symptoms based on the SNOT-22 and also similar effects was demonstrated by placebo. Hence, the exact efficacy of Urtica dioica in this respect could not be determined in this study. We believe that our limitations underscore the need for larger, longer term studies of different pharmaceutical dosage forms of Nettle for the treatment of allergic rhinitis.')

('1', tensor(1), tensor([0.3225, 0.3436, 0.3339]))

In [40]:
classifier_lrner.predict('current study shows certain positive effects of Nettle in the management of allergic rhinitis on controlling the symptoms based on the SNOT-22 and also similar effects was demonstrated by placebo. Hence, the exact efficacy of Urtica dioica in this respect could not be determined in this study. We believe that our limitations underscore the need for larger, longer term studies of different pharmaceutical dosage forms of Nettle for the treatment of allergic rhinitis.')

('1', tensor(1), tensor([0.1384, 0.7218, 0.1397]))

We see that the model with smaller valid_loss shows bigger certainty.

The next passage is ambigous. It includes several deceases, but they are not mentioned in the conclusion. So, even though a human can decide that this part of the article will be about 1 or, more likely, 2, the models score it with 0. In this way they show pretty well that 'understand' the logic behind our zeros - no direct connection between a decease and a substance/practice. [Source](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8739926/) 

'Conclusions
The nutritional content of the plant is significant, and it has incredible therapeutic potential. The findings of this study are needed to investigate the therapeutic potential, as it may be a promising option for drug development.'


In [192]:
new_classifier_learner_1.predict('The nutritional content of the plant is significant, and it has incredible therapeutic potential. The findings of this study are needed to investigate the therapeutic potential, as it may be a promising option for drug development.')

('0', tensor(0), tensor([0.4742, 0.2239, 0.3019]))

In [42]:
classifier_lrner.predict('The nutritional content of the plant is significant, and it has incredible therapeutic potential. The findings of this study are needed to investigate the therapeutic potential, as it may be a promising option for drug development.')

('0', tensor(0), tensor([0.4922, 0.3003, 0.2075]))

The next one is quite interesting. We would score it with 1 (the tone is quite positive and can be mistaken with 2), but the models' decision is 0 (even thoght that the score of the first one for 2 is almost equal). Here they make a big mistake. [Source](https://pubmed.ncbi.nlm.nih.gov/34407441/)

'Conclusions: NSO supplementation was associated with faster recovery of symptoms than usual care alone for patients with mild COVID-19 infection. These potential therapeutic benefits require further exploration with placebo-controlled, double-blinded studies.'



In [193]:
new_classifier_learner_1.predict('Conclusions: NSO supplementation was associated with faster recovery of symptoms than usual care alone for patients with mild COVID-19 infection. These potential therapeutic benefits require further exploration with placebo-controlled, double-blinded studies.')

('0', tensor(0), tensor([0.3760, 0.2622, 0.3617]))

In [43]:
classifier_lrner.predict('Conclusions: NSO supplementation was associated with faster recovery of symptoms than usual care alone for patients with mild COVID-19 infection. These potential therapeutic benefits require further exploration with placebo-controlled, double-blinded studies.')

('0', tensor(0), tensor([0.6711, 0.1434, 0.1854]))

The second model shows bigger certainty in the decisions that both models make. Even when they are wrong.

It seems that our models are quite conservative. The text below should be scored with 2. Maybe because there is an comparison between the different kinds of propolis and their effectiveness, our models' result is 1. [Source](https://pubmed.ncbi.nlm.nih.gov/31146392/)

'Abstract
Researchers are continuing to discover all the properties of propolis due to its complex composition and associated broad spectrum of activities. This review aims to characterize the latest scientific reports in the field of antibacterial activity of this substance. The results of studies on the influence of propolis on more than 600 bacterial strains were analyzed. The greater activity of propolis against Gram-positive bacteria than Gram-negative was confirmed. Moreover, the antimicrobial activity of propolis from different regions of the world was compared. As a result, high activity of propolis from the Middle East was found in relation to both, Gram-positive (Staphylococcus aureus) and Gram-negative (Escherichia coli) strains. Simultaneously, the lowest activity was demonstrated for propolis samples from Germany, Ireland and Korea.

Keywords: Escherichia coli; Staphylococcus aureus; antibacterial; bee product; polyphenols; propolis; terpenoids.'

In [196]:
new_classifier_learner_1.predict('''Abstract
Researchers are continuing to discover all the properties of propolis due to its complex composition and associated broad spectrum of activities. This review aims to characterize the latest scientific reports in the field of antibacterial activity of this substance. The results of studies on the influence of propolis on more than 600 bacterial strains were analyzed. The greater activity of propolis against Gram-positive bacteria than Gram-negative was confirmed. Moreover, the antimicrobial activity of propolis from different regions of the world was compared. As a result, high activity of propolis from the Middle East was found in relation to both, Gram-positive (Staphylococcus aureus) and Gram-negative (Escherichia coli) strains. Simultaneously, the lowest activity was demonstrated for propolis samples from Germany, Ireland and Korea.

Keywords: Escherichia coli; Staphylococcus aureus; antibacterial; bee product; polyphenols; propolis; terpenoids.''')

('1', tensor(1), tensor([0.3280, 0.4311, 0.2409]))

The conclusions from this article could be marked with 2 (or, eveantually, 1 due to its reserved tone). Let's see what our models will show.[Source](https://pubmed.ncbi.nlm.nih.gov/15291903/)

'Conclusions: The results of this study suggest that traditional Chinese therapy may be an efficacious and safe treatment option for patients with seasonal AR.'

In [12]:
#due to end of our session, we load the model
best_model_learner = load_learner('new_classifier_learner_1.pth')

In [13]:
best_model_learner.predict('Conclusions: The results of this study suggest that traditional Chinese therapy may be an efficacious and safe treatment option for patients with seasonal AR')

('0', tensor(0), tensor([0.3859, 0.2572, 0.3569]))

In [45]:
classifier_lrner.predict('Conclusions: The results of this study suggest that traditional Chinese therapy may be an efficacious and safe treatment option for patients with seasonal AR')

('0', tensor(0), tensor([0.8010, 0.0558, 0.1432]))

The second is terribly wrong. As the first is, as well, but at least it was close to 2.

It seems that ULMFiT tends to underestimate 2-scored texts. They are rare (8 out of 32 articles). Probably that is the reason for its high accuracy.

From the comparison of the two models above we can clearly see that smaller valid_loss doesn't necessarily make the model better. 