In [30]:
!pip install --quiet "tensorflow-text==2.8.*"

In [31]:
pip install -q tf-models-official==2.7.0

In [56]:
import os
import re
import random
import seaborn as sns
import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text  # Imports TF ops for preprocessing.
from sklearn.metrics import pairwise
from official.nlp import optimization

In [32]:
random.seed(42)

In [33]:
#@title Configure the model { run: "auto" }
BERT_MODEL = "https://tfhub.dev/google/experts/bert/pubmed/2" # @param {type: "string"} ["https://tfhub.dev/google/experts/bert/wiki_books/2", "https://tfhub.dev/google/experts/bert/wiki_books/mnli/2", "https://tfhub.dev/google/experts/bert/wiki_books/qnli/2", "https://tfhub.dev/google/experts/bert/wiki_books/qqp/2", "https://tfhub.dev/google/experts/bert/wiki_books/squad2/2", "https://tfhub.dev/google/experts/bert/wiki_books/sst2/2",  "https://tfhub.dev/google/experts/bert/pubmed/2", "https://tfhub.dev/google/experts/bert/pubmed/squad2/2"]
# Preprocessing must match the model, but all the above use the same.
PREPROCESS_MODEL = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"

In [34]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [90]:
from google.colab import files
dataset_file_dict = files.upload()

Saving only_abstracts.csv to only_abstracts (1).csv


# BERT PubMed Expert

BERT (Bidirectional Encoder Representations from Transformers) is a deep learning model in which every output element is connected to every input element. The weights between them are dynamically calculated based on their connection. It is designed to read in both directions at once.

Because the algorithm is based on transformers (that generally don't require any fixed order of sequences (like RNNs)), it can process large amounts of data. This allows BERT to be pretrained. Mainly the English Wikipedia and the Brown Corpus are used for this goal.

Nowadays though there are several versions of BERT. We are going to test the BERT PubMed Expert - a model pre-trained on the MEDLINE/PubMed corpus of biomedical and life sciences literature abstracts. It is intended to be used on medical or scientific text NLP tasks. Exactly what our problem is! 

Sources for our code:

[TensorFlow Documentation](https://www.tensorflow.org/text/tutorials/classify_text_with_bert)

[TensorFlowHub experts/bert/pubmed Colab Notebook](https://tfhub.dev/google/experts/bert/pubmed/2)

[Unconventional Sentiment Analysis: BERT vs. Catboost](https://towardsdatascience.com/unconventional-sentiment-analysis-bert-vs-catboost-90645f2437a9)



First, let's open our dataset with PubMed texts.

In [63]:
texts_table = pd.read_csv("only_abstracts.csv")

In [64]:
texts = texts_table[['text', 'score']]

In [65]:
texts.head()

Unnamed: 0,text,score
0,Conclusion: While there is conflicting evidenc...,1
1,Conclusion: Resveratrol possesses protective e...,2
2,Abstract Nettle root is recommended for compla...,2
3,"Abstract For decades, the focus of managing au...",1
4,Abstract Background: Hashimoto's thyroiditis (...,2


## Preprocessing data

Let's split the data according to the labels. 

We are going to take 25% for testing. Let's take into account the distribution of the labels (about 50% for 1-scored artilcles, 25% for 0 and 2-scored texts), then this mean four articles with '1' score, two with '0' and two with '2' score for testing. More information about the distribution of the texts is available in the main ipynb file of this project.

In [66]:
two_scored_testing = texts[texts['score'] == 2][:2]
two_scored_testing_indices = two_scored_testing.index

In [67]:
texts[texts['score'] == 1][:4]

Unnamed: 0,text,score
0,Conclusion: While there is conflicting evidenc...,1
3,"Abstract For decades, the focus of managing au...",1
5,"Conclusion:\nFrom this study, it can be theori...",1
7,"Abstract Thyroid dysfunction, affecting people...",1


We see that three of the four texts concern Hashimoto. Let's try another combination.

In [68]:
texts[texts['score'] == 1]

Unnamed: 0,text,score
0,Conclusion: While there is conflicting evidenc...,1
3,"Abstract For decades, the focus of managing au...",1
5,"Conclusion:\nFrom this study, it can be theori...",1
7,"Abstract Thyroid dysfunction, affecting people...",1
12,Abstract The objective of this study was to de...,1
13,Conclusion: This review showed promising resul...,1
14,Abstract Systemic lupus erythematosus (SLE) is...,1
16,"Abstract Endometriosis, a gynecological diseas...",1
17,"Abstract Quercetin (3,3',4',5,7-pentahydroxyfl...",1
19,"Abstract Epigallocatechin-3-gallate (EGCG), a ...",1


In [69]:
texts.loc[[20, 28, 3, 19]]

Unnamed: 0,text,score
20,Abstract\r\nIdiopathic normal pressure hydroce...,1
28,Abstract Since the global outbreak of severe a...,1
3,"Abstract For decades, the focus of managing au...",1
19,"Abstract Epigallocatechin-3-gallate (EGCG), a ...",1


We see now that the data is quite different.

In [70]:
one_scored_testing_indices = [20, 28, 3, 19]

In [71]:
texts[texts['score'] == 0]

Unnamed: 0,text,score
8,Conclusions: This study demonstrates that wome...,0
9,Conclusions: A limited body of knowledge exist...,0
10,Abstract Awareness of physical activity (PA) c...,0
11,Conclusions: MetS and insulin resistance are a...,0
15,Conclusion\nThe effect of PA and exercise as t...,0
21,Conclusion\nShunt surgery improved short-dista...,0
23,Abstract Thyroid dysfunction can compromise ph...,0


The last ones are different. But they both are connected to activities, and this could mislead the model. Let's take one, connected to a practice, and one to a substance.

In [72]:
zero_scored_testing_indices = [11, 23]

In [73]:
two_scored_testing_indices

Int64Index([1, 2], dtype='int64')

In [74]:
testing_indices = []
testing_indices.extend([ind for ind in two_scored_testing_indices])
testing_indices.extend([ind for ind in one_scored_testing_indices])
testing_indices.extend([ind for ind in zero_scored_testing_indices])

In [75]:
testing_indices

[1, 2, 20, 28, 3, 19, 11, 23]

We are going to shuffle them.

In [76]:
random.shuffle(testing_indices)

In [77]:
print(testing_indices)

[28, 3, 11, 23, 20, 19, 1, 2]


Now, since we have the row indices of the testing set, let's split the whole dataset into two groups - train and test group.

In [78]:
test_group = texts.loc[testing_indices]

In [79]:
train_group = texts.drop(index=testing_indices)

In [80]:
train_group

Unnamed: 0,text,score
0,Conclusion: While there is conflicting evidenc...,1
4,Abstract Background: Hashimoto's thyroiditis (...,2
5,"Conclusion:\nFrom this study, it can be theori...",1
6,Abstract Hypothyroidism is one of the most com...,2
7,"Abstract Thyroid dysfunction, affecting people...",1
8,Conclusions: This study demonstrates that wome...,0
9,Conclusions: A limited body of knowledge exist...,0
10,Abstract Awareness of physical activity (PA) c...,0
12,Abstract The objective of this study was to de...,1
13,Conclusion: This review showed promising resul...,1


In [81]:
train_group.shape, test_group.shape

((24, 2), (8, 2))

## PubMed BERT Model

We are preparing the model.

In [82]:
#For the variables, see 6-th cell "Configure the models"
preprocess = hub.load(PREPROCESS_MODEL)
bert = hub.load(BERT_MODEL)

In [83]:
tfhub_handle_encoder = bert
tfhub_handle_preprocess = preprocess

In [None]:
def build_classifier_model():
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
  preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')
  encoder_inputs = preprocessing_layer(text_input)
  encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder')
  outputs = encoder(encoder_inputs)
  net = outputs['pooled_output']
  net = tf.keras.layers.Dropout(0.1)(net)
  net = tf.keras.layers.Dense(3, activation=None, name='classifier')(net)
  

  return tf.keras.Model(text_input, net)

In [None]:
classifier_model_docs_ = build_classifier_model()

In [None]:
classifier_model_docs_ = build_classifier_model()

#Let's test it if it works
bert_raw_result = classifier_model_docs_(tf.constant(['what an amazing place!']))
print(tf.sigmoid(bert_raw_result))

tf.Tensor([[0.40643921 0.33709714 0.81453496]], shape=(1, 3), dtype=float32)


In [None]:
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics = tf.metrics.SparseCategoricalAccuracy('accuracy')

In [None]:
epochs = 5
steps_per_epoch = train_group.shape[0]
num_train_steps = steps_per_epoch * epochs
num_warmup_steps = int(0.1*num_train_steps)

init_lr = 3e-5
optimizer = optimization.create_optimizer(init_lr=init_lr,
                                          num_train_steps=num_train_steps,
                                          num_warmup_steps=num_warmup_steps,
                                          optimizer_type='adamw')

In [None]:
classifier_model_docs_.compile(optimizer=optimizer,
                         loss=loss,
                         metrics=metrics)

In [None]:
history = classifier_model_docs_.fit(
    x=train_group['text'].values,
    y=train_group['score'],
    validation_data=(test_group['text'].values, test_group['score']),
    epochs=35)

Epoch 1/35
Epoch 2/35
Epoch 3/35
Epoch 4/35
Epoch 5/35
Epoch 6/35
Epoch 7/35
Epoch 8/35
Epoch 9/35
Epoch 10/35
Epoch 11/35
Epoch 12/35
Epoch 13/35
Epoch 14/35
Epoch 15/35
Epoch 16/35
Epoch 17/35
Epoch 18/35
Epoch 19/35
Epoch 20/35
Epoch 21/35
Epoch 22/35
Epoch 23/35
Epoch 24/35
Epoch 25/35
Epoch 26/35
Epoch 27/35
Epoch 28/35
Epoch 29/35
Epoch 30/35
Epoch 31/35
Epoch 32/35
Epoch 33/35
Epoch 34/35
Epoch 35/35


We see the familiar behaviour of our main model (our LSTM model in the main ipynb file of the project). With little above 0.80 val_loss, the val_accuracy is at its peak. After that the model starts overfitting too much.

Unfortunately, this model didn't imrove LSTM's results.

We are going to test similar architecture, but with CategoricalCrossentropy this time.

### BERT PubMed Experts with CategoricalCrossentropy

After some trial and error we managed to reach 0.875 accuracy and save the model (with different learning rate and softmax activation though).

In [None]:
tf.keras.backend.clear_session()

In [84]:
def build_classifier_model_3():
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
  preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')
  encoder_inputs = preprocessing_layer(text_input)
  encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder')
  outputs = encoder(encoder_inputs)
  net = outputs['pooled_output']
  net = tf.keras.layers.Dropout(0.1)(net)
  net = tf.keras.layers.Dense(3, activation='softmax', name='classifier')(net)
  

  model = tf.keras.Model(text_input, net)

  loss = tf.keras.losses.CategoricalCrossentropy()
  metric = tf.metrics.CategoricalAccuracy('accuracy')
  optimizer = tf.keras.optimizers.Adam(
    learning_rate=5e-05, epsilon=1e-08, decay=0.01, clipnorm=1.0)
  model.compile(
    optimizer=optimizer, loss=loss, metrics=metric)
  model.summary()
  return model

In [86]:
y_train = tf.keras.utils.to_categorical(
    train_group['score'].astype('category').cat.codes.values, num_classes=3)
y_test = tf.keras.utils.to_categorical(
    test_group['score'].astype('category').cat.codes.values, num_classes=3)

In [None]:
classifier_model_ar = build_classifier_model_3()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 text (InputLayer)              [(None,)]            0           []                               
                                                                                                  
 preprocessing (KerasLayer)     {'input_mask': (Non  0           ['text[0][0]']                   
                                e, 128),                                                          
                                 'input_type_ids':                                                
                                (None, 128),                                                      
                                 'input_word_ids':                                                
                                (None, 128)}                                                  

In [None]:
checkpoint_path = "training_3/cp.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)

# Create a callback that saves the model's weights
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
                                                 save_weights_only=True,
                                                 monitor='val_accuracy',
                                                  mode='max',
                                                  save_best_only=True,
                                                 verbose=1)

In [None]:
history = classifier_model_ar.fit(
    x=train_group['text'].values,
    y=y_train,
    validation_data=(test_group['text'].values, y_test),
    epochs=20,
    callbacks=[cp_callback])

Epoch 1/20
Epoch 1: val_accuracy improved from -inf to 0.62500, saving model to training_3/cp.ckpt
Epoch 2/20
Epoch 2: val_accuracy did not improve from 0.62500
Epoch 3/20
Epoch 3: val_accuracy improved from 0.62500 to 0.87500, saving model to training_3/cp.ckpt
Epoch 4/20
Epoch 4: val_accuracy did not improve from 0.87500
Epoch 5/20
Epoch 5: val_accuracy did not improve from 0.87500
Epoch 6/20
Epoch 6: val_accuracy did not improve from 0.87500
Epoch 7/20
Epoch 7: val_accuracy did not improve from 0.87500
Epoch 8/20
Epoch 8: val_accuracy did not improve from 0.87500
Epoch 9/20
Epoch 9: val_accuracy did not improve from 0.87500
Epoch 10/20
Epoch 10: val_accuracy did not improve from 0.87500
Epoch 11/20
Epoch 11: val_accuracy did not improve from 0.87500
Epoch 12/20
Epoch 12: val_accuracy did not improve from 0.87500
Epoch 13/20
Epoch 13: val_accuracy did not improve from 0.87500
Epoch 14/20
Epoch 14: val_accuracy did not improve from 0.87500
Epoch 15/20
Epoch 15: val_accuracy did not im

We are going to reload the model from its Checkpoint by creating an empty model with the same architecture, checking its accuracy and after that loading the weights of the best performing model, and checking its accuracy again.



In [None]:
model_empty = build_classifier_model_3()

# Evaluate the model
loss, acc = model_empty.evaluate(test_group['text'].values, y_test, verbose=2)
print("Untrained model, accuracy: {:5.2f}%".format(100 * acc))

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 text (InputLayer)              [(None,)]            0           []                               
                                                                                                  
 preprocessing (KerasLayer)     {'input_mask': (Non  0           ['text[0][0]']                   
                                e, 128),                                                          
                                 'input_type_ids':                                                
                                (None, 128),                                                      
                                 'input_word_ids':                                                
                                (None, 128)}                                                

The empty model initially has 0.5 accuracy.

In [None]:
# Loads the weights
model_empty.load_weights(checkpoint_path)

# Re-evaluate the model
loss, acc = model_empty.evaluate(test_group['text'].values, y_test, verbose=2)
print("Restored model, accuracy: {:5.2f}%".format(100 * acc))

1/1 - 0s - loss: 0.6654 - accuracy: 0.8750 - 148ms/epoch - 148ms/step
Restored model, accuracy: 87.50%


We see the expected results - 0.875 accuracy.

## Some more testing

We are going to use two abstracts. We used them for our 
tests with ULMFiT (see ULMFiT.ipynb). For both texts, ULMFiT gave 0. We expect label=2 for the first text (honey is proven to work against the mentioned bacteria, even though not all honey is that effective) and label=1 for the second (NSO supplementation seems effective but more research is needed). The first part of the task is difficult because 2 is a rare score and the conclusions are ambiguous. Generally, the first text is not easy for categorization.

 First text:
 
 [Source](https://pubmed.ncbi.nlm.nih.gov/31146392/)

'Abstract Researchers are continuing to discover all the properties of propolis due to its complex composition and associated broad spectrum of activities. This review aims to characterize the latest scientific reports in the field of antibacterial activity of this substance. The results of studies on the influence of propolis on more than 600 bacterial strains were analyzed. The greater activity of propolis against Gram-positive bacteria than Gram-negative was confirmed. Moreover, the antimicrobial activity of propolis from different regions of the world was compared. As a result, high activity of propolis from the Middle East was found in relation to both, Gram-positive (Staphylococcus aureus) and Gram-negative (Escherichia coli) strains. Simultaneously, the lowest activity was demonstrated for propolis samples from Germany, Ireland and Korea.

Keywords: Escherichia coli; Staphylococcus aureus; antibacterial; bee product; polyphenols; propolis; terpenoids.'

Second text:

[Source](https://pubmed.ncbi.nlm.nih.gov/34407441/)

'Conclusions: NSO supplementation was associated with faster recovery of symptoms than usual care alone for patients with mild COVID-19 infection. These potential therapeutic benefits require further exploration with placebo-controlled, double-blinded studies.'

In [None]:
pred_texts = ['Abstract Researchers are continuing to discover all the properties of propolis due to its complex composition and associated broad spectrum of activities. This review aims to characterize the latest scientific reports in the field of antibacterial activity of this substance. The results of studies on the influence of propolis on more than 600 bacterial strains were analyzed. The greater activity of propolis against Gram-positive bacteria than Gram-negative was confirmed. Moreover, the antimicrobial activity of propolis from different regions of the world was compared. As a result, high activity of propolis from the Middle East was found in relation to both, Gram-positive (Staphylococcus aureus) and Gram-negative (Escherichia coli) strains. Simultaneously, the lowest activity was demonstrated for propolis samples from Germany, Ireland and Korea. Keywords: Escherichia coli; Staphylococcus aureus; antibacterial; bee product; polyphenols; propolis; terpenoids.',
                  'Conclusions: NSO supplementation was associated with faster recovery of symptoms than usual care alone for patients with mild COVID-19 infection. These potential therapeutic benefits require further exploration with placebo-controlled, double-blinded studies.']


Let's see what the results are.

In [None]:
model_empty.predict(pred_texts)

array([[0.07313772, 0.80489194, 0.1219703 ],
       [0.17718089, 0.32072923, 0.50208986]], dtype=float32)

Our best model gave 1 to the first text and 2 to the second (we expect 2 to the first and 1 to the second) . It didn't manage to categorize them correctly. It seems promising, though. 

## Scraped Data Tests

Now, let's test the winning algorithm with the expected data. Because we cannot analyze many texts here, we allowed the scraper to search for "dermatitis" in combination of a small part (around 400) of the keywords. Now, we are going to import 5 randomly selected texts and see what our model will achieve.

In [43]:
bot_data = pd.read_csv('pubmed_23-23_06_February_2023.csv')

In [44]:
bot_data

Unnamed: 0,title,article_id,text,query
0,Contact allergy to acrylates/methacrylates in ...,17577353,in a recent study we showed that all our denta...,"1,4-Butanediol"
1,The sensitizing capacity of multifunctional ac...,6499426,the multifunctional acrylates used in ultravio...,"1,4-Butanediol"
2,An attempt to improve diagnostics of contact a...,15606651,epoxy resin systems (erss) are a frequent caus...,"1,4-Butanediol"
3,Contact allergy to reactive diluents and relat...,25711534,diglycidyl ether of bisphenol a resin (dgeba-r...,"1,4-Butanediol"
4,Sensitization to reactive diluents and hardene...,26537833,"beside the basic resins, reactive diluents and...","1,4-Butanediol"
...,...,...,...,...
4580,Evaluating Upadacitinib in the Treatment of Mo...,35747444,upadacitinib is a selective small molecule tha...,Creatine
4581,"A phase 3 randomized, multicenter, double-blin...",34988493,systemic atopic dermatitis treatments that hav...,Creatine
4582,Safety of Janus kinase (JAK) inhibitors in the...,34423443,atopic dermatitis (ad) is a chronic heterogene...,Creatine
4583,Safety and efficacy of upadacitinib in combina...,34023009,systemic therapies are typically combined with...,Creatine


In [39]:
random_indices = np.random.randint(0, bot_data.shape[0], size=5)

In [18]:
random_indices

array([3766,  792, 3043, 3861,  302])

In [24]:
txt_list = []
for a_text_ind in random_indices:
  txt_list.append(bot_data.loc[a_text_ind, 'text'])

In [100]:
def depure_data(data):
    # Remove new line characters
    data = re.sub('\s+', ' ', data)

    # Remove distracting single quotes
    data = re.sub("\'", "", data)

    return data

In [60]:
cleaner_txt_list = []
for txt in txt_list:
  cleaner_txt_list.append(depure_data(txt))

cleaner_txt_list


['oxidized zirconium (oxinium), titanium nitride (tin) or titanium niobium nitride (tinbn) coated implants became in recent years available for an increasing amount of total knee arthroplasty (tka) systems. the hypothesis of this study was that the use of tinbn-coated components would not lead to inferior results compared to conventional implants and that none of the metal hypersensitivity patients receiving tinbn-coated implants would require revision for metal allergy. this retrospective study compared 53 titanium niobium nitride coated tka with 103 conventional chrome cobalt implants of the same design. patients were evaluated at a minimal follow-up of 3 years. no differences in clinical, radiological or patient-reported outcome measurements were observed between these groups. a survivorship of 96% without differences in revision rates was observed at medium-term follow-up of 6.5 years. metal allergy leading to contact or generalized dermatitis after tka is very rare and usually lin

**Results**

The first three texts should be 0-scored. The fourth has '2' score, the fifth should be 0-scored as well. Why so many zeros? Let's see what keywords they have.

In [None]:
bot_data.loc[[3766,  792, 3043, 3861,  302], 'query']

3766      Cobalt
792      Arsenic
3043    Chromium
3861    Collagen
302         Agar
Name: query, dtype: object

We included metals in the list of natural remedies, because they could lead to defficiencies (and can be remedies). But they can cause poisoning as well. Arsenic is also used as medicine. It is included in this [list of natural medicines](https://naturalmedicines.therapeuticresearch.com/databases/food,-herbs-supplements.aspx?letter=A). 

So, we have an the intresting case of the old saying "The dose makes the poison".

We found Agar in the last text, in the context: "S. aureus isolates were cultivated from swab samples on selective MSSA and MRSA chromogenic agar(...)". So, it has no connection to the health problem.

**But will our best model recognize the lack of connection we saw above?**

Let's load it first.

In [87]:
model_new = build_classifier_model_3()

# Check whether it works as an empty architecture firts
loss, acc = model_new.evaluate(test_group['text'].values, y_test, verbose=2)
print("Untrained model, accuracy: {:5.2f}%".format(100 * acc))

Model: "model_2"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 text (InputLayer)              [(None,)]            0           []                               
                                                                                                  
 preprocessing (KerasLayer)     {'input_mask': (Non  0           ['text[0][0]']                   
                                e, 128),                                                          
                                 'input_word_ids':                                                
                                (None, 128),                                                      
                                 'input_type_ids':                                                
                                (None, 128)}                                                

The empty model initially has 0.375 accuracy.

In [93]:
%cd /content/gdrive/MyDrive

/content/gdrive/MyDrive


In [95]:
# Loads the weights
# model_new.load_weights('training_3/cp.ckpt')
model_new.load_weights('training_3/cp.ckpt')

# Re-evaluate the model
loss, acc = model_new.evaluate(test_group['text'].values, y_test, verbose=2)
print("Restored model, accuracy: {:5.2f}%".format(100 * acc))

1/1 - 0s - loss: 0.6654 - accuracy: 0.8750 - 164ms/epoch - 164ms/step
Restored model, accuracy: 87.50%


The expected result! Our model works. So, let's see how the model will score the texts. 

In [96]:
model_new.predict(cleaner_txt_list)

array([[0.49589503, 0.26751927, 0.23658568],
       [0.0986938 , 0.62093425, 0.280372  ],
       [0.2009237 , 0.71399945, 0.08507685],
       [0.31470627, 0.48864746, 0.19664626],
       [0.4999839 , 0.29321054, 0.20680557]], dtype=float32)

We have 0, 1, 1, 1, 0 (the right answer: 0, 0, 0, 2, 0). The model has 2 correct scores out of 5.

BUT!

We noticed something! While constructing the spider, we didn't scrape the h elements (the subtitles). That means - we don't have the word 'conclusion/s' in our texts. 

We will remind you that we used the conclusions of the abstracts (and the keywords after them) during the training of the models. Only if there was no conclusion part, we used the whole abstract.

So, we have a solution. The different parts of texts, scraped by the spider, are separated by 5 times '\n'. We are going to split them and use the last fragment or if in the last fragment we find ';' (which probably means that the text has keywords at the end) we are going to take the last two fragments.

Let's check out the results.

In [115]:
#We are taking out the last part/s
lst = []
for txt in txt_list:
  if '\n     \n      \n         \n      \n' in txt:
    txt = txt.split('\n     \n      \n         \n      \n')
    if ';' in txt[-1]:
      # lst.append(txt[-2:])
      txt = txt[-2:]
      tx = ', '.join(txt)
      tx = depure_data(tx)
      tx = tx.strip()
      lst.append(tx)
    else:
      tx = txt[-1][0]
      tx = depure_data(tx)
      tx = tx.strip()
      lst.append(tx)
  else:
    lst.append(txt)
  # for ls in lst:
  # if ls == '':
  #   lst.remove(ls)
print(len(lst))

5


In [116]:
model_new.predict(lst)

array([[0.9757751 , 0.0117159 , 0.01250899],
       [0.67351687, 0.1346261 , 0.19185708],
       [0.5496017 , 0.36461586, 0.08578251],
       [0.31470627, 0.48864746, 0.19664626],
       [0.98735785, 0.00805778, 0.00458442]], dtype=float32)

So the results now are:
0, 0, 0, 1, 0 (the right answer: 0, 0, 0, 2, 0). We have 4 out of 5 correct predictions. The model works very well. 

It is more difficult for the model to distinguish the 2-scored texts. We already saw the same flaw in our LSTM.

How can we improve it? We should feed it with more 2-scored texts for sure.  

