In [4]:
#Transformers installation
!pip install transformers


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 5.2 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.0-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 60.2 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 71.5 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.0 tokenizers-0.13.2 transformers-4.24.0


#Transformer library features : Berta,GPT-2, XLM, RoBERTa (Pre-trained models)
1. Text analysis - Sentiment analysis (positive or negative) -spams,scams, fake news
2. Text Generation - better video games, better AI assistants
3. Text summarization- for products, webpages
4. Feature extraction:return a tensor represtation of the text


#The pipeline for this analysis is DistilBERT architecture.
Check the link to know more about the model and understand the hyperparameters.
https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english?text=I+like+you.+I+love+you

In [5]:
from transformers import pipeline
classifier = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

A pretrained model for the sentiment analysis and its tokenizer are downloaded and cached. Tokeinzer is used to p[reprocess the text for the model, which in turn makes the predictions. The pipeline groups all of this together and post-process the predictions to make them readable. We will see an example for the same now:

In [6]:
classifier('I am happy doing NLP using hugging face.')

[{'label': 'POSITIVE', 'score': 0.9998164772987366}]

#Lets try to confuse it and see the results.

In [7]:
results = classifier(["I liked it.", " I hope you did not hate it."])
for result in results:
  print(f"label:{result['label']}, with score:{round(result['score'], 4)}")

label:POSITIVE, with score:0.9999
label:POSITIVE, with score:0.6869


#We have different models for text classification. You can do it for different languages. Let's see that.

In [8]:
classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment" )

Downloading:   0%|          | 0.00/953 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/669M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/872k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [9]:
classifier("C'était une mauvaise suggestion.")      #English translation-It was a bad suggestion

[{'label': '1 star', 'score': 0.5506829023361206}]

#In this model the rating is between 1 to 5. It gives 1 rating for this statement.

#1. Tokenizer (word embedding associated to the model that we picked) and transformer call. Tokenizing is the most important thing for NLP. 
#2. The model we are calling is for text classification. So, it will change accordingly based on what you are working on (for e.g. text summarization).

In [10]:
#Sequence classification model and then tokenizer for this model. The same process discussed above if you have to do it locally.
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFDistilBertForSequenceClassification.

All the layers of TFDistilBertForSequenceClassification were initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [11]:
classifier("This was a good experience")        #4 stars positive response

[{'label': '4 stars', 'score': 0.48383447527885437}]

#Lets see the tokens (numerical number) for each word.

In [12]:
inputs = tokenizer("This was a good experience")

In [13]:
print(inputs)

{'input_ids': [101, 2023, 2001, 1037, 2204, 3325, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}


#Passing a list of sentences in tokenizer as a batch by padding them to all the same length, truncating them to maximum length the model can accept and get tensors back. Specifying that in a tokenizer.

In [14]:
tf_batch = tokenizer(["I am so happy to work on NLP.", "I hope you do too"], padding = True, truncation = True, max_length = 512, return_tensors= "tf")

#Lets see the token values


In [15]:
for key, value in tf_batch.items():
  print(f"{key}: {value.numpy().tolist()}")

input_ids: [[101, 1045, 2572, 2061, 3407, 2000, 2147, 2006, 17953, 2361, 1012, 102], [101, 1045, 3246, 2017, 2079, 2205, 102, 0, 0, 0, 0, 0]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]]


#Simple transformers used for the text generation application.

In [16]:
!pip install simpletransformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting simpletransformers
  Downloading simpletransformers-0.63.9-py3-none-any.whl (250 kB)
[K     |████████████████████████████████| 250 kB 5.0 MB/s 
Collecting wandb>=0.10.32
  Downloading wandb-0.13.5-py2.py3-none-any.whl (1.9 MB)
[K     |████████████████████████████████| 1.9 MB 55.2 MB/s 
[?25hCollecting streamlit
  Downloading streamlit-1.15.0-py2.py3-none-any.whl (9.2 MB)
[K     |████████████████████████████████| 9.2 MB 52.5 MB/s 
[?25hCollecting datasets
  Downloading datasets-2.7.0-py3-none-any.whl (451 kB)
[K     |████████████████████████████████| 451 kB 38.1 MB/s 
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 2.8 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 62.7 MB/s 
C

#So, for the task at hand the dataset needs to be created. So, for the simple transformer the stucture of the dataset has been defined (input structure). The dataset can be in two different types:
1. json file (list of dictionaries)
2. python dictionary
    Dictionary has two attributes : context and qas in order to understand the conext and whether the question can be answered or not?


#Reading the dataset. Check the formatting of the input structure in the dataset.

In [19]:
import json
with open(r"train.json", "r") as read_file:
    train = json.load(read_file)

In [20]:
train

[{'context': 'Mistborn is a series of epic fantasy novels written by American author Brandon Sanderson.',
  'qas': [{'id': '00001',
    'is_impossible': False,
    'question': 'Who is the author of the Mistborn series?',
    'answers': [{'text': 'Brandon Sanderson', 'answer_start': 71}]}]},
 {'context': 'The first series, published between 2006 and 2008, consists of The Final Empire,The Well of Ascension, and The Hero of Ages.',
  'qas': [{'id': '00002',
    'is_impossible': False,
    'question': 'When was the series published?',
    'answers': [{'text': 'between 2006 and 2008', 'answer_start': 28}]},
   {'id': '00003',
    'is_impossible': False,
    'question': 'What are the three books in the series?',
    'answers': [{'text': 'The Final Empire, The Well of Ascension, and The Hero of Ages',
      'answer_start': 63}]},
   {'id': '00004',
    'is_impossible': True,
    'question': 'Who is the main character in the series?',
    'answers': []}]}]

In [21]:
import json
with open(r"test.json", "r") as read_file:
    test = json.load(read_file)

In [22]:
test

[{'context': 'The series primarily takes place in a region called the Final Empire on a world called Scadrial, where the sun and sky are red, vegetation is brown, and the ground is constantly being covered under black volcanic ashfalls.',
  'qas': [{'id': '00001',
    'is_impossible': False,
    'question': 'Where does the series take place?',
    'answers': [{'text': 'region called the Final Empire', 'answer_start': 38},
     {'text': 'world called Scadrial', 'answer_start': 74}]}]},
 {'context': '"Mistings" have only one of the many Allomantic powers, while "Mistborns" have all the powers.',
  'qas': [{'id': '00002',
    'is_impossible': False,
    'question': 'How many powers does a Misting possess?',
    'answers': [{'text': 'one', 'answer_start': 21}]},
   {'id': '00003',
    'is_impossible': True,
    'question': 'What are Allomantic powers?',
    'answers': []}]}]

#Implementation. Two different libraries:
1. QA model - Help in training
2. QArgs.

We need to specify the model type and model name used for the QA model.
These support all the huggingface models.

In [23]:
import logging
from simpletransformers.question_answering import QuestionAnsweringModel, QuestionAnsweringArgs

#I used type -bert and the model name is bert-base-cased.

In [24]:
model_type = "bert"
model_name = "bert-base-cased"

#Configure the model if it's needed.

In [25]:
model_args = QuestionAnsweringArgs()    #You can see the number of parameters that can be selected
model_args.train_batch_size = 32
model_args.evaluate_during_training = True
model_args.n_best_size = 4
model_args.num_train_epochs=5

#Selection of parameters can also be done this way. Where the training would happen and all the weights will be saved (which is the best model and accuracy). To see the training visualization wandb library needs to be defined.

In [42]:
### Advanced Methodology
train_args = {
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "use_cached_eval_features": True,
    "output_dir": f"outputs/{model_type}",
    "best_model_dir": f"outputs/{model_type}/best_model",
    "evaluate_during_training": True,
    "max_seq_length": 128,
    "num_train_epochs": 30,
    "evaluate_during_training_steps": 1000,
    "wandb_project": "Question Answer Application",
    "wandb_kwargs": {"name": model_name},
    "save_model_every_epoch": False,
    "save_eval_checkpoints": False,
    "n_best_size":3,
    "use_early_stopping": True,             #Can select early stopping too
    #"early_stopping_metric": "mcc",
    "train_batch_size": 32,
    "eval_batch_size": 64,
    # "config": {
    #     "output_hidden_states": True
    # }
}

#Initilazing the model

In [43]:
model = QuestionAnsweringModel(
    model_type,model_name, args=train_args
)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForQuestionAnswering: ['cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and a

#Transform just works like this. Prepare the dataset and just strat the training.

In [44]:
model.train_model(train, eval_data=test)

convert squad examples to features:   0%|          | 0/4 [00:00<?, ?it/s]Could not find answer: 'The Final Empire,The Well of Ascension, and The Hero of Ages.' vs. 'The Final Empire, The Well of Ascension, and The Hero of Ages'
convert squad examples to features: 100%|██████████| 4/4 [00:00<00:00, 573.15it/s]
add example index and unique id: 100%|██████████| 4/4 [00:00<00:00, 13210.41it/s]


Epoch:   0%|          | 0/30 [00:00<?, ?it/s]

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
correct,▁
eval_loss,▁
global_step,▁
incorrect,▁
similar,▁
train_loss,▁

0,1
correct,1.0
eval_loss,0.19519
global_step,1.0
incorrect,2.0
similar,0.0
train_loss,4.89844


Running Epoch 0 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 1 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 2 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 3 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 4 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 5 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 6 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 7 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 8 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 9 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 10 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 11 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 12 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 13 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 14 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 15 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 16 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 17 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 18 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 19 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 20 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 21 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 22 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 23 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 24 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 25 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 26 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 27 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 28 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 29 of 30:   0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

(30,
 {'global_step': [1,
   2,
   3,
   4,
   5,
   6,
   7,
   8,
   9,
   10,
   11,
   12,
   13,
   14,
   15,
   16,
   17,
   18,
   19,
   20,
   21,
   22,
   23,
   24,
   25,
   26,
   27,
   28,
   29,
   30],
  'correct': [0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0],
  'similar': [2,
   2,
   2,
   2,
   2,
   2,
   3,
   3,
   3,
   3,
   3,
   3,
   3,
   3,
   3,
   3,
   3,
   3,
   3,
   3,
   3,
   3,
   3,
   3,
   3,
   3,
   3,
   3,
   3,
   3],
  'incorrect': [1,
   1,
   1,
   1,
   1,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0],
  'train_loss': [4.9381513595581055,
   4.920572757720947,
   4.6673173904418945,
   4.0491533279418945,
   3.4817709922790527,
   2.9677734375,
   2.2874350547790527,
   1.875244140625,
   1.45833325

#You need to have an account with wandb AI to see the visualization of the training.

In [45]:
result, text = model.eval_model(test)

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

In [46]:
result

{'correct': 0, 'similar': 3, 'incorrect': 0, 'eval_loss': -2.330078125}

As the dataset is small it giving as 0. In a similar manner a big dataset is needed to do this QA text generation.

In [47]:
text

{'correct_text': {},
 'similar_text': {'00001': {'truth': 'region called the Final Empire',
   'predicted': '',
   'question': 'Where does the series take place?'},
  '00002': {'truth': 'one',
   'predicted': '"Mistings" have only one of the many Allomantic powers, while "Mistborns" have all the powers',
   'question': 'How many powers does a Misting possess?'},
  '00003': {'truth': '',
   'predicted': '"Mistings" have only one of the many Allomantic powers, while "Mistborns" have all the powers',
   'question': 'What are Allomantic powers?'}},
 'incorrect_text': {}}

# Make predictions with the model

In [48]:
to_predict = [
    {
        "context": "Vin is a Mistborn of great power and skill.",
        "qas": [
            {
                "question": "What is Vin's speciality?",
                "id": "0",
            }
        ],
    }
]

In [49]:
answers, probabilities = model.predict(to_predict)

print(answers)


convert squad examples to features: 100%|██████████| 1/1 [00:00<00:00, 3218.96it/s]
add example index and unique id: 100%|██████████| 1/1 [00:00<00:00, 3412.78it/s]


Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]

[{'id': '0', 'answer': ['great power and skill', 'great power', 'and skill']}]


#As it's a small dataset. I had to train it for a longer time to get a correct prediction for this question. Similar thing can be done for a big dataset.