<a id = 'top'></a>

#  A quick-start guide to BERT-like models with Hugging Face Transformers
  * A. [What is BERT?](#introBERT) 
  * B. [Pipelines](#pipelines)    
      * 1. [Sentiment Classification](#sentimentClass)
      * 2. [Token Classification (Named Entity Recognition, Part-of-Speech tagging)](#tokenClass)
      * 3. [Question-Answering](#questionAnswer)
      * 4. [Masked Language Modeling](#MLModel)
      * 5. [Translation](#translation)

Hugging Face is a company that offers a library of "transformers" as well as pre-trained models geared for a variety of tasks.  We are going to explore several ways of working with these models at a very high level.  In later classes, when we have covered how a transformer works, we'll come back and look at them at a deeper level.  This tutorial is designed to look that the HuggingFace library at the same level as Keras rather at the lower level of TensorFlow.  We'll take full advantage of a number of abstract classes they've created to facilitate using their models.

Note that HuggingFace supports Tensorflow and an alternative called PyTorch.  The default language for HuggingFace is PyTorch.  They have recently begun porting many of their models to Tensorflow.  When using Huggingface just pay attention to which version you're using.  When the model you're using is TensorFlow, the model name often begins with TF as in TFBert or TFDistilBert.  If it doesn't have a TF at the begining of the model name, it is using PyTorch.


---

This directory includes three different uses of the HuggingFace Library because these classes and abstractions are incompatible with each other. 

[Return to Top](#top)
 <a id = 'introBERT'></a>
# What is BERT?
This notebook leverages one of a variety of BERT models.  BERT models can be classified in terms of three parts.  The first part is a component named a transformer.  These can grow to be quite large.  The second part is the training it already has on language.  The third part is the tasks it is geared toward performing.  Different models will use different size transformers and may be optimized for different languages and different tasks.  For example, CamemBERT is trained in French and SciBERT is trained on scientific journal articles.  You'll want to make sure you use a model appropriate to your language and task.

---

The [HuggingFace web site](https://huggingface.co/transformers) offers an interesting set of resources.  Their [ model documentation](https://huggingface.co/transformers/model_summary.html) provides an excellent explanation of transformers as well as the growing variety of models they offer (see the left hand navigation column).  In addition, their collection of [notebooks](https://huggingface.co/transformers/notebooks.html) is a valuable set of examples.  

---

One word of caution:  this is a rapidly evolving resource and as a result you can often run in to bugs.  They will get fixed, eventually, but may be buggy for a while.  

In [None]:
!pip install -q transformers
#!pip install transformers

[K     |████████████████████████████████| 3.4 MB 5.0 MB/s 
[K     |████████████████████████████████| 596 kB 56.8 MB/s 
[K     |████████████████████████████████| 3.3 MB 39.2 MB/s 
[K     |████████████████████████████████| 895 kB 61.3 MB/s 
[K     |████████████████████████████████| 67 kB 3.9 MB/s 
[?25h

[Return to Top](#top)
 <a id = 'pipelines'></a>
# Pipelines

In [1]:
from __future__ import print_function
import ipywidgets as widgets
from transformers import pipeline

The pipeline interface provides a very abstract and simple API that allows you to experiment with several different NLP tasks without having to do any training at all.  These can be useful if you have a limited set of tests or experiments you want to try.  Some of the supported tasks include:


*   Sentiment Classification
*   Token Classification (Named Entity Recognition, Part-of-Speech tagging)
*   Question-Answering
*   Masked Language Modeling
*   Translation


In its simplest and most abstract form we will use two commands.  First, instantiate a pipleline object and specify the task. Second, feed the pipeline the appropriate input and get an answer.

[Return to Top](#top)
 <a id = 'sentimentClass'></a>
### Sentiment Analysis

Sentiment analysis takes sentences as input and classifies into either two categories -- positive and negative -- or three categories -- positive, negative, neutral -- depending on the sentiments expressed in the sentence.

In [7]:
nlp_sentence_classif = pipeline('sentiment-analysis')
nlp_sentence_classif('This NLP stuff is very cool !') #a very positive statement should yield a high positive score

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are 

[{'label': 'POSITIVE', 'score': 0.9998658895492554}]

These two lines of code hide a lot of what's happening under the hood.  The input text is converted into tokens that the underlying model can understand.  The model is called with that set of tokens.  The result is converted back in to a label or text that can be understood by a user.


[Return to Top](#top)
<a id = 'tokenClass'></a>
### Token Classification 

Token classification is a task where each token is assigned a label (e.g. classified).  For example, you might assign a label of article, noun, adjective, preposition, verb, or other to each token.   Named entity recognition assigns a label to each token in the token stream to identify a variety of different "entities" mentioned in the text.

In [None]:
nlp_token_class = pipeline('ner')
nlp_token_class('The iSchool is a part of UC Berkeley in the state of California.')

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

[{'end': 5,
  'entity': 'I-ORG',
  'index': 2,
  'score': 0.98272264,
  'start': 4,
  'word': 'i'},
 {'end': 6,
  'entity': 'I-ORG',
  'index': 3,
  'score': 0.9763639,
  'start': 5,
  'word': '##S'},
 {'end': 9,
  'entity': 'I-ORG',
  'index': 4,
  'score': 0.9488824,
  'start': 6,
  'word': '##cho'},
 {'end': 11,
  'entity': 'I-ORG',
  'index': 5,
  'score': 0.9534884,
  'start': 9,
  'word': '##ol'},
 {'end': 27,
  'entity': 'I-ORG',
  'index': 10,
  'score': 0.9975151,
  'start': 25,
  'word': 'UC'},
 {'end': 36,
  'entity': 'I-ORG',
  'index': 11,
  'score': 0.9888736,
  'start': 28,
  'word': 'Berkeley'},
 {'end': 63,
  'entity': 'I-LOC',
  'index': 16,
  'score': 0.99658054,
  'start': 53,
  'word': 'California'}]

[Return to Top](#top)
<a id = 'questionAnswer'></a>
### Question Answering 

The question answering task tries to identify the answer to a question contained in a context paragraph that is fed in to the system along wth the question.  One formulation seeks to do this by tagging the tokens in the context paragraph as being outside the answer span or inside the answer span.  As noted, the question answering task requires two inputs:


*   The context paragraph
*   The question to be answered



In [None]:
nlp_question_answer = pipeline('question-answering')
nlp_question_answer(context='The iSchool is a part of UC Berkeley in the state of California.', question='In which state is the iSchool located ?')

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/249M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

{'answer': 'California', 'end': 63, 'score': 0.9833863973617554, 'start': 53}

[Return to Top](#top)
<a id = 'MLModel'></a>
### Masked Language Modeling 

The masked language modeling task is like a fill in the blank test.  You provide a sentence but you "mask" a word.  The model then provides a set of candidate answers -- words that could fill in the blank arranged in order of highest to lowest probability.

In [None]:
nlp_mlm = pipeline('fill-mask')
nlp_mlm('UC Berkeley is located in ' + nlp_mlm.tokenizer.mask_token)


No model was supplied, defaulted to distilroberta-base (https://huggingface.co/distilroberta-base)


Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/316M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

[{'score': 0.38975071907043457,
  'sequence': 'UC Berkeley is located in Berkeley',
  'token': 10817,
  'token_str': ' Berkeley'},
 {'score': 0.10624547302722931,
  'sequence': 'UC Berkeley is located in Oakland',
  'token': 5147,
  'token_str': ' Oakland'},
 {'score': 0.08518155664205551,
  'sequence': 'UC Berkeley is located in California',
  'token': 886,
  'token_str': ' California'},
 {'score': 0.03185349702835083,
  'sequence': 'UC Berkeley is located in Irvine',
  'token': 20738,
  'token_str': ' Irvine'},
 {'score': 0.026199573650956154,
  'sequence': 'UC Berkeley is located in Sacramento',
  'token': 7759,
  'token_str': ' Sacramento'}]

[Return to Top](#top)
<a id = 'translation'></a>
### Translation 

The translation task supported by Hugging Face Pipelines takes as input a sentence in English and emits a translation in the specified language -- in this case French.  The pipeline provides a very limited set of translation inputs and outputs.  If you want to translate in different languages then you need to train a model yourself to work with those languages.

In [None]:
translator = pipeline('translation_en_to_fr')
translator("I love studying NLP in the MIDS program .")
#translator("J'aime bien etudier la NLP dans le programme MIDS .")

No model was supplied, defaulted to t5-base (https://huggingface.co/t5-base)


Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

[{'translation_text': "J'aime étudier la NLP dans le programme MIDS ."}]

In [11]:
trans_ch = pipeline('translation_en_to_fr')
trans_ch("hello")

No model was supplied, defaulted to t5-base (https://huggingface.co/t5-base)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/851M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

[{'translation_text': 'Bonjour'}]