# AAI612: Deep Learning & its Applications

*Notebook 3.3: Practice with HuggingFace*

<a href="https://colab.research.google.com/github/OmarMlaeb/AAI612_Malaeb/blob/master/Week%203/Notebook3.3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Experiment with Hugging Face Transformers

In [11]:
text = """Having served on the COVID Vaccine Development committee at Moderna, USA, \
    Dr. Nader was involved in the fight against the pandemic of the century. As \
    the race was on to develop a vaccine – the ultimate defense against a virus \
    of which little was known – what helped to expedite the process at the \
    pharmaceutical and biotechnology company was the availability of the \
    technology – messenger RNA – which had been 10 years in the making.\
    The development of vaccines in record time encapsulates the prerequisites \
    for discovery: research, technology, anticipation and inquiring minds, skills \
    that should be fostered in education."""

### Text Completion

Once you execute the below code, notice in the score in the output.  The highest the score, the higher the probability of that output being selected!

In [12]:
from transformers import pipeline

# specifying the pipeline
bert_unmasker = pipeline('fill-mask', model="bert-base-uncased")
text = "I have to wake up in the morning and [MASK] a doctor"
result = bert_unmasker(text)
for r in result:
    print(r)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


{'score': 0.6457526087760925, 'token': 2156, 'token_str': 'see', 'sequence': 'i have to wake up in the morning and see a doctor'}
{'score': 0.17833702266216278, 'token': 2655, 'token_str': 'call', 'sequence': 'i have to wake up in the morning and call a doctor'}
{'score': 0.07508129626512527, 'token': 2424, 'token_str': 'find', 'sequence': 'i have to wake up in the morning and find a doctor'}
{'score': 0.05682728812098503, 'token': 2131, 'token_str': 'get', 'sequence': 'i have to wake up in the morning and get a doctor'}
{'score': 0.006895779632031918, 'token': 2022, 'token_str': 'be', 'sequence': 'i have to wake up in the morning and be a doctor'}


### Text Classification

The below will be classified the above text as positive.  Can you change that?

In [13]:
#hide_output
from transformers import pipeline

classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


In [14]:
import pandas as pd

outputs = classifier(text)
pd.DataFrame(outputs)

Unnamed: 0,label,score
0,NEGATIVE,0.989723


### Named Entity Recognition

NER involves detecting and categorizing information in text known as named entities. Named entities refer to the key subjects of a piece of text, such as names, locations, companies, events and products, as well as themes, topics, times, monetary values and percentages.

In [15]:
ner_tagger = pipeline("ner", aggregation_strategy="simple")
outputs = ner_tagger(text)
pd.DataFrame(outputs)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


Error: Runtime no longer has a reference to this dataframe, please re-run this cell and try again.


### Question Answering

In [16]:
reader = pipeline("question-answering")
question = "What was Dr. Nader involved in?"
outputs = reader(question=question, context=text)
pd.DataFrame([outputs])

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


Unnamed: 0,score,start,end,answer
0,0.576397,44,52,a doctor


### Summarization

In [17]:
summarizer = pipeline("summarization")
outputs = summarizer(text, max_length=45, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu
Your min_length=56 must be inferior than your max_length=45.
Your max_length is set to 45, but your input_length is only 17. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=8)


 "I have to wake up in the morning and [MASK] a doctor," she says. She says she has to be a doctor every day to get up and go to the doctor. "I'm


### Translation

The below will use a German translation model.  Can you change this to French?  Google will be your best friend in this task :-)

In [18]:
# translator = pipeline("translation_en_to_de",
#                       model="Helsinki-NLP/opus-mt-en-de")
# outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)
# print(outputs[0]['translation_text'])

translator_fr = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr")
outputs_fr = translator_fr(text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs_fr[0]['translation_text'])

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

Device set to use cpu


Je dois me réveiller le matin et [MASK] un médecin. Je dois me réveiller le matin et [MASK] un médecin. Je dois me réveiller le matin. Je dois me réveiller le matin et [MASK] un médecin. Je dois me réveiller le matin. Je dois me réveiller le matin. Je dois me réveiller le matin. Je dois me réveiller le matin. Je dois me réveiller le matin. Je dois me réveiller le matin. Je dois me réveiller le matin. Je dois me réveiller le matin.


### Text Generation

In [19]:
#hide
from transformers import set_seed
set_seed(42) # Set the seed to get reproducible results

In [20]:
generator = pipeline("text-generation")
response = "Dear Dr. Nader, Thank you for working on the vaccine."
prompt = text + "\n\nResponse to the story:\n" + response
outputs = generator(prompt, max_length=500)
print(outputs[0]['generated_text'])

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


I have to wake up in the morning and [MASK] a doctor

Response to the story:
Dear Dr. Nader, Thank you for working on the vaccine. It has been very frustrating being a member of a medical system that lacks critical patient care -- I am so happy to hear your support for such a difficult vaccine. Sincerely,


Michael Doss,

Associate Professor of Medicine and Science of Medicine and Department of Medicine in the Department of Pediatrics and Neurology at the University of Illinois, College Park, Illinois

The first thing we're saying is that Dr. Nader needs to see an entire review of this. We just discovered it on Sunday. I am writing to you because the FDA will hold hearings on Monday. While you are writing what I can say, I want to share the facts to show you that you have serious risks posed by the vaccine with the CDC. Please see our video.

First, I am concerned about your concerns because as a pediatrician at Johns Hopkins, I am exposed to the same viruses that are most seriously as