# Author: Pooja Bhojwani and Dhanush Dharmaretnam
"""
This notebook gives a basic introduction on how to use attention based models such as Bert, GPT2 on various NLP tasks. We will mostly use the below libraries
1. Spacy
2. Hugging Face (Tensorflow and PyTorch)
3. Scikit-Learn
4. Spacy

PART A

Basic NLP Tasks we focus on

1. Named Entity extraction
2. Question and Answering
3. Sentiment Analysis
4. Text Summarization
5. Text Generation
6. Machine Translation
7. Predict Missing word
8. Conversations


PART B

Model Fine tuning and Transfer learning Examples

1. NER with data labelling using Doccano
3. Sentiment Classification
"""

**PART A , TASK 1 : NAMED ENTITY RECOGNITION**

*So what is named entity recognition or NER ?*

---




> ![](https://drive.google.com/uc?export=view&id=1h94GYBi9Er1qZ4hyE7ABNTHr79SXBOQ3)






* Named Entity extraction consists of extraction of words or strings of interest from a sentences and later categorizing them. With the example above here we extract categories such as person, date, device, product etc.

* NER plays an important role in extracting knowledge from large corpus of text. It can be news articles, emails, legal documents, medical transcripts etc. The extracted entities helps the machines understand the text.


Focus:

*   Named Entity Recognition using Transformers.
*   Named Entity Recognition using Spacy.



Ref:
1. https://www.innoplexus.com/blog/what-is-named-entity-recognition/
2. https://huggingface.co/
3. https://github.com/huggingface/transformers/blob/master/notebooks/03-pipelines.ipynb

In [None]:
# lets start with pretrained Spacy model
# The simple spacy library does not use transformer models but their pipeline architecture could support them. Spacy is also developing Spacy transformers but hugging face library supports more models and NLP task.
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import spacy
import spacy.cli
from spacy import displacy
import gc
 
# download the pretrained spacy model. Supports other languages as well
spacy.cli.download("en")

# perform NER
nlp = spacy.load("en")

sequence = "President Donald Trump's efforts to deny the outcome of the 2020 election cannot change an undeniable reality: Joe Biden won decisively, and his lead nationally and in key states has grown over time as more votes have been counted. \
       President-elect Biden is likely to end up over 5 million votes ahead of president Trump in the popular vote when all the counting is done. He'll get about or above 80 million votes -- by far the most of any presidential candidate in history. \
       In the electoral college, Biden looks to be on his way to earning 306 electoral votes. That's about 57% of all the electoral votes available and will be good enough for a 74 electoral vote margin over the sitting President. And let's be clear, \
       the chance of a recount overturning the results in 2020 is basically nothing. Fairvote has looked at statewide recounts since 2000. The average shift in votes has been a mere 430 votes and 0.02 points. \
       The largest shift in votes was a little less than 2,600 and 0.11 points."
doc = nlp(sequence)
for ent in doc.ents:
    print(ent.text, ent.label_)
displacy.render(doc, style='ent', jupyter=True)

del nlp
gc.collect()

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')
Donald Trump PERSON
2020 DATE
Joe Biden PERSON
Biden PERSON
up over 5 million CARDINAL
Trump PERSON
above 80 million CARDINAL
Biden PERSON
306 CARDINAL
about 57% PERCENT
74 CARDINAL
2020 DATE
2000 DATE
430 CARDINAL
0.02 CARDINAL
0.11 CARDINAL


2599

# Now lets go with hugging face library
* large language models such as Distil Bert, Bert, Roberta, Alberta etc so many are available as fine tuned in hugging face library. 

* We could custom select model along with its tokenizer or directly use pipelines.

* List of all models could be found here: https://huggingface.co/models
* NER Models are usually fine-tuned on conll-2003 tagged dataset.

> https://medium.com/analytics-vidhya/fine-tuning-bert-for-ner-on-conll-2003-dataset-with-tf-2-2-0-2f242ca2ce06



Here is an example of using pipelines to do named entity recognition, specifically, trying to identify tokens as belonging to one of 9 classes:

O, Outside of a named entity

B-MIS, Beginning of a miscellaneous entity right after another miscellaneous entity

I-MIS, Miscellaneous entity

B-PER, Beginning of a person’s name right after another person’s name

I-PER, Person’s name

B-ORG, Beginning of an organisation right after another organisation

I-ORG, Organisation

B-LOC, Beginning of a location right after another location

I-LOC, Location





In [2]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/3a/83/e74092e7f24a08d751aa59b37a9fc572b2e4af3918cb66f7766c3affb1b4/transformers-3.5.1-py3-none-any.whl (1.3MB)
[K     |████████████████████████████████| 1.3MB 9.4MB/s 
[?25hCollecting tokenizers==0.9.3
[?25l  Downloading https://files.pythonhosted.org/packages/4c/34/b39eb9994bc3c999270b69c9eea40ecc6f0e97991dba28282b9fd32d44ee/tokenizers-0.9.3-cp36-cp36m-manylinux1_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 16.1MB/s 
Collecting sentencepiece==0.1.91
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 50.6MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K  

In [None]:
from transformers import pipeline

nlp = pipeline("ner", grouped_entities=True)
output = nlp(sequence)

for word in output:
  print (word['word'], word['entity_group'])

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=998.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1334448817.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=60.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…


Donald Trump PER
Joe Biden PER
Biden PER
Trump PER
Biden PER
Fairvote ORG


In [None]:
nlp.model

BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 1024, padding_idx=0)
      (position_embeddings): Embedding(512, 1024)
      (token_type_embeddings): Embedding(2, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): LayerNorm((1024,), eps=1e-1

In [None]:
# If you want to know whats happening behind the pipelines
# Ref: https://stackoverflow.com/questions/60937617/how-to-reconstruct-text-entities-with-hugging-faces-transformers-pipelines-with

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
import gc

model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")

label_list = [
    "O",       # Outside of a named entity
    "B-MISC",  # Beginning of a miscellaneous entity right after another miscellaneous entity
    "I-MISC",  # Miscellaneous entity
    "B-PER",   # Beginning of a person's name right after another person's name
    "I-PER",   # Person's name
    "B-ORG",   # Beginning of an organisation right after another organisation
    "I-ORG",   # Organisation
    "B-LOC",   # Beginning of a location right after another location
    "I-LOC"    # Location
]

# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")

outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim=2)


for token, prediction in zip(tokens, predictions[0].tolist()):
  if label_list[prediction] == 'O':
    continue
  print(token, label_list[prediction]) 

del model, tokenizer
gc.collect()
# The results below are difference mainly because of how bert parses the text.

Donald I-PER
Trump I-PER
Joe I-PER
B I-PER
##iden I-PER
B I-PER
##iden I-PER
Trump I-PER
B I-PER
##iden I-PER
Fair I-ORG
##vo I-ORG
##te I-ORG


3863

In [None]:
# How to use specific model from model list on your pipeline
nlp = pipeline("ner", grouped_entities=True, model="dbmdz/bert-large-cased-finetuned-conll03-english")
output = nlp(sequence)

for word in output:
  print (word['word'], word['entity_group'])


del nlp
gc.collect()


Donald Trump PER
Joe Biden PER
Biden PER
Trump PER
Biden PER
Fairvote ORG


In [None]:
# what about a different language.

In [None]:
# How to use specific model from model list on your pipeline
nlp = pipeline("ner", grouped_entities=True, model="KB/bert-base-swedish-cased-ner")
swedish_sequence = '”President Donald Trumps ansträngningar att förneka resultatet av valet 2020 kan inte förändra en obestridlig verklighet: Joe Biden vann avgörande, och hans ledning nationellt och i nyckelstater har ökat med tiden när fler röster har räknats. \
       Den utvalda presidenten Biden kommer sannolikt att hamna över 5 miljoner röster före president Trump i den populära omröstningen när all räkning är klar. Han får ungefär 80 miljoner röster - överlägset mest av alla presidentkandidater i historien. \
       I valkollegiet ser Biden ut att vara på väg att tjäna 306 rösträtter. Det är ungefär 57% av alla tillgängliga röster och kommer att vara tillräckligt bra för 74 röstmarginal över den sittande presidenten. Och låt oss vara tydliga, \
       chansen att en omräkning förvandlar resultaten 2020 är i princip ingenting. Fairvote har tittat på beräkningar över hela landet sedan 2000. Det genomsnittliga skiftet i röster har varit bara 430 röster och 0,02 poäng.'
output = nlp(swedish_sequence)

for word in output:
  print (word['word'], word['entity_group'])


del nlp
gc.collect()


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=992.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=498854703.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=399162.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=182.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=3.0, style=ProgressStyle(description_wi…




Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Donald Trumps PER
2020 TME
Joe Biden PER
med tiden TME
Biden PER
Trump PER
Biden PER
ungefär 57 % MSR
Fairvote PER
sedan 2000 TME


421

# **PART A, TASK 2 : Question and Answering**




---

In [None]:
text = 'Coronavirus disease 2019 (COVID-19) is a contagious respiratory and vascular disease[9] caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). First identified in Wuhan, China, it has caused an ongoing pandemic.\
  Common symptoms include fever, cough, fatigue, breathing difficulties, and loss of smell and taste.[6] Symptoms begin one to fourteen days after exposure to the virus.[10] While most people have mild symptoms, \
  some people develop acute respiratory distress syndrome (ARDS), which can be precipitated by cytokine storms,[11] multi-organ failure, septic shock, and blood clots. Longer-term damage to organs (in particular, \
  the lungs and heart) has been observed, and there is concern about a significant number of patients who have recovered from the acute phase of the disease but continue to experience a range of effects—known \
  as long COVID—for months afterwards, including severe fatigue, memory loss and other cognitive issues, low grade fever, muscle weakness, and breathlessness."COVID-19 mainly spreads through the air when people \
  are near each other long enough,[a] primarily via small droplets or aerosols, as an infected person breathes, coughs, sneezes, sings, or speaks. Transmission via fomites (contaminated surfaces) has not been \
  conclusively demonstrated.[19] It can spread as early as two days before infected persons show symptoms (presymptomatic), and from asymptomatic (no symptoms) individuals. People remain infectious for up \
  to ten days in moderate cases, and two weeks in severe cases. The standard diagnosis method is by real-time reverse transcription polymerase chain reaction (rRT-PCR) from a nasopharyngeal swab. Preventive measures \
  include social distancing, quarantining, ventilation of indoor spaces, covering coughs and sneezes, hand washing, and keeping unwashed hands away from the face. The use of face masks or coverings has been recommended \
  in public settings to minimise the risk of transmissions. There are no proven vaccines or specific treatments for COVID-19 yet, though several are in development. Management involves the treatment of symptoms, \
  supportive care, isolation,b and experimental measures.'

#ref: https://en.wikipedia.org/wiki/Coronavirus_disease_2019

In [None]:
questions = ["where was covid first found ?",
             "How long does it take for the symptoms to show ?",
             "What are the symptoms of covid",
             "how does it spread",
             "what are the preventive measures"
]

In [None]:
#ref: https://towardsdatascience.com/simple-and-fast-question-answering-system-using-huggingface-distilbert-single-batch-inference-bcf5a5749571
from transformers import DistilBertTokenizer, DistilBertForQuestionAnswering
import torch

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased-distilled-squad',return_token_type_ids = True)
model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased-distilled-squad')
question_context_for_batch = []

##Pre-processing
for question in questions :
    question_context_for_batch.append((question, text))

encoding = tokenizer.batch_encode_plus(question_context_for_batch,pad_to_max_length=True, return_tensors="pt")
input_ids, attention_mask = encoding["input_ids"], encoding["attention_mask"]
start_scores, end_scores = model(input_ids, attention_mask=attention_mask)

##Getting max of start and end scores to get the answer
for index,(start_score,end_score,input_id) in enumerate(zip(start_scores,end_scores,input_ids)):
    max_startscore = torch.argmax(start_score)
    max_endscore = torch.argmax(end_score)
    ans_tokens = input_ids[index][max_startscore: max_endscore + 1]
    answer_tokens = tokenizer.convert_ids_to_tokens(ans_tokens, skip_special_tokens=True)
    answer_tokens_to_string = tokenizer.convert_tokens_to_string(answer_tokens)
    print ("\nQuestion: ",questions[index])
    print ("Answer: ", answer_tokens_to_string)





Question:  where was covid first found ?
Answer:  wuhan , china

Question:  How long does it take for the symptoms to show ?
Answer:  one to fourteen days

Question:  What are the symptoms of covid
Answer:  fever , cough , fatigue , breathing difficulties , and loss of smell and taste

Question:  how does it spread
Answer:  through the air when people are near each other long enough

Question:  what are the preventive measures
Answer:  


# Remember long paragraphs are mostly truncated and/or causes the model to lose context. So why not divide them into smaller chunks

In [None]:
  
text1= '  Transmission via fomites (contaminated surfaces) has not been \
  conclusively demonstrated.[19] It can spread as early as two days before infected persons show symptoms (presymptomatic), and from asymptomatic (no symptoms) individuals. People remain infectious for up \
  to ten days in moderate cases, and two weeks in severe cases. The standard diagnosis method is by real-time reverse transcription polymerase chain reaction (rRT-PCR) from a nasopharyngeal swab. Preventive measures \
  include social distancing, quarantining, ventilation of indoor spaces, covering coughs and sneezes, hand washing, and keeping unwashed hands away from the face. The use of face masks or coverings has been recommended \
  in public settings to minimise the risk of transmissions. There are no proven vaccines or specific treatments for COVID-19 yet, though several are in development. Management involves the treatment of symptoms, \
  supportive care, isolation,b and experimental measures.'

  questions1 = [
             "what are the preventive measures"
]

question_context_for_batch = []

for question in questions1 :
    question_context_for_batch.append((question, text1))

encoding = tokenizer.batch_encode_plus(question_context_for_batch,pad_to_max_length=True, return_tensors="pt")
input_ids, attention_mask = encoding["input_ids"], encoding["attention_mask"]
start_scores, end_scores = model(input_ids, attention_mask=attention_mask)

for index,(start_score,end_score,input_id) in enumerate(zip(start_scores,end_scores,input_ids)):
    max_startscore = torch.argmax(start_score)
    max_endscore = torch.argmax(end_score)
    ans_tokens = input_ids[index][max_startscore: max_endscore + 1]
    answer_tokens = tokenizer.convert_ids_to_tokens(ans_tokens, skip_special_tokens=True)
    answer_tokens_to_string = tokenizer.convert_tokens_to_string(answer_tokens)
    print ("\nQuestion: ",questions1[index])
    print ("Answer: ", answer_tokens_to_string)





Question:  what are the preventive measures
Answer:  social distancing , quarantining , ventilation of indoor spaces , covering coughs and sneezes , hand washing , and keeping unwashed hands away from the face


In [None]:

del model, tokenizer
out = gc.collect()

In [None]:
# Why not the pipeline ?
from transformers import pipeline

nlp = pipeline("question-answering")

text = 'Coronavirus disease 2019 (COVID-19) is a contagious respiratory and vascular disease[9] caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). First identified in Wuhan, China, it has caused an ongoing pandemic.\
  Common symptoms include fever, cough, fatigue, breathing difficulties, and loss of smell and taste.[6] Symptoms begin one to fourteen days after exposure to the virus.[10] While most people have mild symptoms, \
  some people develop acute respiratory distress syndrome (ARDS), which can be precipitated by cytokine storms,[11] multi-organ failure, septic shock, and blood clots. Longer-term damage to organs (in particular, \
  the lungs and heart) has been observed, and there is concern about a significant number of patients who have recovered from the acute phase of the disease but continue to experience a range of effects—known \
  as long COVID—for months afterwards, including severe fatigue, memory loss and other cognitive issues, low grade fever, muscle weakness, and breathlessness."COVID-19 mainly spreads through the air when people \
  are near each other long enough,[a] primarily via small droplets or aerosols, as an infected person breathes, coughs, sneezes, sings, or speaks. Transmission via fomites (contaminated surfaces) has not been \
  conclusively demonstrated.[19] It can spread as early as two days before infected persons show symptoms (presymptomatic), and from asymptomatic (no symptoms) individuals. People remain infectious for up \
  to ten days in moderate cases, and two weeks in severe cases. The standard diagnosis method is by real-time reverse transcription polymerase chain reaction (rRT-PCR) from a nasopharyngeal swab. Preventive measures \
  include social distancing, quarantining, ventilation of indoor spaces, covering coughs and sneezes, hand washing, and keeping unwashed hands away from the face. The use of face masks or coverings has been recommended \
  in public settings to minimise the risk of transmissions. There are no proven vaccines or specific treatments for COVID-19 yet, though several are in development. Management involves the treatment of symptoms, \
  supportive care, isolation,b and experimental measures.'


questions = ["where was covid first found ?",
             "How long does it take for the symptoms to show ?",
             "What are the symptoms of covid",
             "how does it spread",
             "what are the preventive measures"
]


for question in questions:
  print ("----------------------------------------")
  out = nlp(question=question, context=text)
  print (out)
  print (question)
  print(out['answer'])


----------------------------------------
{'score': 0.9737944006919861, 'start': 180, 'end': 193, 'answer': 'Wuhan, China,'}
where was covid first found ?
Wuhan, China,
----------------------------------------
{'score': 0.42967769503593445, 'start': 1329, 'end': 1349, 'answer': 'as early as two days'}
How long does it take for the symptoms to show ?
as early as two days
----------------------------------------
{'score': 0.7823688387870789, 'start': 254, 'end': 332, 'answer': 'fever, cough, fatigue, breathing difficulties, and loss of smell and taste.[6]'}
What are the symptoms of covid
fever, cough, fatigue, breathing difficulties, and loss of smell and taste.[6]
----------------------------------------
{'score': 0.1666080206632614, 'start': 1045, 'end': 1060, 'answer': 'through the air'}
how does it spread
through the air
----------------------------------------
{'score': 0.11085353791713715, 'start': 1713, 'end': 1775, 'answer': 'social distancing, quarantining, ventilation of indoor 

In [None]:
nlp.model

DistilBertForQuestionAnswering(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            

In [None]:
del nlp
gc.collect()

335

# Lets focus with only pipelines for now

# **PART A, TASK 3 : Sentiment Analysis**

In [None]:
from transformers import pipeline

nlp = pipeline("sentiment-analysis")

text ="@AmericanAir just landed - 3hours Late Flight - and now we need to wait TWENTY MORE MINUTES for a gate! I have patience but none for incompetence."
print(nlp(text))


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=629.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267844284.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…


[{'label': 'NEGATIVE', 'score': 0.9982924461364746}]


In [None]:
# Samsumg Art TV reviews on Amazon
# https://www.amazon.ca/Samsung-Frame-QN49LS03RAFXZC-Canada-Version/dp/B07NV84MWD/ref=sr_1_3_sspa?dchild=1&keywords=tv&qid=1605134674&sr=8-3-spons&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUExSE8wMDlYVThWRFYwJmVuY3J5cHRlZElkPUEwNzg4NTU5MVlJRExJMjZDRUs4SCZlbmNyeXB0ZWRBZElkPUEwNTE2NzcySEcxOFNEMERROU5EJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ&th=1

text ="Bought this for a basement reno that I was doing because I wanted something that would blend in better to the design of the design of the room than a standard TV. \
The minimalist style of the TV combined with the zero-gap wall-mount make quite an impression when installed. Hang some pictures with it on the wall, and put it in Art Mode \
to add to the 'Wow!' factor. The picture quality is AMAZING! I first set it up out of the box, connected it to my wifi, and ran some demo 4K videos off of Youtube. \
Totally blown away with the picture clarity, especially since it was 4K from the internet, and over wifi. Don't know what voodoo is going on behind the scenes to produce those results (I firmly believe a chicken or two had to be sacrificed), \
but it was amazing. Sound quality was decent, but since I plan on using this TV mainly for movie watching, I ran the TV audio through a stereo amp and subwoofer."

print(nlp(text))


[{'label': 'POSITIVE', 'score': 0.9068683385848999}]


In [None]:
text = "The 4K image is sharp and the overall aesthetic of the thin, flat-to-the-wall bezel is excellent, but software that allows you to display artwork is buggy and very poorly designed. \
There have been several updates in the 10 weeks that I have owned this TV and Samsung still can't seem to get it right. The software allows you to select your favourite art works from a broad selection and then display these favourites in rotation \
(you select the duration to display each; from 10 minutes to 7 days for each image). But the software constantly stops the rotation \
(e.g., if a software update is automatically downloaded, or if there is a momentary interruption in Internet connectivity, or if you make a change to your list of favourites by adding or deleting \
an artwork, etc., etc.) And every time the rotation stops and you manually restart it, it goes back to displaying the first picture in the list. There is no option to change or randomize the order. \
I've never once managed to see my entire list of favourites without the software restating the list from the beginning. The result? no matter how many favourites you choose (and there are hundreds of good choices), \
you'll probably never see more than the first dozen or so.This is only one of the problems with the software. There is also a white line that appears at the top of an artwork image while it is being downloaded, \
but then sometimes this line will annoyingly persist after the download is complete and remain at the top of the image for as long as that piece of artwork is displayed. Online and phone support from Samsung has been courteous but unhelpful."

print(nlp(text))

[{'label': 'NEGATIVE', 'score': 0.9993626475334167}]


In [None]:
del nlp
gc.collect()

738

# **PART A, TASK 4 : Text Summarization**

In [None]:
text = "U.S. President Donald Trump is refusing to accept his loss to president-elect Joe Biden and is \
instead floating baseless claims of widespread voter fraud and 'illegal' votes despite a lack of proof.All major news networks declared Biden \
the president-elect on Saturday after ballot-counting tallies in the remaining battleground states showed he had an insurmountable lead and would secure enough electoral college votes. But \
some Trump supporters insist that the election results remain undeclared and could eventually swing in the president’s favour, either through a series of legal challenges, statewide recounts or if yet-to-be-counted ballots, \
such as military ballots, tilt overwhelmingly in Trump’s favour. The reality, political experts say, is that none of those possibilities is viable. Even if Trump somehow won every undeclared state -- \
Georgia, North Carolina, Alaska and, by some news organizations’ counts, Arizona -- he’d still be 11 electoral college votes shy of the 270 needed to win."

In [None]:
from transformers import pipeline
summarizer = pipeline('summarization')
summarizer(text)

[{'summary_text': ' President-elect Joe Biden declared president-elect on Saturday after ballot-counting tallies in remaining battleground states showed he had an insurmountable lead and would secure enough electoral college votes . But some Trump supporters insist that the election results remain undeclared and could swing in the president’s favour . Political experts say none of those possibilities is viable .'}]

In [None]:

text = """ 
New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York. 
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband. 
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other. 
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage. 
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the 
2010 marriage license application, according to court documents. 
Prosecutors said the marriages were part of an immigration scam. 
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further. 
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective 
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002. 
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say. 
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages. 
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted. 
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s 
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali. 
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force. 
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""

summarizer(text)

[{'summary_text': ' Liana Barrientos pleaded not guilty to two counts of "offering a false instrument for filing in the first degree" She has been married to 10 men, nine of them between 1999 and 2002 . She is believed to still be married to four men, and at one time, she was married to eight men at once .'}]

In [None]:
text = """

otes are still being counted across the U.S., and states have until Dec. 8 to settle any outstanding disputes. 
After that, members of the electoral college meet on Dec. 14 to formally cast their votes, which will officially declare the winner.
In any given U.S. election it can take weeks for states to finish tabulating ballots. 
This year, the unprecedented surge in mail-in voting has meant that the process is taking even longer. S
everal battleground states, including Pennsylvania, weren’t allowed to open mail-in ballots until election day.
Wayne Petrozzi, a professor emeritus of politics from Ryerson University, said the added labour of removing mail-in ballots from envelopes, 
then removing them from a second secrecy envelope, and then checking the ballot by hand is painstaking work.
“You had some counties where they had two staff. That was it. And they had to follow meticulous protocols,” Petrozzi told CTVNews.ca on Tuesday.

The Associated Press, which called the election for Biden after five days of ballot counting, said the surge in mail-in voting is to blame for the slower-than-normal count.
“The election, in many ways a referendum on Trump’s poor management of the virus, led to widespread use of mail voting for the first time in many states,” AP said in its explanation of how it called the election.
In some of the more competitive states, such as Georgia, Trump’s supporters remain optimistic that an influx of military ballots could provide a sudden, unexpected boost to lift Trump over Biden.

The reason news organizations felt comfortable projecting victory for Biden was because he was ahead by such a margin that these outstanding votes, 
including military ballots, simply wouldn’t be enough to close the gap. Data analysts could also look at where in each batch of outstanding votes was coming from 
to better understand how the missing votes might lean, based on voter trends.
"""
summarizer(text)

[{'summary_text': ' The Associated Press called the election for Biden after five days of ballot counting . The election, in many ways a referendum on Trump’s poor management of the virus, led to widespread use of mail voting for the first time in many states, the AP said in its explanation of how it called it .'}]

In [None]:
del summarizer
gc.collect()

4092

# **PART A, TASK 5 : Text Generation**

In [None]:
from transformers import pipeline

# The model here is GPT2, Remember GPT3 is only an API for now as OpenAI decided not to open source their big model :(
# More here https://huggingface.co/transformers/main_classes/pipelines.html#textgenerationpipeline
text_generator = pipeline("text-generation", model='gpt2')
print (" ")

Some weights of GPT2Model were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


 


In [None]:

out=text_generator("As far as I am concerned, I will", max_length=300)
print("\n\n\n")
print ("-"* 20)
print (out[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.






--------------------
As far as I am concerned, I will continue to push for a federal health care system where all citizens have the right to health insurance," she said. "As I said I will work to create a universal health care system — which, once elected, will be the cornerstone of a comprehensive health care system."


In [None]:
out=text_generator("Donald Trump", max_length=300)
print("\n\n\n")
print ("-"* 20)
print (out[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.






--------------------
Donald Trump and the rise of the alt-right," he told the Daily Beast. "A lot of people are really interested in it."

A video of Trump's speech has gone viral since it was posted to a New York Times website at 9:39 a.m., and, according to the Times, it was originally taken at 9:53 a.m. on April 26 because:

"A Trump supporter is driving by and calling out a bunch of pro-Trump memes because there is one. "Look at all of the people in that video on Facebook. He has done these pro-Trump videos in the past, and what does he think of all of the people posting this? He thinks that they're racists."

Trump's former campaign manager Paul Manafort, who worked extensively with Trump during the campaign, later denied that the billionaire was part of any conspiracy at all.

"The real question with Trump and his followers is which side are they on. He says it, and then we're talking about a different narrative," said Steve Bannon, a White House campaign adviser and the form

In [None]:
out=text_generator("Donald Trump", max_length=300)
print("\n\n\n")
print ("-"* 20)
print (out[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.






--------------------
Donald Trump, who was defeated on Tuesday in New Hampshire.

The Trump campaign said the event was intended for two purposes: to encourage people to vote in a more democratic presidential election.

It also said the Republican front-runner said he was in good spirits and praised "our great state."

Clinton called for an all-out effort at the convention.

He told his colleagues that he wanted delegates to "pay closer attention to what's going on in the North."

Trump, a favorite of supporters in the Democratic field, appeared to be winning the day Tuesday. Clinton was followed by Bernie Sanders, who defeated a rival's candidacy in New Hampshire and narrowly lost in the state.

"It's the day of reckoning," Trump said, referring to Wednesday's primary election, when Clinton was defeated by Sen. Elizabeth Warren — a popular Democratic senator — by 8 points.

"You are not going to forget it in November. It is going to end soon."

___

Follow Stephen Clark on Twitter

In [None]:
out=text_generator("Can Skywalker kill the Sith load", max_length=1000)
print("\n\n\n")
print ("-"* 20)
print (out[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.






--------------------
Can Skywalker kill the Sith load out and watch his younger self die... if not with his face in harm's way as the Force has gone awry... he can at least pretend that we were just at the end of the galaxy to show him we were there as well... Or maybe his own memories, if this wasn't mind-altering, would go something like this:

"It was a few days ago, when he was with his friend, Darth Skywalker... after an old-timey moment when a kid in a wheelchair suddenly got hurt by another. They didn't know it yet, but they saw a man fall apart and try to wake up at the bottom of the mountain... and there was blood everywhere. So they just took him. You know, there is no way to explain what he got, but to show you what he was like in his first years, so that you could see the blood of his son on his father's face."

-- A number of flashbacks from A Nightmare on Elm Street

...which are part of what happened with her. She was so young, and this was no accident. She had been 

# **PART A, TASK 6 : Machine Translation**

In [None]:


# The model here is T5 https://github.com/google-research/text-to-text-transfer-transformer
# https://huggingface.co/transformers/model_doc/t5.html More on T5

# Another one from facebook research https://huggingface.co/transformers/model_doc/bart.html

from transformers import pipeline

# format translation_xx_to_yy, Only few languages are supported. May need to explore model library for other options

translator = pipeline("translation_en_to_de") # English to german
out = translator("So how are you all liking this crash course on NLP so far ?", max_length=40)

print("\n\n\n")
print ("-"* 20)
print (out[0]['translation_text'])

Some weights of T5Model were not initialized from the model checkpoint at t5-base and are newly initialized: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.






--------------------
Wie gefällt Ihnen dieser Crash-Kurs auf NLP so weit ?


In [None]:
translator = pipeline("translation_en_to_fr") # English to french
out = translator("So how are you all liking this crash course on NLP so far ?", max_length=40)

print("\n\n\n")
print ("-"* 20)
print (out[0]['translation_text'])

Some weights of T5Model were not initialized from the model checkpoint at t5-base and are newly initialized: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.






--------------------
Alors comment aimez-vous tous ce cours d'initiation à la NLP jusqu'à présent ?


In [None]:
translator = pipeline("translation_en_to_de") # English to german
out = translator("So how are you all liking this crash course on NLP so far ?", max_length=40)

print("\n\n\n")
print ("-"* 20)
print (out[0]['translation_text'])

Some weights of T5Model were not initialized from the model checkpoint at t5-base and are newly initialized: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.






--------------------
Wie gefällt Ihnen dieser Crash-Kurs auf NLP so weit ?


In [None]:
# For support over 100 languages, use models from here https://huggingface.co/Helsinki-NLP
# https://huggingface.co/transformers/model_doc/marian.html

In [None]:
from transformers import MarianMTModel, MarianTokenizer
src_text = [
    '>>fr<< So how are you all liking this crash course on NLP so far ?',
    '>>de<< So how are you all liking this crash course on NLP so far ?',
    '>>es<< So how are you all liking this crash course on NLP so far ?'
]
model_name = 'Helsinki-NLP/opus-mt-en-ROMANCE'
tokenizer = MarianTokenizer.from_pretrained(model_name)
print ("These are the supported language codes")
print(tokenizer.supported_language_codes)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer.prepare_seq2seq_batch(src_text))
tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]

for item in tgt_text:
  print (item)

These are the supported language codes
['>>fr<<', '>>es<<', '>>it<<', '>>pt<<', '>>pt_br<<', '>>ro<<', '>>ca<<', '>>gl<<', '>>pt_BR<<', '>>la<<', '>>wa<<', '>>fur<<', '>>oc<<', '>>fr_CA<<', '>>sc<<', '>>es_ES<<', '>>es_MX<<', '>>es_AR<<', '>>es_PR<<', '>>es_UY<<', '>>es_CL<<', '>>es_CO<<', '>>es_CR<<', '>>es_GT<<', '>>es_HN<<', '>>es_NI<<', '>>es_PA<<', '>>es_PE<<', '>>es_VE<<', '>>es_DO<<', '>>es_EC<<', '>>es_SV<<', '>>an<<', '>>pt_PT<<', '>>frp<<', '>>lad<<', '>>vec<<', '>>fr_FR<<', '>>co<<', '>>it_IT<<', '>>lld<<', '>>lij<<', '>>lmo<<', '>>nap<<', '>>rm<<', '>>scn<<', '>>mwl<<']
Alors, qu'est-ce que vous aimez tous ce cours sur NLP jusqu'à présent?
Allora, come vi piace a toti questo curso di crash on NLP finora?
Así que, ¿cómo les está gustando este curso de choque en NLP hasta ahora?


# **PART A, TASK 7 : Predict Missing word or Mask Prediction**

In [None]:
from transformers import pipeline

nlp = pipeline("fill-mask")

# A long-term care operator says 29 residents have died in a COVID-19 outbreak at an east Toronto facility that began last month.
out = nlp(f"A long-term care operator says 29 residents have {nlp.tokenizer.mask_token} in a COVID-19 outbreak at an east Toronto facility that began last month.")
print("\n\n\n")
print ("-"* 20)
for item in out[0:2]:
  print (item['sequence'])

Some weights of RobertaForMaskedLM were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.






--------------------
<s>A long-term care operator says 29 residents have died in a COVID-19 outbreak at an east Toronto facility that began last month.</s>
<s>A long-term care operator says 29 residents have perished in a COVID-19 outbreak at an east Toronto facility that began last month.</s>


In [None]:
# The team at Kennedy Lodge offers its most sincere condolences 
out = nlp(f"The team at Kennedy Lodge offers its most sincere  {nlp.tokenizer.mask_token} to the families and friends of the residents who passed away during the pandemic,” the company says in a statement.")
print("\n\n\n")
print ("-"* 20)
for item in out[0:2]:
  print (item['sequence'])





--------------------
<s>The team at Kennedy Lodge offers its most sincere condolences to the families and friends of the residents who passed away during the pandemic,” the company says in a statement.</s>
<s>The team at Kennedy Lodge offers its most sincere apologies to the families and friends of the residents who passed away during the pandemic,” the company says in a statement.</s>


In [None]:
del nlp
gc.collect()

7283

In [None]:

import numpy as np
nlp_features = pipeline('feature-extraction')
output = nlp_features('How to use Bert for long text classification?')
np.array(output).shape   # (Samples, Tokens, Vector Size)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=411.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=263273408.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…




(1, 11, 768)

# **PART A, TASK 8 : Conversations**

In [None]:
# models
# https://huggingface.co/transformers/model_doc/dialogpt.html?highlight=conversation
# https://huggingface.co/transformers/model_doc/blenderbot.html?highlight=conversation
from transformers import Conversation

conversation_1 = Conversation("Recommend a movie to me ")
conversational_pipeline([conversation_1])
conversation_1.add_user_input("did you see it before")
conversational_pipeline([conversation_1])
conversation_1.add_user_input("Is it scary")
conversational_pipeline([conversation_1])
conversation_1.add_user_input("Is it a long movie?")
conversational_pipeline([conversation_1])

# Not a great chatbot. :( But may be fined tuned on your dataset or conversations for better use

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Conversation id: 6354ecbe-e3db-49b1-bb03-cbddbd76fbfe 
user >> Recommend a movie to me  
bot >> The Big Lebowski 
user >> did you see it before 
bot >> I did, but I'm not sure if it's worth it. 
user >> Is it scary 
bot >> It's not scary, but it's definitely not scary. 
user >> Is it a long movie? 
bot >> It's not a long movie, but it's definitely not a long movie. 