# **Introduction to Natural Language Processing with Hugging Face Transformers**

# **Installing Required Libraries**

In [None]:
# The following required libraries are pre-installed in the Skills Network Labs environment.
# However, if you run this notebook commands in a different Jupyter environment (e.g. Watson Studio or Ananconda),
# you will need to install these libraries by removing the # sign before !mamba in the code cell below.

In [None]:
# All Libraries required for this lab are listed below. The libraries pre-installed on Skills Network Labs are commented.
# !mamba install -qy pandas==1.3.4 numpy==1.21.4 seaborn==0.9.0 matplotlib==3.5.0 scikit-learn==0.20.1
# Note: If your environment doesn't support "!mamba install", use "!pip install"

In [None]:
!pip install --upgrade torch



In [None]:
!pip install -q transformers

In [None]:
!pip install datasets evaluate transformers[sentencepiece]



In [None]:
!pip install sacremoses



# **Importing Required Libraries**

In [None]:
# Importing Required Libraries
import warnings
warnings.filterwarnings('ignore')

In [None]:
from transformers import pipeline
from transformers import AutoTokenizer
from transformers import AutoModel

# **1. SENTIMENT ANALYSIS**

In [None]:
# load text classification pipeline from pipeline() using "sentiment-analysis" task identifier.
# use the default, "distilbert-base-uncased-finetuned-sst-2-english" model for sentiment analysis.
# input sentence (e.g: random product review) into selected classifier

classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
classifier("Having three long haired, heavy shedding dogs at home, I was pretty skeptical that this could hold up to all the hair and dirt they trek in, but this wonderful piece of tech has been nothing short of a godsend for me! ")

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9982457160949707}]

In [None]:
# result : the sentiment is classified as POSITIVE with 99.8% accuracy score

> **Exercise 1 :**

> 1. use "cardiffnlp/twitter-roberta-base-sentiment" model pre-trained on tweets data, to analyze any tweet of choice. (**NOTE** : output labels for this model are: 0 -> Negative; 1 -> Neutral; 2 -> Positive.)

> 2. use the default model (used in Example 1) on the same tweet, to see if the result will change.



In [None]:
classifier = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment")
classifier("This drinks sucks. We’ve spent hundreds of dollars at many locations and we weren’t allowed to use the restroom without buying something.")

Device set to use cpu


[{'label': 'LABEL_0', 'score': 0.9493259787559509}]

In [None]:
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
classifier("This drinks sucks. We’ve spent hundreds of dollars at many locations and we weren’t allowed to use the restroom without buying something.")

Device set to use cpu


[{'label': 'NEGATIVE', 'score': 0.9933196306228638}]

# **2. TOPIC CLASSIFICATION**

In [None]:
# load a pipeline with "zero-shot-classification"
# pass a sequence that want to classify and a list of candidate labels.

classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
classifier(
    "Exploratory Data Analysis is the first course in Machine Learning Program that introduces learners to the broad range of Machine Learning concepts, applications, challenges, and solutions, while utilizing interesting real-life datasets",
    candidate_labels=["art", "natural science", "data analysis"],
)

Device set to use cpu


{'sequence': 'Exploratory Data Analysis is the first course in Machine Learning Program that introduces learners to the broad range of Machine Learning concepts, applications, challenges, and solutions, while utilizing interesting real-life datasets',
 'labels': ['data analysis', 'art', 'natural science'],
 'scores': [0.995779275894165, 0.0026982570998370647, 0.0015224807430058718]}

In [None]:
# result

# the model assign corresponding labels to the input.
# 'data analysis' is the most successful candidate for the topic of this input, having 99.6% score

> **Exercise 2 :**

> use any sentence of choice to classify it under any classes/ topics of choice. Use "zero-shot-classification" and specify the model="facebook/bart-large-mnli".



In [None]:
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
classifier(
    "The bond that links your true family is not one of blood, but of respect and joy in each other's life.",
    candidate_labels=["life", "family", "sunset"],
)

Device set to use cpu


{'sequence': "The bond that links your true family is not one of blood, but of respect and joy in each other's life.",
 'labels': ['family', 'life', 'sunset'],
 'scores': [0.8444904685020447, 0.10644853115081787, 0.04906097799539566]}

# **3. TEXT GENERATION MODELS**

In [None]:
#  load a pipeline with the default "text-generation" model, "gpt2"
generator = pipeline("text-generation", model="gpt2")
generator("This course will teach you")

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'This course will teach you how to understand Google Docs, Google Search, and Google Cloud Calendar. You will also learn how to run Google Apps on Android as well as on iOS for basic troubleshooting. From there, you will learn how to build'}]

In [None]:
# result : the model continued the "This course will teach you" sentence into full sentence

In [None]:
# Alternative

# load a pipeline with "distilgpt2" model with parameters, such length and number of the sentences needed
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "This course will teach you",
    max_length=30,
    num_return_sequences=2,
)

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'This course will teach you how to become more agile about building software. This course is free to download via a computer or another program. Free textbooks will'},
 {'generated_text': 'This course will teach you how long the process will take. We are very excited about the end result here and hope to have you come home the next'}]

In [None]:
# result : the model gives 2 different sentences, both within 30 words or less, continuing from the sentence "This course will teach you".

In [None]:
#  4 output options for a 'masked' word in the input sentence
unmasker = pipeline("fill-mask", "distilroberta-base")
unmasker("This course will teach you all about <mask> models.", top_k=4)

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'score': 0.19198468327522278,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.042092032730579376,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'},
 {'score': 0.03602461889386177,
  'token': 27930,
  'token_str': ' predictive',
  'sequence': 'This course will teach you all about predictive models.'},
 {'score': 0.02978115901350975,
  'token': 745,
  'token_str': ' building',
  'sequence': 'This course will teach you all about building models.'}]

In [None]:
# result : the model replace the <mask> wih 4 different words



> **Exercise 3 :**

> use 'text-generator' and 'gpt2' model to complete any sentence. Define any desirable number of returned sentences.



In [None]:
# text-generator
generator = pipeline("text-generation", model="gpt2")
generator(
    "Today, the recipe that I will learn is",
    max_length=15,
    num_return_sequences=3,
)

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Today, the recipe that I will learn is "Echo" from the'},
 {'generated_text': "Today, the recipe that I will learn is that I'm going to be"},
 {'generated_text': 'Today, the recipe that I will learn is how to make my own cinnamon'}]

# **4. NAME ENTITY RECOGNITION**

In [None]:
# load a pipeline with "ner" model
# put sentences as input

ner = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english", grouped_entities=True)
ner("My name is Roberta and I work with IBM Skills Network in Toronto")

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'entity_group': 'PER',
  'score': np.float32(0.9993105),
  'word': 'Roberta',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': np.float32(0.9976597),
  'word': 'IBM Skills Network',
  'start': 35,
  'end': 53},
 {'entity_group': 'LOC',
  'score': np.float32(0.99702173),
  'word': 'Toronto',
  'start': 57,
  'end': 64}]

In [None]:
# result : the model identifies all entities (PER, ORG, LOC) in the sentence with highest confidence score

In [None]:
del ner



> **Exercise 4** :

> use any sentence of choice to extract entities: person, location and organization, using Name Entity Recognition task, specify model as "Jean-Baptiste/camembert-ner".



In [None]:
ner = pipeline("ner", model="Jean-Baptiste/camembert-ner", grouped_entities=True)
ner("Her name Amelia and she work under Toyota Company near Mount Fuji")

Device set to use cpu


[{'entity_group': 'PER',
  'score': np.float32(0.9357212),
  'word': 'Amelia',
  'start': 8,
  'end': 15},
 {'entity_group': 'ORG',
  'score': np.float32(0.9905671),
  'word': 'Toyota Company',
  'start': 34,
  'end': 49},
 {'entity_group': 'LOC',
  'score': np.float32(0.9984614),
  'word': 'Mount Fuji',
  'start': 54,
  'end': 65}]

# **5. QUESTION ANSWERING**

In [None]:
# load the pipeline() with "question-answering" identifier and model
# input question and content
# apply model to the input

qa_model = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")
question = "Which name is also used to describe the Amazon rainforest in English?"
context = "The Amazon rainforest, also known in English as Amazonia or the Amazon Jungle."
qa_model(question = question, context = context)

Device set to use cpu


{'score': 0.8247056603431702, 'start': 48, 'end': 56, 'answer': 'Amazonia'}

In [None]:
# result : the correct answer has been extracted with 82% confidence score.



> **Exercise 5 :**

> use any sentence and a question of choice to extract some information, using "distilbert-base-cased-distilled-squad" model.



In [None]:
qa_model = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")
question = "What is the capital city of Malaysia?"
context = "Malaysia[d] is a country in Southeast Asia. A federal constitutional monarchy, it consists of 13 states and three federal territories. Kuala Lumpur is the national capital, the country's largest city, and the seat of the legislative branch of the federal government."
qa_model(question = question, context = context)

Device set to use cpu


{'score': 0.9962024688720703,
 'start': 135,
 'end': 147,
 'answer': 'Kuala Lumpur'}

# **6. TEXT SUMMARIZATION**

In [None]:
# load the "summarization" pipeline with model
# input some text that want to be summarize
# check the output

In [None]:
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")
summarizer(
    """
Exploratory Data Analysis is the first course in Machine Learning Program that introduces learners to the broad range of Machine Learning concepts, applications, challenges, and solutions, while utilizing interesting real-life datasets. So, what is EDA and why is it important to perform it before we dive into any analysis?
EDA is a visual and statistical process that allows us to take a glimpse into the data before the analysis. It lets us test the assumptions that we might have about the data, proving or disproving our prior believes and biases. It lays foundation for the analysis, so our results go along with our expectations. In a way, it’s a quality check for our predictions.
As any data scientist would agree, the most challenging part in any data analysis is to obtain a good quality data to work with. Nothing is served to us on a silver plate, data comes in different shapes and formats. It can be structured and unstructured, it may contain errors or be biased, it may have missing fields, it can have different formats than what an untrained eye would perceive. For example, when we import some data, very often it would contain a time stamp. To a human it is understandable format that can interpreted. But to a machine, it is not interpretable, so it needs to be told what that means, the data needs to be transformed into simple numbers first. There are also different date-time conventions depending on a country (i.e., Canadian versus USA), metric versus imperial systems, and many other data features that need to be recognized before we start doing the analysis. Therefore, the first step before performing any analysis – is get really aquatinted with your data!
This course will teach you to ‘see’ and to ‘feel’ the data as well as to transform it into analysis-ready format. It is introductory level course, so no prior knowledge is required, and it is a good starting point if you are interested in getting into the world of Machine Learning. The only thing that is needed is some computer with internet, your curiosity and eagerness to learn and to apply acquired knowledge.  If you live in Canada, you might be interested about gasoline prices in different cities or if you are an insurance actuary you need to analyze the financial risks that you will take based on your clients information. Whatever is the case, you will be able to do your own analysis, and confirm or disprove some of the existing information.
The course contains videos and reading materials, as well as well as a lot of interactive practice labs that learners can explore and apply the skills learned. It will allow you to use Python language in Jupyter Notebook, a cloud-based skills network environment that is pre-set for you with all available to be downloaded packages and libraries. It will introduce you to the most common visualization libraries such as Pandas, Seaborn, and Matplotlib to demonstrate various EDA techniques with some real-life datasets.

"""
)

Device set to use cpu


[{'summary_text': ' Exploratory Data Analysis is the first course in Machine Learning Program that introduces learners to the broad range of Machine Learning concepts, applications, challenges, and solutions . EDA is a visual and statistical process that allows us to take a glimpse into the data before the analysis . It lays foundation for the analysis so our results go along with our expectations .'}]

In [None]:
# result : a short summary of the paragraph

In [None]:
del summarizer



> **Exercise 6 :**

> Use any document/paragraph of choice and summarize it, using "sshleifer/distilbart-cnn-12-6" model.



In [None]:
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")
summarizer(
    """
    Mental health can affect a person’s day-to-day life, relationships, and physical health. External factors in people’s lives and relationships can also contribute to their mental well-being.
    Looking after one’s mental health can help a person maintain their ability to enjoy life. This involves balancing their activities, responsibilities, and efforts to achieve psychological resilience.
    Stress, depression, and anxiety can affect mental health and may disrupt a person’s routine.
    Although healthcare professionals often use the term “mental health,” doctors recognize that many mental health conditions have physical roots.
    Everyone is at some risk of developing a mental health disorder, regardless of age, sex, income, or ethnicity. In the U.S. and much of the developed world, depression is one of the leading causesTrusted Source of disability.
    Social and financial circumstances, adverse childhood experiences, biological factors, and underlying medical conditions can allTrusted Source shape a person’s mental well-being.
    Many people with a mental health disorder have more than oneTrusted Source condition at the same time.
    It is important to note that mental well-being depends on a balance of factors, and several elements may contribute to the development of a mental health disorder.

"""
)

Device set to use cpu


[{'summary_text': " Everyone is at some risk of developing a mental health disorder, regardless of age, sex, income, or ethnicity . Social and financial circumstances, adverse childhood experiences, biological factors, and underlying medical conditions can all shape a person's mental well-being . Stress, depression, and anxiety can affect mental health and disrupt a person’s routine ."}]

# **7. TRANSLATION**

In [None]:
# monolingual

In [None]:
# add task prefix : "_en_to_fr" to translate English to French

en_fr_translator = pipeline("translation_en_to_fr", model="t5-small")
en_fr_translator("How old are you?")

Device set to use cpu


[{'translation_text': 'Quel est votre âge ?'}]

In [None]:
# result : the sentence translate from English to French

In [None]:
# Alternative

# load the "translator" pipeline
# use a specific model that is from one specific language to another (e.g : French-English),
# input some text that want to be translated
# check the output

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("La science des données est la meilleure.")

Device set to use cpu


[{'translation_text': 'Data science is the best.'}]



> **Exercise 7 :**

> Use any sentence of choice to translate English to German. The translation model you can use is "translation_en_to_de"





In [None]:
en_de_translator = pipeline("translation_en_to_de", model="t5-small")
en_de_translator("Hello! How are you?")

Device set to use cpu


[{'translation_text': 'Hallo, wie sind Sie?'}]