### *Applied AI Foundations: An introudction to your AI Journey*

<hr>

# NLP With OpenSource Language Models
### Sumudu Tennakoon, PhD

*www.datasciencefoundations.com*
<hr>

In this notebook we will explore some basic fetures on Python programing language for those who have a prior programing expereince.

To learn more about Python, refeer to the following websites

- Python : https://www.python.org

To learn more about the Python packages we explore in this notebook, refer to the following websites

- HuggingFace : https://huggingface.co


# Getting Started with HuggingFace

* Run below code cell to install required libraries before you continue. Ignore that if you already installed them.

In [None]:
# !pip install transformers sentencepiece

# Pipelines

* HuggingFace pipelines  streamlined interface for common NLP tasks, such as sentiment analysis, text classification, named entity recognition, and text generation, speech-recognition.
* You can choose from many different models and tasks on the HuggingFace website.
* Pipelines make it easy to use models without writing a lot of code.*

https://huggingface.co/docs/transformers/main_classes/pipelines

In [1]:
from transformers import pipeline

# Sentiment Analysis

In [2]:
from transformers import pipeline

classifier = pipeline('sentiment-analysis')
classifier('I enojoy watching this movie!')

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9986220598220825}]

In [3]:
classifier = pipeline('sentiment-analysis')
classifier('This movie was not good or bad')

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[{'label': 'NEGATIVE', 'score': 0.997802197933197}]

In [4]:
classifier = pipeline('sentiment-analysis', model="finiteautomata/bertweet-base-sentiment-analysis")
classifier('This movie was the worst in the series!')

config.json:   0%|          | 0.00/949 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/540M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/338 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/540M [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

bpe.codes: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/167 [00:00<?, ?B/s]

emoji is not installed, thus not converting emoticons or emojis into text. Install emoji: pip3 install emoji==0.6.0
Device set to use cuda:0


[{'label': 'NEG', 'score': 0.9834358096122742}]

In [5]:
classifier('This movie was not good or bad')

[{'label': 'NEG', 'score': 0.9730438590049744}]

## Question Answering

In [6]:
from transformers import pipeline

nlp = pipeline("question-answering")

context = """ Marie Curie, née Maria Sklodowska, was born in Warsaw on November 7, \
1867, the daughter of a secondary-school teacher. She received a general education \
in local schools and some scientific training from her father. She became involved \
in a students’ revolutionary organization and found it prudent to leave Warsaw, then \
in the part of Poland dominated by Russia, for Cracow, which at that time was under \
Austrian rule. In 1891, she went to Paris to continue her studies at the Sorbonne \
where she obtained Licenciateships in Physics and the Mathematical Sciences. She met \
Pierre Curie, Professor in the School of Physics in 1894 and in the following year \
they were married. She succeeded her husband as Head of the Physics Laboratory at \
the Sorbonne, gained her Doctor of Science degree in 1903, and following the tragic \
death of Pierre Curie in 1906, she took his place as Professor of General Physics in \
the Faculty of Sciences, the first time a woman had held this position. She was also \
appointed Director of the Curie Laboratory in the Radium Institute of the University \
of Paris, founded in 1914.
"""

nlp(question="When did Marie Curie Born?", context=context)


No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


{'score': 0.9644646048545837,
 'start': 58,
 'end': 74,
 'answer': 'November 7, 1867'}

In [7]:
nlp(question="What are the positions Marie Curie held at University of Paris?", context=context)

{'score': 0.4384743869304657,
 'start': 1001,
 'end': 1033,
 'answer': 'Director of the Curie Laboratory'}

In [8]:
nlp(question="When did the Curie Laboratory founded ?", context=context)

{'score': 0.8869678378105164, 'start': 1097, 'end': 1101, 'answer': '1914'}

#### Specify Model and Reuse Pipleline for Multiple Questions

In [9]:
from transformers import pipeline

nlp = pipeline("question-answering", model='deepset/roberta-base-squad2')

question = "When did Marie Curie Born?"
response = nlp(question=question, context=context)
print({"question":question, "response": response})

question = "What are the positions Marie Curie held at University of Paris?"
response = nlp(question=question, context=context)
print({"question":question, "response": response})

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Device set to use cuda:0


{'question': 'When did Marie Curie Born?', 'response': {'score': 0.9604946221224964, 'start': 58, 'end': 74, 'answer': 'November 7, 1867'}}
{'question': 'What are the positions Marie Curie held at University of Paris?', 'response': {'score': 0.19194774329662323, 'start': 1001, 'end': 1033, 'answer': 'Director of the Curie Laboratory'}}


## Text Generation

In [10]:
from transformers import pipeline

text_generator = pipeline("text-generation")
text_generator("An apple fell from the", max_new_tokens=1, do_sample=True)

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'An apple fell from the sky'}]

In [15]:
# Extend the length of Generated Sequence

text_generator("An apple fell from the tree", max_new_tokens=100, do_sample=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'An apple fell from the tree. The grass was just growing and grew back, and the tree had a little bit of root, because the apple was growing. And then the branches of the tree started to grow back, and the tree started to grow back, and the apple started to grow back. And the apple was growing back. And the apple was growing back. And the apple fell and the tree was just like, "What do you want me to do?" So, you know, I was just kind of like,'}]

## Translation

In [16]:
from transformers import pipeline

text = "Hello. How are you?"
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-fr")

translator(text)

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/301M [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


[{'translation_text': 'Bonjour, comment allez-vous ?'}]

In [17]:
from transformers import pipeline

text = "Hola cómo estás?"
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-es-en")

translator(text)

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/312M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/312M [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/826k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


[{'translation_text': 'Hi, how are you?'}]

## Summarization

In [18]:
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

text = """ Marie Curie, née Maria Sklodowska, was born in Warsaw on November 7, \
1867, the daughter of a secondary-school teacher. She received a general education \
in local schools and some scientific training from her father. She became involved \
in a students’ revolutionary organization and found it prudent to leave Warsaw, then \
in the part of Poland dominated by Russia, for Cracow, which at that time was under \
Austrian rule. In 1891, she went to Paris to continue her studies at the Sorbonne \
where she obtained Licenciateships in Physics and the Mathematical Sciences. She met \
Pierre Curie, Professor in the School of Physics in 1894 and in the following year \
they were married. She succeeded her husband as Head of the Physics Laboratory at \
the Sorbonne, gained her Doctor of Science degree in 1903, and following the tragic \
death of Pierre Curie in 1906, she took his place as Professor of General Physics in \
the Faculty of Sciences, the first time a woman had held this position. She was also \
appointed Director of the Curie Laboratory in the Radium Institute of the University \
of Paris, founded in 1914.
"""

summarizer(text)

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


[{'summary_text': 'Marie Curie, née Maria Sklodowska, was born in Warsaw on November 7, 1867. She received a general education in local schools and some scientific training from her father. In 1891, she went to Paris to continue her studies at the Sorbonne where she obtained Licenciateships in Physics and the Mathematical Sciences.'}]

## Classification

In [19]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

sequence_to_classify = "Today I am going to prepare a dinner for my friends"

candidate_labels = ['travel', 'cooking', 'playing', 'learning']

classifier(sequence_to_classify, candidate_labels)

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


{'sequence': 'Today I am going to prepare a dinner for my friends',
 'labels': ['cooking', 'learning', 'playing', 'travel'],
 'scores': [0.957097053527832,
  0.03565386310219765,
  0.005505927838385105,
  0.0017431129235774279]}

In [20]:
sequence_to_classify = "I am going to visit Paris next year"

candidate_labels = ['travel', 'cooking', 'playing', 'learning']

classifier(sequence_to_classify, candidate_labels)

{'sequence': 'I am going to visit Paris next year',
 'labels': ['travel', 'learning', 'playing', 'cooking'],
 'scores': [0.7273197174072266,
  0.17637644708156586,
  0.0905284509062767,
  0.005775335244834423]}

In [21]:
sequence_to_classify = "I scored 75 runs in the cricket match yesterday"

candidate_labels = ['travel', 'cooking', 'playing', 'learning']

classifier(sequence_to_classify, candidate_labels)

{'sequence': 'I scored 75 runs in the cricket match yesterday',
 'labels': ['playing', 'learning', 'travel', 'cooking'],
 'scores': [0.8112397193908691,
  0.14230629801750183,
  0.030837280675768852,
  0.015616616234183311]}

# Conversation

In [22]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

In [23]:
MODEL = "microsoft/GODEL-v1_1-large-seq2seq"

tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL)

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/37.0 [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

### No Context

In [24]:
instruction = f'Instruction: given a dialog context, you need to response professionally. Limit answer to one sentence'

question = 'Why the sky is blue?'

knowledge = ''

prompt = f"{instruction}\n[CONTEXT] {question}\n[KNOWLEDGE] {knowledge}"

input_ids = tokenizer(f"{prompt}", return_tensors="pt").input_ids

outputs = model.generate(input_ids, max_length=30, min_length=8, top_p=1.0, do_sample=True)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(input_ids)
print(outputs)
print(prompt)
print(response)

tensor([[21035,    10,   787,     3,     9, 13463,  2625,     6,    25,   174,
            12,  1773, 13931,     5, 18185,  1525,    12,    80,  7142,   784,
         17752,  3463,     4,   382,   908,  1615,     8,  5796,    19,  1692,
            58,   784,   439, 12038, 17717,  5042,   908,     1]])
tensor([[    0,  2070,    66,     8,  4811,    33, 23215,     5,    37,  1997,
            31,     7,  1692,     6,    11,     8,  4345,     7,    33,  1131,
             6,  1692,     6,  1692,    11,  1692,     5,     1]])
Instruction: given a dialog context, you need to response professionally. Limit answer to one sentence
[CONTEXT] Why the sky is blue?
[KNOWLEDGE] 
Because all the stars are shining. The sun's blue, and the planets are red, blue, blue and blue.


### With Scientific Context

In [25]:
instruction = f'Instruction: given a dialog context, you need to response professionally. Limit answer to one sentence'

knowledge = """
A portion of the beam of light coming from the sun scatters off molecules of gas and other \
small particles in the atmosphere. Here, Rayleigh scattering primarily occurs through \
sunlight's interaction with randomly located air molecules. It is this scattered light that \
gives the surrounding sky its brightness and its color. As previously stated, Rayleigh \
scattering is inversely proportional to the fourth power of wavelength, so that shorter \
wavelength violet and blue light will scatter more than the longer wavelengths (yellow and \
especially red light).
"""

question = 'Why the sky is blue?'

prompt = f"{instruction}\n[CONTEXT] {question}\n[KNOWLEDGE] {knowledge}"

input_ids = tokenizer(f"{prompt}", return_tensors="pt").input_ids

outputs = model.generate(input_ids, max_length=30, min_length=8, top_p=1.0, do_sample=False)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(prompt)
print(response)

Instruction: given a dialog context, you need to response professionally. Limit answer to one sentence
[CONTEXT] Why the sky is blue?
[KNOWLEDGE] 
A portion of the beam of light coming from the sun scatters off molecules of gas and other small particles in the atmosphere. Here, Rayleigh scattering primarily occurs through sunlight's interaction with randomly located air molecules. It is this scattered light that gives the surrounding sky its brightness and its color. As previously stated, Rayleigh scattering is inversely proportional to the fourth power of wavelength, so that shorter wavelength violet and blue light will scatter more than the longer wavelengths (yellow and especially red light).

Rayleigh scattering is a process where light is scattered off molecules of gas and other small particles in the atmosphere.


### With Greek Mythology as Context

In [26]:
instruction = f'Instruction: given a dialog context, you need to response professionally. Limit answer to one sentence'
knowledge = """
The story goes that one day Zeus, the Greek god of the sky, asked his daughter Athena \
to make a wish. The blue-eyed Athena, wrapped up in herself, wished that the world \
could see her beauty every single day. Zeus granted Athena’s wish by turning the sky \
in blue, the color of her beautiful eyes.
"""

question = 'Why the sky is blue?'

prompt = f"{instruction}\n[CONTEXT] {question}\n[KNOWLEDGE] {knowledge}"

input_ids = tokenizer(f"{prompt}", return_tensors="pt").input_ids

outputs = model.generate(input_ids, max_length=30, min_length=8, top_p=1.0, do_sample=True)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(prompt)
print(response)

Instruction: given a dialog context, you need to response professionally. Limit answer to one sentence
[CONTEXT] Why the sky is blue?
[KNOWLEDGE] 
The story goes that one day Zeus, the Greek god of the sky, asked his daughter Athena to make a wish. The blue-eyed Athena, wrapped up in herself, wished that the world could see her beauty every single day. Zeus granted Athena’s wish by turning the sky in blue, the color of her beautiful eyes.

Because Zeus, the Greek god of the sky, granted Zeus's daughter Athenes wish that she could see her beautiful eyes every


<hr/>
First Upload 2023-07-04 | Last update 2025-12-12 by Sumudu Tennakoon

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.