<a href="https://colab.research.google.com/github/Smita-Pr/Using_Transformers_NLP_Tasks/blob/main/Intro_Transformers_NLP_Using_HuggingFace.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformer Models For NLP Tasks Using the Hugging Face Transformer Library
🤗
[Course on Hugging Face ](https://huggingface.co/learn/nlp-course/chapter1/3?fw=pt)

Transformer Models are used to perform a wide variety of NLP Tasks. The Transformer Library in Hugging Face makes it easy to use these versatile  models.  The hugging Face Model hub has thousands of Models which can be downloaded and used. ⬇

In [1]:
!pip install transformers



In [2]:
import transformers

In [3]:
!pip install transformers[sentencepiece]



# Pipeline Object ✈

Pipeline is the most basic object in the Transformer Library , it does the necessary preprocessing and postprocessing steps making it easy to use any model to input a text and get an answer. We can specify a model to use in the pipeline function or a default model is picked.

In [1]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9598048329353333}]

### Zero Shot Classification

In [5]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
     candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445989489555359, 0.11197412759065628, 0.04342695698142052]}

### Text Generation and Specifying a Model in Pipeline

In [6]:
from transformers import pipeline

generator = pipeline("text-generation", model="gpt2")
generator("Sustainability is important because")



Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Sustainability is important because of the fact that it gives our planet a long term energy future.'}]

In [None]:
# awaiting access approval : meta-llama is a gated repository & requires auth
from transformers import AutoModel

access_token = "*****"

model = AutoModel.from_pretrained("meta-llama/Llama-2-7b-hf", token=access_token)


In [7]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this data analytics course, we will teach you how to",
    max_length=100,
    num_return_sequences=2,
)



Downloading (…)lve/main/config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this data analytics course, we will teach you how to use any of our core services, including REST, React and more. If you follow the course, be sure to share with your fellow webers as well, and check out the tutorials of various different services that also help you to set up your own private cloud platform.\n\n\n\n\nBefore this course, you will learn about how to use any of our core services, including REST, React and more. If you follow the course'},
 {'generated_text': 'In this data analytics course, we will teach you how to use our API (the AWS SDK). Learn how to implement our API, and help you design or prototype the API, such as a quick deployment, or with more advanced features.'}]

In [None]:
# this model used up the available RAM
from transformers import pipeline

generator = pipeline("text-generation", model="stabilityai/StableBeluga-7B")
generator(
    "In this data analytics course, we will teach you how to",
    max_length=100,
    num_return_sequences=2,
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

### Mask Filling 😷

Filling in the blanks. The top_k parameter controls the number of possibilities displayed.
Different models have different "MASK" words
 the model below uses "< mask >"

In [2]:
from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'score': 0.19619806110858917,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.04052723944187164,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'}]

In [5]:
from transformers import pipeline

pipe = pipeline("fill-mask", model="bert-base-cased")
pipe("[MASK] is the capital of France",top_k=3)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.43705087900161743,
  'token': 2123,
  'token_str': 'Paris',
  'sequence': 'Paris is the capital of France'},
 {'score': 0.09661243110895157,
  'token': 17851,
  'token_str': 'Marseille',
  'sequence': 'Marseille is the capital of France'},
 {'score': 0.0600145161151886,
  'token': 10067,
  'token_str': 'Lyon',
  'sequence': 'Lyon is the capital of France'}]

Named Entity Recognition (NER) is a NLP task where the entities such as Persons,Organizations or Locations etc are identified in the input Text. Success of the algorithm depends on correctly identifying all the entities based on context such as names of people and entities with multiple words such as names of organizations or cities.

In [9]:
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("My name is Smita and I work at xyz in the United Kingdom.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'PER',
  'score': 0.9943884,
  'word': 'Smita',
  'start': 11,
  'end': 16},
 {'entity_group': 'ORG',
  'score': 0.82636,
  'word': 'xyz',
  'start': 31,
  'end': 34},
 {'entity_group': 'LOC',
  'score': 0.9997419,
  'word': 'United Kingdom',
  'start': 42,
  'end': 56}]

Question - Anwering format

In [11]:
#from transformers import pipeline
question_answer = pipeline("question-answering")
question_answer(question = "Where do I live?",
                context = "My name is Smita and I work at xyz.France is a beautiful country.London is also a vibrant city where I live and work.")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'score': 0.9639270901679993, 'start': 65, 'end': 71, 'answer': 'London'}

Summarization Task
Summarizing the text including all key aspects in the original input.
Like with text generation, you can specify a max_length or a min_length for the result.



In [12]:
# model = sshleifer/distilbart-cnn-12-6 was used by default when no model was specified.
from transformers import pipeline

summarizer = pipeline("summarization")

summarizer("""
           Before diving into the technical aspects of LLM development, let’s do some back-of-the-napkin math to get a sense of the financial costs here.

Meta’s Llama 2 models required about 180,000 GPU hours to train its 7b parameter model and 1,700,000 GPU hours to train the 70b model [2]. Taking orders of magnitude here means that a ~10b parameter model can take 100,000 GPU hours to train, and a ~100b parameter takes 1,000,000 GPU hours.

Translating this into commercial cloud computing costs, an Invidia A100 GPU (i.e. what was used to train Llama 2 models) costs around $1–2 per GPU per hour. That means a ~10b parameter model costs about $150,000 to train, and a ~100b parameter model costs ~$1,500,000.

Alternatively, you can buy the GPUs if you don’t want to rent them. The cost of training will then include the price of the A100 GPUs and the marginal energy costs for model training. An A100 is about $10,000 multiplied by 1000 GPUs to form a cluster. The hardware cost is then on the order of $10,000,000. Next, supposing the energy cost to be about $100 per megawatt hour and it requiring about 1,000 megawatt hours to train a 100b parameter model [3]. That comes to a marginal energy cost of about $100,000 per 100b parameter model.

These costs do not include funding a team of ML engineers, data engineers, data scientists, and others needed for model development, which can easily get to $1,000,000 (to get people who know what they are doing).

Needless to say, training an LLM from scratch is a massive investment (at least for now). Accordingly, there must be a significant potential upside that is not achievable via prompt engineering or fine-tuning existing models to justify the cost for non-research applications.
""")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

[{'summary_text': ' Meta Meta’s Llama 2 models required 180,000 GPU hours to train its 7b model and 1,700,000 to train the 70b model . An Invidia A100 GPU costs around $1–2 per GPU per hour . That means a 10b parameter model costs about $150k to train, and a 100b model costs $1,500k . An A100 is about $10,000 multiplied by 1000 GPUs to form a cluster .'}]

In [13]:
#using model = facebook/bart-large-cnn
from transformers import pipeline

summarizer = pipeline("summarization",model= "facebook/bart-large-cnn")

summarizer("""
           Before diving into the technical aspects of LLM development, let’s do some back-of-the-napkin math to get a sense of the financial costs here.

Meta’s Llama 2 models required about 180,000 GPU hours to train its 7b parameter model and 1,700,000 GPU hours to train the 70b model [2]. Taking orders of magnitude here means that a ~10b parameter model can take 100,000 GPU hours to train, and a ~100b parameter takes 1,000,000 GPU hours.

Translating this into commercial cloud computing costs, an Invidia A100 GPU (i.e. what was used to train Llama 2 models) costs around $1–2 per GPU per hour. That means a ~10b parameter model costs about $150,000 to train, and a ~100b parameter model costs ~$1,500,000.

Alternatively, you can buy the GPUs if you don’t want to rent them. The cost of training will then include the price of the A100 GPUs and the marginal energy costs for model training. An A100 is about $10,000 multiplied by 1000 GPUs to form a cluster. The hardware cost is then on the order of $10,000,000. Next, supposing the energy cost to be about $100 per megawatt hour and it requiring about 1,000 megawatt hours to train a 100b parameter model [3]. That comes to a marginal energy cost of about $100,000 per 100b parameter model.

These costs do not include funding a team of ML engineers, data engineers, data scientists, and others needed for model development, which can easily get to $1,000,000 (to get people who know what they are doing).

Needless to say, training an LLM from scratch is a massive investment (at least for now). Accordingly, there must be a significant potential upside that is not achievable via prompt engineering or fine-tuning existing models to justify the cost for non-research applications.
""")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'summary_text': 'Training an LLM from scratch is a massive investment (at least for now) There must be a significant potential upside that is not achievable via prompt engineering or fine-tuning existing models to justify the cost for non-research applications. Meta’s Llama 2 models required about 180,000 GPU hours to train its 7b parameter model.'}]

Translation tasks - translating input text into desired language

In [4]:
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

Downloading (…)olve/main/source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

Downloading (…)olve/main/target.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]



[{'translation_text': 'This course is produced by Hugging Face.'}]

In [18]:
#t5-samll translates english to german.
from transformers import pipeline

translator = pipeline("translation", model="t5-small")
translator("this course is produced by Hugging Face.")

[{'translation_text': 'Dieser Kurs wird von Hugging Face produziert.'}]