<a href="https://colab.research.google.com/github/AmmarNasirDS/NLP-with-Transformers/blob/main/course/en/chapter1/section3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**This Google Colab notebook provides a comprehensive hands-on introduction to the Hugging Face Transformers library, showcasing its versatility across a range of essential Natural Language Processing (NLP) tasks. From sentiment analysis and text generation to named entity recognition, summarization, and translation, it demonstrates how pre-trained models can be effortlessly applied to understand, generate, and transform human language, highlighting the efficiency and power of modern AI in linguistic applications.**

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

1.**Installing Libraries**

Explanation: This first cell sets up our environment. It's like adding tools to your toolbox. We're installing essential libraries: datasets for easily accessing various data, evaluate for checking how well our models perform, and transformers (with sentencepiece for advanced text processing) which is the main library we'll use for working with powerful AI models.



In [1]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.6


2.**Basic Sentiment Analysis**

Explanation: This code shows how to detect the emotion behind a sentence. We use the pipeline function from the transformers library, telling it we want to do "sentiment-analysis." It automatically picks a good pre-trained model. When we give it a sentence, it tells us if the sentiment is positive, negative, or neutral, along with a confidence score.



In [2]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f.
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/104 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9598046541213989}]

3.**Sentiment Analysis on Multiple Sentences**

Explanation: Building on the previous example, this cell demonstrates that our sentiment analysis tool can process multiple sentences at once. We provide a list of sentences, and the model quickly gives us a sentiment score for each one, showing its efficiency.

In [3]:
classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)

[{'label': 'POSITIVE', 'score': 0.9598046541213989},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

4.**Zero-Shot Classification**

Explanation: This is a very cool feature! "Zero-shot" means we can classify text into categories without needing to train the model specifically on those categories. We give it a sentence and a list of possible categories (like "education," "politics," "business"), and the model figures out which category the sentence best fits, along with confidence scores for each.

In [4]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1.
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/515 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.844595193862915, 0.11197695881128311, 0.04342786595225334]}

5.**Text Generation (Default Model)**

Explanation: This cell demonstrates how to make an AI write text. We use the text-generation pipeline. Given a starting phrase (a "prompt"), the AI continues writing, trying to complete the thought or story. Since we didn't specify a model, it used a default one (like GPT-2).


In [5]:
from transformers import pipeline

generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d.
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/148 [00:00<?, ?it/s]

GPT2LMHeadModel LOAD REPORT from: openai-community/gpt2
Key                  | Status     |  | 
---------------------+------------+--+-
h.{0...11}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'In this course, we will teach you how to use your own skills as a writer to make your own stories. We will show you a variety of ways to create a compelling story with your own words, but will introduce you to the most important things: the importance of humor, the importance of making your own stories, the importance of making your own stories as a creative endeavor, and the importance of making your own stories.\n\nThe course will focus on the following subjects:\n\n- writing\n\n- writing about\n\n- writing about writing about writing about\n\n- writing about writing about writing about\n\n- writing about writing about writing about\n\n- writing about writing about writing about\n\n- writing about writing about writing about\n\n- writing about writing about writing about\n\n- writing about writing about writing about\n\n- writing about writing about writing about\n\n- writing about writing about writing about\n\n- writing about writing about writing about\n\n- wr

6.**Text Generation (Specific Model and Parameters)**

Explanation: Here, we refine our text generation. We specifically tell the pipeline to use a smaller, faster model called distilgpt2. We also control the output by setting max_length (how long the generated text should be) and num_return_sequences (how many different creative continuations we want).




In [6]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/76 [00:00<?, ?it/s]

GPT2LMHeadModel LOAD REPORT from: distilgpt2
Key                                        | Status     |  | 
-------------------------------------------+------------+--+-
transformer.h.{0, 1, 2, 3, 4, 5}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Passing `generation_config` together with generation-related arguments=({'max_length', 'num_return_sequences'}) is deprecated and will be removed in future versions. Please pass either a `generation_config` object OR all generation parameters explicitly, but not both.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'In this course, we will teach you how to use the techniques of the same technique and how to apply it to other situations in the future.\n\n\n\nHere are several articles in the course:'},
 {'generated_text': 'In this course, we will teach you how to use the tools to control multiple tasks in the same language to control multiple tasks in different languages. In this course, we will show you how to use the tools to control multiple tasks in different languages.\n\n\n\nThis course is free. We reserve the right to create and delete an entire series of text files, so that you can use the tools to control multiple tasks in different languages. This course is free. We reserve the right to create and delete an entire series of text files, so that you can use the tools to control multiple tasks in different languages. This course is free.\nFor more information, please visit www.nasa.gov/sites/default/files/nasa.gov/sites/default/files/nasa.gov/sites/default/files/nasa.gov/

7.**Fill-Mask**

Explanation: This code allows the AI to fill in missing words in a sentence. We put a <mask> token where a word is missing. The fill-mask pipeline then predicts the most likely words that fit there. top_k=2 means it will show us the top two best guesses.

In [7]:
from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8.
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/106 [00:00<?, ?it/s]

RobertaForMaskedLM LOAD REPORT from: distilbert/distilroberta-base
Key                         | Status     |  | 
----------------------------+------------+--+-
roberta.pooler.dense.bias   | UNEXPECTED |  | 
roberta.pooler.dense.weight | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

[{'score': 0.19619743525981903,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.04052695631980896,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'}]

8. **Named Entity Recognition (NER)**

Explanation: Named Entity Recognition is about finding and classifying specific pieces of information in text. This code identifies names of people (PER), organizations (ORG), and locations (LOC) within a sentence, like finding "Sylvain" as a person, "Hugging Face" as an organization, and "Brooklyn" as a location.




In [8]:
from transformers import pipeline

ner = pipeline("ner", aggregation_strategy="simple")
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496.
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertForTokenClassification LOAD REPORT from: dbmdz/bert-large-cased-finetuned-conll03-english
Key                      | Status     |  | 
-------------------------+------------+--+-
bert.pooler.dense.bias   | UNEXPECTED |  | 
bert.pooler.dense.weight | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

[{'entity_group': 'PER',
  'score': np.float32(0.9981694),
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': np.float32(0.9796019),
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': np.float32(0.9932106),
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

9. **Question Answering**

Explanation: This cell shows how an AI can answer questions based on a provided text. We give the question-answering pipeline a question and a piece of text (the "context"). The model then reads the context and finds the most relevant answer within it.




In [9]:
from transformers import pipeline

question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5.
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/102 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

{'score': 0.6949763894081116, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

10. **Text Summarization**

Explanation: This section demonstrates text summarization. Because the pipeline wasn't directly working for summarization, we instead directly load a specific summarization model (sshleifer/distilbart-cnn-12-6) and its associated tokenizer. The tokenizer breaks the text into pieces the model understands, the model generates a shorter version (summary) based on parameters like max_length, and then the tokenizer converts the summary back into human-readable text.




In [15]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "sshleifer/distilbart-cnn-12-6"

# Load tokenizer and model explicitly
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

input_text = """
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
"""

# Prepare input for the model
inputs = tokenizer(input_text, return_tensors="pt", max_length=1024, truncation=True)

# Generate summary
summary_ids = model.generate(inputs["input_ids"], num_beams=4, max_length=150, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print(summary)

Loading weights:   0%|          | 0/358 [00:00<?, ?it/s]



 The number of engineering graduates in the United States has declined in recent years . China and India graduate six and eight times as many traditional engineers as does the U.S. Other industrial countries at minimum maintain their output, while America suffers an increasingly serious decline in the number of engineers .


11. **Text Translation**

Explanation: Finally, this code shows text translation. Similar to summarization, we directly load a translation model (Helsinki-NLP/opus-mt-fr-en for French to English) and its tokenizer. We input a French sentence, the tokenizer prepares it, the model translates it into English, and then the tokenizer converts the translated output back into a readable English sentence.




In [18]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "Helsinki-NLP/opus-mt-fr-en"

# Load tokenizer and model explicitly
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

input_text = "Ce cours est produit par Hugging Face."

# Prepare input for the model
inputs = tokenizer(input_text, return_tensors="pt")

# Generate translation
translated_ids = model.generate(inputs["input_ids"])
translated_text = tokenizer.decode(translated_ids[0], skip_special_tokens=True)

print(translated_text)

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]



model.safetensors:   0%|          | 0.00/301M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/256 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

This course is produced by Hugging Face.
