What is Hugging Face: 🤗 https://huggingface.co/          
Hugging Face is like a repository for AI models and data. You grab ready-to-go models for tasks ranging from NLP to say image captioning, and datasets to help them learn. No need to build from scratch—just pick and use!

*🤗*: Transformers: https://huggingface.co/docs/transformers/index


Provides APIs and tools to easily download and train state-of-the-art pretrained models.                       

Transformers support framework interoperability between PyTorch, TensorFlow, and JAX. This provides the flexibility to use a different framework at each stage of a model’s life; train a model in three lines of code in one framework, and load it for inference in another.

**How to Use Transformers and Access Pre-trained models Using Pipelines**

**O_O!: What is pipeline**()


With the `pipeline()` function, users can easily load models and tokenizers, and perform inference without requiring expertise in the underlying models. Users can specify an inference task, provide input data, and the pipeline will automatically handle pre-processing, model inference, and post-processing.

**Pipeline** supports multiple modalities including text, audio, and vision, and allows users to easily switch between different models for improved results.

Install Transformers

In [None]:
!pip install transformers



Import pipeline

In [None]:
from transformers import pipeline

There are three main steps involved when you pass some text to a pipeline:


*   The text is preprocessed into a format the model can understand.

*   The preprocessed inputs are passed to the model.



*   The predictions of the model are post-processed, so you can make sense of them

Pipeline function **selects a particular pretrained model that has been fine-tuned for the given task**.

Lets look at an example:

Lets look at an example: Here pipeline will **select a pretrianed model for text-generation task**

In [None]:
generator = pipeline("text-generation")
generator("Hugging face is a AI community")

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Hugging face is a AI community trait that can be passed down to their peers, making their reputation stronger and more important for their communities than for their self-improvement. AI community traits are very difficult to learn for some reasons - they can be'}]

model selected: openai-community/gpt2

---------------------------------------

# **Accessing Pre-trained Models Using Pipeline**



1. Go to the **Hugging Face** website.
2. Click on the **Models** page (located at the top right corner).
3. On the left side of the page, you'll see a list of **tasks**.
4. Select a task that interests you.
5. Choose a model that fits your task.
6. Copy the model's name for later use.



---



**Task: Text Generation**

Picked: https://huggingface.co/openai-community/gpt2

Name: gpt2


1. After selecting a model, go to the **model card** page for the chosen model.
2. On the **right corner**, you’ll find a button called **"Use this model"**.
3. Click on it, and a tab called **"How to use from Transformers"** will appear.
4. Copy the code provided and paste it into your cell.



Some models might require additional pck's to be installed so check the model card

In [None]:
from transformers import pipeline
pipe = pipeline("text-generation", model="gpt2")

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


Passing prompt

In [1]:
pipe("Hugging face is a emoji")

NameError: name 'pipe' is not defined

You can also configure the structure of outputs


In [None]:
pipe('Spooky Halloween', max_length=30, num_return_sequences=4)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Spooky Halloween Story. So, what's the point of going out there and trying to be funny?\n\nROB: As a comedian I"},
 {'generated_text': 'Spooky Halloween Stories\n\nThe most interesting and well-known creepy story is that of the great witch, who had only known she was female before'},
 {'generated_text': 'Spooky Halloween Show"\n\nhttp://www.youtube.com/watch?v=8_FeZn7b_Ow'},
 {'generated_text': 'Spooky Halloween Fairy Tale'}]



---



**Task: Summarization**

In [None]:
summarizer = pipeline("summarization")

summarizer([
    """AGI (Artificial General Intelligence) is a type of AI that
    can perform any intellectual task
that a human can do. Unlike narrow AI, which is designed for specific tasks
(like language translation or playing chess),
AGI would be able to learn and apply knowledge across a wide range of subjects,
understand complex concepts,
reason, and adapt to new situations, just like a human brain."""
])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Device set to use cpu
Your max_length is set to 142, but your input_length is only 85. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=42)


[{'summary_text': ' AGI (Artificial General Intelligence) is a type of AI that can perform any intellectual task that a human can do . AGI would be able to learn and apply knowledge across a wide range of subjects, understand complex concepts, reason, and adapt to new situations, just like a human brain .'}]

model selected: sshleifer/distilbart-cnn-12-6



---



**Task: Question Answering**

In [None]:
from transformers import pipeline

question_answerer = pipeline("question-answering")
question_answerer(
    question="What did amanda bake",
    context="Amanda: I baked cookies. Do you want some? Jerry: Sure! Amanda: I'll bring you tomorrow :-)",
)
#question: The query you want the model to answer. In this case, the question is:

#context: The supporting text that contains the information needed to answer the question.


No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


{'score': 0.6992829442024231, 'start': 16, 'end': 23, 'answer': 'cookies'}

model selected: distilbert/distilbert-base-cased-distilled-squad



---



Go ahead and explore these pipelines:
* feature-extraction (get the vector representation of a text)
* fill-mask
* ner (named entity recognition)
* sentiment-analysis
* translation
* zero-shot-classification



---



# **Tokenizers**

The **AutoTokenizer** is a class that **automatically loads the appropriate tokenizer for a given model**.


In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model: facebook/bart-large-cnn

In [None]:
raw_inputs =["Hugging face is an emoji","Ya but it is also a AI community"]
inputs = tokenizer(raw_inputs, padding=True, truncation=True,max_length=10)
print(inputs)

{'input_ids': [[0, 40710, 3923, 652, 16, 41, 21554, 2, 1, 1], [0, 975, 102, 53, 24, 16, 67, 10, 4687, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}


Notice tokens:

* [0, 40710, 3923, 652, 16, 41, 21554, 2, 1, 1]
* [0, 975, 102, 53, 24, 16, 67, 10, 4687, 2]

Both have a length=max_length which is 10
* [0, 40710, 3923, 652, 16, 41, 21554, 2, 1, 1] =>
We have **padded the values with 1** at the end to make sure the tensor length is 10.

Additionally, **the attention mask** [1, 1, 1, 1, 1, 1, 1, 1, 0, 0] indicates **which tokens the model should focus on**. Since the **last two values (1, 1)** were padded, the **model can ignore them**, so the **attention mask for these tokens is set to zero**.


In [None]:
raw_inputs =     [
        "Ya but it is also a AI community"
    ]
inputs = tokenizer(raw_inputs, padding=True, truncation=True)
print(inputs)

{'input_ids': [[0, 975, 102, 53, 24, 16, 67, 10, 4687, 435, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}


Notice that the **tensor is different** from the previous cell output

Why? As **we havent mentioned max_length** value there was **no truncation performed**.


Lets look at the tokens generated for the given input

In [None]:
tokens = tokenizer.tokenize('Google colab also has a gpu version')
tokens

['Google', 'Ġcol', 'ab', 'Ġalso', 'Ġhas', 'Ġa', 'Ġg', 'pu', 'Ġversion']

Notice that some of the tokens are begining with a additional 'G' which is used to indicate space

Extract Token-ID's

In [None]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
token_ids

[20441, 11311, 873, 67, 34, 10, 821, 30738, 1732]

Decoding these Token-Id's

In [None]:
decoded_tokens = tokenizer.decode(token_ids)
decoded_tokens

'Google colab also has a gpu version'

# **Implementing Named Entity Recognition (NER) with NLTK**

Installation and Import Statements

In [None]:
!pip install nltk



In [None]:
import nltk
nltk.download('punkt')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers

True

Tokenization

In [None]:
from nltk.tokenize import word_tokenize
text = "PES University, BMS College est during early 2000's"
tokens = word_tokenize(text)
print(tokens)

['PES', 'University', ',', 'BMS', 'College', 'est', 'during', 'early', '2000', "'s"]


POS

In [None]:
#part of speech tagging - assigns grammatical tag (noun, verb, etc) to a word

from nltk import pos_tag
tagged_words = pos_tag(tokens)
print(tagged_words)

[('PES', 'NNP'), ('University', 'NNP'), (',', ','), ('BMS', 'NNP'), ('College', 'NNP'), ('est', 'JJS'), ('during', 'IN'), ('early', 'JJ'), ('2000', 'CD'), ("'s", 'POS')]


NER

In [None]:
from nltk import ne_chunk
named_entities = ne_chunk(tagged_words)
print(named_entities)

#The output is a tree structure where: Named entities are grouped together as chunks. Non-entity words remain as individual tokens.

(S
  (ORGANIZATION PES/NNP)
  (GPE University/NNP)
  ,/,
  (ORGANIZATION BMS/NNP College/NNP)
  est/JJS
  during/IN
  early/JJ
  2000/CD
  's/POS)


Read from a txt file

In [None]:
file_name = 'your_file.txt'
# Open the file and read its content
with open(file_name, 'r') as file:
    content = file.read()

Tokens+POS+NER

In [None]:
text = content
tokens = word_tokenize(text)
print(f'Word Tokens: {tokens}')
tagged_words = pos_tag(tokens)
print(f'Tagged Words: {tagged_words}')
named_entities = ne_chunk(tagged_words)
print(f'Named Entities: {named_entities}')

# **Assignment**