In [33]:
!pip install transformers



In [32]:
import warnings

# Suppress specific warning
warnings.filterwarnings("ignore", message="Using a pipeline without specifying a model name and revision in production is not recommended.")
warnings.filterwarnings("ignore", message="`resume_download` is deprecated and will be removed in version 1.0.0.")
warnings.filterwarnings("ignore", message="The secret `HF_TOKEN` does not exist in your Colab secrets.")

# Text Classification

In [4]:
from transformers import pipeline
classifier = pipeline("text-classification")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [7]:
positive_example = """Subject: Great Job on the Project!
Body: I was initially skeptical about your approach, and there were moments I thought it might not work.
However, despite some minor issues, the outcome was impressive."""

negative_example = """Subject: Concerns about the Recent Update
Body: While the update had some good elements, it failed to address several critical issues.
I'm not entirely dissatisfied, but we need to discuss the unresolved problems soon.
"""

In [8]:
classifier(positive_example)

[{'label': 'POSITIVE', 'score': 0.9997162222862244}]

In [10]:
classifier(negative_example)

[{'label': 'NEGATIVE', 'score': 0.9994484782218933}]

# NER - (Named Entity Recognition)

In [11]:
import pandas as pd
from transformers import pipeline
ner = pipeline("ner", aggregation_strategy="simple")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

In [12]:
example_1 = "Shahbaz Sharif is the Prime minister of Pakistan"
example_2 = "Sarah is 27 years old, and she study in harvard university"

In [13]:
print(pd.DataFrame(ner(example_1)))

  entity_group     score            word  start  end
0          PER  0.999271  Shahbaz Sharif      0   14
1          LOC  0.999003        Pakistan     40   48


In [14]:
print(pd.DataFrame(ner(example_2)))

  entity_group     score   word  start  end
0          PER  0.997724  Sarah      0    5


# Question Answering

In [19]:
from transformers import pipeline
question_answering = pipeline("question-answering")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [17]:
question = "which country invent the currency note first?"

text = """The first known banknote was first developed in China during the Tang and Song dynasties,
starting in the 7th century. Its roots were in merchant receipts of deposit during the Tang dynasty (618–907),
as merchants and wholesalers desired to avoid the heavy bulk of copper coinage in large commercial transactions.
"""

In [21]:
outputs = question_answering(question=question, context=text)
pd.DataFrame([outputs])

Unnamed: 0,score,start,end,answer
0,0.994988,48,53,China


# Summarization

In [22]:
from transformers import pipeline
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [23]:
example_text = """
Once upon a time in a small village nestled between two mountains, there lived a young girl named Ella.
Ella loved exploring the forests surrounding her village, often discovering hidden streams and meadows
filled with wildflowers. One sunny morning, while wandering deeper into the woods than ever before, she
stumbled upon a peculiar-looking tree with golden leaves.

Intrigued, Ella touched one of the leaves, and to her amazement, the tree began to speak.
It introduced itself as an ancient tree with magical powers, capable of granting a single wish to those pure of heart.
Without hesitation, Ella wished for her village to prosper, as it had been struggling with poor harvests and a harsh winter.

The tree's golden leaves shimmered and fell to the ground, transforming into a shower of gold dust that spread
across the village. Miraculously, the fields began to flourish, and the villagers rejoiced at the bountiful harvest
that followed. Ella became a hero in her village, and the story of the magical tree was passed down through generations,
reminding everyone of the power of kindness and selflessness.
"""

In [24]:
summary = summarizer(example_text, clean_up_tokenization_spaces=True, max_length=30)
print(summary[0]['summary_text'])

Your min_length=56 must be inferior than your max_length=30.


 The story of a tree with magical powers has been passed down through generations of generations of people. The tree's golden leaves fell to the


# Sentiment Analysis

In [29]:
from transformers import pipeline
sentiment = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [27]:
sentiment_1 = "The service at the restaurant was terrible. I had to wait for over an hour for my food."
sentiment_2 = "While I appreciate the effort, the final product didn't meet my expectations."

In [30]:
results = classifier(sentiment_1)
print(results)

[{'label': 'NEGATIVE', 'score': 0.9996429681777954}]


In [31]:
results = classifier(sentiment_2)
print(results)

[{'label': 'NEGATIVE', 'score': 0.9996115565299988}]


# **Keypoints to Remember**

**Default Models used by Huggingface for different use-cases**

<table>
  <tr>
    <th>S.No</th>
    <th>Problems</th>
    <th>Default Models</th>
  </tr>
  <tr>
    <td>1</td>
    <td>Text Classification</td>
    <td>distilbert/distilbert-base-uncased-finetuned-sst-2-english</td>
  </tr>
  <tr>
    <td>2</td>
    <td>Named Entity Recoginition</td>
    <td>dbmdz/bert-large-cased-finetuned-conll03-english</td>
  </tr>
  <tr>
    <td>3</td>
    <td>Question Answering</td>
    <td>distilbert/distilbert-base-cased-distilled-squad</td>
  </tr>
  <tr>
    <td>4</td>
    <td>Text Summarization</td>
    <td>sshleifer/distilbart-cnn-12-6</td>
  </tr>
  <tr>
    <td>5</td>
    <td>Sentiment Analysis</td>
    <td>distilbert/distilbert-base-uncased-finetuned-sst-2-english</td>
  </tr>

</table>

**config.json:**

This file contains the configuration parameters of the model, including details such as the model architecture, hyperparameters, tokenizer settings, and any other model-specific configurations. It provides essential information needed to instantiate and use the model in your code.

        

```
{
    "attention_probs_dropout_prob": 0.1,    
    "hidden_dropout_prob": 0.1,   
    "hidden_size": 768,
    "num_attention_heads": 12,
    "num_hidden_layers": 12,  
    "type_vocab_size": 2,
    "vocab_size": 30522,
}

```

- **attention_probs_dropout_prob:** determines the dropout probability for attention weights.
- **hidden_dropout_prob**: Similar to attention_probs_dropout_prob, but for the hidden layers.
- **hidden_size**: It represents the number of neurons in each layer's feedforward network.
- **num_hidden_layers**: The total number of layers in the transformer model, including both encoder and decoder layers. Each layer consists of sublayers like self-attention, feedforward, and layer normalization.

<br>
<hr>
<br>

**model.safetensors**


safetensors is a safe and fast file format for storing and loading tensors. Typically, PyTorch model weights are saved or pickled into a .bin file with Python’s pickle utility. However, pickle is not secure and pickled files may contain malicious code that can be executed. safetensors is a secure alternative to pickle, making it ideal for sharing model weights.<br><br>


          # Shopping cart data
          shopping_cart = {
              "items": [
                  {"name": "Laptop", "quantity": 1, "price": 999.99},
                  {"name": "Headphones", "quantity": 2, "price": 49.99},
                  {"name": "Mouse", "quantity": 1, "price": 19.99}
              ],
              "total": 1119.96
          }


          import json

          # Serialize the shopping cart to JSON
          serialized_cart = json.dumps(shopping_cart)

          # Save the serialized data to a file
          with open("shopping_cart.json", "w") as file:
              file.write(serialized_cart)



                        # Read the serialized data from the file
          with open("shopping_cart.json", "r") as file:
              serialized_cart = file.read()

          # Deserialize the JSON data into a Python dictionary
          restored_cart = json.loads(serialized_cart)

          print(restored_cart)

<br>
<hr>
<br>


**pytorch_model.bin:**

This file contains the actual weights and parameters of the BERT model, learned during pre-training or fine-tuning.
Example content: Serialized tensor data representing the weights of each layer, attention matrices, and other parameters of the model.

- Serialization: Serialization is the process of converting complex data structures, such as tensors, into a format that can be easily stored, transmitted, or reconstructed later. Serialization typically involves converting the data into a byte stream or a string format that preserves the structure and content of the original data.<br><br>

        import pickle

        # Example data to serialize
        data = {'name': 'John', 'age': 30, 'city': 'New York'}

        # Serialize data using pickle and save to a file
        with open('data.pkl', 'wb') as file:
            pickle.dump(data, file)

        # Deserialize data from the file
        with open('data.pkl', 'rb') as file:
            loaded_data = pickle.load(file)

        print(loaded_data)

<br>
<hr>
<br>


**tokenizer_config.json:**

This file contains the configuration parameters of the BERT tokenizer, such as the tokenizer type, special tokens, vocabulary size, etc.
Example content:

```
{
    "max_len": 512,
    "model_type": "bert",
    "pad_token_id": 0,
    "vocab_size": 30522,
}
```

- **max_len**: maximum sequence length that the tokenizer will process.
    1. Sequences longer than this length will be truncated
    2. sequences shorter than this length will be padded to reach this length
- **pad_token_id**: Token ID used for padding sequences.
When sequences are shorter than the maximum length, they are padded with this token ID to match the maximum length.