In [1]:
!pip install transformers



In [20]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from transformers import pipeline
import tensorflow as tf
import math

# Pipeline Function

In [6]:
classifier=pipeline("sentiment-analysis")
classifier(["i have been working here more the time without paid",
          "i hate everything and love everything"])

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'NEGATIVE', 'score': 0.9956315755844116},
 {'label': 'POSITIVE', 'score': 0.999284565448761}]

In [8]:
classifier=pipeline("zero-shot-classification")
classifier("the education system is bad in the  world due to politics",
          candidate_labels=['education','politics','environments'])

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'the education system is bad in the  world due to politics',
 'labels': ['education', 'politics', 'environments'],
 'scores': [0.5001472234725952, 0.4958876967430115, 0.0039651351980865]}

In [11]:
generator=pipeline("text-generation")
generator("Nepal is a country having ")

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Nepal is a country having \xa0a strong ethnic and religious community. But is Tibet an independent country?\xa0 Is Nepal one of the most beautiful, or is India its favourite country?\xa0 Has India made peace with Tibet? There is a'}]

In [14]:
generator=pipeline("text-generation",model="gpt2")
generator("Nepal is a country having ",max_length=30,num_return_sequences=3)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Nepal is a country having \xa0its own identity crisis, and that crisis has been exacerbated by a rising tide of ethnic tensions. The '},
 {'generated_text': "Nepal is a country having vernacular words and words of English used by its people in a way that the West doesn't understand. For"},
 {'generated_text': 'Nepal is a country having _____ for all intents and purposes," says Vintner. And here is where your mileage crosses the line'}]

# Fill-mask for missing words


In [20]:
unmasker=pipeline("fill-mask")
unmasker("this world is best for <mask> and is not best for <mask>",top_k=3)

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[[{'score': 0.06011602282524109,
   'token': 961,
   'token_str': ' everyone',
   'sequence': '<s>this world is best for everyone and is not best for<mask></s>'},
  {'score': 0.04559367895126343,
   'token': 30515,
   'token_str': ' mankind',
   'sequence': '<s>this world is best for mankind and is not best for<mask></s>'},
  {'score': 0.04126153886318207,
   'token': 9187,
   'token_str': ' humanity',
   'sequence': '<s>this world is best for humanity and is not best for<mask></s>'}],
 [{'score': 0.05913162976503372,
   'token': 961,
   'token_str': ' everyone',
   'sequence': '<s>this world is best for<mask> and is not best for everyone</s>'},
  {'score': 0.05457741394639015,
   'token': 9187,
   'token_str': ' humanity',
   'sequence': '<s>this world is best for<mask> and is not best for humanity</s>'},
  {'score': 0.05069468170404434,
   'token': 30515,
   'token_str': ' mankind',
   'sequence': '<s>this world is best for<mask> and is not best for mankind</s>'}]]

# Named Entity recognition

In [30]:
ner=pipeline("ner",grouped_entities=True)
ner("My name is Hero and i am from bangladesh i am an student of arts. I work at a huggingFace.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'PER',
  'score': 0.99065137,
  'word': 'Hero',
  'start': 11,
  'end': 15},
 {'entity_group': 'LOC',
  'score': 0.79891044,
  'word': 'bangladesh',
  'start': 30,
  'end': 40},
 {'entity_group': 'ORG',
  'score': 0.6582433,
  'word': '##Face',
  'start': 85,
  'end': 89}]

# Question-answering

In [31]:
ana=pipeline("question-answering")
ana(
question="where does the peoples work",
context="peoples work at the far region"
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'score': 0.44286495447158813,
 'start': 16,
 'end': 30,
 'answer': 'the far region'}

In [32]:
summary=pipeline("summarization")
summary(""" 
In the enchanting realm of artificial intelligence, where algorithms dance with data, innovation unfolds at a breathtaking pace. From the realms of machine learning to natural language processing, the landscape of AI continually evolves, weaving intricate patterns of intellect. As silicon minds tirelessly crunch numbers, they give rise to technologies that reshape our world.

At the core of this digital odyssey lies the marvel of deep learning, an intricate web of interconnected neurons emulating the human brain. Neural networks, inspired by their biological counterparts, unravel complex problems with an unprecedented finesse. They dissect images, translate languages, and even compose symphonies, all driven by the relentless pursuit of efficiency.

In the ever-expanding universe of AI applications, natural language processing emerges as a luminary. Machines decipher the nuances of human communication, conversing fluently and comprehending context. Chatbots engage in dialogues, virtual assistants anticipate needs, and language models like GPT-3 unravel the beauty of language in profound ways. Yet, ethical considerations and responsible AI stewardship accompany these advancements, urging a delicate balance between progress and prudence.

As algorithms traverse the vast expanse of uncharted territories, the transformative power of AI extends beyond computation. Autonomous vehicles navigate bustling streets, healthcare welcomes the precision of predictive analytics, and industries embrace automation to enhance productivity. The tapestry of possibilities woven by artificial intelligence becomes a testament to human ingenuity, propelling society into an era where the boundaries of innovation are seemingly boundless.

In this tapestry, challenges intertwine with triumphs, as the quest for ethical AI, unbiased algorithms, and equitable access unfolds. The future of artificial intelligence is an unwritten symphony, where each line of code adds a note to the melody of progress. As the journey continues, humanity stands at the crossroads of discovery, holding the keys to unlock the vast potential embedded in the digital age.
""")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

[{'summary_text': ' The landscape of AI continually evolves, weaving intricate patterns of intellect . As silicon minds crunch numbers, they give rise to technologies that reshape our world . The future of artificial intelligence is an unwritten symphony, where each line of code adds a note to the melody of progress .'}]

In [34]:
translator=pipeline("translation",model="google-t5/t5-large")
translator("the world is beautiful")

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]



[{'translation_text': 'die Welt ist schön'}]

# What happens inside a Pipeline Function?


In [4]:
from transformers import AutoTokenizer
checkpoint="distilbert-base-uncased-finetuned-sst-2-english"
tokenizer=AutoTokenizer.from_pretrained(checkpoint)
raw_inputs=["i have been drinking from last 2 days",'shut the fuck up']
inputs=tokenizer(raw_inputs,padding=True,truncation=True,return_tensors="tf")

In [9]:
inputs['attention_mask']

<tf.Tensor: shape=(2, 10), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]], dtype=int32)>

In [25]:
from transformers import TFAutoModelForSequenceClassification
checkpoint="distilbert-base-uncased-finetuned-sst-2-english"
model=TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
output=model(**inputs)

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [30]:
output.logits

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[ 1.7742105, -1.5290266],
       [ 3.1991885, -2.5411508]], dtype=float32)>

In [27]:
predictions=tf.math.softmax(output.logits,axis=-1)
print(predictions)

tf.Tensor(
[[0.96453965 0.03546031]
 [0.99679667 0.00320338]], shape=(2, 2), dtype=float32)


In [35]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

In [32]:
# Assuming model is an instance of DistilBertConfig
print(model.config.__dict__)



{'vocab_size': 30522, 'max_position_embeddings': 512, 'sinusoidal_pos_embds': False, 'n_layers': 6, 'n_heads': 12, 'dim': 768, 'hidden_dim': 3072, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation': 'gelu', 'initializer_range': 0.02, 'qa_dropout': 0.1, 'seq_classif_dropout': 0.2, 'return_dict': True, 'output_hidden_states': False, 'output_attentions': False, 'torchscript': False, 'torch_dtype': None, 'use_bfloat16': False, 'tf_legacy_loss': False, 'pruned_heads': {}, 'tie_word_embeddings': True, 'chunk_size_feed_forward': 0, 'is_encoder_decoder': False, 'is_decoder': False, 'cross_attention_hidden_size': None, 'add_cross_attention': False, 'tie_encoder_decoder': False, 'max_length': 20, 'min_length': 0, 'do_sample': False, 'early_stopping': False, 'num_beams': 1, 'num_beam_groups': 1, 'diversity_penalty': 0.0, 'temperature': 1.0, 'top_k': 50, 'top_p': 1.0, 'typical_p': 1.0, 'repetition_penalty': 1.0, 'length_penalty': 1.0, 'no_repeat_ngram_size': 0, 'encoder_no_repeat_ngram_size': 

# Instantiate a transformers Model

In [41]:
from transformers import TFAutoModel
bert_model=TFAutoModel.from_pretrained("bert-base-cased")
print(type(bert_model))
gpt_model=TFAutoModel.from_pretrained("gpt2")
print(type(gpt_model))
bart_model=TFAutoModel.from_pretrained("facebook/bart-base")
print(type(bart_model))

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

<class 'transformers.models.bert.modeling_tf_bert.TFBertModel'>


All PyTorch model weights were used when initializing TFGPT2Model.

All the weights of TFGPT2Model were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2Model for predictions without further training.


<class 'transformers.models.gpt2.modeling_tf_gpt2.TFGPT2Model'>


All PyTorch model weights were used when initializing TFBartModel.

All the weights of TFBartModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBartModel for predictions without further training.


<class 'transformers.models.bart.modeling_tf_bart.TFBartModel'>


In [46]:
from transformers import AutoConfig

bert_config=AutoConfig.from_pretrained("bert-base-cased")
print(type(bert_config))

gpt_config=AutoConfig.from_pretrained("gpt2")
print(type(gpt_config))
bart_config=AutoConfig.from_pretrained("facebook/bart-base")
print(type(bart_config))


<class 'transformers.models.bert.configuration_bert.BertConfig'>
<class 'transformers.models.gpt2.configuration_gpt2.GPT2Config'>
<class 'transformers.models.bart.configuration_bart.BartConfig'>


In [47]:
bert_config

BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.37.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

# You can instantiate a given model with random weights from this config

In [49]:
from transformers import BertConfig,TFBertModel

In [55]:
bert_config=BertConfig.from_pretrained('bert-base-cased',num_hidden_layers=10)
bert_model=TFBertModel(bert_config)
bert_model.save_pretrained("my_bert_model")

ValueError: Weights for model 'tf_bert_model_5' have not yet been created. Weights are created when the model is first called on inputs or `build()` is called with an `input_shape`.

ValueError: Weights for model 'tf_bert_model_4' have not yet been created. Weights are created when the model is first called on inputs or `build()` is called with an `input_shape`.

In [57]:
import json

# Specify the path to the JSON file
file_path = '/kaggle/working/my_bert_model/config.json'

# Read the contents of the file
with open(file_path, 'r') as file:
    json_content = file.read()

# Parse the JSON content
ro = json.loads(json_content)


In [58]:
ro

{'architectures': ['BertModel'],
 'attention_probs_dropout_prob': 0.1,
 'classifier_dropout': None,
 'gradient_checkpointing': False,
 'hidden_act': 'gelu',
 'hidden_dropout_prob': 0.1,
 'hidden_size': 768,
 'initializer_range': 0.02,
 'intermediate_size': 3072,
 'layer_norm_eps': 1e-12,
 'max_position_embeddings': 512,
 'model_type': 'bert',
 'num_attention_heads': 12,
 'num_hidden_layers': 10,
 'pad_token_id': 0,
 'position_embedding_type': 'absolute',
 'transformers_version': '4.37.0',
 'type_vocab_size': 2,
 'use_cache': True,
 'vocab_size': 28996}

# wordbased character-based,subword_based

# The Tokenization Pipeline

In [15]:
import tensorflow as tf

In [2]:
from transformers import AutoTokenizer
tokenizer=AutoTokenizer.from_pretrained('bert-base-uncased')
tokens=tokenizer.tokenize("let's try by tokenizing")

print(tokens)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

['let', "'", 's', 'try', 'by', 'token', '##izing']


In [7]:
tokenizer=AutoTokenizer.from_pretrained('bert-base-uncased')
tokens=tokenizer.tokenize("let's try by tokenizing")
input_ids=tokenizer.convert_tokens_to_ids(tokens)
print(input_ids)

[2292, 1005, 1055, 3046, 2011, 19204, 6026]


In [10]:
tokenizer=AutoTokenizer.from_pretrained('bert-base-uncased')
tokens=tokenizer.tokenize("let's try by tokenizing")
input_ids=tokenizer.convert_tokens_to_ids(tokens)
final_output=tokenizer.prepare_for_model(input_ids)
print(final_output)

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'input_ids': [101, 2292, 1005, 1055, 3046, 2011, 19204, 6026, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}


### use .decode to decode the

In [9]:
from transformers import AutoTokenizer
tokenizer=AutoTokenizer.from_pretrained("bert-base-uncased")
inputs=tokenizer("let's try to tokenize and decoding the exa larger words by tokenization")
print(tokenizer.decode(inputs["input_ids"]))

[CLS] let's try to tokenize and decoding the exa larger words by tokenization [SEP]


# Batching input Together

In [13]:
from transformers import AutoTokenizer
checkpoint="distilbert-base-uncased-finetuned-sst-2-english"
tokenizer=AutoTokenizer.from_pretrained(checkpoint)
sentences=[
    "I've been holidayy in a intense holiday for a couple of months",
    "I hate the holiday"
]
tokens=[tokenizer.tokenize(sent) for sent in sentences]
ids=[tokenizer.convert_tokens_to_ids(tok) for tok in tokens]
ids

[[1045,
  1005,
  2310,
  2042,
  6209,
  2100,
  1999,
  1037,
  6387,
  6209,
  2005,
  1037,
  3232,
  1997,
  2706],
 [1045, 5223, 1996, 6209]]

In [46]:
ids0=tf.constant(ids[0])
ids1=tf.constant([1045, 5223, 1996, 6209,0,0,0,0,0,0,0,0,0,0,0])
print(ids1)

tf.Tensor(
[1045 5223 1996 6209    0    0    0    0    0    0    0    0    0    0
    0], shape=(15,), dtype=int32)


In [48]:
print(len(ids1))
print(len(ids0))

15
15


In [50]:
from transformers import TFAutoModelForSequenceClassification
model=TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
output1=model(ids0)
output2=model(ids1)

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [52]:
print(output1.logits)
print(output2.logits)
o

tf.Tensor([[-1.4933891  1.5724165]], shape=(1, 2), dtype=float32)
tf.Tensor([[ 1.7307717 -1.567487 ]], shape=(1, 2), dtype=float32)


# Hugging Face Datasets overview

In [54]:
from datasets import load_dataset
raw_datasets=load_dataset("glue",'mrpc')
raw_datasets

Downloading builder script:   0%|          | 0.00/7.78k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/4.47k [00:00<?, ?B/s]

Downloading and preparing dataset glue/mrpc (download: 1.43 MiB, generated: 1.43 MiB, post-processed: Unknown size, total: 2.85 MiB) to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [55]:
raw_datasets.keys()

dict_keys(['train', 'validation', 'test'])

In [62]:
df=raw_datasets['train']

In [63]:
import pandas as pd
pd.DataFrame(df)

Unnamed: 0,sentence1,sentence2,label,idx
0,"Amrozi accused his brother , whom he called "" ...","Referring to him as only "" the witness "" , Amr...",1,0
1,Yucaipa owned Dominick 's before selling the c...,Yucaipa bought Dominick 's in 1995 for $ 693 m...,0,1
2,They had published an advertisement on the Int...,"On June 10 , the ship 's owners had published ...",1,2
3,"Around 0335 GMT , Tab shares were up 19 cents ...","Tab shares jumped 20 cents , or 4.6 % , to set...",0,3
4,"The stock rose $ 2.11 , or about 11 percent , ...",PG & E Corp. shares jumped $ 1.63 or 8 percent...,1,4
...,...,...,...,...
3663,""" At this point , Mr. Brando announced : ' Som...","Brando said that "" somebody ought to put a bul...",1,4071
3664,"Martin , 58 , will be freed today after servin...",Martin served two thirds of a five-year senten...,0,4072
3665,""" We have concluded that the outlook for price...","In a statement , the ECB said the outlook for ...",1,4073
3666,The notification was first reported Friday by ...,MSNBC.com first reported the CIA request on Fr...,1,4074


In [64]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [74]:
from transformers import AutoTokenizer
checkpoint="bert-base-cased"
model=AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example['sentence1'],example['sentence2'],padding="max_length",truncation=True,max_length=128)
tokenized_datasets=raw_datasets.map(tokenize_function,batched=True)


  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [1]:
tokenized_datasets.column_names

NameError: name 'tokenized_datasets' is not defined

In [76]:
tokenized_datasets=tokenized_datasets.remove_columns(["idx","sentence1","sentence2"])
tokenized_datasets=tokenized_datasets.rename_column("label","labels")
tokenize_datasets=tokenized_datasets.with_format("tensorflow")
tokenized_datasets["train"]

Dataset({
    features: ['labels', 'input_ids', 'attention_mask'],
    num_rows: 3668
})

In [81]:
op=pd.DataFrame(tokenized_datasets['train'])
op

Unnamed: 0,labels,input_ids,attention_mask
0,1,"[101, 2572, 3217, 5831, 5496, 2010, 2567, 1010...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,0,"[101, 9805, 3540, 11514, 2050, 3079, 11282, 22...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2,1,"[101, 2027, 2018, 2405, 2019, 15147, 2006, 199...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
3,0,"[101, 2105, 6021, 19481, 13938, 2102, 1010, 21...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
4,1,"[101, 1996, 4518, 3123, 1002, 1016, 1012, 2340...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
...,...,...,...
3663,1,"[101, 1000, 2012, 2023, 2391, 1010, 2720, 1012...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
3664,0,"[101, 3235, 1010, 5388, 1010, 2097, 2022, 1065...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
3665,1,"[101, 1000, 2057, 2031, 5531, 2008, 1996, 1768...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
3666,1,"[101, 1996, 26828, 2001, 2034, 2988, 5958, 201...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


In [14]:
import tensorflow as tf
from transformers import AutoTokenizer
checkpoint="bert-base-uncased"
tokenizer=AutoTokenizer.from_pretrained(checkpoint)
sequence=[
    "Ive been working harder for more than fucking 2 years ",
    "Help the world to make a better world"
]
batch=tokenizer(sequence,padding=True,truncation=True,return_tensors="tf")

In [7]:
batch

{'input_ids': <tf.Tensor: shape=(2, 13), dtype=int32, numpy=
array([[ 101, 4921, 2063, 2042, 2551, 6211, 2005, 2062, 2084, 8239, 1016,
        2086,  102],
       [ 101, 2393, 1996, 2088, 2000, 2191, 1037, 2488, 2088,  102,    0,
           0,    0]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(2, 13), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 13), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]], dtype=int32)>}

In [11]:
from datasets import load_dataset
from transformers import AutoTokenizer
raw_datasets=load_dataset("glue","mrpc")
checkpoint="bert-base-cased"
tokenizer=AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(examples):
    return tokenizer(
        examples['sentence1'],example['sentence2'],padding="max_length",truncation=True,max_length=128
    )
tokenized_datasets=raw_datasets.remove_columns(["idx","sentence1","sentence2"])
tokenized_datasets=tokenized_datasets.rename_column("label","labels")
tokenize_datasets=tokenized_datasets.with_format("tensorflow")


  0%|          | 0/3 [00:00<?, ?it/s]

(3668, 1)

In [12]:
tokenized_datasets["train"].shape

(3668, 1)

In [22]:
from tensorflow.keras.utils import DataLoader


ImportError: cannot import name 'DataLoader' from 'tensorflow.keras.utils' (/opt/conda/lib/python3.10/site-packages/keras/api/_v2/keras/utils/__init__.py)

In [23]:
import tensorflow as tf

## The Trainer API

In [4]:
from datasets import load_dataset
from transformers import AutoTokenizer ,DataCollatorWithPadding
raw_datasets=load_dataset("glue","mrpc")
checkpoint="bert-base-cased"
tokenizer=AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(examples):
    return tokenizer(examples['sentence1'],examples['sentence2'],truncation=True)

tokenized_datasets=raw_datasets.map(tokenize_function,batched=True)
datacollator=DataCollatorWithPadding(tokenizer)


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [24]:
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments
from transformers import Trainer
from datasets import load_metric
metric=load_metric("glue","mrpc")
def compute_metrices(eval_preds):
    logits,labels=eval_preds 
    preds=np.argmax(logits,axis=-1)
    return metric.compute(predictions=preds,reference=predictions.labels)
model=AutoModelForSequenceClassification.from_pretrained('bert-base-uncased',num_labels=2,max_length=64, gradient_checkpointing=True)
training_args=TrainingArguments("test-trainer",
                               per_device_train_batch_size=4,
                               per_device_eval_batch_size=4,
                               num_train_epochs=2,
                               learning_rate=2e-5,
                               weight_decay=0.01,
                                )

trainer=Trainer(
model,
training_args,
train_dataset=tokenized_datasets['train'],
eval_dataset=tokenized_datasets['validation'],
data_collator=datacollator,
tokenizer=tokenizer,
compute_metrics=compute_metrices)

trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.6241
1000,0.5604
1500,0.5341


Checkpoint destination directory test-trainer/checkpoint-500 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Non-default generation parameters: {'max_length': 64}
Checkpoint destination directory test-trainer/checkpoint-1000 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Non-default generation parameters: {'max_length': 64}
Checkpoint destination directory test-trainer/checkpoint-1500 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Non-default generation parameters: {'max_length': 64}


TrainOutput(global_step=1834, training_loss=0.5667965971136562, metrics={'train_runtime': 145.104, 'train_samples_per_second': 50.557, 'train_steps_per_second': 12.639, 'total_flos': 259374056151840.0, 'train_loss': 0.5667965971136562, 'epoch': 2.0})

NameError: name 'your_predictions_array' is not defined

In [15]:
from datasets import load_metric
metric=load_metric("glue","mrpc")
preds=np.argmax(predictions.predictions,axis=-1)
metric.compute(predictions=preds,reference=predictions.label_ids)

TypeError: 'NoneType' object is not iterable

In [13]:
preds

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,