<a href="https://colab.research.google.com/github/Sinusealpha/playing-openai-api/blob/main/step1_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [10]:
import transformers
# it is from huggingface

**1) the simplest way to work with the transformers package:**

In [9]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", device = 'cuda')
classifier("i'm neither happy nor bored to work in the gc space. i'm feeling somehow good.")
# this text is just an example, you can edit it.

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'POSITIVE', 'score': 0.9994269609451294}]

**2) we can use tokenizer and model seprately:**

In [16]:
from transformers import AutoTokenizer,AutoModel
# we can use TFAutoModel to use TensorFlow as BackEnd of transformers module

# bert is here to encode texts. it receives a paragraph and give us a 512 dimentional vector to represent the text.
# this process is called "embedding" and we are gonna use it in the RAG and VectorDBs.

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
# be careful, all pre-training things should be the same with training things, i.e, model and tokenizer!

text = "this is sina"

input = tokenizer(text, return_tensors = "pt")
# we are using that return_tensor="pt" to say that we need in the pytorch-aligned way.
# we can use tf instead of pt, if we're using tensorflow as backend.

output = model(**input)
print(output)
# it is gonna be a high dimentional vector.

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.0735,  0.1171, -0.1137,  ..., -0.1849,  0.1637,  0.1906],
         [-0.5274, -0.0381,  0.1222,  ..., -0.5427,  0.6652,  0.5235],
         [-0.3611, -0.2636,  0.3259,  ..., -0.0176,  0.3163,  1.0792],
         [ 0.3684, -0.7143,  0.2972,  ..., -0.0452,  0.1361, -0.1764],
         [-0.3120, -0.4086, -0.1155,  ...,  0.2673,  0.3390,  0.0815],
         [ 0.7924,  0.1487, -0.3462,  ..., -0.1642, -0.9591, -0.2393]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-8.1575e-01, -1.9694e-01,  4.8349e-01,  6.0623e-01, -3.2398e-01,
         -5.4389e-02,  8.3786e-01,  2.0135e-01,  2.1983e-01, -9.9948e-01,
          2.3199e-01,  7.7051e-02,  9.6494e-01, -2.3865e-01,  9.0038e-01,
         -4.4461e-01,  1.7089e-02, -4.9665e-01,  3.0457e-01, -6.8125e-01,
          4.6492e-01,  7.5874e-01,  6.4273e-01,  1.9611e-01,  2.8585e-01,
          1.5133e-01, -4.3322e-01,  8.8309e-01,  9.2586e-01,  5.9573e-01,
       

**3) we can use another model, instead of transformers default model:**


In [22]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# we are proposing a new model.
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# we are using the same classifier as previous task.
classifier = pipeline("sentiment-analysis", model = model, tokenizer = tokenizer)

results = classifier(["i'm somehow feel great yesterday but now is different",
                     "there is nothing as good as this moment."])
# we proposed 2 different sentences this time and it will return different lables for each of them.

for i in results:
  print(i)



Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


{'label': 'POSITIVE', 'score': 0.9966962337493896}
{'label': 'NEGATIVE', 'score': 0.9997376799583435}


**4) we can tokenize in some different ways:**

In [28]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# now we are gonna tokenize
tokens = tokenizer.tokenize("hi this is sina moradi from tehran and i'm learning transformers.")
token_ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = tokenizer("hi this is sina moradi from tehran and i'm learning transformers.")

# now we are gonna print them seperately
print(f' tokens: {tokens}')
print("##########################################")
print(f' token_IDs: {token_ids}')
print("##########################################")
print(f' input_IDs: {input_ids}')


# we can also make it done in 2 sentences with different sizes.
# padding is printing "0" for empty [arts of the shorter sentence.
# max length determines the point to cut the sentence, cosidering its size.
x_train = ["i'm so happy to learn this module its so amazing",
           "i hope to continue in the next hours"]

# in the train phase, we are gonna put the texts in a batch, instead of putting them seperately.
# batchs are matrices with specific rows and columns.

batch = tokenizer(x_train, padding = True,
                  truncation = True,
                  max_length = 512,
                  return_tensors = "pt")
print("##########################################")
print(batch)


 tokens: ['hi', 'this', 'is', 'sin', '##a', 'mora', '##di', 'from', 'tehran', 'and', 'i', "'", 'm', 'learning', 'transformers', '.']
##########################################
 token_IDs: [7632, 2023, 2003, 8254, 2050, 26821, 4305, 2013, 13503, 1998, 1045, 1005, 1049, 4083, 19081, 1012]
##########################################
 input_IDs: {'input_ids': [101, 7632, 2023, 2003, 8254, 2050, 26821, 4305, 2013, 13503, 1998, 1045, 1005, 1049, 4083, 19081, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
##########################################
{'input_ids': tensor([[  101,  1045,  1005,  1049,  2061,  3407,  2000,  4553,  2023, 11336,
          2049,  2061,  6429,   102],
        [  101,  1045,  3246,  2000,  3613,  1999,  1996,  2279,  2847,   102,
             0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}


**4) we can work directly with pytorch for better options:**

In [30]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

x_train = ["i'm so happy to learn this module its so amazing",
           "i hope to continue in the next hours"]
batch = tokenizer(x_train, padding = True,
                  truncation = True,
                  max_length = 512,
                  return_tensors = "pt")
print(batch)
print("##########################################")

with torch.no_grad():
  outputs = model(**batch)
  print(outputs)
  print("##########################################")
  predictions = F.softmax(outputs.logits, dim = 1)
  print(predictions)
  print("##########################################")
  labels = torch.argmax(predictions, dim = 1)
  print(labels)
  print("##########################################")
  labels = [model.config.id2label[label_id] for label_id in labels.tolist()]
  print(labels)


{'input_ids': tensor([[  101,  1045,  1005,  1049,  2061,  3407,  2000,  4553,  2023, 11336,
          2049,  2061,  6429,   102],
        [  101,  1045,  3246,  2000,  3613,  1999,  1996,  2279,  2847,   102,
             0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}
##########################################
SequenceClassifierOutput(loss=None, logits=tensor([[-4.3352,  4.6670],
        [-2.1270,  2.1375]]), hidden_states=None, attentions=None)
##########################################
tensor([[1.2311e-04, 9.9988e-01],
        [1.3863e-02, 9.8614e-01]])
##########################################
tensor([1, 1])
##########################################
['POSITIVE', 'POSITIVE']
