<a href="https://colab.research.google.com/github/Metachondria/LearnTransformers/blob/main/LearnTransformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [38]:
!pip install -q datasets
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification

import torch
import torch.nn.functional as F
import datasets

#Pipelines

In [51]:
summarizer = pipeline('summarization', model="sshleifer/distilbart-cnn-12-6")

text = """The rapid advancement of artificial intelligence (AI) has brought significant changes to various industries, including healthcare, education, and transportation. In healthcare, AI-powered tools assist in diagnosing diseases, analyzing medical images, and even predicting patient outcomes. In education, adaptive learning systems personalize content for students, making the learning process more efficient. Transportation has seen the rise of autonomous vehicles, which promise to reduce accidents and improve traffic flow.

However, alongside these benefits, there are concerns about the ethical implications of AI. Issues such as data privacy, algorithmic bias, and the potential loss of jobs due to automation remain at the forefront of discussions. Addressing these challenges requires collaboration between governments, businesses, and researchers to ensure AI technologies are used responsibly and inclusivel"""

summarizer(text, max_length=128, do_sample=False)


Device set to use cuda:0


[{'summary_text': ' The rapid advancement of artificial intelligence (AI) has brought significant changes to various industries . In healthcare, AI-powered tools assist in diagnosing diseases, analyzing medical images, and predicting patient outcomes . In education, adaptive learning systems personalize content for students, making the learning process more efficient . In transportation, autonomous vehicles promise to reduce accidents and improve traffic flow .'}]

In [5]:
classifier = pipeline('sentiment-analysis')

res = classifier(" I've been waiting for a HuggingFace course my whole life.")

res

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9598046541213989}]


In [12]:
generator = pipeline('text-generation', model='distilgpt2')

res = generator(
    'I am suitable for an internship because',
    max_length=32,
    num_return_sequences=2
)

res

Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I am suitable for an internship because it is so much easier for me to complete without having to be trained on certain aspects. It allows me to work with different'},
 {'generated_text': 'I am suitable for an internship because of the work involved in this project, and I will not hesitate to accept an internship if requested. I also have a great'}]

In [20]:
classifier = pipeline('zero-shot-classification')

res = classifier(
    "This is a proof of Fermat's theorem",
    candidate_labels=['education', 'politics','business', 'science'],

)

res

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


{'sequence': "This is a proof of Fermat's theorem",
 'labels': ['science', 'business', 'education', 'politics'],
 'scores': [0.728356122970581,
  0.11407586187124252,
  0.0872567817568779,
  0.07031118869781494]}

# AutoTokenizer, AutoModelForSequenceClassification

In [27]:
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [28]:
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

res = classifier(" I've been waiting for a HuggingFace course my whole life.")
res

Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9598046541213989}]

In [29]:
sequence = "using a transformers network in simple"

res = tokenizer(sequence)
res

{'input_ids': [101, 2478, 1037, 19081, 2897, 1999, 3722, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

In [31]:
tokens = tokenizer.tokenize(sequence)
tokens

['using', 'a', 'transformers', 'network', 'in', 'simple']

In [35]:
ids = tokenizer.convert_tokens_to_ids(tokens)
ids

[2478, 1037, 19081, 2897, 1999, 3722]

In [36]:
tokenizer.decode(ids)

'using a transformers network in simple'

#PyTorch with Transformers

In [41]:
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

classifier = pipeline("sentiment-analysis", model =model, tokenizer=tokenizer)

Device set to use cuda:0


In [42]:
X_train = [
    "I'm feeling so happy and excited about the future!",
    "This is the worst day of my life, and I don't know how to handle it.",
    "I can't stop laughing; this joke is absolutely hilarious!",
    "I'm worried and anxious about the upcoming exam next week.",
    "Spending time with my family brings me so much peace and joy."
]

res = classifier(X_train)
res

[{'label': 'POSITIVE', 'score': 0.999881386756897},
 {'label': 'NEGATIVE', 'score': 0.9997773766517639},
 {'label': 'POSITIVE', 'score': 0.9998645782470703},
 {'label': 'NEGATIVE', 'score': 0.9966844916343689},
 {'label': 'POSITIVE', 'score': 0.9998894929885864}]

In [44]:
batch = tokenizer(X_train, padding=True, truncation=True, max_length=512, return_tensors='pt')
batch

{'input_ids': tensor([[  101,  1045,  1005,  1049,  3110,  2061,  3407,  1998,  7568,  2055,
          1996,  2925,   999,   102,     0,     0,     0,     0,     0,     0,
             0,     0],
        [  101,  2023,  2003,  1996,  5409,  2154,  1997,  2026,  2166,  1010,
          1998,  1045,  2123,  1005,  1056,  2113,  2129,  2000,  5047,  2009,
          1012,   102],
        [  101,  1045,  2064,  1005,  1056,  2644,  5870,  1025,  2023,  8257,
          2003,  7078, 26316,   999,   102,     0,     0,     0,     0,     0,
             0,     0],
        [  101,  1045,  1005,  1049,  5191,  1998, 11480,  2055,  1996,  9046,
         11360,  2279,  2733,  1012,   102,     0,     0,     0,     0,     0,
             0,     0],
        [  101,  5938,  2051,  2007,  2026,  2155,  7545,  2033,  2061,  2172,
          3521,  1998,  6569,  1012,   102,     0,     0,     0,     0,     0,
             0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,

In [46]:
batch = batch.to('cuda')
with torch.no_grad():
  outputs = model(**batch)
  print(outputs)
  predictions = F.softmax(outputs.logits, dim=1)
  print(predictions)
  labels = torch.argmax(predictions, dim=1)
  print(labels)


SequenceClassifierOutput(loss=None, logits=tensor([[-4.3366,  4.7028],
        [ 4.6829, -3.7270],
        [-4.2810,  4.6265],
        [ 3.1132, -2.5926],
        [-4.3895,  4.7213]], device='cuda:0'), hidden_states=None, attentions=None)
tensor([[1.1863e-04, 9.9988e-01],
        [9.9978e-01, 2.2260e-04],
        [1.3536e-04, 9.9986e-01],
        [9.9668e-01, 3.3155e-03],
        [1.1045e-04, 9.9989e-01]], device='cuda:0')
tensor([1, 0, 1, 0, 1], device='cuda:0')


#Save model

In [47]:
sace_directory = "saved"
tokenizer.save_pretrained(sace_directory)
model.save_pretrained(sace_directory)

tok = AutoTokenizer.from_pretrained(sace_directory)
mod = AutoModelForSequenceClassification.from_pretrained(sace_directory)