<a href="https://colab.research.google.com/github/CodeSolid/colab-testing/blob/main/TestNotebookForGithubIntegration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ensure we have what we need.

In [7]:
!pip install -q numpy # Should be here, but make sure
!pip install -q torch transformers tensorflow tf_keras

In [8]:
import transformers

Now run through some of the basic transformer functionality, largely based on the discussion [here](https://huggingface.co/learn/llm-course/chapter1/3?fw=pt).

# Some Sentiment Analysis Examples

In [57]:
# A brief Transformers demo cf. https://huggingface.co/learn/llm-course/chapter1/3?fw=pt

from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")
results = classifier("We are incredibly excited show you the Transformers library.", mps=0)
print(results)
print(type(results), type(results[0]))


Device set to use mps:0


[{'label': 'POSITIVE', 'score': 0.9996805191040039}]
<class 'list'> <class 'dict'>


In [10]:
sentence_list = ["HuggingFace is awesome!", "I can't stand this soup!  It's disgusting", "I am curious about chimps"]

results = classifier(sentence_list)
for result in results:
    print(result)

{'label': 'POSITIVE', 'score': 0.9998612403869629}
{'label': 'NEGATIVE', 'score': 0.9994088411331177}
{'label': 'POSITIVE', 'score': 0.9980136156082153}


In [None]:
# Embeddings
Extracting embeddings

In [14]:
# Try our hand at some embedding:
embedder = pipeline("feature-extraction", model="openai-community/gpt2")
results = embedder(sentence_list)
import numpy as np
for result in results:
    print("----")
    data = np.array(result)
    print(data.shape)

Device set to use mps:0


----
(1, 6, 768)
----
(1, 11, 768)
----
(1, 6, 768)


In [15]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
sentence_embeddings = []
for sentence in sentence_list:
    token_ids = tokenizer(sentence).input_ids
    sentence_embeddings.append(token_ids)
print(sentence_embeddings)

[[48098, 2667, 32388, 318, 7427, 0], [40, 460, 470, 1302, 428, 17141, 0, 220, 632, 338, 23374], [40, 716, 11040, 546, 18205, 862]]


In [58]:
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("My name is John and I'm married to Jenniffer, and we live in Charlotte.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Device set to use mps:0


[{'entity_group': 'PER',
  'score': np.float32(0.99875915),
  'word': 'John',
  'start': 11,
  'end': 15},
 {'entity_group': 'PER',
  'score': np.float32(0.99015516),
  'word': 'Jenniffer',
  'start': 35,
  'end': 44},
 {'entity_group': 'LOC',
  'score': np.float32(0.99624443),
  'word': 'Charlotte',
  'start': 61,
  'end': 70}]

## Decoding

Here we decode the tokens. By printing an extra space we can show that tokens != words.

In [26]:
for s in sentence_embeddings:
    for tok in s:
        print(tokenizer.decode(tok), end=" ")
        
    print()

Hug ging Face  is  awesome ! 
I  can 't  stand  this  soup !    It 's  disgusting 
I  am  curious  about  chim ps 


In [23]:
# Speaking of tokens, 
tokenizer.vocab_size

50257

## Some random text generation.

In [None]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator(
    "The day began",
    max_length=30,
    num_return_sequences=2,
)

## And fill in the blanks

In [54]:
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
unmasker = pipeline("fill-mask", "distilbert/distilroberta-base")
unmasker("I love to <mask> in my spare time.", top_k=3)

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


[{'score': 0.1919156312942505,
  'token': 3116,
  'token_str': ' write',
  'sequence': 'I love to write in my spare time.'},
 {'score': 0.055325016379356384,
  'token': 1166,
  'token_str': ' read',
  'sequence': 'I love to read in my spare time.'},
 {'score': 0.035565681755542755,
  'token': 7142,
  'token_str': ' cook',
  'sequence': 'I love to cook in my spare time.'}]

In [66]:
from transformers import pipeline

pos_tagging = "TweebankNLP/bertweet-tb2-pos-tagging"
ner_tagging = "dslim/bert-base-NER"
model_name = ner_tagging

ner = pipeline("ner", model=model_name, grouped_entities=True)
# My wife's name is actually "Jenniffer", but that confuses the model :(
ner("My name is John, and I live in Charlotte with my wife, Jennifer.")

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


[{'entity_group': 'PER',
  'score': np.float32(0.9986285),
  'word': 'John',
  'start': 11,
  'end': 15},
 {'entity_group': 'LOC',
  'score': np.float32(0.9949582),
  'word': 'Charlotte',
  'start': 31,
  'end': 40},
 {'entity_group': 'PER',
  'score': np.float32(0.9989145),
  'word': 'Jennifer',
  'start': 55,
  'end': 63}]

In [70]:
from transformers import pipeline

filler = pipeline("fill-mask", model="bert-base-cased")
result = filler("This [MASK] has been waiting for you.")
result

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


[{'score': 0.09050454199314117,
  'token': 1299,
  'token_str': 'man',
  'sequence': 'This man has been waiting for you.'},
 {'score': 0.07195105403661728,
  'token': 1282,
  'token_str': 'place',
  'sequence': 'This place has been waiting for you.'},
 {'score': 0.055770039558410645,
  'token': 1362,
  'token_str': 'world',
  'sequence': 'This world has been waiting for you.'},
 {'score': 0.04573364928364754,
  'token': 1141,
  'token_str': 'one',
  'sequence': 'This one has been waiting for you.'},
 {'score': 0.03509213402867317,
  'token': 1590,
  'token_str': 'woman',
  'sequence': 'This woman has been waiting for you.'}]

In [76]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
result = classifier("This is a course about the Transformers library", candidate_labels=["education", "food"])
# or
# result = classifier("This is a course about the Transformers library", ["education", "food"])
result

Device set to use mps:0


{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'food'],
 'scores': [0.9308226108551025, 0.06917736679315567]}