# LAB 2 Notebook

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

### 1. 
Preprocessing with a tokenizer. 

You're to preprocess and display the output of the below sentence for the model. The tokenizer has already been loaded for you.

* "The universe is vast and full of mysteries.",
* "Exploring the depths of the ocean reveals hidden wonders."

Set the following parameters:

- `padding=True`: This parameter ensures that the sequences are padded to equal lengths if their lengths vary.
- `truncation=True`: If any sentence exceeds the maximum length that the model can handle, truncation cuts it to fit within that limit.
- `return_tensors="pt"`: This specifies that the tokenizer should return PyTorch tensors.

In [2]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [3]:
### Write your code here

# Different set of raw inputs
different_raw_inputs = [
    "The universe is vast and full of mysteries.",
    "Exploring the depths of the ocean reveals hidden wonders.",
]

# Tokenize the new set of inputs
inputs = tokenizer(different_raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)


{'input_ids': tensor([[  101,  1996,  5304,  2003,  6565,  1998,  2440,  1997, 15572,  1012,
           102,     0],
        [  101, 11131,  1996, 11143,  1997,  1996,  4153,  7657,  5023, 16278,
          1012,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


### 2. 
Going through the model

Feed the preprocessed text into the model and display the vector output of the transformer. The model has already been loaded for you.

What are the batch size, sequence length and hidden size from the vector?

In [5]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

In [8]:
### Write your code here

outputs = model(**inputs)
shape = outputs.last_hidden_state.shape
print(f"The batch size is {shape[0]}")
print(f"The sequence lenght is {shape[1]}")
print(f"The hidden size is {shape[2]}")

The batch size is 2
The sequence lenght is 12
The hidden size is 768


### 3. 
Making sense out of the numbers

Instead of just extracting features from the input text, we want to actually categorize it as positive or negative. To do this, we need a specialized model head called a "sequence classification head" that takes the extracted features and predicts a category for the sentence.

While the `AutoModel` class can extract features, it's not designed for classification. Therefore, we'll use `AutoModelForSequenceClassification` instead, which comes with this specialized head built-in.

The model has been loaded for you, Pass in the preprocess text and display the logits and it's shape

In [11]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

In [13]:
# Write your code here

outputs = model(**inputs)
print(outputs.logits.shape)
print(outputs.logits)

torch.Size([2, 2])
tensor([[-3.8570,  4.1280],
        [-4.2169,  4.5045]], grad_fn=<AddmmBackward0>)


### 4.
The output from the model are not probabilities but logits which are unnormalized scores outputted by the last layer of the models. 
The softmax function is used to convert these logits into probabilities. It takes an array of numbers (logits) as input and returns. Convert those output to probabilities and display them, also display their labels.

In [14]:
# Write your code here

import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=1)
print(predictions)
print(model.config.id2label)

tensor([[3.4041e-04, 9.9966e-01],
        [1.6303e-04, 9.9984e-01]], grad_fn=<SoftmaxBackward0>)
{0: 'NEGATIVE', 1: 'POSITIVE'}


### 5. 
#### Write your conclusions from the model prediction below:

Now we can conclude that the model predicted the following:

First sentence: NEGATIVE: 0.00034041, POSITIVE: 0.9966
Second sentence: NEGATIVE: 0.00016303, POSITIVE: 0.99984

### 6.
 Replicate Tokenization and Conversion to Input IDs

Replicate the tokenization and conversion to input IDs for the following input sentences:
1. "HuggingFace provides amazing NLP resources.",
2. "I love learning about natural language processing!",


Using the appropriate tokenizer, tokenize these sentences and convert them into input IDs.

In [17]:
# Write your code here

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = [
    "HuggingFace provides amazing NLP resources.",
    "I love learning about natural language processing!",
]

tokens = tokenizer.tokenize(sequence)

print(tokens)

ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

['Hu', '##gging', '##F', '##ace', 'provides', 'amazing', 'NL', '##P', 'resources', '.', 'I', 'love', 'learning', 'about', 'natural', 'language', 'processing', '!']
[20164, 10932, 2271, 7954, 2790, 6929, 21239, 2101, 3979, 119, 146, 1567, 3776, 1164, 2379, 1846, 6165, 106]


### 7.
Decoding

Convert the vocabulary indices gotten from above back into a human-readable text string.

In [18]:
# write your code here

decoded_string = tokenizer.decode([20164, 10932, 2271, 7954, 2790, 6929, 21239, 2101, 3979, 119, 146, 1567, 3776, 1164, 2379, 1846, 6165, 106])
print(decoded_string)

HuggingFace provides amazing NLP resources. I love learning about natural language processing!


### 8.

What are the tokenized representations of the following sequences: 
1. 'The mountains are calling and I must go.'
2. 'Nature always wears the colors of the spirit.' 
These sequences are tokenized using the 'distilbert-base-uncased-finetuned-sst-2-english' checkpoint.

In [19]:
# Write your code here

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

sequence = [
    'The mountains are calling and I must go.',
    'Nature always wears the colors of the spirit.' 
]

tokens = tokenizer.tokenize(sequence)

print(tokens)

['the', 'mountains', 'are', 'calling', 'and', 'i', 'must', 'go', '.', 'nature', 'always', 'wears', 'the', 'colors', 'of', 'the', 'spirit', '.']


### 9.

What are the different tokenization and padding methods applied to the sequences: 

1. 'The road to success is always under construction.'
2. 'A journey of a thousand miles begins with a single step.'

These sequences are tokenized using the 'distilbert-base-uncased-finetuned-sst-2-english' checkpoint. The methods employed include padding to the longest sequence, padding to the model's maximum sequence length, padding to a specified maximum length (10), truncation to the model's maximum sequence length, truncation to a specified maximum length (10), and returning the tokenized sequences as PyTorch tensors and NumPy arrays with padding enabled.


In [20]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Different context for the sequences
different_sequences = [
    "The road to success is always under construction.",
    "A journey of a thousand miles begins with a single step."
]

# Padding to the longest sequence in the new context
model_inputs = tokenizer(different_sequences, padding="longest")
print("Padding to longest sequence:")
print(model_inputs)

# Padding to the model's max sequence length
model_inputs = tokenizer(different_sequences, padding="max_length")
print("\nPadding to model's max length:")
print(model_inputs)

# Padding to a specified max length
model_inputs = tokenizer(different_sequences, padding="max_length", max_length=10)
print("\nPadding to specified max length (10):")
print(model_inputs)

# Truncation to the model's max sequence length
model_inputs = tokenizer(different_sequences, truncation=True)
print("\nTruncation to model's max length:")
print(model_inputs)

# Truncation to a specified max length
model_inputs = tokenizer(different_sequences, max_length=10, truncation=True)
print("\nTruncation to specified max length (10):")
print(model_inputs)

# Tokenizing sequences and returning PyTorch tensors
model_inputs = tokenizer(different_sequences, padding=True, return_tensors="pt")
print("\nReturning PyTorch tensors:")
print(model_inputs)

# Tokenizing sequences and returning NumPy arrays
model_inputs = tokenizer(different_sequences, padding=True, return_tensors="np")
print("\nReturning NumPy arrays:")
print(model_inputs)


Padding to longest sequence:
{'input_ids': [[101, 1996, 2346, 2000, 3112, 2003, 2467, 2104, 2810, 1012, 102, 0, 0, 0], [101, 1037, 4990, 1997, 1037, 4595, 2661, 4269, 2007, 1037, 2309, 3357, 1012, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

Padding to model's max length:
{'input_ids': [[101, 1996, 2346, 2000, 3112, 2003, 2467, 2104, 2810, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 