# <center> NLP Sequence Classificiation

## Load cleaned data

In [1]:
from datasets import load_dataset

twitter_data = load_dataset("cayjobla/twitter-sentiment-classification")

Found cached dataset parquet (/home/jupyter-cayjobla/.cache/huggingface/datasets/cayjobla___parquet/cayjobla--twitter-sentiment-classification-5f2b29fe45958d87/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


  0%|          | 0/2 [00:00<?, ?it/s]

In [2]:
twitter_data = twitter_data.rename_column(new_column_name="label", original_column_name="sentiment")

In [3]:
# View data example
twitter_data["train"][0]

{'tweet_id': 1753253621,
 'label': 8,
 'content': '@aminorjourney - We owe you a LOT.'}

## Tokenize the dataset

In [5]:
from transformers import AutoTokenizer

# Load the pretrained tokenizer
tokenizer = AutoTokenizer.from_pretrained("cayjobla/distilbert-base-uncased-finetuned-twitter")

Downloading (…)okenizer_config.json:   0%|          | 0.00/320 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/205k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/677k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [6]:
test_ids = tokenizer(twitter_data["train"][0]["content"])['input_ids']
tokenizer.decode(test_ids)

2023-05-25 11:32:59.336433: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-05-25 11:32:59.518737: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-05-25 11:33:00.383935: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-05-25 11:33:00.384024: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or 

'[CLS] @ aminorjourney - we owe you a lot. [SEP]'

In [24]:
tokenizer.save_pretrained("distilbert-base-uncased-finetuned-twitter-classification")

('distilbert-base-uncased-finetuned-twitter-classification/tokenizer_config.json',
 'distilbert-base-uncased-finetuned-twitter-classification/special_tokens_map.json',
 'distilbert-base-uncased-finetuned-twitter-classification/vocab.txt',
 'distilbert-base-uncased-finetuned-twitter-classification/added_tokens.json',
 'distilbert-base-uncased-finetuned-twitter-classification/tokenizer.json')

In [7]:
def tokenize_function(examples):
    result = tokenizer(examples["content"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result

tokenized_datasets = twitter_data.map(
    tokenize_function, batched=True, remove_columns=["tweet_id", "content"]
)
tokenized_datasets

Map:   0%|          | 0/32000 [00:00<?, ? examples/s]

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask', 'word_ids'],
        num_rows: 32000
    })
    test: Dataset({
        features: ['label', 'input_ids', 'attention_mask', 'word_ids'],
        num_rows: 8000
    })
})

## Collate Data

In [8]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors='tf')

In [9]:
# This masks individual tokens, we mask by entire word later
samples = [tokenized_datasets["train"][i] for i in range(3)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
2023-05-25 11:33:19.754724: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2023-05-25 11:33:19.754769: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1934] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2023-05-25 11:33:19.755471: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Lib


'>>> [CLS] @ aminorjourney - we owe you a lot. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD]'

'>>> [CLS] chilling feeling really nice.. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]'

'>>> [CLS] i'm soooo sleepy but i'm not a home just yet [SEP]'


## Load and Fine-tune the model

In [10]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [11]:
num_labels = len(twitter_data["train"].features["label"].names)
num_labels

13

In [12]:
from transformers import TFAutoModelForSequenceClassification

model_checkpoint = "distilbert-base-uncased"
model = TFAutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
model.summary()

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_transform', 'vocab_projector', 'activation_13']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier', 'dropout_19', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use i

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMai  multiple                 66362880  
 nLayer)                                                         
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  9997      
                                                                 
 dropout_19 (Dropout)        multiple                  0         
                                                                 
Total params: 66,963,469
Trainable params: 66,963,469
Non-trainable params: 0
_________________________________________________________________


In [13]:
batch_size = 16

tf_train_dataset = tokenized_datasets["train"].to_tf_dataset(
    columns=['attention_mask', 'input_ids', 'label'],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)

tf_eval_dataset = tokenized_datasets["test"].to_tf_dataset(
    columns=['attention_mask', 'input_ids', 'label'],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=batch_size,
)

In [15]:
from transformers import create_optimizer
from transformers.keras_callbacks import PushToHubCallback
import tensorflow as tf

num_epochs = 1
batches_per_epoch = len(tf_train_dataset) // batch_size
num_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(
    init_lr=2e-5,
    num_warmup_steps=0,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)

model.compile(
    optimizer=optimizer,
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.metrics.SparseCategoricalAccuracy(),
)

# Train in mixed-precision float16
tf.keras.mixed_precision.set_global_policy("mixed_float16")

model_name = model_checkpoint.split("/")[-1]

In [16]:
model.fit(tf_train_dataset, validation_data=tf_eval_dataset, epochs=num_epochs)



<keras.callbacks.History at 0x7fce7df6dc10>

In [19]:
model.save_pretrained("distilbert-base-uncased-finetuned-twitter-classification")

## Pipeline for our model

In [27]:
from transformers import pipeline

text_classifier = pipeline(
    "text-classification", model="distilbert-base-uncased-finetuned-twitter-classification"
)

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-twitter-classification were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-twitter-classification and are newly initialized: ['dropout_79']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and infer

In [29]:
text = "I am upset"
preds = text_classifier(text)

for pred in preds:
    print(f">>> {pred}")

>>> {'label': 'LABEL_12', 'score': 0.20263671875}
