<a href="https://colab.research.google.com/github/Satwikram/NLP-Implementations/blob/main/BERT/Visualize%20BERT%20Word%20Embeddings%20in%20Tensorboard.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Author: Satwik Ram K

### Connecting to Kaggle

In [2]:
from google.colab import files

files.upload()


! mkdir ~/.kaggle


! cp kaggle.json ~/.kaggle/

! chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json


### Downloading the dataset

In [3]:
!kaggle datasets download -d rmisra/news-headlines-dataset-for-sarcasm-detection

Downloading news-headlines-dataset-for-sarcasm-detection.zip to /content
  0% 0.00/3.30M [00:00<?, ?B/s]
100% 3.30M/3.30M [00:00<00:00, 109MB/s]


In [4]:
!unzip /content/news-headlines-dataset-for-sarcasm-detection.zip

Archive:  /content/news-headlines-dataset-for-sarcasm-detection.zip
  inflating: Sarcasm_Headlines_Dataset.json  
  inflating: Sarcasm_Headlines_Dataset_v2.json  


In [5]:
!pip install transformers

### Importing Dependencies

In [60]:
import numpy as np
import pandas as pd

import tensorflow as tf
from tensorflow.keras.layers import Input, Flatten, Dense
from tensorflow.keras.models import Model

from tensorboard.plugins import projector

from transformers import  AutoTokenizer
from transformers import TFAutoModelForSequenceClassification

### Loading dataset

In [2]:
df = pd.read_json("/content/Sarcasm_Headlines_Dataset_v2.json", lines=True)

In [3]:
df.head()

Unnamed: 0,is_sarcastic,headline,article_link
0,1,thirtysomething scientists unveil doomsday clo...,https://www.theonion.com/thirtysomething-scien...
1,0,dem rep. totally nails why congress is falling...,https://www.huffingtonpost.com/entry/donna-edw...
2,0,eat your veggies: 9 deliciously different recipes,https://www.huffingtonpost.com/entry/eat-your-...
3,1,inclement weather prevents liar from getting t...,https://local.theonion.com/inclement-weather-p...
4,1,mother comes pretty close to using word 'strea...,https://www.theonion.com/mother-comes-pretty-c...


### Taking X and Y

In [4]:
X = df["headline"]
y = df["is_sarcastic"].values

### Tokenization

In [5]:
checkpoint = "bert-base-uncased"
sequence_length = 64

In [6]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokens = tokenizer(
    X.tolist(),
    max_length=sequence_length,
    truncation=True,
    padding="max_length",
    add_special_tokens=True,
    return_tensors="np",
)

### Building the model

In [7]:
def get_model():

    m = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
    return m

def build_model():

    bert = get_model()

    input_ids = Input(shape=(sequence_length,), name="input_ids", dtype="int32")
    mask = Input(shape=(sequence_length,), name="attention_mask", dtype="int32")

    x1 = bert.bert(input_ids, attention_mask=mask)[1]
    
    x1 = Flatten()(x1)

    y = Dense(1, activation = "sigmoid", name = "output")(x1)

    model = Model(inputs=[input_ids, mask], outputs=y)

    print(model.summary())
    return model

model = build_model()

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 64)]         0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 64)]         0           []                               
                                                                                                  
 bert (TFBertMainLayer)         TFBaseModelOutputWi  109482240   ['input_ids[0][0]',              
                                thPoolingAndCrossAt               'attention_mask[0][0]']         
                                tentions(last_hidde                                               
                                n_state=(None, 64,                                            

### Compiling the model

In [8]:
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

### Training the model

In [9]:
history = model.fit([tokens["input_ids"], tokens["attention_mask"]], y, batch_size=8,  epochs=1)



### Tensorboard Projector

In [11]:
import os

os.makedirs("logs", exist_ok=True)

### Saving the Vocabulary

In [34]:
tokenizer.save_vocabulary("logs", filename_prefix="metadata.tsv")

('logs/metadata.tsv-vocab.txt',)

### Saving Weights

In [57]:
weights = tf.Variable(model.layers[2].get_weights()[0][0:])

### Crteating Checkpoint

In [59]:
checkpoint = tf.train.Checkpoint(embedding=weights)
checkpoint.save(os.path.join("logs", "embedding.ckpt"))

'logs/embedding.ckpt-1'

### Visualizing the Tensorboard

In [62]:
config = projector.ProjectorConfig()
embedding = config.embeddings.add()
# The name of the tensor will be suffixed by `/.ATTRIBUTES/VARIABLE_VALUE`.
embedding.tensor_name = "embedding/.ATTRIBUTES/VARIABLE_VALUE"
embedding.metadata_path = 'metadata.tsv-vocab.txt'
projector.visualize_embeddings("logs", config)

In [None]:
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

%load_ext tensorboard

%tensorboard --logdir /content/logs