<a href="https://colab.research.google.com/github/Satwikram/NLP-Implementations/blob/main/Fine%20Tune%20BERT%20with%20Custom%20dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Author: Satwik Ram K

### Connecting to Kaggle

In [2]:
from google.colab import files

files.upload()


! mkdir ~/.kaggle


! cp kaggle.json ~/.kaggle/

! chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json
mkdir: cannot create directory ‘/root/.kaggle’: File exists


### Downloading the dataset

In [3]:
!kaggle datasets download -d rmisra/news-headlines-dataset-for-sarcasm-detection

Downloading news-headlines-dataset-for-sarcasm-detection.zip to /content
  0% 0.00/3.30M [00:00<?, ?B/s]
100% 3.30M/3.30M [00:00<00:00, 109MB/s]


In [4]:
!unzip /content/news-headlines-dataset-for-sarcasm-detection.zip

Archive:  /content/news-headlines-dataset-for-sarcasm-detection.zip
  inflating: Sarcasm_Headlines_Dataset.json  
  inflating: Sarcasm_Headlines_Dataset_v2.json  


### Importing the dependencies

In [5]:
import pandas as pd
import numpy as np
import tensorflow as tf

### Loading Dataset

In [6]:
dataset = pd.read_json("/content/Sarcasm_Headlines_Dataset_v2.json", lines = True)

In [7]:
dataset.head()

Unnamed: 0,is_sarcastic,headline,article_link
0,1,thirtysomething scientists unveil doomsday clo...,https://www.theonion.com/thirtysomething-scien...
1,0,dem rep. totally nails why congress is falling...,https://www.huffingtonpost.com/entry/donna-edw...
2,0,eat your veggies: 9 deliciously different recipes,https://www.huffingtonpost.com/entry/eat-your-...
3,1,inclement weather prevents liar from getting t...,https://local.theonion.com/inclement-weather-p...
4,1,mother comes pretty close to using word 'strea...,https://www.theonion.com/mother-comes-pretty-c...


In [8]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28619 entries, 0 to 28618
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   is_sarcastic  28619 non-null  int64 
 1   headline      28619 non-null  object
 2   article_link  28619 non-null  object
dtypes: int64(1), object(2)
memory usage: 670.9+ KB


### Taking X and Y

In [9]:
X = dataset["headline"]

In [10]:
y = dataset["is_sarcastic"]

### Splitting Data into Train and Test

In [11]:
from sklearn.model_selection import train_test_split

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

### Tokenization

In [None]:
!pip install transformers

In [14]:
from transformers import AutoTokenizer

In [15]:
tokenizer = AutoTokenizer.from_pretrained('bert-large-uncased')

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [16]:
tokenizer

PreTrainedTokenizerFast(name_or_path='bert-large-uncased', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

#### Defining Tokenization function

In [17]:
def tokenize_function(examples):
    return tokenizer(examples, padding = "max_length", truncation = True)

In [18]:
X_train = tokenize_function(X_train.to_list())

In [19]:
X_test = tokenize_function(X_test.to_list())

### Preparing the dataset

In [20]:
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(X_train),
    y_train
))

In [21]:
test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(X_test),
    y_test
))

### Fine-Tuning the model

In [22]:
from transformers import TFAutoModelForSequenceClassification

In [23]:
model = TFAutoModelForSequenceClassification.from_pretrained('bert-large-uncased')

Downloading:   0%|          | 0.00/1.37G [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [26]:
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)

In [27]:
model.compile(optimizer = optimizer, loss = model.compute_loss)

In [None]:
model.fit(train_dataset.shuffle(1000).batch(16), epochs = 3, batch_size = 16)