<a href="https://colab.research.google.com/github/CastHash532/kaggle-automl/blob/main/Kaggle_news_NLP_Transformers_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup

In [1]:
!pip install fast-bert

Collecting fast-bert
  Downloading fast_bert-1.9.9-py3-none-any.whl (99 kB)
[?25l[K     |███▎                            | 10 kB 10.5 MB/s eta 0:00:01[K     |██████▋                         | 20 kB 7.6 MB/s eta 0:00:01[K     |█████████▉                      | 30 kB 5.6 MB/s eta 0:00:01[K     |█████████████▏                  | 40 kB 3.4 MB/s eta 0:00:01[K     |████████████████▌               | 51 kB 3.7 MB/s eta 0:00:01[K     |███████████████████▊            | 61 kB 4.2 MB/s eta 0:00:01[K     |███████████████████████         | 71 kB 4.1 MB/s eta 0:00:01[K     |██████████████████████████▎     | 81 kB 4.7 MB/s eta 0:00:01[K     |█████████████████████████████▋  | 92 kB 3.7 MB/s eta 0:00:01[K     |████████████████████████████████| 99 kB 2.9 MB/s 
Collecting transformers==3.0.2
  Downloading transformers-3.0.2-py3-none-any.whl (769 kB)
[K     |████████████████████████████████| 769 kB 24.8 MB/s 
Collecting tokenizers==0.8.1.rc1
  Downloading tokenizers-0.8.1rc1-cp37-cp37

## Authenticating with Kaggle using kaggle.json

Navigate to https://www.kaggle.com. Then go to the [Account tab of your user profile](https://www.kaggle.com/me/account) and select Create API Token. This will trigger the download of kaggle.json, a file containing your API credentials.

Then run the cell below to upload kaggle.json to your Colab runtime.

In [2]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))
  
# Then move kaggle.json into the folder where the API expects to find it.
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json
User uploaded file "kaggle.json" with length 70 bytes


## Load data and preprocess



In [3]:
!kaggle datasets download -d rmisra/news-category-dataset

Downloading news-category-dataset.zip to /content
 35% 9.00M/25.4M [00:00<00:00, 19.5MB/s]
100% 25.4M/25.4M [00:00<00:00, 43.4MB/s]


In [4]:
!unzip -u news-category-dataset.zip

Archive:  news-category-dataset.zip
  inflating: News_Category_Dataset_v2.json  


In [5]:
import pandas as pd
import numpy as np

dataset = pd.read_json('/content/News_Category_Dataset_v2.json',lines=True)

In [6]:
# WORLDPOST and THE WORLDPOST were given as two separate categories in the dataset. Here I change the category THE WORLDPOST to WORLDPOST 
dataset.category = dataset.category.map(lambda x: "WORLDPOST" if x == "THE WORLDPOST" else x)

In [7]:
# Here I concatinate both headlines and short descriptions into one Series 

l=pd.DataFrame(['. ']*dataset.shape[0])[0] # seperators dataframe
dataset['text']= dataset['headline'] + l + dataset['short_description']

In [8]:
from sklearn.model_selection import train_test_split

ds_train, ds_test = train_test_split(dataset,test_size=0.2, random_state=42)

In [9]:
text = 'text'
target = 'category'

In [10]:
# Take a subset of the News dataset and save it to csv files
ds_train[[text, target]][:8000].to_csv('/content/train.csv')
ds_test[[text, target]][:2000].to_csv('/content/test.csv')
pd.DataFrame(ds_train[target].unique()).to_csv('/content/labels.csv', index=False)

In [11]:
ds_train['text'].str.len().max()

1488

## Fine Tuning

In [12]:
from fast_bert.data_cls import BertDataBunch
import torch

DATA_PATH = LABEL_PATH = OUTPUT_PATH ='/content'

databunch = BertDataBunch(DATA_PATH, LABEL_PATH,
                          tokenizer='bert-base-uncased',
                          train_file='train.csv',
                          val_file='test.csv',
                          label_file='labels.csv',
                          text_col=text,
                          label_col=target,
                          batch_size_per_gpu=64,
                          max_seq_length=512,
                          multi_gpu=True if torch.cuda.device_count() > 1 else False,
                          multi_label=True,
                          model_type='bert')

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [13]:

from fast_bert.learner_cls import BertLearner
from fast_bert.metrics import accuracy
import logging


logger = logging.getLogger()
device_cuda = torch.device("cuda")
metrics = [{'name': 'accuracy', 'function': accuracy}]

learner = BertLearner.from_pretrained_model(
						databunch,
						pretrained_path='bert-base-uncased',
						metrics=metrics,
						device=device_cuda,
						logger=logger,
						output_dir=OUTPUT_PATH,
						finetuned_wgts_path=None,
						warmup_steps=500,
						multi_gpu=True if torch.cuda.device_count() > 1 else False,
						is_fp16=True,
						multi_label=True,
						logging_steps=50,
						freeze_transformer_layers=True
						)


Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMultiLabelSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForMultiLabelSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForMultiLabelSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForMultiLabelSequenceClassification were not 

In [None]:
learner.lr_find(start_lr=1e-5,optimizer_type='lamb')

  0%|          | 0/100 [00:00<?, ?it/s]

	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:1025.)
  exp_avg.mul_(beta1).add_(1 - beta1, grad)


In [None]:
#fine tuning the model
learner.fit(epochs=3,
			lr=1e-5,
			validate=True, 	# Evaluate the model after each epoch
			schedule_type="warmup_cosine",
			optimizer_type="lamb")


In [None]:
learner.validate()

In [None]:
learner.save_model()