<a href="https://colab.research.google.com/github/Sudhandar/Intent-Classification-with-BERT/blob/master/notebooks/atis_bert_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intent Classification with BERT

The main intent behind choosing this particular dataset which is small in size (5871 rows) and highy imbalanced is to show how transfer learning can help us achieve high accuracy ( accuracy metric : F1 Score) for multiclass classification using text data as input which would otherwise be computationaly expensive and also requires huge training data.


In [1]:
import torch
import pandas as pd
from tqdm.notebook import tqdm

### Uploading the dataset
The dataset can be downloaded from the github repo ([/input](https://github.com/Sudhandar/Intent-Classification-with-BERT/tree/master/input)) and can be uploaded by running the following cell. (Refer to the file named atis_dataset.csv
)

In [2]:
from google.colab import files
uploaded = files.upload()

Saving atis_dataset.csv to atis_dataset.csv


### Exploring the dataset 
The original dataset was in the pickle format ([Data Source](https://github.com/Sudhandar/Intent-Classification-with-BERT/tree/master/src)). It has been converted to CSV format and the data preprocessing script can found in the github repo ([/src](https://github.com/Sudhandar/Intent-Classification-with-BERT/tree/master/src)). Now, lets do some basic analysis of the dataset.


In [27]:
df = pd.read_csv('atis_dataset.csv')

In [28]:
df.head(5)

Unnamed: 0,query,intent
0,i want to fly from boston at 838 am and arrive...,flight
1,what flights are available from pittsburgh to ...,flight
2,what is the arrival time in san francisco for ...,flight_time
3,cheapest airfare from tacoma to orlando,airfare
4,round trip fares from pittsburgh to philadelph...,airfare


In [29]:
print(f'Dataset shape: {df.shape}')

Dataset shape: (5871, 2)


In [30]:
df.intent.value_counts()

flight                        4298
airfare                        471
ground_service                 291
airline                        195
abbreviation                   180
aircraft                        90
flight_time                     55
quantity                        54
airport                         38
capacity                        37
flight+airfare                  33
distance                        30
city                            25
ground_fare                     25
flight_no                       20
meal                            12
restriction                      6
airline+flight_no                2
day_name                         2
flight+airline                   1
ground_service+ground_fare       1
aircraft+flight+flight_no        1
flight_no+airline                1
airfare+flight_time              1
airfare+flight                   1
cheapest                         1
Name: intent, dtype: int64

It is a **highly imbalanced dataset**. We can remove the intents with '+' sign and also the intents with less than 5 queries.


In [31]:
df = df[~df['intent'].str.contains('\+')]
df = df[~df['intent'].str.contains('day_name')]
df = df[~df['intent'].str.contains('cheapest')]

In [32]:
df.intent.value_counts()

flight            4298
airfare            471
ground_service     291
airline            195
abbreviation       180
aircraft            90
flight_time         55
quantity            54
airport             38
capacity            37
distance            30
city                25
ground_fare         25
flight_no           20
meal                12
restriction          6
Name: intent, dtype: int64

This is a multiclass classification problem with 16 target classes. Now lets encode the target classes.

In [33]:
possible_intents = df.intent.unique()

In [34]:
intent_dict ={}
for index, possible_intent in enumerate(possible_intents):
  intent_dict[possible_intent] = index

In [35]:
intent_dict

{'abbreviation': 8,
 'aircraft': 3,
 'airfare': 2,
 'airline': 6,
 'airport': 5,
 'capacity': 13,
 'city': 11,
 'distance': 7,
 'flight': 0,
 'flight_no': 12,
 'flight_time': 1,
 'ground_fare': 9,
 'ground_service': 4,
 'meal': 14,
 'quantity': 10,
 'restriction': 15}

In [36]:
df['label'] = df['intent'].apply(lambda x: intent_dict[x])
df.label.value_counts()

0     4298
2      471
4      291
6      195
8      180
3       90
1       55
10      54
5       38
13      37
7       30
11      25
9       25
12      20
14      12
15       6
Name: label, dtype: int64

In [37]:
n_rows = int(df.shape[0])
df.sample(frac =1).reset_index(drop =True)
df.head()

Unnamed: 0,query,intent,label
0,i want to fly from boston at 838 am and arrive...,flight,0
1,what flights are available from pittsburgh to ...,flight,0
2,what is the arrival time in san francisco for ...,flight_time,1
3,cheapest airfare from tacoma to orlando,airfare,2
4,round trip fares from pittsburgh to philadelph...,airfare,2


## Feature Engineering

We can split the data into train and test set using a 85%-15% split. Since, this is a highly imbalanced dataset we can use **Stratified KFold cross validation** which ensures we retain the same percentage of classes in both train and test data. 

In [38]:
holdout_testset = df.sample(n = 600, random_state= 12)
df = df[~df['query'].isin(holdout_testset['query'])]

In [39]:
print(f"Holdout Testset Shape:{holdout_testset.shape}")
print(f"Dataset New Shape:{df.shape}")

Holdout Testset Shape:(600, 3)
Dataset New Shape:(5130, 3)


In [40]:
from sklearn.model_selection import train_test_split

In [41]:
x_train, x_test, y_train, y_test = train_test_split(
    df.index.values,
    df.label.values,
    test_size = 0.15,
    random_state = 26,
    stratify = df.label.values
)

In [42]:
df.loc[x_train,'data_type'] = 'train'
df.loc[x_test, 'data_type'] = 'test'

In [43]:
df.groupby(['intent','label','data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,query
intent,label,data_type,Unnamed: 3_level_1
abbreviation,8,test,21
abbreviation,8,train,121
aircraft,3,test,11
aircraft,3,train,65
airfare,2,test,63
airfare,2,train,358
airline,6,test,24
airline,6,train,135
airport,5,test,5
airport,5,train,27


In [44]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/27/3c/91ed8f5c4e7ef3227b4119200fc0ed4b4fd965b1f0172021c25701087825/transformers-3.0.2-py3-none-any.whl (769kB)
[K     |▍                               | 10kB 19.3MB/s eta 0:00:01[K     |▉                               | 20kB 4.4MB/s eta 0:00:01[K     |█▎                              | 30kB 5.6MB/s eta 0:00:01[K     |█▊                              | 40kB 5.9MB/s eta 0:00:01[K     |██▏                             | 51kB 5.0MB/s eta 0:00:01[K     |██▋                             | 61kB 5.5MB/s eta 0:00:01[K     |███                             | 71kB 5.9MB/s eta 0:00:01[K     |███▍                            | 81kB 6.3MB/s eta 0:00:01[K     |███▉                            | 92kB 6.7MB/s eta 0:00:01[K     |████▎                           | 102kB 6.6MB/s eta 0:00:01[K     |████▊                           | 112kB 6.6MB/s eta 0:00:01[K     |█████▏                          | 122kB 6.6M

In [45]:
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

**BERT** has a specific encoding method. We can use the preatrained tokenziers from bert and transform our dataframes into Tensor datasets which can be consumed our **BERT** model.

In [46]:
tokenizer = BertTokenizer.from_pretrained(
    'bert-base-uncased',
    do_lower_case = True
)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




In [48]:
encoded_data_train = tokenizer.batch_encode_plus(df[df.data_type =='train']['query'].values,
                        add_special_tokens = True,
                        return_attention_mask = True,
                        pad_to_max_length = True,
                        max_length = 256,
                        truncation = True,
                        return_tensors = 'pt')

encoded_data_test = tokenizer.batch_encode_plus(df[df.data_type =='test']['query'].values,
                        add_special_tokens = True,
                        return_attention_mask = True,
                        pad_to_max_length = True,
                        max_length = 256,
                        truncation = True,
                        return_tensors = 'pt')

input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.data_type == 'train']['label'].values)

input_ids_test = encoded_data_test['input_ids']
attention_masks_test = encoded_data_test['attention_mask']
labels_test = torch.tensor(df[df.data_type == 'test']['label'].values)

In [49]:
dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_test = TensorDataset(input_ids_test, attention_masks_test, labels_test)

We can use **BERT for Sequence Classification** model for our multiclass classification problem.

In [50]:
from transformers import BertForSequenceClassification

In [51]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased',
                                                      num_labels = len(intent_dict),
                                                      output_attentions = False,
                                                      output_hidden_states = False)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [52]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

## Model Training and Evaluation

We can train our dataset on top of the pretrained BERT model for 10 epochs and of batch size 32.

In [53]:
batch_size = 32
dataloader_train = DataLoader(
    dataset_train,
    sampler = RandomSampler(dataset_train),
    batch_size = batch_size)
dataloader_test = DataLoader(
    dataset_test,
    sampler = RandomSampler(dataset_test),
    batch_size = batch_size)

In [54]:
from transformers import AdamW, get_linear_schedule_with_warmup

In [55]:
optimizer = AdamW(
    model.parameters(),
    lr = 1e-5,
    eps = 1e-8)

In [56]:
epochs = 10
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps= len(dataloader_train)*epochs)

Since this is a highly imbalanced dataset, we can use the F1 score as our accuracy metric.

In [57]:
import numpy as np
from sklearn.metrics import f1_score

In [58]:
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average = 'weighted')

In [59]:
def accuracy_per_class(preds, labels):
    intent_dict_inverse = {v: k for k, v in intent_dict.items()}
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    
    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat == label]
        y_true = labels_flat[labels_flat == label]
        print(f'Class: {intent_dict_inverse[label]}')
        print(f'accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}')

In [60]:
import random
seed_val = 26
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

Sending the model to the GPU provided by Google Colab

In [61]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(device)

cuda


In [62]:
def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in dataloader_val:
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

In [63]:
for epoch in tqdm(range(1, epochs+1)):
    model.train()
    loss_train_total = 0
    progress_bar = tqdm(dataloader_train, desc = 'Epoch {:1d}'.format(epoch), leave =False, disable =False)
    for batch in progress_bar:
        model.zero_grad()
        batch = tuple(b.to(device) for b in batch)
        inputs = {
            'input_ids': batch[0],
            'attention_mask': batch[1],
            'labels': batch[2]
            
        }
        outputs = model(**inputs)
        
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(),1.0)
        optimizer.step()
        scheduler.step()
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
    torch.save(model.state_dict(),f'BERT_ft_epoch{epoch}.model')
    tqdm.write(f'\nEpoch {epoch}')
    loss_train_avg = loss_train_total/len(dataloader_train)
    tqdm.write(f'Training loss: {loss_train_avg}')
    val_loss, predictions, true_vals = evaluate(dataloader_test)
    val_f1 = f1_score_func(predictions,true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (weighted): {val_f1}')

HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=137.0, style=ProgressStyle(description_widt…


Epoch {epoch}
Training loss: 1.178565883614721
Validation loss: 0.5526434063911438
F1 Score (weighted): 0.8680094109125074


HBox(children=(FloatProgress(value=0.0, description='Epoch 2', max=137.0, style=ProgressStyle(description_widt…


Epoch {epoch}
Training loss: 0.3787773750232954
Validation loss: 0.23448099166154862
F1 Score (weighted): 0.9412941728588587


HBox(children=(FloatProgress(value=0.0, description='Epoch 3', max=137.0, style=ProgressStyle(description_widt…


Epoch {epoch}
Training loss: 0.19194109442840962
Validation loss: 0.14295994894579053
F1 Score (weighted): 0.9748826919311301


HBox(children=(FloatProgress(value=0.0, description='Epoch 4', max=137.0, style=ProgressStyle(description_widt…


Epoch {epoch}
Training loss: 0.1198993518397919
Validation loss: 0.1077176421135664
F1 Score (weighted): 0.9775485524614516


HBox(children=(FloatProgress(value=0.0, description='Epoch 5', max=137.0, style=ProgressStyle(description_widt…


Epoch {epoch}
Training loss: 0.08605793539951317
Validation loss: 0.09616075534373522
F1 Score (weighted): 0.9788288686819643


HBox(children=(FloatProgress(value=0.0, description='Epoch 6', max=137.0, style=ProgressStyle(description_widt…


Epoch {epoch}
Training loss: 0.06500531119518797
Validation loss: 0.07689532656222582
F1 Score (weighted): 0.9805024672335481


HBox(children=(FloatProgress(value=0.0, description='Epoch 7', max=137.0, style=ProgressStyle(description_widt…


Epoch {epoch}
Training loss: 0.05100673072067273
Validation loss: 0.0722674657497555
F1 Score (weighted): 0.9844288346452287


HBox(children=(FloatProgress(value=0.0, description='Epoch 8', max=137.0, style=ProgressStyle(description_widt…


Epoch {epoch}
Training loss: 0.043027464511131285
Validation loss: 0.06803967367857694
F1 Score (weighted): 0.9844288346452287


HBox(children=(FloatProgress(value=0.0, description='Epoch 9', max=137.0, style=ProgressStyle(description_widt…


Epoch {epoch}
Training loss: 0.04000260879754694
Validation loss: 0.06737244879826904
F1 Score (weighted): 0.9844288346452287


HBox(children=(FloatProgress(value=0.0, description='Epoch 10', max=137.0, style=ProgressStyle(description_wid…


Epoch {epoch}
Training loss: 0.03569762617771099
Validation loss: 0.07002760646864772
F1 Score (weighted): 0.9844288346452287



As we can see, our model does pretty well with a **F1 Score of 0.98** 

### Testing our BERT model on the holdout testset

We can check the number of correct predictions for each class to see how our model has performed.

In [64]:
encoded_data_holdout_test = tokenizer.batch_encode_plus(holdout_testset['query'].values,
                        add_special_tokens = True,
                        return_attention_mask = True,
                        pad_to_max_length = True,
                        max_length = 256,
                        truncation = True,
                        return_tensors = 'pt')

input_ids_holdout_testset = encoded_data_holdout_test['input_ids']
attention_masks_holdout_testset = encoded_data_holdout_test['attention_mask']
labels_holdout_testset = torch.tensor(holdout_testset['label'].values)

dataset_holdout_testset = TensorDataset(input_ids_holdout_testset, attention_masks_holdout_testset, labels_holdout_testset)

dataloader_holdout_testset = DataLoader(
    dataset_holdout_testset,
    sampler = RandomSampler(dataset_holdout_testset),
    batch_size = batch_size)

In [65]:
_ , predictions, true_vals = evaluate(dataloader_holdout_testset)

In [67]:
f1_score_func(predictions, true_vals)

0.9901523845752106

In [68]:
accuracy_per_class(predictions, true_vals)

Class: flight
accuracy: 435/436
Class: flight_time
accuracy: 2/2
Class: airfare
accuracy: 47/47
Class: aircraft
accuracy: 12/13
Class: ground_service
accuracy: 25/27
Class: airport
accuracy: 4/4
Class: airline
accuracy: 23/23
Class: distance
accuracy: 4/4
Class: abbreviation
accuracy: 18/19
Class: ground_fare
accuracy: 5/5
Class: quantity
accuracy: 11/11
Class: city
accuracy: 2/2
Class: flight_no
accuracy: 4/5
Class: meal
accuracy: 2/2


The **F1 score** of our holdout testset is **0.99**. But it is safe to assume the F1 score of our training and validation (test set) which is 0.985 , because if we sample our testset there might be minor fluctuations in F1 scores.