<h1>Multiclass text classification of questions/description into subjects. Transformers.</h1>

The goal of this project is to classify texts by 4 subjects - math, physic, chemistry and biology.

<h1>A note from a publisher of the data set</h2><br>
Here I am posting the link to the dataset. This dataset contains 3 columns. The goal is to classify the given texts into 4 subjects i.e Maths, Physics, Chemistry, and Biology.<br>
Challenges: Data Cleaning (contains random special characters, symbols, expressions which might contain class dependent information. Also carries latex formulas, mathematical functions, etc.),class imbalance. Customized NLP techniques (lemmatization, stemming, stop word removal must be carried out carefully to distinguish between classes.) Overfitting.<br>
Have a nice experience playing around with the data, community🤞.<br>
https://www.kaggle.com/mrutyunjaybiswal/iitjee-neet-aims-students-questions-data


<b>Context</b><br>
In India, every year lacs of students sit for competitive examinations like JEE Advanced, JEE Mains, NEET, etc. These exams are said to be the gateway to get admission into India's premier Institutes such as IITs, NITs, AIIMS, etc. Keeping in mind that the competition is tough as lacs of students appear for these examinations, there has been an enormous development in Ed Tech Industry in India, fortuning the dreams of lacs of aspirants via providing online as well as offline coaching, mentoring, etc. This particular dataset consists of questions/doubts raised by students preparing for such examinations.

<b>Content</b><br>
The dataset contains Students-questions.csv file in version 1 as of now.
Inside the CSV file, we have two columns:<br>
eng: The full question or description of the questions<br>
Subject: Which subject does the question belong to. It has 4 classes, Physics, Chemistry, Biology, and Mathematics.
So, it's basically an NLP problem where we have the question description and we need to find out which subject does this question belongs to.

<b>Results</b><br>
With 10 neighbors, train accuracy 0.8841 and test accuracy 0.8650.<br>
The math text might be easily confused with the physics text.<br>
High confusion between physics and chemistry.<br>
The least confusion happens between math and physics texts with biology texts.

<h2>Importing and splitting the data</h2>

In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.22.2-py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 29.8 MB/s 
Collecting huggingface-hub<1.0,>=0.9.0
  Downloading huggingface_hub-0.10.0-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 75.2 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 62.2 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.10.0 tokenizers-0.12.1 transformers-4.22.2


In [2]:
# importing dependencies
from platform import python_version
import warnings

# for working with arrays and dataframes
import numpy as np
import pandas as pd

# for plotting
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from tqdm.notebook import tqdm

# for validation and evaluation
import random
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, f1_score

# for working with text
import re
import nltk
from nltk.corpus import stopwords

import torch
from transformers import BertTokenizer
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

# for modelling
import transformers
from transformers import BertForSequenceClassification
from transformers import AdamW, get_linear_schedule_with_warmup

In [3]:
# showing versions
print('Python version:',python_version())
print('NumPy version:',np.__version__)
print('Pandas version:',pd.__version__)
print('NLTK version:',nltk.__version__)
print('Sklearn version:',sklearn.__version__)
print('Torch version:',sklearn.__version__)
print('Transformers version:',sklearn.__version__)

Python version: 3.7.14
NumPy version: 1.21.6
Pandas version: 1.3.5
NLTK version: 3.7
Sklearn version: 1.0.2
Torch version: 1.0.2
Transformers version: 1.0.2


In [4]:
# primary settings
warnings.filterwarnings("ignore")
pd.set_option('display.max_rows', None, 'display.max_columns', None)

In [5]:
# definig GPU
print(torch.cuda.device_count())
print(torch.cuda.get_device_name(0))
cuda0 = torch.device('cuda:0')

1
Tesla T4


In [6]:
# importing the dataset
df = pd.read_csv("subjects-questions.csv")
print(f"{df.shape[1]} columns, {df.shape[0]} rows")
# checking duplicated rows
print(f"{df.duplicated().sum()} duplicated rows")
print("Removing duplicated rows...")
df.drop_duplicates(inplace=True)
print(f"{df.duplicated().sum()} duplicated rows")
print(f"{df.shape[1]} columns, {df.shape[0]} rows")
df.head()

2 columns, 122519 rows
811 duplicated rows
Removing duplicated rows...
0 duplicated rows
2 columns, 121708 rows


Unnamed: 0,eng,Subject
0,An anti-forest measure is\nA. Afforestation\nB...,Biology
1,"Among the following organic acids, the acid pr...",Chemistry
2,If the area of two similar triangles are equal...,Maths
3,"In recent year, there has been a growing\nconc...",Biology
4,Which of the following statement\nregarding tr...,Physics


In [7]:
# renaming the columns
df.rename(columns={'eng': 'text', 'Subject': 'subject'}, inplace=True)
df = df[['subject', 'text']]
df.head()

Unnamed: 0,subject,text
0,Biology,An anti-forest measure is\nA. Afforestation\nB...
1,Chemistry,"Among the following organic acids, the acid pr..."
2,Maths,If the area of two similar triangles are equal...
3,Biology,"In recent year, there has been a growing\nconc..."
4,Physics,Which of the following statement\nregarding tr...


In [8]:
# seing the target
df.subject.value_counts(dropna=False) 

Physics      38128
Chemistry    37612
Maths        32874
Biology      13094
Name: subject, dtype: int64

In [9]:
# checking null values
print(df.isnull().values.any(axis=1).sum())

0


In [10]:
# printing data types
print(df.dtypes)

subject    object
text       object
dtype: object


In [11]:
# casting as string
df = df.astype(str)

In [12]:
# stripping
df['text'] = df['text'].str.strip()
df['subject'] = df['subject'].str.strip()

In [13]:
# encoding the subject
possible_labels = df.subject.unique()

label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index
label_dict

{'Biology': 0, 'Chemistry': 1, 'Maths': 2, 'Physics': 3}

In [14]:
df['label'] = df.subject.replace(label_dict)

In [15]:
# splitting to train and test sets
X_train, X_val, y_train, y_val = \
    train_test_split(df.index.values, df.subject.values, test_size=0.15, 
                     random_state=42, stratify=df.subject.values)

df['data_type'] = ['not_set']*df.shape[0]

df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

print(df.groupby(['subject', 'label', 'data_type']).count())

                            text
subject   label data_type       
Biology   0     train      11130
                val         1964
Chemistry 1     train      31970
                val         5642
Maths     2     train      27943
                val         4931
Physics   3     train      32408
                val         5720


<h2>BertTokenizer and Encoding the Data</h2>

In [16]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                          do_lower_case=True)
                                          
encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].text.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)

encoded_data_val = tokenizer.batch_encode_plus(
    df[df.data_type=='val'].text.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)


input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.data_type=='train'].label.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(df[df.data_type=='val'].label.values)

dataset_train = \
    TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


<h2>BERT Pre-trained Model</h2>

In [17]:
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=len(label_dict), output_attentions=False,
    output_hidden_states=False)

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

<h2>Data Loaders</h2>

In [18]:
batch_size = 3

dataloader_train = DataLoader(dataset_train, 
                              sampler=RandomSampler(dataset_train), 
                              batch_size=batch_size)

dataloader_validation = DataLoader(dataset_val, 
                                   sampler=SequentialSampler(dataset_val), 
                                   batch_size=batch_size)

In [19]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

print(device)

cuda


<h2>Optimizer & Scheduler</h2>

In [20]:
optimizer = AdamW(model.parameters(),
                  lr=1e-5, 
                  eps=1e-8)
                  
epochs = 5

scheduler = get_linear_schedule_with_warmup(optimizer, 
    num_warmup_steps=0, num_training_steps=len(dataloader_train)*epochs)

<h2>Performance Metrics</h2>

In [21]:
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')

<h2>Training Loop</h2>

In [22]:
seed_val = 42
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in dataloader_val:
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals
    
for epoch in tqdm(range(1, epochs+1)):
    
    model.train()
    
    loss_train_total = 0

    progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch),
                        leave=False, disable=False)
    for batch in progress_bar:

        model.zero_grad()
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }       

        outputs = model(**inputs)
        
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix(
            {'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
         
        
    torch.save(model.state_dict(),
               f'finetuned_BERT_epoch_{epoch}.model')
        
    tqdm.write(f'\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train)            
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_validation)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (Weighted): {val_f1}')

  0%|          | 0/5 [00:00<?, ?it/s]

Epoch 1:   0%|          | 0/34484 [00:00<?, ?it/s]


Epoch 1
Training loss: 0.30160854127065667
Validation loss: 0.2410527761893115
F1 Score (Weighted): 0.9494629640028709


Epoch 2:   0%|          | 0/34484 [00:00<?, ?it/s]

KeyboardInterrupt: ignored

In [23]:
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=len(label_dict),
    output_attentions=False, output_hidden_states=False)

model.to(device)

model.load_state_dict(torch.load('finetuned_BERT_epoch_5.model',
                                 map_location=torch.device('cpu')))

_, predictions, true_vals = evaluate(dataloader_validation)
accuracy_per_class(predictions, true_vals)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Class: Biology
Accuracy: 1867/1964

Class: Chemistry
Accuracy: 5295/5642

Class: Maths
Accuracy: 4880/4931

Class: Physics
Accuracy: 5294/5720

