<a href="https://colab.research.google.com/github/EnesGokceDS/Score_Prediction/blob/main/Best_egg_NLP_Task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!git clone https://github.com/MarletteFunding/marlette-ds-challenge2

Cloning into 'marlette-ds-challenge2'...
remote: Enumerating objects: 7, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Total 7 (delta 0), reused 3 (delta 0), pack-reused 4[K
Unpacking objects: 100% (7/7), done.


In [2]:
import pandas as pd
df_train = pd.read_csv('marlette-ds-challenge2/NLP_task_train.csv.zip',header=0,index_col=0,compression='infer')
df_validate = pd.read_csv('marlette-ds-challenge2/NLP_task_validate.csv.zip',header=0,index_col=0,compression='infer')

In [3]:
df_train.head()

Unnamed: 0,DOCUMENT_ID,SENTENCE_ID,SENTENCE,SENTENCE_START_POS,SENTENCE_END_POS,SCORE
0,583306034,1888104,The whole process went smooth and I am thankfu...,109,165,10.0
1,583306034,1888102,It was quick and easy to apply and got the app...,0,60,10.0
2,584193040,1909902,service,0,7,10.0
3,584203035,1910310,Thanks!,198,205,10.0
4,584200037,1910006,Thank you!,58,68,10.0


In [4]:
df_train.SCORE.value_counts()

10.0    43843
9.0      6214
8.0      3632
7.0      1297
5.0       516
6.0       434
0.0       227
4.0       192
3.0       127
1.0        83
2.0        79
Name: SCORE, dtype: int64

In [5]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 56644 entries, 0 to 70662
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   DOCUMENT_ID         56644 non-null  int64  
 1   SENTENCE_ID         56644 non-null  int64  
 2   SENTENCE            56623 non-null  object 
 3   SENTENCE_START_POS  56644 non-null  int64  
 4   SENTENCE_END_POS    56644 non-null  int64  
 5   SCORE               56644 non-null  float64
dtypes: float64(1), int64(4), object(1)
memory usage: 3.0+ MB


In [6]:
df_validate.head()

Unnamed: 0,DOCUMENT_ID,SENTENCE_ID,SENTENCE,SENTENCE_START_POS,SENTENCE_END_POS,SCORE
7,584207033,1910601,The loan process was super easy,0,31,10.0
10,584199033,1909801,The process was quick and easy!,0,31,9.0
13,591377035,2116637,were in my bank within just a few days from th...,178,252,10.0
14,591377035,2116635,"I do not know what you mean by ""my score"", but...",0,131,10.0
25,34211545,359794,Very expensive dental work and debt consolidat...,0,50,10.0


In [7]:
df_validate.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14019 entries, 7 to 70658
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   DOCUMENT_ID         14019 non-null  int64  
 1   SENTENCE_ID         14019 non-null  int64  
 2   SENTENCE            14015 non-null  object 
 3   SENTENCE_START_POS  14019 non-null  int64  
 4   SENTENCE_END_POS    14019 non-null  int64  
 5   SCORE               14018 non-null  float64
dtypes: float64(1), int64(4), object(1)
memory usage: 766.7+ KB


# **Text Cleaning Processes**

We will make some basic text cleaning that can be used with BERT. While using BERT, we should stay away from some text cleaning processes used in rules-based prediction models such as stemming, lemmatization, expanding contractions etc. 

**1) Make all text lower-case characters**

Cased or uncased are both reasonable approach. But, according to BERT official Github guide, upper case characters doesn't bring performance improvment in text classification. All text can be lowered. In tasks like NER, upper-case characters can be important, and needs to kept.

In [8]:
df_train.SENTENCE = df_train.SENTENCE.astype('str') # sometimes sentences aren't read as str. This prevents such an error.
df_train['clean_text'] = df_train.SENTENCE.str.lower()

2) Remove punctuation

According to some studies, BERT doesn't bring performance improvment by keeping punctuations. The reason behind this is that in many textual data, punctuation doesn't change the meaning drastically in a way to affect BERT text reading system. One day, if there are a dataset that punctuation has a critical impact on the meaning, punctuation can be kept. 

Removing punctuation will descrease token length. Keep in mind that BERT can read up to 512 token. Also, lesser token length will make compuation easier. That's why removing punctuation is a justifiable step.

> Based on similar justification, we don't expand contractions. Google Research official Github page also states this point, and suggests not to expand contractions.

In [9]:
# Remove punctuations and multiple spaces
import re
df_train['clean_text'] = df_train['clean_text'].apply(lambda x: re.sub(r"[,.;@#?!&$-+]+\ *", " ", x))

#### **Check if there is any null sentence in train and validation datasets**

In [10]:
df_train.SENTENCE.isnull().value_counts()

False    56644
Name: SENTENCE, dtype: int64

In [11]:
df_train.SCORE.isnull().value_counts()

False    56644
Name: SCORE, dtype: int64

In [12]:
df_validate.SENTENCE.isnull().value_counts()

False    14015
True         4
Name: SENTENCE, dtype: int64

In [13]:
df_validate.SCORE.isnull().value_counts()

False    14018
True         1
Name: SCORE, dtype: int64

In [14]:
# We can remove the null SENTENCES and SCORE from validation dataset
df_validate = df_validate.dropna(subset=['SCORE', 'SENTENCE'])

**Train The BERT MODEL**


In [15]:
pip install transformers

Collecting transformers
  Downloading transformers-4.14.1-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 4.4 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 22.1 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 22.1 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 641 kB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 85.1 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
 

In [16]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from sklearn.metrics import classification_report
import transformers
from transformers import AutoModel, BertTokenizerFast

# specify GPU
device = torch.device("cuda")

In [17]:
bert = AutoModel.from_pretrained('bert-base-uncased')

# Load the BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

In [None]:
# DistilBert model can also be tried with a few changes for further testing

# from transformers import DistilBertTokenizer, DistilBertModel
# bert = DistilBertModel.from_pretrained('distilbert-base-uncased')
# tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

In [18]:
train_labels = df_train.SCORE.astype('int')
train_text = df_train.SENTENCE

y_validation = df_validate.SCORE.astype('int')
X_validation = df_validate.SENTENCE

In [19]:
# split train dataset into train, validation and test sets
from sklearn.model_selection import train_test_split


X_validation, test_text, y_validation, test_labels = train_test_split(X_validation, y_validation, 
                                                                random_state=2, 
                                                                test_size=0.5, 
                                                                stratify=y_validation)

Fine-Tuning BERT for Text Classification

In [20]:
# In this step, for faster computing max_length can be descresed. But, I believe P100 can handle with this. So, I am keeping it at 32
tokens_train = tokenizer.batch_encode_plus(
    train_text.tolist(),
    max_length = 32,
    pad_to_max_length=True,
    truncation=True
)

# tokenize and encode sequences in the validation set
tokens_val = tokenizer.batch_encode_plus(
    X_validation.tolist(),
    max_length = 32,
    pad_to_max_length=True,
    truncation=True
)

# tokenize and encode sequences in the test set
tokens_test = tokenizer.batch_encode_plus(
    test_text.tolist(),
    max_length = 32,
    pad_to_max_length=True,
    truncation=True
)



In [21]:
# Next, we will convert the integer sequences to tensors.

## convert lists to tensors

train_seq = torch.tensor(tokens_train['input_ids'])
train_mask = torch.tensor(tokens_train['attention_mask'])
train_y = torch.tensor(train_labels.tolist())

val_seq = torch.tensor(tokens_val['input_ids'])
val_mask = torch.tensor(tokens_val['attention_mask'])
val_y = torch.tensor(y_validation.tolist())

test_seq = torch.tensor(tokens_test['input_ids'])
test_mask = torch.tensor(tokens_test['attention_mask'])
test_y = torch.tensor(test_labels.tolist())

Now we will create dataloaders for both train and validation set. These dataloaders will pass batches of train data and validation data as input to the model during the training phase

In [22]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

#define a batch size
batch_size = 4 # with better computers and higher RAM, this can be increased to 32 or 64

# wrap tensors
train_data = TensorDataset(train_seq, train_mask, train_y)

# sampler for sampling the data during training
train_sampler = RandomSampler(train_data)

# dataLoader for train set
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

# wrap tensors
val_data = TensorDataset(val_seq, val_mask, val_y)

# sampler for sampling the data during training
val_sampler = SequentialSampler(val_data)

# dataLoader for validation set
val_dataloader = DataLoader(val_data, sampler = val_sampler, batch_size=batch_size)

# **Define Model Architecture**

In [23]:
# Moving on we will now let’s define our model architecture
class BERT_Arch(nn.Module):

    def __init__(self, bert):
      
      super(BERT_Arch, self).__init__()

      self.bert = bert 
      
      # dropout layer
      self.dropout = nn.Dropout(0.1)
      
      # relu activation function
      self.relu =  nn.ReLU()

      # dense layer 1
      self.fc1 = nn.Linear(768,512)
      
      # dense layer 2 (Output layer)
      self.fc2 = nn.Linear(512,11)

      #softmax activation function
      self.softmax = nn.LogSoftmax(dim=1)

    #define the forward pass
    def forward(self, sent_id, mask):

      #pass the inputs to the model  
      _, cls_hs = self.bert(sent_id, attention_mask=mask,  return_dict=False)  
      
      x = self.fc1(cls_hs)

      x = self.relu(x)

      x = self.dropout(x)

      # output layer
      x = self.fc2(x)
      
      # apply softmax activation
      x = self.softmax(x)

      return x

In [24]:
# pass the pre-trained BERT to our define architecture
model = BERT_Arch(bert)

# push the model to GPU
model = model.to(device)

In [25]:
# optimizer from hugging face transformers
from transformers import AdamW

# define the optimizer
optimizer = AdamW(model.parameters(),
                  lr = 1e-4)         # learning rate

There is a class imbalance in our dataset. The majority of the SCOREs are 9 and 10. So, we will first compute class weights for the labels in the train set and then pass these weights to the loss function so that it takes care of the class imbalance.

In [26]:
from sklearn.utils.class_weight import compute_class_weight

#compute the class weights
class_weights = compute_class_weight(
                                        class_weight = "balanced",
                                        classes = np.unique(train_labels),
                                        y = train_labels                                                    
                                    )
class_weights_dic = dict(zip(np.unique(train_labels), class_weights)),
class_weights_dic

({0: 22.68482178614337,
  1: 62.04162102957284,
  2: 65.18296892980437,
  3: 40.54688618468146,
  4: 26.820075757575758,
  5: 9.979563072586329,
  6: 11.865102639296188,
  7: 3.9702810681993412,
  8: 1.4178013616339606,
  9: 0.8286859583930714,
  10: 0.11745214847192358},)

In [27]:
class_weights

array([22.68482179, 62.04162103, 65.18296893, 40.54688618, 26.82007576,
        9.97956307, 11.86510264,  3.97028107,  1.41780136,  0.82868596,
        0.11745215])

In [28]:
# converting list of class weights to a tensor
weights= torch.tensor(class_weights,dtype=torch.float)

# push to GPU
weights = weights.to(device)

# define the loss function
cross_entropy  = nn.NLLLoss(weight=weights) 

# number of training epochs
epochs = 5

# **Fine-Tune BERT**

So, till now we have defined the model architecture, we have specified the optimizer and the loss function, and our dataloaders are also ready. Now we have to define a couple of functions to train (fine-tune) and evaluate the model, respectively.

In [29]:
# function to train the model
def train():
  
  model.train()

  total_loss, total_accuracy = 0, 0
  
  # empty list to save model predictions
  total_preds=[]
  
  # iterate over batches
  for step,batch in enumerate(train_dataloader):
    
    # progress update after every 50 batches.
    if step % 50 == 0 and not step == 0:
      print('  Batch {:>5,}  of  {:>5,}.'.format(step, len(train_dataloader)))

    # push the batch to gpu
    batch = [r.to(device) for r in batch]
 
    sent_id, mask, labels = batch

    # clear previously calculated gradients 
    model.zero_grad()        

    # get model predictions for the current batch
    preds = model(sent_id, mask)

    # compute the loss between actual and predicted values
    loss = cross_entropy(preds, labels)

    # add on to the total loss
    total_loss = total_loss + loss.item()

    # backward pass to calculate the gradients
    loss.backward()

    # clip the the gradients to 1.0. It helps in preventing the exploding gradient problem
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

    # update parameters
    optimizer.step()

    # model predictions are stored on GPU. So, push it to CPU
    preds=preds.detach().cpu().numpy()

    # append the model predictions
    total_preds.append(preds)

  # compute the training loss of the epoch
  avg_loss = total_loss / len(train_dataloader)
  
  # predictions are in the form of (no. of batches, size of batch, no. of classes).
  # reshape the predictions in form of (number of samples, no. of classes)
  total_preds  = np.concatenate(total_preds, axis=0)

  #returns the loss and predictions
  return avg_loss, total_preds

##### We will use the following function to evaluate the model. It will use the validation set data.

In [30]:
from babel.dates import format_time
from datetime import date, datetime, time

In [31]:
# function for evaluating the model
def evaluate():
  
  print("\nEvaluating...")
  
  # deactivate dropout layers
  model.eval()

  total_loss, total_accuracy = 0, 0
  
  # empty list to save the model predictions
  total_preds = []

  # iterate over batches
  for step,batch in enumerate(val_dataloader):
    
    # Progress update every 50 batches.
    # if step % 50 == 0 and not step == 0:
      
    #   # Calculate elapsed time in minutes.
    #   elapsed = format_time(time.time() - t0)
            
    #   # Report progress.
    #   print('  Batch {:>5,}  of  {:>5,}.'.format(step, len(val_dataloader)))

    # push the batch to gpu
    batch = [t.to(device) for t in batch]

    sent_id, mask, labels = batch

    # deactivate autograd
    with torch.no_grad():
      
      # model predictions
      preds = model(sent_id, mask)

      # compute the validation loss between actual and predicted values
      loss = cross_entropy(preds,labels)

      total_loss = total_loss + loss.item()

      preds = preds.detach().cpu().numpy()

      total_preds.append(preds)

  # compute the validation loss of the epoch
  avg_loss = total_loss / len(val_dataloader) 

  # reshape the predictions in form of (number of samples, no. of classes)
  total_preds  = np.concatenate(total_preds, axis=0)

  return avg_loss, total_preds

#### **Now we will finally start fine-tuning of the model.**

In [32]:
# set initial loss to infinite
best_valid_loss = float('inf')

# empty lists to store training and validation loss of each epoch
train_losses=[]
valid_losses=[]

#for each epoch
for epoch in range(epochs):
     
    print('\n Epoch {:} / {:}'.format(epoch + 1, epochs))
    
    #train model
    train_loss, _ = train()
    
    #evaluate model
    valid_loss, _ = evaluate()
    
    #save the best model
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'saved_weights.pt')
    
    # append training and validation loss
    train_losses.append(train_loss)
    valid_losses.append(valid_loss)
    
    print(f'\nTraining Loss: {train_loss:.3f}')
    print(f'Validation Loss: {valid_loss:.3f}')


 Epoch 1 / 5
  Batch    50  of  14,161.
  Batch   100  of  14,161.
  Batch   150  of  14,161.
  Batch   200  of  14,161.
  Batch   250  of  14,161.
  Batch   300  of  14,161.
  Batch   350  of  14,161.
  Batch   400  of  14,161.
  Batch   450  of  14,161.
  Batch   500  of  14,161.
  Batch   550  of  14,161.
  Batch   600  of  14,161.
  Batch   650  of  14,161.
  Batch   700  of  14,161.
  Batch   750  of  14,161.
  Batch   800  of  14,161.
  Batch   850  of  14,161.
  Batch   900  of  14,161.
  Batch   950  of  14,161.
  Batch 1,000  of  14,161.
  Batch 1,050  of  14,161.
  Batch 1,100  of  14,161.
  Batch 1,150  of  14,161.
  Batch 1,200  of  14,161.
  Batch 1,250  of  14,161.
  Batch 1,300  of  14,161.
  Batch 1,350  of  14,161.
  Batch 1,400  of  14,161.
  Batch 1,450  of  14,161.
  Batch 1,500  of  14,161.
  Batch 1,550  of  14,161.
  Batch 1,600  of  14,161.
  Batch 1,650  of  14,161.
  Batch 1,700  of  14,161.
  Batch 1,750  of  14,161.
  Batch 1,800  of  14,161.
  Batch 1,850 

In [33]:
#load weights of best model
path = 'saved_weights.pt'
model.load_state_dict(torch.load(path))

<All keys matched successfully>

In [37]:
import torch
torch.cuda.empty_cache()

In [38]:
# get predictions for test data
with torch.no_grad():
  preds = model(test_seq.to(device), test_mask.to(device))
  preds = preds.detach().cpu().numpy()

In [39]:
preds = np.argmax(preds, axis = 1)
print(classification_report(test_y, preds))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        18
           1       0.00      0.00      0.00         9
           2       0.00      0.00      0.00        10
           3       0.00      0.00      0.00        13
           4       0.00      0.00      0.00        23
           5       0.00      0.00      0.00        55
           6       0.00      0.00      0.00        48
           7       0.00      0.00      0.00       166
           8       0.00      0.00      0.00       452
           9       0.00      0.00      0.00       781
          10       0.78      1.00      0.87      5432

    accuracy                           0.78      7007
   macro avg       0.07      0.09      0.08      7007
weighted avg       0.60      0.78      0.68      7007



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Explanation for metrics:

> **Accuracy**: Accuracy classification score. In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true. However, our data is unbalanced. Therefore, accuracy is not enough in this case.

> **Precision**: Intuitively the ability of the classifier not to label as positive a sample that is negative. Precision is the estimated probability that a randomly selected retrieved document is relevant. For unbalanced data, precision helps understand how our model is successfull actually. 

> **Recall**: Intuitively the ability of the classifier to find all the positive samples. Recall is the estimated probability that a randomly selected relevant document is retrieved in a search. For this study, in addition to acuracy and precision, recall is another meaningful metric.

> **F1 Score**: The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. F1 provides additional perspective to read precision and recall.


# **Interpretation**

* Basically, all the SENTENCES are labeled as SCORE 10 with this model. This is a pretty bad prediction. (We don't even need an AI to label every SENTENCE as 10)

* We can see why the model has such a bias to label SENTENCEs as SCORE 10. In total, 78% of the SENTENCEs has SCORE 10. 
* Weighted Average score is (68%). Weighted F1 score calculates the F1 score for each class independently but when it adds them together uses a weight that depends on the number of true labels of each class. This is overall performance of the model (highly weak)
* Currently, I don't have a strong explanation for this undesirable performance result. But, in the scope of the take-home assignment, I will stop at this point. 
* If we desperately want to obtain a better model performance quickly, we can try Random Forest classifier, Multi-nominal Naive Bayes, and few other multi-label classification algorithms. However, they are all rule-based models. I would never suggest to use them in production. I strongly advice to spend time on building a better transformers-based language model.

# **Ideas for further improvement**

* This prediction model can be trained with stronger/heavier Transformers models such as RoBERTa, T5-XL etc. We can expect performance improvement with different models.

* I set ***batch_size = 8***. This can be done 32 or higher to run the training faster if there is a higher RAM
* Training the model with more fine-tuned parameters have potential give to give better performance.

* I wonder how the prediction performance would be if we use use Zero-Shot Classification approach. It worths giving it a try if we have enough time.

* Of course with a larger training and validation datasets, and the model can be trained better, model performance can be improved.

* As a data scientist, I can question the validity of the data, and try to understand potential biases in the dataset. Is there any threat to validity in the data gathering process? This point is part of a seperate discussion but still can affect the prediction performance.
* All these suggestions needs to be evaluated regarding business needs, upcoming deadlines, stakeholders' priorities. 