# Text classification using BERT

In this notebook, we will utilize a pre-trained deep learning model to analyze some text. The model's output will be used to categorize the text, which is a collection of sentences extracted from movie reviews. Our goal is to determine whether each sentence conveys a positive or negative sentiment towards the subject.

#### Objective

Our objective is to develop a model that can analyze a given sentence and determine whether it expresses a positive sentiment, in which case it should produce a value of 1, or a negative sentiment.

The model comprises two components: [DistilBERT](https://huggingface.co/transformers/model_doc/distilbert.html) and a basic [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) model from scikit-learn.

* DistilBERT processes the input sentence and passes on relevant information to the Logistic Regression model for sentiment classification. It is a lighter and faster version of BERT that performs comparably well.

* The data shared between the two models is a vector of size 768. This is because DistilBERT represents each input sentence as a sequence of vectors, with each vector having a size of 768. This vector sequence is then fed to the Logistic Regression model for classification.

#### Dataset - SST2

The SST2 dataset is a widely-used benchmark dataset for sentiment analysis and text classification tasks. It consists of movie reviews from Rotten Tomatoes, with each review labeled as positive or negative. The dataset contains 11,855 training sentences and 2,210 testing sentences, each of which is parsed into a binary parse tree to capture its grammatical structure. The dataset has been used to evaluate the performance of various natural language processing models, including BERT and its variants. You can find the dataset [here](https://nlp.stanford.edu/sentiment/index.html).

In [3]:
!pip install transformers | grep -v "already"

Collecting transformers
  Obtaining dependency information for transformers from https://files.pythonhosted.org/packages/20/0a/739426a81f7635b422fbe6cb8d1d99d1235579a6ac8024c13d743efa6847/transformers-4.36.2-py3-none-any.whl.metadata
  Downloading transformers-4.36.2-py3-none-any.whl.metadata (126 kB)
     ---------------------------------------- 0.0/126.8 kB ? eta -:--:--
     --- ------------------------------------ 10.2/126.8 kB ? eta -:--:--
     -------- ---------------------------- 30.7/126.8 kB 262.6 kB/s eta 0:00:01
     ----------- ------------------------- 41.0/126.8 kB 245.8 kB/s eta 0:00:01
     -------------------------- ---------- 92.2/126.8 kB 476.3 kB/s eta 0:00:01
     ------------------------------------ 126.8/126.8 kB 574.0 kB/s eta 0:00:00
Collecting huggingface-hub<1.0,>=0.19.3 (from transformers)
  Obtaining dependency information for huggingface-hub<1.0,>=0.19.3 from https://files.pythonhosted.org/packages/3d/0a/aed3253a9ce63d9c90829b1d36bc44ad966499ff4f582730909

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
open-interpreter 0.1.4 requires huggingface-hub<0.17.0,>=0.16.4, but you have huggingface-hub 0.20.2 which is incompatible.


In [4]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression
import torch
import transformers

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

### Import the dataset

In [15]:
url = 'https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv'
df = pd.read_csv(url, delimiter='\t', header=None, nrows=2500)
df.head()

Unnamed: 0,0,1
0,"a stirring , funny and finally transporting re...",1
1,apparently reassembled from the cutting room f...,0
2,they presume their audience wo n't sit still f...,0
3,this is a visually stunning rumination on love...,1
4,jonathan parker 's bartleby should have been t...,1


### Load Pretrained model

In [18]:
# DistilBERT
model_class, tokenizer_class, pretrained_weights = (transformers.DistilBertModel,
                                                   transformers.DistilBertTokenizer,
                                                   'distilbert-base-uncased')
# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

The code above demonstrates how to load a pre-trained DistilBERT model and tokenizer from the Transformers library by Hugging Face, which can be used for various natural language processing tasks.

First, the `model_class`, `tokenizer_class`, and `pretrained_weights` variables are defined to hold the appropriate classes and weights required for the **DistilBERT** model.

The `DistilBertTokenizer` class is used to tokenize raw text data and prepare it for input to the DistilBERT model. The `DistilBertModel` class is the implementation of the DistilBERT model itself. The `pretrained_weights` variable is set to `distilbert-base-uncased`, which indicates the specific pre-trained DistilBERT model to be used.

Next, the `tokenizer` variable is initialized using the `from_pretrained()` method, which loads the pre-trained tokenizer for the specified DistilBERT model. This allows the raw text data to be tokenized and encoded in a way that can be understood by the model.

Finally, the model variable is initialized using the `from_pretrained()` method, which loads the pre-trained DistilBERT model with the specified weights. This allows the model to be used for various NLP tasks, such as sentiment analysis or text classification.

In [21]:
df[0]

0       a stirring , funny and finally transporting re...
1       apparently reassembled from the cutting room f...
2       they presume their audience wo n't sit still f...
3       this is a visually stunning rumination on love...
4       jonathan parker 's bartleby should have been t...
                              ...                        
2495    allegiance to chekhov , which director michael...
2496    not only a coming of age story and cautionary ...
2497    sparkling , often hilarious romantic jealousy ...
2498     a harrowing account of a psychological breakdown
2499    a mature , deeply felt fantasy of a director '...
Name: 0, Length: 2500, dtype: object

In [22]:
# Tokenize all the reviews in column 0 of the dataframe "df"
tokenized = df[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

In [23]:
tokenized

0       [101, 1037, 18385, 1010, 6057, 1998, 2633, 182...
1       [101, 4593, 2128, 27241, 23931, 2013, 1996, 62...
2       [101, 2027, 3653, 23545, 2037, 4378, 24185, 10...
3       [101, 2023, 2003, 1037, 17453, 14726, 19379, 1...
4       [101, 5655, 6262, 1005, 1055, 12075, 2571, 376...
                              ...                        
2495    [101, 14588, 2000, 18178, 25495, 1010, 2029, 2...
2496    [101, 2025, 2069, 1037, 2746, 1997, 2287, 2466...
2497    [101, 16619, 1010, 2411, 26316, 6298, 14225, 4...
2498    [101, 1037, 24560, 2075, 4070, 1997, 1037, 831...
2499    [101, 1037, 9677, 1010, 6171, 2371, 5913, 1997...
Name: 0, Length: 2500, dtype: object

The code above tokenizes a column of reviews in a Pandas DataFrame using the pre-trained tokenizer from the DistilBERT model, which was previously loaded. The resulting tokenized reviews are stored in a new Pandas Series called `tokenized`.

First, the `tokenizer.encode()` method is used to encode each review in the DataFrame. The `encode()` method converts the text into a sequence of integers that can be fed into the `DistilBERT` model. The `add_special_tokens=True` argument is passed to add special tokens like **[CLS]** (beginning of sequence) and **[SEP]** (end of sequence) to the beginning and end of each encoded review, respectively.

The `apply()` method is used to apply the `tokenizer.encode()` function to each row in the DataFrame column containing the reviews. The resulting tokenized reviews are stored in a new Pandas Series called tokenized.

In [24]:
df.iloc[0,0].split(" ")

['a',
 'stirring',
 ',',
 'funny',
 'and',
 'finally',
 'transporting',
 're',
 'imagining',
 'of',
 'beauty',
 'and',
 'the',
 'beast',
 'and',
 '1930s',
 'horror',
 'films']

In [31]:
def visualized_sentence_embedding(df: pd.DataFrame, tokenized: pd.Series) -> pd.DataFrame:
    """
    Function to see tokens and embeddings of the first review in df
    """
    tokens = df.iloc[0,0].split(" ")
    tokens.insert(0, "CLS")
    tokens.append("SEP")
    assert len(tokens) == len(tokenized[0])
    token_embeddings = list(zip(tokens, tokenized[0]))
    df_token_embeddings = pd.DataFrame(token_embeddings, columns=["Tokens", "Embeddings"])
    return df_token_embeddings

In [32]:
df_token_embeddings = visualized_sentence_embedding(df, tokenized)
df_token_embeddings

Unnamed: 0,Tokens,Embeddings
0,CLS,101
1,a,1037
2,stirring,18385
3,",",1010
4,funny,6057
5,and,1998
6,finally,2633
7,transporting,18276
8,re,2128
9,imagining,16603


### Padding
Once the reviews in a DataFrame are tokenized, they are stored as a list of sentences (`tokenized`; data type =`pd.Series`), where each sentence is represented as a list of tokens. In order to process these examples in one batch using BERT, it is necessary to pad all of the lists to the same length. This allows the input to be represented as a single 2-dimensional array, rather than a list of variable-length lists. By doing this, the processing time can be greatly reduced.

In [42]:
max_len = 0
max_len = max([len(i) for i in tokenized.values])
padded_token_embeddings = np.array([i + [0] * (max_len - len(i)) for i in tokenized.values])
print(padded_token_embeddings.shape)

(2500, 65)


The above code performs the following steps:

1. Initializes `max_len` to zero.
2. Computes the maximum length of the tokenized reviews using a list comprehension that iterates over the tokenized reviews, returns their lengths. The resulting maximum length is assigned to the `max_len` variable.
3. Pads the tokenized reviews with zeros to make them all the same length as the maximum length `max_len`. This is done using a list comprehension that iterates over the tokenized reviews, appends 0 to the end of each review until it has the same length as `max_len`, and converts the resulting list of padded reviews to a NumPy array. The resulting padded token embeddings are assigned to the `padded_token_embeddings` variable.

4. Overall, this code computes the maximum length of the tokenized reviews and pads them with zeros to make them all the same length, which is necessary for feeding them into a deep learning model.

### Masking

In order to avoid confusing BERT with the padding added to the tokenized reviews, we need to create a separate variable called attention_mask. This variable indicates which tokens should be attended to by the model and which tokens should be ignored (masked) during processing. By setting the attention mask to 1 for the real tokens and 0 for the padding tokens, we can tell BERT to ignore the padding when processing the input. This helps to improve the accuracy of the model's predictions.

In [47]:
attention_mask = np.where(padded_token_embeddings != 0, 1, 0)
assert attention_mask.shape == padded_token_embeddings.shape
print(attention_mask[:2], attention_mask.shape)

[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]] (2500, 65)


### Model inputs

We're now ready to train a deep learning model using PyTorch. We will be using the pre-trained **DistilBERT** model that we previously loaded. First, we need to prepare our inputs for the model. We take our tokenized and padded sentences and convert them into PyTorch tensors using the `torch.tensor()` function.

we can pass the `input_ids` (torch tensor) and `attention_mask` tensors to the DistilBERT model using the `model()` function. The output of the function, `last_hidden_states`, will contain the contextualized embeddings for each token in our input sentences.

In [49]:
input_ids = torch.LongTensor(padded_token_embeddings)
attention_mask = torch.Tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

In [52]:
# extracting features and labels
features = last_hidden_states[0][:,0,:].numpy()

In [53]:
features

array([[-0.21593435, -0.14028911,  0.00831076, ..., -0.13694832,
         0.5867005 ,  0.20112693],
       [-0.17262718, -0.14476153,  0.00223438, ..., -0.1744257 ,
         0.21386446,  0.37197465],
       [-0.05063373,  0.07203954, -0.02959727, ..., -0.0714895 ,
         0.7185238 ,  0.2622547 ],
       ...,
       [-0.03103156,  0.06106165, -0.0742958 , ..., -0.12653553,
         0.55033255,  0.45576078],
       [-0.37851074, -0.04516774, -0.18900727, ..., -0.14968894,
         0.2870552 ,  0.2913559 ],
       [-0.27609336, -0.02547334, -0.1110792 , ..., -0.26346034,
         0.5565323 ,  0.42118883]], dtype=float32)

**Explanation for feature extraction from `last_hidden_states`:**

Suppose we have a batch of 2500 input sentences, where each sentence is tokenized and padded to a length of 65. So, the shape of our padded array would be (2500, 65).

Now, we pass this padded array to BERT using the `model()` function, and it returns a tensor `last_hidden_states` of shape (2500, 65, 768). Here, 2500 is the batch size, 65 is the length of the padded sentence, and 768 is the size of the BERT embedding for each token.

To get a fixed-length representation of each sentence, we take the first token of each sentence, which is the `[CLS]` token. So, we extract the embeddings corresponding to the `[CLS]` token, which is located at index 0 in the second dimension of last_hidden_states.

To get these embeddings for each sentence in the batch, we use the slicing operation `[:,0,:]`. This selects all elements along the first dimension (which corresponds to the batch size), the first element along the second dimension (which corresponds to the `[CLS]` token), and all elements along the third dimension (which corresponds to the embedding size). This returns a tensor of shape (2500, 768), where each row corresponds to the embedding of a single sentence.

Finally, we convert this tensor to a numpy array using `.numpy()`, which gives us a 2D numpy array features of shape (2500, 768), where each row represents the fixed-length representation of a sentence.

In [56]:
labels = df[1]
assert len(features) == len(labels)
labels

0       1
1       0
2       0
3       1
4       1
       ..
2495    0
2496    1
2497    1
2498    1
2499    1
Name: 1, Length: 2500, dtype: int64

### Split data into training and testing sets

In [57]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

### Logistic Regression

In [60]:
lr_clf = LogisticRegression(C=5, max_iter=1000)
lr_clf.fit(train_features, train_labels)

In [62]:
# see how our trained LR model performs on the test set
lr_clf.score(test_features, test_labels)

0.848