
---
#### Text classification using BERT

In this notebook, we will utilize a pre-trained deep learning model to analyze some text. The model's output will be used to categorize the text, which is a collection of sentences extracted from movie reviews. Our goal is to determine whether each sentence conveys a positive or negative sentiment towards the subject.


---

#### Objective

Our objective is to develop a model that can analyze a given sentence and determine whether it expresses a positive sentiment, in which case it should produce a value of 1, or a negative sentiment.

The model comprises two components: [DistilBERT](https://huggingface.co/transformers/model_doc/distilbert.html) and a basic [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) model from scikit-learn.

* DistilBERT processes the input sentence and passes on relevant information to the Logistic Regression model for sentiment classification. It is a lighter and faster version of BERT that performs comparably well.

* The data shared between the two models is a vector of size 768. This is because DistilBERT represents each input sentence as a sequence of vectors, with each vector having a size of 768. This vector sequence is then fed to the Logistic Regression model for classification.

#### Dataset - SST2

The SST2 dataset is a widely-used benchmark dataset for sentiment analysis and text classification tasks. It consists of movie reviews from Rotten Tomatoes, with each review labeled as positive or negative. The dataset contains 11,855 training sentences and 2,210 testing sentences, each of which is parsed into a binary parse tree to capture its grammatical structure. The dataset has been used to evaluate the performance of various natural language processing models, including BERT and its variants. You can find the dataset [here](https://nlp.stanford.edu/sentiment/index.html).

In [15]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression
import torch
import transformers

### Import the dataset

In [16]:
url = 'https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv'
df = pd.read_csv(url, delimiter='\t', header=None, nrows=2500)

df.head()

Unnamed: 0,0,1
0,"a stirring , funny and finally transporting re...",1
1,apparently reassembled from the cutting room f...,0
2,they presume their audience wo n't sit still f...,0
3,this is a visually stunning rumination on love...,1
4,jonathan parker 's bartleby should have been t...,1



---
### Load Pretrained model


The code below demonstrates how to load a pre-trained DistilBERT model and tokenizer from the Transformers library by Hugging Face, which can be used for various natural language processing tasks.

First, the `model_class`, `tokenizer_class`, and `pretrained_weights` variables are defined to hold the appropriate classes and weights required for the **DistilBERT** model.

The `DistilBertTokenizer` class is used to tokenize raw text data and prepare it for input to the DistilBERT model. The `DistilBertModel` class is the implementation of the DistilBERT model itself. The `pretrained_weights` variable is set to `distilbert-base-uncased`, which indicates the specific pre-trained DistilBERT model to be used.

Next, the `tokenizer` variable is initialized using the `from_pretrained()` method, which loads the pre-trained tokenizer for the specified DistilBERT model. This allows the raw text data to be tokenized and encoded in a way that can be understood by the model.

Finally, the model variable is initialized using the `from_pretrained()` method, which loads the pre-trained DistilBERT model with the specified weights. This allows the model to be used for various NLP tasks, such as sentiment analysis or text classification.

In [18]:
"""
Purpose:
    Initializes pre-trained DistilBERT model and tokenizer components for text classification
    tasks. Sets up the foundation for extracting contextualized embeddings from text input
    using the lightweight DistilBERT architecture for downstream sentiment analysis.

Parameters:
    None - Uses predefined DistilBERT model configuration and weights

Process Flow:
    1. Define model class (DistilBertModel), tokenizer class (DistilBertTokenizer), and model name
    2. Load pre-trained tokenizer from HuggingFace Hub using specified weights
    3. Load pre-trained DistilBERT model with 'distilbert-base-uncased' configuration
    4. Store initialized components for text processing and feature extraction pipeline

Outputs:
    tokenizer: DistilBertTokenizer instance for text preprocessing and encoding
    model: DistilBertModel instance for generating contextualized embeddings (768-dim vectors)
    Both components ready for inference-mode feature extraction

Example:
    >>> tokenizer = tokenizer_class.from_pretrained('distilbert-base-uncased')
    >>> model = model_class.from_pretrained('distilbert-base-uncased')
    >>> tokens = tokenizer.encode("This movie is great!", add_special_tokens=True)
    >>> embeddings = model(torch.tensor([tokens]))
"""

# For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (transformers.DistilBertModel,
                                                    transformers.DistilBertTokenizer,
                                                    'distilbert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)



---

The code below tokenizes a column of reviews in a Pandas DataFrame using the pre-trained tokenizer from the DistilBERT model, which was previously loaded. The resulting tokenized reviews are stored in a new Pandas Series called `tokenized`.

First, the `tokenizer.encode()` method is used to encode each review in the DataFrame. The `encode()` method converts the text into a sequence of integers that can be fed into the `DistilBERT` model. The `add_special_tokens=True` argument is passed to add special tokens like **[CLS]** (beginning of sequence) and **[SEP]** (end of sequence) to the beginning and end of each encoded review, respectively.

The `apply()` method is used to apply the `tokenizer.encode()` function to each row in the DataFrame column containing the reviews. The resulting tokenized reviews are stored in a new Pandas Series called tokenized.

In [4]:
# tokenize all the reviews in column 0 of the dataframe "df"
tokenized = df[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

In [20]:
def visualized_sentence_embedding(df: pd.DataFrame, tokenized: pd.Series) -> pd.DataFrame:
    """
    Purpose:
        Creates a visualization DataFrame showing the mapping between original text tokens
        and their corresponding numerical embeddings from DistilBERT tokenization. Helps
        understand how text is converted to token IDs for model input.

    Parameters:
        df (pd.DataFrame): Original DataFrame containing text reviews in column 0
        tokenized (pd.Series): Series of tokenized sequences with numerical token IDs

    Process Flow:
        1. Extract first review text and split into individual word tokens
        2. Add special tokens [CLS] at beginning and [SEP] at end to match BERT format
        3. Validate token count matches tokenized sequence length
        4. Create token-to-embedding pairs using zip operation
        5. Convert pairs to DataFrame with descriptive column names

    Outputs:
        pd.DataFrame: Two-column DataFrame with 'Tokens' (text) and 'Embeddings' (token IDs)
                     Shows direct mapping for debugging and educational visualization

    Example:
        >>> df_viz = visualized_sentence_embedding(df, tokenized)
        >>> print(df_viz.head(3))
           Tokens  Embeddings
        0     CLS         101
        1       a        1037
        2  stirring     18385
    """
    tokens = df.iloc[0,0].split(" ")
    tokens.insert(0, "CLS")
    tokens.append("SEP")
    assert len(tokens) == len(tokenized[0])

    token_embeddings = list(zip(tokens, tokenized[0]))
    df_token_embeddings = pd.DataFrame(token_embeddings, columns=["Tokens", "Embeddings"])
    
    return df_token_embeddings

In [21]:
df_token_embeddings = visualized_sentence_embedding(df, tokenized)
df_token_embeddings.head(10)

Unnamed: 0,Tokens,Embeddings
0,CLS,101
1,a,1037
2,stirring,18385
3,",",1010
4,funny,6057
5,and,1998
6,finally,2633
7,transporting,18276
8,re,2128
9,imagining,16603



---

##### Padding

Once the reviews in a DataFrame are tokenized, they are stored as a list of sentences (`tokenized`; data type =`pd.Series`), where each sentence is represented as a list of tokens. In order to process these examples in one batch using BERT, it is necessary to pad all of the lists to the same length. This allows the input to be represented as a single 2-dimensional array, rather than a list of variable-length lists. By doing this, the processing time can be greatly reduced.

The code below performs the following steps:

1. Initializes `max_len` to zero.
2. Computes the maximum length of the tokenized reviews using a list comprehension that iterates over the tokenized reviews, returns their lengths. The resulting maximum length is assigned to the `max_len` variable.
3. Pads the tokenized reviews with zeros to make them all the same length as the maximum length `max_len`. This is done using a list comprehension that iterates over the tokenized reviews, appends 0 to the end of each review until it has the same length as `max_len`, and converts the resulting list of padded reviews to a NumPy array. The resulting padded token embeddings are assigned to the `padded_token_embeddings` variable.

4. Overall, this code computes the maximum length of the tokenized reviews and pads them with zeros to make them all the same length, which is necessary for feeding them into a deep learning model.

In [22]:
"""
Purpose:
    Implements sequence padding to standardize tokenized text lengths for batch processing
    in neural networks. Ensures all tokenized sequences have uniform dimensions by padding
    shorter sequences with zeros to match the maximum sequence length.

Parameters:
    tokenized.values: Collection of variable-length tokenized sequences (token ID lists)

Process Flow:
    1. Initialize max_len counter to zero
    2. Compute maximum sequence length across all tokenized reviews using list comprehension
    3. Pad each sequence with zeros to reach max_len using right-padding strategy
    4. Convert padded sequences to NumPy array for efficient tensor operations
    5. Display final array shape for verification (samples, max_sequence_length)

Outputs:
    padded_token_embeddings (np.ndarray): 2D array of shape (n_samples, max_len)
                                         Zero-padded sequences ready for model input
    Printed shape: Tuple showing (number_of_samples, padded_sequence_length)

Example:
    >>> # Input: [[101, 1037], [101, 1037, 2633, 102]]  # Variable lengths
    >>> # Output: [[101, 1037, 0, 0], [101, 1037, 2633, 102]]  # Uniform length
    >>> print(padded_token_embeddings.shape)
    (2500, 65)
"""

max_len = 0
max_len = max([len(i) for i in tokenized.values if len(i) > max_len])
padded_token_embeddings = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

print(padded_token_embeddings.shape)

(2500, 65)



---
### Masking

In order to avoid confusing BERT with the padding added to the tokenized reviews, we need to create a separate variable called attention_mask. This variable indicates which tokens should be attended to by the model and which tokens should be ignored (masked) during processing. By setting the attention mask to 1 for the real tokens and 0 for the padding tokens, we can tell BERT to ignore the padding when processing the input. This helps to improve the accuracy of the model's predictions.

In [8]:
attention_mask = np.where(padded_token_embeddings != 0, 1, 0)
assert attention_mask.shape == padded_token_embeddings.shape

print(attention_mask.shape)

(2500, 65)



---
##### Model inputs

We're now ready to train a deep learning model using PyTorch. We will be using the pre-trained **DistilBERT** model that we previously loaded. First, we need to prepare our inputs for the model. We take our tokenized and padded sentences and convert them into PyTorch tensors using the `torch.tensor()` function.

we can pass the `input_ids` (torch tensor) and `attention_mask` tensors to the DistilBERT model using the `model()` function. The output of the function, `last_hidden_states`, will contain the contextualized embeddings for each token in our input sentences.

In [23]:
"""
Purpose:
    Performs forward pass through DistilBERT model to extract contextualized embeddings
    from padded token sequences. Uses inference mode to generate feature representations
    without gradient computation for efficient feature extraction pipeline.

Parameters:
    padded_token_embeddings: NumPy array of zero-padded token ID sequences
    attention_mask: Binary mask indicating real tokens (1) vs padding tokens (0)

Process Flow:
    1. Convert padded token embeddings to PyTorch tensor format (input_ids)
    2. Convert attention mask to PyTorch tensor for proper masking
    3. Disable gradient computation using torch.no_grad() context for inference
    4. Pass input_ids and attention_mask through DistilBERT model
    5. Extract last_hidden_states containing contextualized embeddings for all tokens

Outputs:
    last_hidden_states: Tuple containing tensor of shape (batch_size, seq_len, hidden_size)
                       Contains 768-dimensional embeddings for each token position
                       Ready for [CLS] token extraction and downstream classification

Example:
    >>> input_ids.shape  # (2500, 65)
    >>> attention_mask.shape  # (2500, 65)
    >>> last_hidden_states[0].shape  # (2500, 65, 768)
    >>> cls_embeddings = last_hidden_states[0][:,0,:]  # Extract [CLS] tokens
"""

input_ids = torch.tensor(padded_token_embeddings)
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

  attention_mask = torch.tensor(attention_mask)



---
**Explanation for feature extraction from `last_hidden_states`:**

Suppose we have a batch of 2500 input sentences, where each sentence is tokenized and padded to a length of 65. So, the shape of our padded array would be (2500, 65).

Now, we pass this padded array to BERT using the `model()` function, and it returns a tensor `last_hidden_states` of shape (2500, 65, 768). Here, 2500 is the batch size, 65 is the length of the padded sentence, and 768 is the size of the BERT embedding for each token.

To get a fixed-length representation of each sentence, we take the first token of each sentence, which is the `[CLS]` token. So, we extract the embeddings corresponding to the `[CLS]` token, which is located at index 0 in the second dimension of last_hidden_states.

To get these embeddings for each sentence in the batch, we use the slicing operation `[:,0,:]`. This selects all elements along the first dimension (which corresponds to the batch size), the first element along the second dimension (which corresponds to the `[CLS]` token), and all elements along the third dimension (which corresponds to the embedding size). This returns a tensor of shape (2500, 768), where each row corresponds to the embedding of a single sentence.

Finally, we convert this tensor to a numpy array using `.numpy()`, which gives us a 2D numpy array features of shape (2500, 768), where each row represents the fixed-length representation of a sentence.

In [None]:
"""
Purpose:
    Extracts fixed-length sentence representations from DistilBERT's contextualized embeddings
    by selecting the [CLS] token embeddings. Converts PyTorch tensors to NumPy arrays for
    compatibility with scikit-learn classifiers in the hybrid ML pipeline.

Parameters:
    last_hidden_states[0]: PyTorch tensor of shape (batch_size, seq_len, hidden_size)
                          Contains contextualized embeddings for all token positions

Process Flow:
    1. Access first element of last_hidden_states tuple (main embedding tensor)
    2. Select [CLS] token embeddings using [:,0,:] slicing operation
    3. Extract all samples (:), first token position (0), all embedding dimensions (:)
    4. Convert PyTorch tensor to NumPy array using .numpy() method
    5. Result: Fixed-length feature vectors ready for classical ML algorithms

Outputs:
    features (np.ndarray): 2D array of shape (batch_size, 768) containing sentence embeddings
                          Each row represents one sentence's [CLS] token embedding
                          Ready for logistic regression or other scikit-learn classifiers


    Additional Technical Detail - Tensor Slicing Breakdown
    ––––––––––––––––––––––––––––––––––––––––––––––––––––––

    The [:,0,:] operation performs 3-dimensional array slicing on last_hidden_states[0]:

    Tensor Shape Analysis:
        last_hidden_states[0] = (2500, 65, 768)
        - Axis 0: 2500 sentences in batch
        - Axis 1: 65 token positions (padded sequence length)  
        - Axis 2: 768 embedding dimensions per token

    Slicing Operation Breakdown:
        [:, 0, :] means:
        - First ':' → Select ALL rows (all 2500 sentences)
        - '0' → Select ONLY column 0 (first token position = [CLS] token)
        - Last ':' → Select ALL depths (all 768 embedding dimensions)

    Mathematical Transformation:
        Input:  (2500, 65, 768) → 3D tensor with all token embeddings
        Slice:  [:,0,:]         → Extract only [CLS] token from each sentence
        Output: (2500, 768)     → 2D matrix with sentence representations

    Why [CLS] Token (Position 0)?
        - BERT adds [CLS] at beginning of every sequence during tokenization
        - [CLS] is trained to aggregate entire sentence meaning
        - Perfect for sentence-level classification tasks
        - Contains contextualized information from all other tokens via self-attention

    Result: Each of 2500 sentences becomes a single 768-dimensional vector

Example:
    >>> last_hidden_states[0].shape  # (2500, 65, 768)
    >>> features.shape  # (2500, 768)
    >>> # Each row is a 768-dim sentence representation from [CLS] token
"""

# extracting features and labels
features = last_hidden_states[0][:,0,:].numpy()

In [24]:
"""
Purpose:
    Extracts the ground truth sentiment labels from the DataFrame for training and evaluating
    the classification model. These labels represent the target variable (positive or negative
    sentiment) that the model aims to predict based on DistilBERT features.

Parameters:
    df: pandas DataFrame containing the SST2 dataset
        - Column 0: Text reviews (sentences)
        - Column 1: Binary sentiment labels (target variable)

Process Flow:
    1. Access column 1 of the DataFrame (df[1]) which contains the sentiment labels
    2. Store these values in the 'labels' variable as a pandas Series
    3. Labels are binary: 1 for positive sentiment, 0 for negative sentiment
    4. Used as the target variable for logistic regression training and evaluation

Outputs:
    labels (pd.Series): Series of binary integers (0 or 1) representing sentiment
                       - 0: Negative sentiment review
                       - 1: Positive sentiment review
                       Matches length of features array for supervised learning

Example:
    >>> labels.head()
    0    1    # Positive sentiment for first review
    1    0    # Negative sentiment for second review
    2    0
    3    1
    4    1
    >>> len(labels)  # 2500 (matches number of samples)
"""
labels = df[1]

assert len(features) == len(labels)

### Split data into training and testing sets

In [12]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

### Logistic Regression

In [30]:
"""
Purpose:
    Initializes and trains a Logistic Regression classifier using scikit-learn on top of
    DistilBERT-extracted features. Serves as the final classification layer in the hybrid
    deep learning + classical ML pipeline for binary sentiment prediction.

Parameters:
    C (float): Inverse of regularization strength, set to 5 for moderate regularization
              Smaller values increase regularization to prevent overfitting
    max_iter (int): Maximum number of iterations for solver convergence, set to 1000
                   Ensures model training completes even with complex data
    train_features (np.ndarray): Training data of shape (n_samples, 768)
                                DistilBERT [CLS] token embeddings as features
    train_labels (pd.Series): Training target values, binary labels (0=negative, 1=positive)

Process Flow:
    1. Create LogisticRegression instance with specified hyperparameters
    2. Fit the model to training data using .fit() method
    3. Optimize model weights to minimize binary cross-entropy loss
    4. Use L2 regularization (default) with strength controlled by C parameter
    5. Iterate up to max_iter times or until convergence criteria met

Outputs:
    lr_clf: Trained LogisticRegression model instance
           Ready for prediction on test set and performance evaluation
           Capable of classifying new DistilBERT features into binary sentiment

Example:
    >>> lr_clf = LogisticRegression(C=5, max_iter=1000)
    >>> lr_clf.fit(train_features, train_labels)
    >>> test_accuracy = lr_clf.score(test_features, test_labels)
    >>> print(f"Test accuracy: {test_accuracy:.3f}")  # e.g., Test accuracy: 0.821
"""

lr_clf = LogisticRegression(C=5, max_iter=1000)
lr_clf.fit(train_features, train_labels)

In [31]:
# see how our trained LR model performs on the test set
lr_clf.score(test_features, test_labels)

0.8208


---
### Further improvements
- Fine tune DistilBERT
- Use GridSearchCV for getting best hyperparameters for the LogisticRegression model.
- Try other classifiers, build a NN for classification, or used another pretrained neural network for classification.

---
