## Generating Embeddings with BERT (Bidirectional Encoder Representations from Transformers) 

### What is BERT?
BERT is a large state-of-the-art neural network that has been trained on a large corpora of text (millions of sentences). Its applications include but are not limited to:

- Sentiment analysis
- Text classification
- Question answering systems
 
In this notebook, we walk through how BERT generates fixed-length embeddings (features) from a sentence. You could think of these embeddings as an alternate feature extraction technique compared to bag of words. The BERT model has 2 main components as shown below 



## 1. Tokenizer (Converting sentences into series of numerical tokens):

The tokenizer in BERT is like a translator that converts sentences into a series of numerical tokens that the BERT model can understand. Specifically, it does the following:

- Splits Text: It breaks down sentences into smaller pieces called tokens. These tokens can be as short as one character or as long as one word. For example, the word "chatting" might be split into "chat" and "##ting".

- Converts Tokens to IDs: Each token has a unique ID in BERT's vocabulary. The tokenizer maps every token to its corresponding ID. This is like looking up the "meaning" of the word in BERT's dictionary.

- Adds Special Tokens: BERT requires certain special tokens for its tasks, like [CLS] at the beginning of a sentence and [SEP] at the end or between two sentences. The tokenizer adds these in.


### Example usage of the tokenizer

In the cell below, we see how BERT tokenizes 3 sentences and decodes them back.

We'll use the following example sentences:

1. "The sky is blue."
2. "Sky is clear today."
3. "Look at the clear blue sky."


In [1]:
# Import required libraries
from transformers import AutoTokenizer

# # Load pre-trained BERT tokenizer and model
sentences = ["The sky is blue.", "Sky is clear today.", "Look at the clear blue sky."]
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
encoded_text = tokenizer(sentences, padding=True,
                         max_length=10,
                         truncation=True)['input_ids']

print('----------------------------------------------')
print('Examples of tokenizing the sentences with BERT')
print('----------------------------------------------')
for jj, txt in enumerate(sentences):
    print('%s is enocoded as : %s'%(txt, encoded_text[jj]))

print('----------------------------------------------')
print('Examples of decoding the tokens back to English')
print('----------------------------------------------')
for enc in encoded_text:
    decoded_text = tokenizer.decode(enc)
    print("Decoded tokens back into text: ", decoded_text)

  from .autonotebook import tqdm as notebook_tqdm


----------------------------------------------
Examples of tokenizing the sentences with BERT
----------------------------------------------
The sky is blue. is enocoded as : [101, 1109, 3901, 1110, 2221, 119, 102, 0, 0]
Sky is clear today. is enocoded as : [101, 5751, 1110, 2330, 2052, 119, 102, 0, 0]
Look at the clear blue sky. is enocoded as : [101, 4785, 1120, 1103, 2330, 2221, 3901, 119, 102]
----------------------------------------------
Examples of decoding the tokens back to English
----------------------------------------------
Decoded tokens back into text:  [CLS] The sky is blue. [SEP] [PAD] [PAD]
Decoded tokens back into text:  [CLS] Sky is clear today. [SEP] [PAD] [PAD]
Decoded tokens back into text:  [CLS] Look at the clear blue sky. [SEP]


## 2. Model (Extracting meaningful feature representations from the sentences):

Once the text is tokenized and converted into the necessary format, it's fed into the BERT model. 
The **model** processes these inputs to generate contextual embeddings or representations for each token. These representations can then be utilized for various downstream tasks like classification, entity recognition, and more.

In [2]:
import torch
from transformers import BertTokenizer, BertModel
import numpy as np
import pandas as pd
import os
# Initialize BERT tokenizer and model
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

def get_bert_embedding(sentence_list, pooling_strategy='cls'):
    embedding_list = []
    for nn, sentence in enumerate(sentence_list):
        if (nn%100==0)&(nn>0):
            print('Done with %d sentences'%nn)
        
        # Tokenize the sentence and get the output from BERT
        inputs = tokenizer(sentence, return_tensors="pt")
        with torch.no_grad():
            outputs = model(**inputs)
        # Take the embeddings from the last hidden state (optionally, one can use pooling techniques for different representations)
        # Here, we take the [CLS] token representation as the sentence embedding
        last_hidden_states = outputs.last_hidden_state[0]
        
        # Pooling strategies
        if pooling_strategy == "cls":
            sentence_embedding = last_hidden_states[0]
        elif pooling_strategy == "mean":
            sentence_embedding = torch.mean(last_hidden_states, dim=0)
        elif pooling_strategy == "max":
            sentence_embedding, _ = torch.max(last_hidden_states, dim=0)
        else:
            raise ValueError(f"Unknown pooling strategy: {pooling_strategy}")
        
        embedding_list.append(sentence_embedding)
    return torch.stack(embedding_list)

sentence = [sentences[0]]
embedding = get_bert_embedding(sentence)

np.set_printoptions(precision=3, suppress=True)
print('-----------------------------------------------------------------------------------------------------------')
print('The sentence "%s" has been converted to a feature representation of shape %s'%(sentence[0], embedding.numpy().shape))
print('-----------------------------------------------------------------------------------------------------------')
print(embedding.numpy()[0])

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


-----------------------------------------------------------------------------------------------------------
The sentence "The sky is blue." has been converted to a feature representation of shape (1, 768)
-----------------------------------------------------------------------------------------------------------
[ 0.229  0.012 -0.139 -0.237 -0.437 -0.553  0.21   0.847  0.17  -0.736
  0.019 -0.11   0.38   0.424  0.418  0.06  -0.137  0.701  0.392 -0.189
  0.124 -0.139 -0.308  0.088  0.13  -0.334 -0.078 -0.147  0.238  0.005
 -0.099  0.484 -0.337 -0.415  0.414 -0.065  0.479 -0.037 -0.126  0.332
 -0.148 -0.099  0.347 -0.211  0.294 -0.679 -2.25  -0.413 -0.084 -0.049
  0.376 -0.534  0.055  0.561 -0.002  0.842 -0.419  0.756  0.367  0.318
  0.211  0.056  0.023 -0.19   0.364  0.389 -0.157  0.268 -0.03   0.233
 -0.702 -0.283  0.695 -0.329 -0.286  0.069 -0.524  0.339 -0.061 -0.455
  0.245  0.815  0.2    0.419  0.149  0.381 -0.742 -0.676  0.282  0.293
 -0.382  0.198 -0.349  0.774  0.174 -0.151 -0.05

In [3]:
print('Loading data...')
x_train_df = pd.read_csv('../data_reviews/x_train.csv')
x_test_df = pd.read_csv('../data_reviews/x_test.csv')

tr_text_list = x_train_df['text'].values.tolist()
te_text_list = x_test_df['text'].values.tolist()

Loading data...


In [4]:
print('Generating embeddings for train sequences...')
tr_embedding = get_bert_embedding(tr_text_list)

print('Generating embeddings for test sequences...')
te_embedding = get_bert_embedding(te_text_list)


Generating embeddings for train sequences...
Done with 100 sentences
Done with 200 sentences
Done with 300 sentences
Done with 400 sentences
Done with 500 sentences
Done with 600 sentences
Done with 700 sentences
Done with 800 sentences
Done with 900 sentences
Done with 1000 sentences
Done with 1100 sentences
Done with 1200 sentences
Done with 1300 sentences
Done with 1400 sentences
Done with 1500 sentences
Done with 1600 sentences
Done with 1700 sentences
Done with 1800 sentences
Done with 1900 sentences
Done with 2000 sentences
Done with 2100 sentences
Done with 2200 sentences
Done with 2300 sentences
Generating embeddings for test sequences...
Done with 100 sentences
Done with 200 sentences
Done with 300 sentences
Done with 400 sentences
Done with 500 sentences


In [5]:
tr_embeddings_ND = tr_embedding.numpy()
te_embeddings_ND = te_embedding.numpy()

save_dir = os.path.abspath('../data_reviews/')
print('Saving the train and test embeddings to %s'%save_dir)

np.save(os.path.join(save_dir, 'x_train_BERT_embeddings.npy'), tr_embeddings_ND)
np.save(os.path.join(save_dir, 'x_test_BERT_embeddings.npy'), te_embeddings_ND)

Saving the train and test embeddings to /Users/manuelpena/Documents/Tufts/fifth_semester/cs135-24f-assignments/projectA/data_reviews


In [31]:
import numpy as np
import os
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV



In [34]:
# Load the precomputed BERT embeddings
train_embeddings = np.load(os.path.join(save_dir, 'x_train_BERT_embeddings.npy'))
test_embeddings = np.load(os.path.join(save_dir, 'x_test_BERT_embeddings.npy'))

# Load the labels (we only have y_train, so we'll still fit on it)
y_train_df = pd.read_csv('../data_reviews/y_train.csv')
y_train = y_train_df['is_positive_sentiment'].values

# Standardize the embeddings before training SVC
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(train_embeddings)
X_test_scaled = scaler.transform(test_embeddings)

## Show similarity between reviews using the embeddings

In [35]:
# Train SVC (you can tune hyperparameters via GridSearchCV later if needed) 9min
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto'],
    'kernel': ['linear', 'rbf']
}

# 16 min
# param_grid = {
#     'C': [0.1, 1],        
#     'gamma': ['scale', 'auto'],     
#     'kernel': ['linear', 'rbf', 'poly'],
#     'degree': [2, 3],           
# }
# Best parameters found:  {'C': 1, 'degree': 2, 'gamma': 'auto', 'kernel': 'rbf'}
# Best cross-validation AUC:  0.96721875
# Training AUC: 0.9967
# Training AUC: 0.9967
# Predicted probabilities saved to /Users/manuelpena/Documents/Tufts/fifth_semester/cs135-24f-assignments/projectA/data_reviews/yproba1_test.txt


# Initialize SVC
svc = SVC(class_weight='balanced', probability=True)

# Initialize GridSearchCV
grid_search = GridSearchCV(
    estimator=svc, 
    param_grid=param_grid, 
    scoring='roc_auc', 
    cv=5
)


# Fit GridSearchCV
grid_search.fit(X_train_scaled, y_train)

In [37]:
# Retrieve the best parameters and best score
print("Best parameters found: ", grid_search.best_params_)
print("Best cross-validation AUC: ", grid_search.best_score_)

# # Use the best estimator from GridSearchCV to make predictions
best_svc = grid_search.best_estimator_

Best parameters found:  {'C': 1, 'gamma': 'auto', 'kernel': 'rbf'}
Best cross-validation AUC:  0.96721875


In [38]:
# Make predictions (probabilities) for training set
y_pred_prob_train = best_svc.predict_proba(X_train_scaled)[:, 1]
auc_train = roc_auc_score(y_train, y_pred_prob_train)
print(f'Training AUC: {auc_train:.4f}')


# Calculate AUC for the training set
auc_train = roc_auc_score(y_train, y_pred_prob_train)
print(f'Training AUC: {auc_train:.4f}')

# Make predictions on test set
y_pred_prob_test = best_svc.predict_proba(X_test_scaled)[:, 1] 

Training AUC: 0.9967
Training AUC: 0.9967


In [39]:
# Save predictions to a text file (y_proba1_test.txt)
output_file = os.path.join(save_dir, 'yproba1_test.txt')
np.savetxt(output_file, y_pred_prob_test, fmt='%.6f')

print(f'Predicted probabilities saved to {output_file}')

Predicted probabilities saved to /Users/manuelpena/Documents/Tufts/fifth_semester/cs135-24f-assignments/projectA/data_reviews/yproba1_test.txt
