# 🤖 BERT for Word Embeddings


This notebook provides a **very simplified and beginner-friendly** guide to using **BERT for sentiment analysis**. We'll:
- Use a small set of sentences labeled as Positive (1) or Negative (0)
- Tokenize them using BERT tokenizer
- Get BERT embeddings
- (Optionally) Use these embeddings for a classifier

All steps are done using `transformers` and `torch` libraries.


In [None]:
!pip install transformers torch --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m48.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m46.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m41.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import torch
from transformers import BertTokenizer, BertModel
import pandas as pd

### 🧾 Step 1: Create Sample Data

In [None]:

data = {
    'text': [
        "I love this product!",
        "This is the worst experience I've ever had.",
        "Absolutely fantastic!",
        "Not good, very disappointing.",
        "I'm happy with the service."
    ],
    'label': [1, 0, 1, 0, 1]  # 1 = Positive, 0 = Negative
}

df = pd.DataFrame(data)
df

Unnamed: 0,text,label
0,I love this product!,1
1,This is the worst experience I've ever had.,0
2,Absolutely fantastic!,1
3,"Not good, very disappointing.",0
4,I'm happy with the service.,1


### 🧠 Step 2: Tokenize Text using BERT Tokenizer

* Loads a pre-trained BERT tokenizer.
* ``'bert-base-uncased'`` means:
    * Base BERT model with 12 layers
    * Trained on lowercase text (uncased: "Hello" = "hello")


Tokenizer is responsible for splitting the sentence into Word Piece tokens.

This tokenizes a **list of text** from your DataFrame using the tokenizer.

* **list(df['text']):** Extracts a list of raw strings from the DataFrame's 'text' column.

* **padding=True:** Ensures that all sequences are the same length by padding shorter ones with [PAD] tokens. After the actual tokens.

* **truncation=True:** Cuts off longer sentences so they don’t exceed the max length BERT supports (maximum sequence length is 512 tokens.).

* **return_tensors="pt":** Converts output into PyTorch tensors instead of Python lists or NumPy arrays.

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize with padding and truncation
inputs = tokenizer(
    list(df['text']),
    padding=True,
    truncation=True, #any sentence is longer than BERT’s maximum allowed length (512 tokens), it will be cut off
    return_tensors="pt"
)

inputs['input_ids'].shape  # (batch_size, max_seq_len)

# Splits text into word pieces (subwords) → e.g., "playing" → ["play", "##ing"].

# Maps tokens to integer IDs using BERT’s vocabulary (30,522 tokens for bert-base).

# Adds special tokens like [CLS] (classification) at the start, [SEP] (separator) at the end.

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

torch.Size([5, 13])

In [None]:
# Sentence 1: "I love this product!"
# Sentence 2: "This is bad."
# Sentence 3: "Absolutely fantastic!"

| Sentence | Tokens (IDs)                        |                          |
| -------- | ----------------------------------- | ------------------------ |
| S1       | \[101, 1045, 2293, 2023, 4033, 102] |                          |
| S2       | \[101, 2023, 2003, 2919, 102, 0]    | <- padded to same length |
| S3       | \[101, 6934, 1033, 102, 0, 0]       | <- padded                |

101 = [CLS], 102 = [SEP], 0 = [PAD]

In [None]:
input = "I Love AI!"

tokenizer.tokenize(input)

['i', 'love', 'ai', '!']

### 📥 Step 3: Get Sentence Embeddings using BERT

In [None]:
model = BertModel.from_pretrained("bert-base-uncased")
with torch.no_grad():
    outputs = model(**inputs) #This tensor contains contextual embeddings for each token in each sentence.
# inputs = {
#     'input_ids': tensor(...),      # token IDs of your sentences
#     'attention_mask': tensor(...), # 1 for real tokens, 0 for padding
#     # 'token_type_ids' may also be here for BERT (for sentence pairs)
# }
# **inputs unpacks the dictionary into keyword arguments.

# So this is equivalent to writing:

# outputs = model(
#     input_ids=inputs['input_ids'],
#     attention_mask=inputs['attention_mask'],
#     token_type_ids=inputs.get('token_type_ids', None)
# )

# outputs.last_hidden_state shape: (batch, seq_len, hidden_size)
# We'll use the embedding of [CLS] token for sentence representation
sentence_embeddings = outputs.last_hidden_state[:, 0, :]  # shape: (batch, token_sequence, hidden_size/vector)
#: → all batches → pick all sentences (dimension 0)

#0 → first token in the sequence → [CLS] token (dimension 1)

#: → all hidden dimensions → 0 to 767 (dimension 2)

# So we are trying to cover all the contextual meaning of the sentences, covering classification as well

sentence_embeddings.shape



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

torch.Size([5, 768])

In [None]:
# Sentence 1: "I love this product!"
# Sentence 2: "This is bad."
# Sentence 3: "Absolutely fantastic!"

| Sentence | Tokens (IDs)                        |                          |
| -------- | ----------------------------------- | ------------------------ |
| S1       | \[101, 1045, 2293, 2023, 4033, 102] |                          |
| S2       | \[101, 2023, 2003, 2919, 102, 0]    | <- padded to same length |
| S3       | \[101, 6934, 1033, 102, 0, 0]       | <- padded                |

101 = [CLS], 102 = [SEP], 0 = [PAD]

### 🔍 Step 4: View Embeddings

In [None]:
sentence_embeddings  # These are 768-dimensional vectors for each sentence

tensor([[ 0.1877,  0.2440,  0.1015,  ..., -0.3153,  0.1309,  0.1728],
        [-0.1591,  0.4611, -0.1535,  ..., -0.2904,  0.2858,  0.4118],
        [-0.0552,  0.1704,  0.0682,  ..., -0.1736,  0.1650,  0.1812],
        [-0.4362,  0.0155, -0.1871,  ..., -0.1107,  0.3316,  0.4704],
        [ 0.1945,  0.2416, -0.0405,  ...,  0.1529,  0.3023,  0.4088]])

### ✅ Summary


- We used `BertTokenizer` to tokenize text
- Passed input to `BertModel` to get embeddings
- Used [CLS] token's embedding as a sentence representation

This can now be used as input features to a classifier (e.g., logistic regression, neural net).


In [None]:
# Add the Embeddings Back to DataFrame

df['embedding'] = sentence_embeddings.tolist()
df

Unnamed: 0,text,label,embedding
0,I love this product!,1,"[0.1876906454563141, 0.2440350502729416, 0.101..."
1,This is the worst experience I've ever had.,0,"[-0.15914830565452576, 0.4611494541168213, -0...."
2,Absolutely fantastic!,1,"[-0.05521172285079956, 0.1704026609659195, 0.0..."
3,"Not good, very disappointing.",0,"[-0.43618062138557434, 0.015502016060054302, -..."
4,I'm happy with the service.,1,"[0.1945129781961441, 0.2415696382522583, -0.04..."


In [None]:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Convert embeddings and labels to numpy arrays
X = sentence_embeddings.numpy()
y = df['label'].values

# Split data into training and testing sets (optional for this small dataset, but good practice)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Logistic Regression model
log_reg = LogisticRegression(solver='liblinear') # 'liblinear' is a good choice for small datasets
log_reg.fit(X_train, y_train)

# Make predictions on the test set
y_pred = log_reg.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy on the test set: {accuracy:.2f}")

# You can also predict on the training set to see how well it fits the training data
y_train_pred = log_reg.predict(X_train)
train_accuracy = accuracy_score(y_train, y_train_pred)
print(f"Accuracy on the training set: {train_accuracy:.2f}")

# To predict the sentiment of a new sentence:
# 1. Tokenize the new sentence
# 2. Get its BERT embedding
# 3. Use the trained Logistic Regression model to predict the label

def predict_sentiment(text, model, tokenizer, classifier):
    inputs = tokenizer(
        [text],
        padding=True,
        truncation=True,
        return_tensors="pt"
    )
    with torch.no_grad():
        outputs = model(**inputs)
    sentence_embedding = outputs.last_hidden_state[:, 0, :].numpy()
    prediction = classifier.predict(sentence_embedding)
    return "Positive" if prediction[0] == 1 else "Negative"

# Example prediction on a new sentence
new_sentence = "This is an amazing day!"
predicted_sentiment = predict_sentiment(new_sentence, model, tokenizer, log_reg)
print(f"The sentiment of '{new_sentence}' is: {predicted_sentiment}")

new_sentence_2 = "I hate this situation."
predicted_sentiment_2 = predict_sentiment(new_sentence_2, model, tokenizer, log_reg)
print(f"The sentiment of '{new_sentence_2}' is: {predicted_sentiment_2}")


Accuracy on the test set: 0.00
Accuracy on the training set: 1.00
The sentiment of 'This is an amazing day!' is: Positive
The sentiment of 'I hate this situation.' is: Positive


## Sentiment Analysis on IMDB Dataset using BERT Embeddings

In [None]:
# The Keras IMDB dataset is available directly in TensorFlow Datasets (tfds) or through Keras built-in datasets.
# We will use the built-in Keras dataset for simplicity.

!pip install tensorflow datasets --quiet

import tensorflow as tf
import pandas as pd

# Load the IMDB dataset
# num_words=10000 means we will only consider the top 10,000 most frequent words
(train_data, train_labels), (test_data, test_labels) = tf.keras.datasets.imdb.load_data(num_words=10000)

# The dataset consists of sequences of integers, where each integer represents a word.
# We need to convert these integers back to words to create a readable text column for the DataFrame.
# Get the word index mapping
word_index = tf.keras.datasets.imdb.get_word_index()

# train_data[0]  # first review in integer format
# # Output (sample):
# # [1, 14, 22, 16, 43, ...]
# train_labels[0]
# # Output:
# # 1  (positive review)
# Each number represents a specific word in the dataset.

# word_index['movie']  # might output 123; This will allow us to convert integers back to readable words

# Reverse the word index to map integers to words
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

# reverse_word_index[123]  # 'movie'

# Helper function to decode the reviews
def decode_review(text): # decoding integers to words
    # decode_review([1, 14, 22, 16])
    # Might output: "<START> this movie was great"
    # The first 3 indices are reserved for padding, start of sequence, and unknown
    return ' '.join([reverse_word_index.get(i - 3, '?') for i in text])
    #?-> default, if no match found; We are using frequent top 10000 words
    #Keras shifts all original word indices by 3 to make room for the reserved tokens
    #0 → <PAD>; 1 → <START>; 2 → <UNK>
    # So, if the integer in train_data[0] is 123 (Movie), the original word index is actually 123 - 3 = 120

# Decode the training and testing data
decoded_train_reviews = [decode_review(review) for review in train_data]
decoded_test_reviews = [decode_review(review) for review in test_data]
# Converts each integer sequence → readable text.
# Now we have lists of sentences instead of numbers
# example: decoded_train_reviews[0]
# Output:
# "the film was just amazing and very entertaining ..."

# Create DataFrames
df_train = pd.DataFrame({'text': decoded_train_reviews, 'label': train_labels})
df_test = pd.DataFrame({'text': decoded_test_reviews, 'label': test_labels})

# Concatenate train and test dataframes for a single dataset
df_imdb = pd.concat([df_train, df_test], ignore_index=True)
# This gives you all 50,000 reviews (25k train + 25k test), so all reviews are included.
# But each review is represented using only the top 10,000 words; rare words are replaced by <UNK> or '?'

print("IMDB DataFrame created successfully:")
print(df_imdb.head())
print("\nDataFrame Info:")
df_imdb.info()
print("\nValue counts for labels:")
print(df_imdb['label'].value_counts())

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
[1m17464789/17464789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
[1m1641221/1641221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
IMDB DataFrame created successfully:
                                                text  label
0  ? this film was just brilliant casting locatio...      1
1  ? big hair big boobs bad music and a giant saf...      0
2  ? this has to be one of the worst films of the...      0
3  ? the ? ? at storytelling the traditional sort...      1
4  ? worst mistake of my life br br i picked this...      0

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    50000 non-null  object
 1   lab

In [None]:
# We'll process the IMDB dataset in chunks due to memory constraints if the dataset is large
# For the full IMDB dataset (25000 train + 25000 test), processing it all at once might exceed Colab's memory limit.
# Let's process a smaller subset or iterate in batches.

# Let's use a smaller subset for demonstration, e.g., first 1000 reviews from train and 1000 from test
df_train_subset = df_train.sample(n=1000, random_state=42).copy() # Or use .head(1000)
df_test_subset = df_test.sample(n=1000, random_state=42).copy()   # Or use .head(1000)

# Concatenate for easier processing (if you need embeddings for the whole subset)
df_subset = pd.concat([df_train_subset, df_test_subset], ignore_index=True)

print(f"Using a subset of {len(df_subset)} reviews.")

# Function to get BERT embeddings in batches
def get_bert_embeddings_batch(texts, tokenizer, model, batch_size=32):
    embeddings = [] #Starting an empty list for embeddings
    model.eval()  # Set model to evaluation mode
    with torch.no_grad():
        for i in range(0, len(texts), batch_size):
            batch_texts = texts[i:i + batch_size]
            inputs = tokenizer(
                batch_texts,
                padding=True,
                truncation=True,
                return_tensors="pt",
                max_length=512 # BERT's max sequence length
            )
            # Move inputs to GPU if available
            device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
            inputs = {k: v.to(device) for k, v in inputs.items()}
            model.to(device)
            # inputs:
            # {'input_ids': tensor([[101, 2023, 2003, ...], [...]]),  # token IDs
            # 'attention_mask': tensor([[1, 1, 1, ...], [...]]),     # mask for padding
            # }
            # v is a tensor. .to(device) moves it to the specified device

            outputs = model(**inputs) #This tensor contains contextual embeddings for each token in each sentence.
            # Use the embedding of [CLS] token as sentence representation
            batch_embeddings = outputs.last_hidden_state[:, 0, :].cpu().numpy() # Move back to CPU for numpy conversion
            # (num_samples, embedding_dim)
            # texts = ["I love this movie", "This movie was terrible"]
            # embeddings = get_bert_embeddings_batch(texts, tokenizer, model, batch_size=1)
            # embeddings.shape # Output: (2, 768)
            embeddings.append(batch_embeddings)
    return np.vstack(embeddings) # Stack all batch embeddings in to a single array

    # batch1 = np.array([[0.1, 0.2, 0.3],
    #                [0.4, 0.5, 0.6]])

    # batch2 = np.array([[0.7, 0.8, 0.9],
    #                [1.0, 1.1, 1.2]])

    # array([[0.1, 0.2, 0.3],
    #    [0.4, 0.5, 0.6],
    #    [0.7, 0.8, 0.9],
    #    [1.0, 1.1, 1.2]])

import numpy as np

# Get embeddings for the subset (adjust batch_size based on memory)
# Lower batch_size if you encounter CUDA out of memory errors
batch_size = 64
print(f"Generating BERT embeddings in batches of {batch_size}...")
subset_embeddings = get_bert_embeddings_batch(df_subset['text'].tolist(), tokenizer, model, batch_size=batch_size)

print("Embeddings generated. Shape:", subset_embeddings.shape)

# Add embeddings back to the dataframe subset
df_subset['embedding'] = list(subset_embeddings) # Store as list of numpy arrays or convert to list of lists

# You can now split this subset back into train and test based on the original indices
# Or directly use df_train_subset and df_test_subset to get embeddings separately

# Example: Get embeddings for the train and test subsets separately
print("Generating embeddings for original train and test subsets...")
train_subset_embeddings = get_bert_embeddings_batch(df_train_subset['text'].tolist(), tokenizer, model, batch_size=batch_size)
test_subset_embeddings = get_bert_embeddings_batch(df_test_subset['text'].tolist(), tokenizer, model, batch_size=batch_size)

df_train_subset['embedding'] = list(train_subset_embeddings)
df_test_subset['embedding'] = list(test_subset_embeddings)

print("Embeddings added to subset dataframes.")
print("df_train_subset with embeddings:")
print(df_train_subset.head())
print("\ndf_test_subset with embeddings:")
print(df_test_subset.head())

# These embeddings (train_subset_embeddings and test_subset_embeddings)
# can now be used as features for training a classifier.
# For example, using Logistic Regression again:

X_train_subset = np.vstack(df_train_subset['embedding'].values) # Stack embeddings back to a single array
y_train_subset = df_train_subset['label'].values

X_test_subset = np.vstack(df_test_subset['embedding'].values)
y_test_subset = df_test_subset['label'].values

# Initialize and train the Logistic Regression model on the subset data
log_reg_subset = LogisticRegression(solver='liblinear', max_iter=200) # Increased max_iter just in case
print("\nTraining Logistic Regression on BERT embeddings...")
log_reg_subset.fit(X_train_subset, y_train_subset)

# Make predictions on the test subset
y_pred_subset = log_reg_subset.predict(X_test_subset)

# Evaluate the model
accuracy_subset = accuracy_score(y_test_subset, y_pred_subset)
print(f"Accuracy on the IMDB subset test set (using BERT embeddings): {accuracy_subset:.2f}")

# To use the full dataset (50,000 reviews), you would need to iterate through it
# in batches and generate embeddings batch by batch, potentially saving them
# to disk or processing them sequentially for training.

# Example (Conceptual - for processing the full dataset iteratively)
# all_embeddings = []
# for i in range(0, len(df_imdb), batch_size):
#     batch_texts = df_imdb['text'][i:i+batch_size].tolist()
#     batch_embeddings = get_bert_embeddings_batch(batch_texts, tokenizer, model, batch_size=batch_size)
#     # Process or store batch_embeddings (e.g., train a classifier incrementally or save)
#     # Be mindful of memory if trying to collect all 50,000 embeddings into a single list/array

# If you plan to train a deep learning model using these embeddings,
# you would typically pass batches of (embeddings, labels) to your training loop.

Using a subset of 2000 reviews.
Generating BERT embeddings in batches of 64...
Embeddings generated. Shape: (2000, 768)
Generating embeddings for original train and test subsets...
Embeddings added to subset dataframes.
df_train_subset with embeddings:
                                                    text  label  \
6868   ? there's a major difference between releasing...      0   
24016  ? when a small ? named ? ? ? a magic ring from...      1   
9668   ? the characters are cliched and predictable w...      0   
13640  ? soylent green is a really good movie actuall...      1   
14018  ? the us appear to run the uk police who all r...      0   

                                               embedding  
6868   [0.18785325, -0.11231941, 0.15215641, -0.17572...  
24016  [-0.059974123, -0.0121877985, 0.39474112, -0.2...  
9668   [0.0031689848, -0.26403102, 0.29375637, -0.168...  
13640  [0.10213518, -0.14593977, 0.020417262, 0.32758...  
14018  [-0.057021294, 0.1970391, 0.27187058, -0.4