# Bert as AutoEncoder

The idea behind this demo is to see how Bert can be used as autoencoder where input to the model is text and output is embeddings (which can be used for further ML models and algorithms as features, calculate similarities etc)

**NOTE**

If aws notebook instance is used, select proper kernel (conda_pytorch)

Install requirements

In [2]:
%pip install torch --quiet
%pip install transformers --quiet
%pip install -U scikit-learn --quiet
%pip install s3fs --quiet

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
boto3 1.28.85 requires botocore<1.32.0,>=1.31.85, but you have botocore 1.31.64 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [3]:
import torch
import transformers

print("done")

done


Load pre-trained BERT model and tokenizer

In [4]:
model_name = 'bert-base-uncased'

tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModel.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [5]:
def get_bert_vector(text):
    if len(text) < 5:
        print(text)
    # Tokenize the text
    input_ids = torch.tensor(tokenizer.encode(text, add_special_tokens=True)).unsqueeze(0)

    # Get BERT embeddings
    with torch.no_grad():
        outputs = model(input_ids)
        last_hidden_states = outputs[0]
        mean_last_hidden_states = torch.mean(last_hidden_states, dim=1)

    # Return the mean of the last hidden states as a vector
    return mean_last_hidden_states.numpy()[0]

In [6]:
bertvector = get_bert_vector("Hello auto encoder. This is a string of text. I want a vector back.")
print(len(bertvector))
bertvector

768


array([ 6.86645880e-02, -1.46198228e-01,  3.27494562e-01, -1.11544654e-01,
       -5.77578181e-03,  2.07549185e-01,  2.26298943e-01,  4.75044757e-01,
       -9.00262594e-02, -2.00914428e-01, -1.82463482e-01, -2.51878500e-01,
       -2.99872011e-01,  3.26931775e-01, -2.11426437e-01,  1.59557194e-01,
       -2.39347100e-01,  1.83264837e-01, -2.36361623e-01,  1.32732868e-01,
        5.52591383e-02,  7.64939785e-02, -3.87540579e-01, -1.04481421e-01,
        7.45240808e-01, -5.53761721e-02, -4.20232505e-01, -1.02724880e-01,
       -7.88896084e-01, -1.33740902e-01,  3.34353708e-02,  2.44405568e-01,
        3.62050943e-02,  1.36475325e-01, -1.88355207e-01, -1.69674858e-01,
        1.46547109e-01, -1.43513024e-01, -3.38513017e-01,  3.24504554e-01,
       -5.68893731e-01, -4.36611891e-01, -5.05800322e-02, -1.72781378e-01,
        1.98797181e-01, -3.77498567e-01, -2.21412092e-01,  2.05047920e-01,
        1.33984342e-01,  8.64517093e-02, -7.78364658e-01,  3.28900397e-01,
       -3.08279395e-01,  

## Calculate Cosine Similarity between two sentences

In [None]:
bertvector_1 = get_bert_vector("She eagerly accepted the job offer.").reshape(1, -1)
bertvector_2 = get_bert_vector("With great enthusiasm, she embraced the job opportunity.").reshape(1, -1)

bertvector_1 = bertvector_1.reshape(1, -1)
bertvector_2 = bertvector_2.reshape(1, -1)

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
cosine_similarity(bertvector_1, bertvector_2)

array([[0.8592123]], dtype=float32)

# ML Model using BERT Vectors

In [None]:
import pandas as pd
#df = pd.read_csv("s3://webage-genai-data/sentiment_data_for_exercise/sentiment_data.csv")
df = pd.read_csv("s3://btcampdata/sentiment_data.csv")
df.tail()

Unnamed: 0,text,sentiment
95,OPS sorry Queen Mom,negative
96,I overslept headache,negative
97,just got home from work.... and is chugging do...,neutral
98,Im trying to move and get up but it just hurts...,negative
99,I can`t wait to see UP! How dare have a 'real...,neutral


In [None]:
df.describe()

In [None]:
df['sentiment'].value_counts()

positive    38
neutral     35
negative    27
Name: sentiment, dtype: int64

In [None]:
df['vectors'] = df['text'].apply(lambda x: list(get_bert_vector(x)))

Wait a little bit..

In [None]:
df

Encode "sentiment" column

In [None]:
from sklearn.preprocessing import LabelEncoder

# Creating a instance of label Encoder.
label_encoder = LabelEncoder()

# Using .fit_transform function to fit label
# encoder and return encoded label
df['sentiment'] = label_encoder.fit_transform(df['sentiment'])
df

Prepare vectors for trainig

In [None]:
X = df['vectors'].apply(lambda x: pd.Series(x))
Y = df['sentiment']

In [None]:
X

In [None]:
Y

Split data into train/test datasets

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.25, random_state=125
)

Fit the model

In [None]:
from sklearn.naive_bayes import GaussianNB

# Build a Gaussian Classifier
model = GaussianNB()

# Model training
model.fit(X_train, y_train)

Evaluate

In [None]:
from sklearn.metrics import accuracy_score, f1_score

y_pred = model.predict(X_test)
accuray = accuracy_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test, average="weighted")

print("Accuracy:", accuray)
print("F1 score:", f1)