In [1]:
import numpy as np
import pandas as pd
import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')

## Importing the dataset
We'll use pandas to read the dataset and load it into a dataframe.

In [2]:
#from google.colab import drive
#drive.mount('/content/drive')


In [2]:
#with open('/content/drive/My Drive/Colab Notebooks/data_processed.csv', 'r') as f:
#  f.open()
df = pd.read_csv('data_proc3.csv')
#df = pd.read_csv("/content/drive/My Drive/Colab Notebooks/data_prep_join.csv")

In [3]:
df.shape
df[df["target"].isna()].shape

(3263, 7)

In [4]:
test_size = 3263
train_size = max(df.shape)-test_size
test_df=df.tail(test_size) 
train_df = df.head(train_size)
train_df.shape

(7613, 7)

## Loading the Pre-trained BERT model
Let's now load a pre-trained BERT model. 

  * bert-base-cased-finetuned-mrpc
  * bert-large-uncased
  * distilbert-base-multilingual-cased
  * distilbert-base-uncased


In [5]:
tag = ["bbfm", "blu", "dbmc", "dbu"]

In [7]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

# For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertForSequenceClassification, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

## Want BERT instead of distilBERT? Uncomment the following line:
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)
#prepare labels
labels = torch.tensor(df["target"].iloc[:7613]).unsqueeze(0) 

Right now, the variable `model` holds a pretrained distilBERT model -- a version of BERT that is smaller, but much faster and requiring a lot less memory.

## Model #1: Preparing the Dataset
Before we can hand our sentences to BERT, we need to so some minimal processing to put them in the format it requires.

### Tokenization
Our first step is to tokenize the sentences -- break them up into word and subwords in the format BERT is comfortable with.

In [8]:
tokenized = df["text"].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

In [9]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

Our dataset is now in the `padded` variable, we can view its dimensions below:

### Masking
If we directly send `padded` to BERT, that would slightly confuse it. We need to create another variable to tell it to ignore (mask) the padding we've added when it's processing its input. That's what attention_mask is:

In [10]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(10840, 116)

## Model #1: And Now, Deep Learning!
Now that we have our model and inputs ready, let's run our model!

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-tutorial-sentence-embedding.png" />

The `model()` function runs our sentences through BERT. The results of the processing will be returned into `last_hidden_states`.

In [11]:
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, labels, attention_mask=attention_mask)
    
loss, logits = last_hidden_states[:2]


Let's slice only the part of the output that we need. That is the output corresponding the first token of each sentence. The way BERT does sentence classification, is that it adds a token called `[CLS]` (for classification) at the beginning of every sentence. The output corresponding to that token can be thought of as an embedding for the entire sentence.

<img src="https://jalammar.github.io/images/distilBERT/bert-output-tensor-selection.png" />

We'll save those in the `features` variable, as they'll serve as the features to our logitics regression model.

In [12]:
features = last_hidden_states[0][:,0,:].numpy()


In [13]:
features.shape

(10840, 768)

In [14]:
pd.DataFrame(features).to_csv('features_'+tag[3]+'.csv')


NameError: name 'loss' is not defined