<a href="https://colab.research.google.com/github/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [30]:

!pip install transformers




In [0]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')

## Importing the dataset
We'll use pandas to read the dataset and load it into a dataframe.

In [32]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
#with open('/content/drive/My Drive/Colab Notebooks/data_processed.csv', 'r') as f:
#  f.open()
#df = pd.read_csv('data_processed.csv')
df = pd.read_csv("/content/drive/My Drive/Colab Notebooks/data_processed.csv", sep="#")

In [34]:
df.shape

(10876, 9)

## Loading the Pre-trained BERT model
Let's now load a pre-trained BERT model. 

In [0]:
# For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

## Want BERT instead of distilBERT? Uncomment the following line:
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Right now, the variable `model` holds a pretrained distilBERT model -- a version of BERT that is smaller, but much faster and requiring a lot less memory.

## Model #1: Preparing the Dataset
Before we can hand our sentences to BERT, we need to so some minimal processing to put them in the format it requires.

### Tokenization
Our first step is to tokenize the sentences -- break them up into word and subwords in the format BERT is comfortable with.

In [0]:
tags_simple = pd.read_csv("/content/drive/My Drive/Colab Notebooks/tags_simple.csv")
#print(tags_simple.isna())
tags_simple.fillna("", inplace=True)
#print(tags_simple.head())
tags_tokenized = tags_simple["tags"].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

In [86]:
!pip install emoji
print(tokenizer.tokenize('I ❤️ you'))
s = u'\U0001f600'
from emoji.unicode_codes import UNICODE_EMOJI

print(UNICODE_EMOJI[s])

Collecting emoji
[?25l  Downloading https://files.pythonhosted.org/packages/40/8d/521be7f0091fe0f2ae690cc044faf43e3445e0ff33c574eae752dd7e39fa/emoji-0.5.4.tar.gz (43kB)
[K     |███████▌                        | 10kB 19.2MB/s eta 0:00:01[K     |███████████████                 | 20kB 2.2MB/s eta 0:00:01[K     |██████████████████████▋         | 30kB 2.8MB/s eta 0:00:01[K     |██████████████████████████████▏ | 40kB 3.1MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 2.0MB/s 
[?25hBuilding wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-0.5.4-cp36-none-any.whl size=42176 sha256=df7e24d81e7af2afa2794a93b39560974906bfbb2c1778e85294fed751280293
  Stored in directory: /root/.cache/pip/wheels/2a/a9/0a/4f8e8cce8074232aba240caca3fade315bb49fac68808d1a9c
Successfully built emoji
Installing collected packages: emoji
Successfully installed emoji-0.5.4
['i', '[UNK]', 'you']
:grinning_fac

In [0]:
tokenized = df["text"].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-tokenization-2-token-ids.png" />

### Padding
After tokenization, `tokenized` is a list of sentences -- each sentences is represented as a list of tokens. We want BERT to process our examples all at once (as one batch). It's just faster that way. For that reason, we need to pad all lists to the same size, so we can represent the input as one 2-d array, rather than a list of lists (of different lengths).

In [0]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

In [0]:
max_len = 0
for i in tags_tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

tags_padded = np.array([i + [0]*(max_len-len(i)) for i in tags_tokenized.values])

Our dataset is now in the `padded` variable, we can view its dimensions below:

In [40]:
np.array(tags_padded).shape

(10876, 34)

### Masking
If we directly send `padded` to BERT, that would slightly confuse it. We need to create another variable to tell it to ignore (mask) the padding we've added when it's processing its input. That's what attention_mask is:

In [41]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(10876, 54)

In [42]:
tags_attention_mask = np.where(tags_padded != 0, 1, 0)
tags_attention_mask.shape

(10876, 34)

## Model #1: And Now, Deep Learning!
Now that we have our model and inputs ready, let's run our model!

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-tutorial-sentence-embedding.png" />

The `model()` function runs our sentences through BERT. The results of the processing will be returned into `last_hidden_states`.

In [0]:
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

In [0]:
tags_input_ids = torch.tensor(tags_padded)  
tags_attention_mask = torch.tensor(tags_attention_mask)

with torch.no_grad():
    tags_last_hidden_states = model(tags_input_ids, attention_mask=tags_attention_mask)

Let's slice only the part of the output that we need. That is the output corresponding the first token of each sentence. The way BERT does sentence classification, is that it adds a token called `[CLS]` (for classification) at the beginning of every sentence. The output corresponding to that token can be thought of as an embedding for the entire sentence.

<img src="https://jalammar.github.io/images/distilBERT/bert-output-tensor-selection.png" />

We'll save those in the `features` variable, as they'll serve as the features to our logitics regression model.

In [0]:
features = last_hidden_states[0][:,0,:].numpy()
tags_features = tags_last_hidden_states[0][:,0,:].numpy()

In [0]:
pd.DataFrame(features).to_csv('/content/drive/My Drive/Colab Notebooks/features.csv')
pd.DataFrame(tags_features).to_csv('/content/drive/My Drive/Colab Notebooks/tags_features.csv')

In [0]:
#cat = pd.read_csv("/content/drive/My Drive/Colab Notebooks/cat.csv")
#tags = pd.read_csv("/content/drive/My Drive/Colab Notebooks/tags.csv")
# добавление тегов в виде разреженной матрицы изрядно уменьшило результат. 
#ff = np.hstack((features, tags))

In [0]:
ff = np.hstack((features, tags_features))

The labels indicating which sentence is positive and negative now go into the `labels` variable

In [49]:
ff.shape

(10876, 1536)

In [0]:
labels = df["target"]

In [73]:
#с учетом тегов
#train_f = ff[:7613]
#test_f = ff[7613:]
# без учета тегов
train_f = features[:7613]
test_f = features[7613:]

train_l = labels[:7613]
test_l = labels[7613:]

print(len(train_l))

7613


## Model #2: Train/Test Split
Let's now split our datset into a training set and testing set (even though we're using 2,000 sentences from the SST2 training set).

In [0]:
train_features, test_features, train_labels, test_labels = train_test_split(train_f, train_l, shuffle="False")

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-train-test-split-sentence-embedding.png" />

### [Bonus] Grid Search for Parameters
We can dive into Logistic regression directly with the Scikit Learn default parameters, but sometimes it's worth searching for the best value of the C parameter, which determines regularization strength.

In [75]:
parameters = {'C': np.linspace(0.0001, 100, 20)}
grid_search = GridSearchCV(LogisticRegression(), parameters)
grid_search.fit(train_features, train_labels)

print('best parameters: ', grid_search.best_params_)
print('best scrores: ', grid_search.best_score_)

best parameters:  {'C': 5.263252631578947}
best scrores:  0.8008397402346239


We now train the LogisticRegression model. If you've chosen to do the gridsearch, you can plug the value of C into the model declaration (e.g. `LogisticRegression(C=5.2)`).

In [76]:
lr_clf = LogisticRegression( C = grid_search.best_params_['C'])
lr_clf.fit(train_features, train_labels)

LogisticRegression(C=5.263252631578947, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

<img src="https://jalammar.github.io/images/distilBERT/bert-training-logistic-regression.png" />

## Evaluating Model #2
So how well does our model do in classifying sentences? One way is to check the accuracy against the testing dataset:

In [77]:
lr_clf.score(test_features, test_labels)

0.8182773109243697

In [78]:
Y_pred = lr_clf.predict(test_f)
len(Y_pred)

3263

In [0]:
import time
submission = pd.DataFrame({"id": df["id"][7613:], "target": Y_pred }, dtype = "int")
fname = '/content/drive/My Drive/Colab Notebooks/' + time.ctime() + 'submission.csv'
submission.to_csv(fname, index=False, header="id,target")

How good is this score? What can we compare it against? Let's first look at a dummy classifier:

In [80]:
from sklearn.naive_bayes import BernoulliNB
clf = BernoulliNB()

scores = cross_val_score(clf, train_features, train_labels)
print("BNB classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

BNB classifier score: 0.762 (+/- 0.02)


In [81]:
np.array(test_labels).reshape(-1, 1).shape

(1904, 1)

In [82]:
from sklearn.svm import LinearSVC
clf = LinearSVC( dual=False, tol=1e-3)
clf.fit(train_features, np.array(train_labels).reshape(-1, 1))
print(test_features.shape, test_labels.shape)
scores = clf.score(test_features, np.array(test_labels).reshape(-1, 1))
print("LinearSVC classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

(1904, 768) (1904,)
LinearSVC classifier score: 0.810 (+/- 0.00)


So our model clearly does better than a dummy classifier. But how does it compare against the best models?

## Proper SST2 scores
For reference, the [highest accuracy score](http://nlpprogress.com/english/sentiment_analysis.html) for this dataset is currently **96.8**. DistilBERT can be trained to improve its score on this task – a process called **fine-tuning** which updates BERT’s weights to make it achieve a better performance in this sentence classification task (which we can call the downstream task). The fine-tuned DistilBERT turns out to achieve an accuracy score of **90.7**. The full size BERT model achieves **94.9**.



And that’s it! That’s a good first contact with BERT. The next step would be to head over to the documentation and try your hand at [fine-tuning](https://huggingface.co/transformers/examples.html#glue). You can also go back and switch from distilBERT to BERT and see how that works.