# 1. Setup
## 1.1. Using Colab GPU for Training



In [1]:
import tensorflow as tf

# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

SystemError: GPU device not found

In [2]:
import torch

# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: GeForce GT 745M


    Found GPU0 GeForce GT 745M which is of cuda capability 3.0.
    PyTorch no longer supports this GPU because it is too old.
    The minimum cuda capability that we support is 3.5.
    


## 1.2. Installing the Hugging Face Library


In [3]:
!pip install transformer

[31mERROR: Could not find a version that satisfies the requirement transformer (from versions: none)[0m
[31mERROR: No matching distribution found for transformer[0m


# 2. Loading CoLA Dataset
## 2.1. Download & Extract

In [4]:
# !pip install wget

In [6]:
import wget
import os

print('Downloading dataset...')

# The URL for dataset zip file.
url = 'https://nyu-mll.github.io/CoLA/cola_public_1.1.zip'

# Download the file (if we haven't already)
if not os.path.exists('./cola_public_1.1.zip'):
    wget.download(url, './cola_public_1.1.zip')

Downloading dataset...


In [7]:
# Unzip the dataset (if we haven't already)
if not os.path.exists('./cola_public/'):
    !unzip cola_public_1.1.zip

Archive:  cola_public_1.1.zip
   creating: cola_public/
  inflating: cola_public/README      
   creating: cola_public/tokenized/
  inflating: cola_public/tokenized/in_domain_dev.tsv  
  inflating: cola_public/tokenized/in_domain_train.tsv  
  inflating: cola_public/tokenized/out_of_domain_dev.tsv  
   creating: cola_public/raw/
  inflating: cola_public/raw/in_domain_dev.tsv  
  inflating: cola_public/raw/in_domain_train.tsv  
  inflating: cola_public/raw/out_of_domain_dev.tsv  


## 2.2. Parse



In [8]:
import pandas as pd

# Load the dataset into a pandas dataframe.
df = pd.read_csv("./cola_public/raw/in_domain_train.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])

# Report the number of sentences.
print('Number of training sentences: {:,}\n'.format(df.shape[0]))

# Display 10 random rows from the data.
df.sample(10)

Number of training sentences: 8,551



Unnamed: 0,sentence_source,label,label_notes,sentence
3524,ks08,1,,The birds are singing because the weather is l...
2896,l-93,1,,I broke the twig off the branch.
1015,bc01,1,,Some tourists visited all the museums.
6318,c_13,1,,Chris wants himself to win.
3865,ks08,1,,Frank threw himself into the sofa.
4572,ks08,0,*,"John seems fond of ice cream, and Bill seems, ..."
7964,ad03,1,,He might no could have done it
1951,r-67,1,,John and Mary met in Vienna.
7133,sks13,1,,This girl in the red coat will put a picture o...
7724,ad03,1,,I might have eaten some seaweed.


The two properties we actually care about are the the sentence and its label, which is referred to as the "acceptibility judgment" (0=unacceptable, 1=acceptable).

우리가 실제로 관심을 갖는 두 가지 속성은 문장과 그 레이블이며, "허용 가능성 판단"(0 = unacceptable, 1 = acceptable)이라고합니다.


Here are five sentences which are labeled as not grammatically acceptible. Note how much more difficult this task is than something like sentiment analysis!

문법적으로 허용되지 않는 것으로 표시된 5 개의 문장이 있습니다. 이 작업이 감정 분석과 같은 것보다 얼마나 어려운지 주목하십시오!


In [9]:
df.loc[df.label == 0].sample(5)[['sentence', 'label']]

Unnamed: 0,sentence,label
1407,Light was made of her indiscretions.,0
171,"The more obnoxious Fred, the less attention yo...",0
248,"He gets angry, the longer John has to wait.",0
583,The cup filled of water.,0
1279,The tall nurse who Tony has a Fiat and yearns ...,0


In [11]:
# Get the lists of sentences and their labels.
sentences = df.sentence.values
labels = df.label.values

print(sentences)
print(labels)


["Our friends won't buy this analysis, let alone the next one we propose."
 "One more pseudo generalization and I'm giving up."
 "One more pseudo generalization or I'm giving up." ...
 'It is easy to slay the Gorgon.'
 'I had the strangest feeling that I knew you.'
 'What all did you get for Christmas?']
[1 1 1 ... 1 1 1]


# 3. Tokenization & Input Formatting

## 3.1. BERT Tokenizer

In [12]:
from transformers import BertTokenizer

# Load the BERT tokenizer.
print('Loading BERT tokenizer...')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

Loading BERT tokenizer...


In [13]:
# Print the original sentence.
print(' Original: ', sentences[0])

# Print the sentence split into tokens.
print('Tokenized: ', tokenizer.tokenize(sentences[0]))

# Print the sentence mapped to token ids.
print('Token IDs: ', tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sentences[0])))

 Original:  Our friends won't buy this analysis, let alone the next one we propose.
Tokenized:  ['our', 'friends', 'won', "'", 't', 'buy', 'this', 'analysis', ',', 'let', 'alone', 'the', 'next', 'one', 'we', 'propose', '.']
Token IDs:  [2256, 2814, 2180, 1005, 1056, 4965, 2023, 4106, 1010, 2292, 2894, 1996, 2279, 2028, 2057, 16599, 1012]


## 3.2. Required Formatting

The above code left out a few required formatting steps that we'll look at here.
위의 코드는 여기서 살펴볼 몇 가지 필수 형식 지정 단계를 생략했습니다.

Side Note: The input format to BERT seems "over-specified" to me... We are required to give it a number of pieces of information which seem redundant, or like they could easily be inferred from the data without us explicity providing it. But it is what it is, and I suspect it will make more sense once I have a deeper understanding of the BERT internals.
참고 : BERT의 입력 형식은 나에게 "과도하게 지정되어있는"것 같습니다. 중복 된 것처럼 보이거나 데이터를 명시 적으로 제공하지 않고 쉽게 데이터에서 유추 할 수있는 많은 정보를 제공해야합니다. . 그러나 그것은 그것이 무엇이며, BERT 내부에 대해 깊이 이해하면 더 이해가 될 것입니다.

We are required to:

1. Add special tokens to the start and end of each sentence.
2. Pad & truncate all sentences to a single constant length.
3. Explicitly differentiate real tokens from padding tokens with the "attention mask".

1. 각 문장의 시작과 끝에 특수 토큰을 추가하십시오.
2. 모든 문장을 하나의 일정한 길이로 채 웁니다.
3. "attention mask"를 사용하여 실제 토큰과 패딩 토큰을 명시 적으로 구별하십시오.


### Special Tokens

[SEP]

At the end of every sentence, we need to append the special [SEP] token.
모든 문장 끝에 특별한 [SEP] 토큰을 추가해야합니다.

This token is an artifact of two-sentence tasks, where BERT is given two separate sentences and asked to determine something (e.g., can the answer to the question in sentence A be found in sentence B?).
이 토큰은 BERT에 두 개의 개별 문장이 제공되고 무언가를 결정하도록 요청되는 2 문장 작업의 인공물입니다 (예 : 문장 A의 질문에 대한 답변을 문장 B에서 찾을 수 있습니까?).

I am not certain yet why the token is still required when we have only single-sentence input, but it is!
한 문장 만 입력해도 여전히 토큰이 필요한 이유는 확실하지 않습니다.

[CLS]

For classification tasks, we must prepend the special [CLS] token to the beginning of every sentence.
분류 작업의 경우 모든 문장의 시작 부분에 특수 [CLS] 토큰을 추가해야합니다.


This token has special significance. BERT consists of 12 Transformer layers. Each transformer takes in a list of token embeddings, and produces the same number of embeddings on the output (but with the feature values changed, of course!).
이 토큰에는 특별한 의미가 있습니다. BERT는 12 개의 트랜스포머 레이어로 구성됩니다. 각 변환기는 토큰 임베딩 목록을 가져와 출력에 동일한 수의 임베딩을 생성합니다 (물론 기능 값이 변경됨).

![Illustration of CLS token purpose](http://www.mccormickml.com/assets/BERT/CLS_token_500x606.png)





On the output of the final (12th) transformer, *only the first embedding (corresponding to the [CLS] token) is used by the classifier*.

>  "The first token of every sequence is always a special classification token (`[CLS]`). The final hidden state
corresponding to this token is used as the aggregate sequence representation for classification
tasks." (from the [BERT paper](https://arxiv.org/pdf/1810.04805.pdf))

You might think to try some pooling strategy over the final embeddings, but this isn't necessary. Because BERT is trained to only use this [CLS] token for classification, we know that the model has been motivated to encode everything it needs for the classification step into that single 768-value embedding vector. It's already done the pooling for us!