<a href="https://colab.research.google.com/github/JishnuJayaraj/ML/blob/master/NLP/Bert/BERTPlay.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Junk

In [None]:
# SUMMARIZATION AND NER
!pip install transformers
# https://chriskhanhtran.github.io/posts/named-entity-recognition-with-transformers/

input text -> load model with pretrained weights -> tokenize to byte seq: Bert tokenizer -> cls, sep token -> padding: max_length -> masking tokens

-> tokens to bert vocab ->

## Sub word Tokenization
words cat and cats, a sub-tokenization of the word cats would be [cat, ##s]. Where the prefix "##" indicates a subtoken of the initial input. Such training algorithms might extract sub-tokens such as "##ing", "##ed" over English corpus.

this kind of sub-tokens construction leveraging compositions of "pieces" overall reduces the size of the vocabulary you have to carry to train a Machine Learning model. On the other side, as one token might be exploded into multiple subtokens, the input of your model might increase and become an issue on model with non-linear complexity over the input sequence's length.

### huggingface/tokenizers library
blazing fast tokenization library able to train, tokenize and decode dozens of Gb/s of text on a common multi-core machine.
library so that it provides all the required blocks to create end-to-end tokenizers in an interchangeable way. In that sense, we provide these various components:

* **Normalizer**: Executes all the initial transformations over the initial input string. For example when you need to lowercase some text, maybe strip it, or even apply one of the common unicode normalization process, you will add a Normalizer.
* PreTokenizer: In charge of splitting the initial input string. That's the component that decides where and how to pre-segment the origin string. The simplest example would be like we saw before, to simply split on spaces.
* Model: Handles all the sub-token discovery and generation, this part is trainable and really dependant of your input data.
* Post-Processor: Provides advanced construction features to be compatible with some of the Transformers-based SoTA models. For instance, for BERT it would wrap the tokenized sentence around [CLS] and [SEP] tokens.
* Decoder: In charge of mapping back a tokenized input to the original string. The decoder is usually chosen according to the PreTokenizer we used previously.
* Trainer: Provides training capabilities to each model.

For each of the components above we provide multiple implementations:

* Normalizer: Lowercase, Unicode (NFD, NFKD, NFC, NFKC), Bert, Strip, ...
* PreTokenizer: ByteLevel, WhitespaceSplit, CharDelimiterSplit, Metaspace, ...
* Model: WordLevel, BPE, WordPiece
* Post-Processor: BertProcessor, ...
* Decoder: WordLevel, BPE, WordPiece, ...

In [None]:
import pandas as pd 
import transformers

In [None]:
from tokenizers import Tokenizer 

## Transformer library
The transformers library allows you to benefits from large, pretrained language models without requiring a huge and costly computational infrastructure. Most of the State-of-the-Art models are provided directly by their author and made available in the library in PyTorch and TensorFlow in a transparent and interchangeable way.

In [None]:

import torch
from transformers import AutoModel, AutoTokenizer, BertTokenizer

torch.set_grad_enabled(False)

# Store the model we want to use
MODEL_NAME = "bert-base-cased"

# We need to create the model and tokenizer
model = AutoModel.from_pretrained(MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Tokens comes from a process that splits the input into sub-entities with interesting linguistic properties. 
tokens = tokenizer.tokenize("This is an input example")
print("Tokens: {}".format(tokens))

# This is not sufficient for the model, as it requires integers as input, 
# not a problem, let's convert tokens to ids.
tokens_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Tokens id: {}".format(tokens_ids))

# Add the required special tokens
tokens_ids = tokenizer.build_inputs_with_special_tokens(tokens_ids)

# We need to convert to a Deep Learning framework specific format, let's use PyTorch for now.
tokens_pt = torch.tensor([tokens_ids])
print("Tokens PyTorch: {}".format(tokens_pt))

# Now we're ready to go through BERT with out input
outputs, pooled = model(tokens_pt)
print("Token wise output: {}, Pooled output: {}".format(outputs.shape, pooled.shape))

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435779157.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…


Tokens: ['This', 'is', 'an', 'input', 'example']
Tokens id: [1188, 1110, 1126, 7758, 1859]
Tokens PyTorch: tensor([[ 101, 1188, 1110, 1126, 7758, 1859,  102]])
Token wise output: torch.Size([1, 7, 768]), Pooled output: torch.Size([1, 768])


As you can see, BERT outputs two tensors:

* One with the generated representation for every token in the input (1, NB_TOKENS, REPRESENTATION_SIZE)
* One with an aggregated representation for the whole input (1, REPRESENTATION_SIZE)

The first, token-based, representation can be leveraged if your task requires to keep the sequence representation and you want to operate at a token-level. This is particularly useful for Named Entity Recognition and Question-Answering.

The second, aggregated, representation is especially useful if you need to extract the overall context of the sequence and don't require a fine-grained token-level. This is the case for Sentiment-Analysis of the sequence or Information Retrieval

In [None]:
# high level 
# tokens = tokenizer.tokenize("This is an input example")
# tokens_ids = tokenizer.convert_tokens_to_ids(tokens)
# tokens_pt = torch.tensor([tokens_ids])

# This code can be factored into one-line as follow
tokens_pt2 = tokenizer.encode_plus("This is an input example", return_tensors="pt")

for key, value in tokens_pt2.items():
    print("{}:\n\t{}".format(key, value))

outputs2, pooled2 = model(**tokens_pt2)
print("Difference with previous code: ({}, {})".format((outputs2 - outputs).sum(), (pooled2 - pooled).sum()))

input_ids:
	tensor([[ 101, 1188, 1110, 1126, 7758, 1859,  102]])
token_type_ids:
	tensor([[0, 0, 0, 0, 0, 0, 0]])
attention_mask:
	tensor([[1, 1, 1, 1, 1, 1, 1]])
Difference with previous code: (0.0, 0.0)



As you can see above, the methode encode_plus provides a convenient way to generate all the required parameters that will go through the model.

Moreover, you might have noticed it generated some additional tensors:

token_type_ids: This tensor will map every tokens to their corresponding segment (see below).
attention_mask: This tensor is used to "mask" padded values in a batch of sequence with different lengths (see below)

In [None]:
# Padding highlight
tokens = tokenizer.batch_encode_plus(
    ["This is a sample", "This is another longer sample text"], 
    pad_to_max_length=True  # First sentence will have some PADDED tokens to match second sequence length
)

for i in range(2):
    print("Tokens (int)      : {}".format(tokens['input_ids'][i]))
    print("Tokens (str)      : {}".format([tokenizer.convert_ids_to_tokens(s) for s in tokens['input_ids'][i]]))
    print("Tokens (attn_mask): {}".format(tokens['attention_mask'][i]))
    print()

In [None]:
# DISTIL BERT

from transformers import DistilBertModel

bert_distil = DistilBertModel.from_pretrained('distilbert-base-cased')
input_pt = tokenizer.encode_plus(
    'This is a sample input to demonstrate performance of distiled models especially inference time', 
    return_tensors="pt"
)


%time _ = bert_distil(input_pt['input_ids'])
# %time _ = model_pt(input_pt['input_ids'])

## HOW TO USE PIPELINES

Newly introduced in transformers v2.3.0, pipelines provides a high-level, easy to use, API for doing inference over a variety of downstream-tasks, including:

* Sentence Classification (Sentiment Analysis): Indicate if the overall sentence is either positive or negative, i.e. binary classification task or logitic regression task.
* Token Classification (Named Entity Recognition, Part-of-Speech tagging): For each sub-entities (tokens) in the input, assign them a label, i.e. classification task.
* Question-Answering: Provided a tuple (question, context) the model should find the span of text in content answering the question.
* Mask-Filling: Suggests possible word(s) to fill the masked input with respect to the provided context.
* Summarization: Summarizes the input article to a shorter article.
* Translation: Translates the input from a language to another language.
* Feature Extraction: Maps the input to a higher, multi-dimensional space learned from the data.


Pipelines encapsulate the overall process of every NLP process:

* Tokenization: Split the initial input into multiple sub-entities with ... properties (i.e. tokens).
* Inference: Maps every tokens into a more meaningful representation.
* Decoding: Use the above representation to generate and/or extract the final output for the underlying task

In [None]:
from transformers import pipeline

nlp_token_class = pipeline('ner')
nlp_token_class('Hugging Face is a French company based in New-York.')

TEXT_TO_SUMMARIZE = """ 
New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York. 
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband. 
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometime"""

summarizer = pipeline('summarization')
summarizer(TEXT_TO_SUMMARIZE)

In [None]:
df = pd.read_csv('article_20.csv')

# name the columns
df.columns = ['rowno','date','heading','full_content','link','empty']

# cut down columns
articles = df[['heading','full_content']]
articles['word_count'] = articles['full_content'].apply(lambda x: len(str(x).split(" ")))
articles['content'] = articles['full_content'].str.slice(0,1024)
#articles['content4k'] = articles['full_content'].str.slice(0,4096)

articles.head()

In [None]:
# Defining DistilBERT tokonizer
distil_bert = 'distilbert-base-uncased' # Pick any desired pre-trained model
roberta = 'roberta-base-uncase'

# change name here to change tokenizer
tokenizer = DistilBertTokenizer.from_pretrained(distil_bert, do_lower_case=True, add_special_tokens=True,
                                                max_length=512, pad_to_max_length=True)

In [None]:
# Tokenize the document
def tokenize(sentences, tokenizer):
    input_ids, input_masks, input_segments = [],[],[]
    for sentence in tqdm(sentences):
        inputs = tokenizer.encode_plus(sentence, add_special_tokens=True, max_length=128, pad_to_max_length=True, 
                                             return_attention_mask=True, return_token_type_ids=True)
        input_ids.append(inputs['input_ids'])
        input_masks.append(inputs['attention_mask'])
        input_segments.append(inputs['token_type_ids'])        
        
    return np.asarray(input_ids, dtype='int32'), np.asarray(input_masks, dtype='int32'), np.asarray(input_segments, dtype='int32')