# Task 4

# Tutorial (Not Graded)

In this section, we will look into the open-source HuggingFace library. This linrary makes it quite easy to do natural language processing in Python.

## What is Natural Language Processing?

### Common Tasks in NLP
There are different common tasks in NLP. Some are easier than others and some can be categorized as medium or hard difficulty.

Spell checking, keyword search and finding synonyms are relatively categorized as easy. Parsing information from websites and documents and other sources of information is relatively moderate in difficulty. The harder tasks in NLP are as follows:
* Translation between languages
* Semantic analysis to find the real meaning of text
* Finding what pronouns refer to. This task is called Coreference.
* Answering questions about a text

Ability to work with natural language equips us to target other useful applications, such as spam detection and medical report analysis as some examples.

### Language Models and Text Generation
By modeling a language, we mean assigning a probability to occurrence of a set of words following each other to make a phrase. For example, given a set of words, what word can come next to complete the sentence.

In other words, each language model predicts the probability of the occurrence of the next word given an observerd sequence of words. Having this probability, we can even generate a sentence by choosing one of the most probable next words.

It is worth to mention that creating a language model and generating text can be seen at different levels, e.g. word, character or sub-words. For instance, a language model can be trained on characters only and given a set of characters seen, predict what next character suits the sequence well.

The language models are defined in a way that they try to predict the next possible word. This is basically what a text generator does. Using language models, we can generate text and we can also evaluate the generated text to score the language model.

### What is Transformer?
Transformers in NLP are the generic name of a family of new architectures, which started with the paper [Attention is All You Need](https://arxiv.org/abs/1706.03762) in 2017. The idea is to only use [attention](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html) in deep learning networks and not use any convolution (ConvNet) or recurrence (RNN or LSTM). This method has proven to be quite effective not only in language realm, but also in other areas too. One main advantage of this method is the ability to create larger models without loss of gradient information as in RNNs or also slowdowns that were experienced with RNNs as of their recurrent and slow nature.

<p align="center">
  <img width="200"  src="https://drive.google.com/uc?export=view&id=1MdYq1BpCp_Hvyb0hu3XyhKXxMrhXJNf_">
</p>

Since then, there have been a lot of transformer models introduced. An extensive list can be found here: https://huggingface.co/models. 

The following figure shows some of the prominent ones and compares them in terms of their size (Hint: the last one is GPT-3):
<p align="center">
  <img width="500" src="https://cdn.substack.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4c9d108-0e5c-4a7f-83ed-beed232b0e65_1384x1264.png">
  <font size="1"><center><a href="https://samcash.substack.com/p/-laymans-guide-to-language-models">Image Credit</a> </center></font>
</p>


## Installation

In [None]:
!pip install transformers==4.3.3

Collecting transformers==4.3.3
[?25l  Downloading https://files.pythonhosted.org/packages/f9/54/5ca07ec9569d2f232f3166de5457b63943882f7950ddfcc887732fc7fb23/transformers-4.3.3-py3-none-any.whl (1.9MB)
[K     |████████████████████████████████| 1.9MB 6.4MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 34.7MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 41.3MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp37-none-any.whl size=893262 sha256=6198

## Exploring
In this part we do a little bit of high-level exploration on the HuggingFace transformers library. The goal is to explore some sample tasks that can be done using NLP. We use pipelines for these experiments: https://huggingface.co/transformers/main_classes/pipelines.html.

In [None]:
from transformers import pipeline

Let's do some sentiment classification:

In [None]:
# Sentiment classification
classifier = pipeline('sentiment-analysis')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=629.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267844284.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=48.0, style=ProgressStyle(description_w…




In [None]:
classifier('University of Alberta is a great university.')

[{'label': 'POSITIVE', 'score': 0.9998518824577332}]

In [None]:
classifier('Why is Edmonton so cold in the winter?!')

[{'label': 'NEGATIVE', 'score': 0.9975214004516602}]

Another interesting thing to do in NLP is to have a conversational bot. We can develop a simple conversational bot using a pipeline called 'conversational'.

In [None]:
conversational_pipeline = pipeline("conversational")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=642.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=862955157.0, style=ProgressStyle(descri…




Some weights of GPT2Model were not initialized from the model checkpoint at microsoft/DialoGPT-medium and are newly initialized: ['transformer.h.0.attn.masked_bias', 'transformer.h.1.attn.masked_bias', 'transformer.h.2.attn.masked_bias', 'transformer.h.3.attn.masked_bias', 'transformer.h.4.attn.masked_bias', 'transformer.h.5.attn.masked_bias', 'transformer.h.6.attn.masked_bias', 'transformer.h.7.attn.masked_bias', 'transformer.h.8.attn.masked_bias', 'transformer.h.9.attn.masked_bias', 'transformer.h.10.attn.masked_bias', 'transformer.h.11.attn.masked_bias', 'transformer.h.12.attn.masked_bias', 'transformer.h.13.attn.masked_bias', 'transformer.h.14.attn.masked_bias', 'transformer.h.15.attn.masked_bias', 'transformer.h.16.attn.masked_bias', 'transformer.h.17.attn.masked_bias', 'transformer.h.18.attn.masked_bias', 'transformer.h.19.attn.masked_bias', 'transformer.h.20.attn.masked_bias', 'transformer.h.21.attn.masked_bias', 'transformer.h.22.attn.masked_bias', 'transformer.h.23.attn.masked

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=26.0, style=ProgressStyle(description_w…




In [None]:
from transformers import Conversation

conversation_1 = Conversation("Where is a good place to hang out at the University of Alberta?")

conversational_pipeline([conversation_1])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Conversation id: a214cd6f-f811-460c-8578-e6a044434a5b 
user >> Where is a good place to hang out at the University of Alberta? 
bot >> The student center. 

In [None]:
conversation_1 = Conversation("Is the student center called the SUB?")

conversational_pipeline([conversation_1])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Conversation id: 0bc95965-d317-4150-bd65-cfb1b9b013cd 
user >> Is the student center called the SUB? 
bot >> Yes, it's the SUB. 

In [None]:
conversation_1 = Conversation("Which one is better Edmonton or Toronto?")

conversational_pipeline([conversation_1])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Conversation id: e1db64ac-5786-47f0-bc4b-ba8a7b847655 
user >> Which one is better Edmonton or Toronto? 
bot >> Edmonton. Toronto is a joke. 

Well you have to trust the AI on this! :-)

More things can be done using NLP techniques and especially the HuggingFace library, which is based on Transfomer models. You can learn more about Transformers here: https://arxiv.org/abs/1706.03762.

Even recently a larger transformer-based language model, called GPT-3, has created a lot of buzz. There have been quite interesting projects done using the GPT-3 model. You can see some of these fun projects for yourself under this link: https://gpt3demo.com/. GPT-3 model is developed by [Open AI group](https://openai.com/) and has 175 billion parameters. 

## Hints


Dataset style conversion can be done with a code similar to this:

In [None]:
train_dataset.set_format(type='tensorflow', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
features = {x: train_dataset[x].to_tensor(default_value=0, shape=[None, max_length]) for x in ['input_ids', 'token_type_ids', 'attention_mask']}
tf_train_dataset = tf.data.Dataset.from_tensor_slices((features, train_dataset["label"])).batch(32)

# Sentiment Classification (Graded)

We first need to install the transformers and datasets library. These are part of the HuggingFace open-source library for natural language processing.

In [None]:
!pip install datasets
!pip install transformers==4.3.3

## IMDB dataset
Next we need to load the IMDB dataset and process it. You can find the list and instructions of all datasets under this link: https://huggingface.co/datasets

### Downloadig the Dataset

In [None]:
from datasets import load_dataset
dataset = load_dataset('''TODO: Enter the name of the IMDB dataset. ''')

In [None]:
# Viewing the details of the dataset
dataset

In the next code box, please visualize a few samples of the dataset from both training and validation splits.

In [None]:
''' TODO: Visualize a few samples of the data '''

### Model and Tokenizer
As explained in the handout, we need to tokenize the text in order for the model to understand it. Here we use BERT base model. You can see all the available models here: https://huggingface.co/models

In [None]:
from transformers import TFAutoModelForSequenceClassification, AutoTokenizer
model = TFAutoModelForSequenceClassification.from_pretrained('''TODO: Find and place the appropriate model name from the list of available HuggingFace models.''')
tokenizer = AutoTokenizer.from_pretrained('''TODO: Find and place the appropriate model name from the list of available HuggingFace models.''')

### Tokenization
Complete the following codes in order to be able to tokenize the text.

In [None]:
def encode(examples):
  return tokenizer(examples['text'], 
                   truncation=True, 
                   padding='max_length', 
                   max_length='''TODO: Set the maximum length.'''
                  )

In [None]:
# Encoding the training dataset
train_dataset = dataset['train'].map(encode, batched=True)

In [None]:
# Showing the available keys
train_dataset[0].keys()

In [None]:
''' TODO: Encode the testing part of the dataset and call it test_dataset. '''

You can convert the training dataset format into TensorFlow format using the following code:

In [None]:
# Converting the dataset format to TensorFlow/Keras one
import tensorflow as tf
max_length = 128
train_dataset.set_format(type='tensorflow', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
features = {x: train_dataset[x].to_tensor(default_value=0, shape=[None, max_length]) for x in ['input_ids', 'token_type_ids', 'attention_mask']}
tf_train_dataset = tf.data.Dataset.from_tensor_slices((features, train_dataset["label"])).batch(32)

In [None]:
# Peeking at how the result looks
next(iter(tf_train_dataset))

In [None]:
''' TODO: Do the same conversion for the testing part of the dataset '''

## Training and Validation
Using the following code, you can train and validate the results.

In [None]:
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(reduction=tf.keras.losses.Reduction.NONE, from_logits=True)
opt = tf.keras.optimizers.Adam(learning_rate=3e-5)
model.compile(optimizer=opt, loss=loss_fn, metrics=["accuracy"])
model.fit(tf_train_dataset, validation_data=tf_test_dataset, epochs=3)

** TODO ** Explain what the training and validation code does, line-by-line. Explain if you would have done anything differently and why. Please feel free if you like to update the code (optionally).

## Testing in Wild
Please write a few movie reviews yourself or find from the web. Next, use your moodel to predict these movie reviews.

In [None]:
''' TODO: Place your code here '''