<a href="https://colab.research.google.com/github/RichardXiao13/Google_Code_In/blob/master/Interactive_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Using DistilBERT**
In this notebook, we will be using the DistilBERT model for natural language processing to do many tasks! To begin, we will install `transformers` from huggingface.

In [1]:
! pip install transformers==2.2.0

Collecting transformers==2.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/ec/e7/0a1babead1b79afabb654fbec0a052e0d833ba4205a6dfd98b1aeda9c82e/transformers-2.2.0-py3-none-any.whl (360kB)
[K     |████████████████████████████████| 368kB 3.5MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/a6/b4/7a41d630547a4afd58143597d5a49e07bfd4c42914d8335b2a5657efc14b/sacremoses-0.0.38.tar.gz (860kB)
[K     |████████████████████████████████| 870kB 49.3MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/74/f4/2d5214cbf13d06e7cb2c20d84115ca25b53ea76fa1f0ade0e3c9749de214/sentencepiece-0.1.85-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K     |████████████████████████████████| 1.0MB 34.2MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.38-cp36-none-any.whl size=884629 sha256=27155ed9ab6e051

We will use tensorflow, tensorflow_datasets, and transformers from which we just installed.

In [2]:
%tensorflow_version 2.x
import tensorflow as tf
import tensorflow_datasets as tfds
from transformers import *

TensorFlow 2.x selected.


# **Choose a task!**
Below, you will find a list of natural language processing tasks that we can use to train our DistilBERT model. Here is a quick summary of each.


*   **CoLA** - The Corpus of Linguistic Acceptability. This task is designed to train models to analyze the grammatical acceptability of sentences.
*   **SST-2** - The Stanford Sentiment Treebank. This task measures how well a model is able to distinguish between negative and positive sentences.
*   **MRPC** - Microsoft Research Paraphrase Corpus. This task is composed of sentence pairs which tries to train a model to identify whether or not the sentence pair is a paraphrase.
*   **STS-B** - Semantic Textual Similarity Benchmark. this is similar to the MRPC task and tries to train a model based on the similarities between sentences.
*   **QQP** - Quora Question Pairs. This task is similar to the MRPC and STS-B tasks in that they all try to train a model to classify whether or not pairs of sentences are similar or not. The difference is that this task uses questions as features.
*   **MNLI** - MultiNLI. This task trains a model to identify the relationship between sentences such as contradictions, neutrality, or entailment (One sentence leads to the other logically).
*   **QNLI** - Question NLI. This task is designed to train models on their ability to answer questions given a question and relevant text.
*   **RTE** - Recognizing Textual Entailment. this task is similar to the MNLI task. Both try to identify whether or not one sentence can be inferred from another.
*   **WNLI** - Winograd NLI. This task contains sentence pairs where one word differs between them and the model has to choose the correct sentence by learning world knowledge.

Once we have chosen our task, we will load in the DistilBERT model and its corresponding vocabulary.

In [3]:
tasks = ["cola", "sst2", "mrpc", "qqp", "stsb", "mnli", "qnli", "rte", "wnli"]
display_txt = "Choose a task from the following: " + str(tasks) + "\n\n"

while True:
  chosen_task = input(display_txt)
  if chosen_task in tasks:
    break
  else:
    print("Please enter a valid task. Do not put '' or \"\" around the selected task.\n")

task = "glue/" + chosen_task
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
data = tfds.load(task)

Choose a task from the following: ['cola', 'sst2', 'mrpc', 'qqp', 'stsb', 'mnli', 'qnli', 'rte', 'wnli']

mrpc


100%|██████████| 231508/231508 [00:00<00:00, 2710242.75B/s]
100%|██████████| 492/492 [00:00<00:00, 198537.38B/s]
100%|██████████| 363423424/363423424 [00:07<00:00, 46428977.50B/s]
INFO:absl:Load pre-computed datasetinfo (eg: splits) from bucket.
INFO:absl:Loading info from GCS for glue/mrpc/0.0.2
INFO:absl:Field info.description from disk and from code do not match. Keeping the one from code.
INFO:absl:Field info.location from disk and from code do not match. Keeping the one from code.
INFO:absl:Generating dataset glue (/root/tensorflow_datasets/glue/mrpc/0.0.2)


[1mDownloading and preparing dataset glue (1.43 MiB) to /root/tensorflow_datasets/glue/mrpc/0.0.2...[0m


HBox(children=(IntProgress(value=1, bar_style='info', description='Dl Completed...', max=1, style=ProgressStyl…

HBox(children=(IntProgress(value=1, bar_style='info', description='Dl Size...', max=1, style=ProgressStyle(des…

INFO:absl:Downloading https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2Fmrpc_dev_ids.tsv?alt=media&token=ec5c0836-31d5-48f4-b431-7480817f1adc into /root/tensorflow_datasets/downloads/fire.goog.com_v0_b_mtl-sent-repr.apps.com_o_2FjSIMlCiqs1QSmIykr4IRPnEHjPuGwAz5i40v8K9U0Z8.tsvalt=media&token=ec5c0836-31d5-48f4-b431-7480817f1adc.tmp.039620f3f9e642b2a0f2a25091e84e1a...
INFO:absl:Downloading https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt into /root/tensorflow_datasets/downloads/dl.fbaip.com_sente_sente_msr_parap_trainfGxPZuQWGBti4Tbd1YNOwQr-OqxPejJ7gcp0Al6mlSk.txt.tmp.d24b7ceb10f44f768ccf9bed4811654b...
INFO:absl:Downloading https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_test.txt into /root/tensorflow_datasets/downloads/dl.fbaip.com_sente_sente_msr_parap_test0PdekMcyqYR-w4Rx_d7OTryq0J3RlYRn4rAMajy9Mak.txt.tmp.6887cf3f5f144f888d0e98d9f14e8cc3...
INFO:absl:Generating split train
INFO:absl:Writin






HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))



HBox(children=(IntProgress(value=0, description='Shuffling...', max=1, style=ProgressStyle(description_width='…

Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=3668, style=ProgressStyle(description_width=…

INFO:absl:Generating split validation
INFO:absl:Writing TFRecords




HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))



HBox(children=(IntProgress(value=0, description='Shuffling...', max=1, style=ProgressStyle(description_width='…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=408, style=ProgressStyle(description_width='…

INFO:absl:Generating split test
INFO:absl:Writing TFRecords




HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))



HBox(children=(IntProgress(value=0, description='Shuffling...', max=1, style=ProgressStyle(description_width='…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=1725, style=ProgressStyle(description_width=…

INFO:absl:Skipping computing stats for mode ComputeStatsMode.AUTO.
INFO:absl:Constructing tf.data.Dataset for split None, from /root/tensorflow_datasets/glue/mrpc/0.0.2


[1mDataset glue downloaded and prepared to /root/tensorflow_datasets/glue/mrpc/0.0.2. Subsequent calls will reuse this data.[0m


# **Convert to Features**
Above, we loaded in data corresponding to the task you chose. Now, we will use `transformers` to convert the data into features that our DistilBERT model can use.

In [0]:
train_dataset = glue_convert_examples_to_features(data['train'], tokenizer, max_length=128, task=chosen_task)
valid_dataset = glue_convert_examples_to_features(data['validation'], tokenizer, max_length=128, task=chosen_task)
train_dataset = train_dataset.shuffle(100).batch(32).repeat(2)
valid_dataset = valid_dataset.batch(64)

# **Compile and Train our Model**

In [0]:
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

In [6]:
history = model.fit(train_dataset, epochs=2, steps_per_epoch=115,
                    validation_data=valid_dataset, validation_steps=7)

Train for 115 steps, validate for 7 steps
Epoch 1/2
Epoch 2/2


# **Save our Model**
Below, you may specify the name of the folder to which you want to save your model.

In [8]:
import os

while True:
  save_dir = input("Input the name of the directory to save your model: \n")
  proceed = input("Your directory name is " + "\"" + save_dir + "\". " + "Do you wish to change it? y/n: \n")
  if proceed[0] == "n" or proceed[0] == "N":
    break

save_dir = "/content/" + save_dir

if os.path.isdir(save_dir) == False:
  os.mkdir(save_dir)
  
model.save_pretrained(save_dir)
pytorch_model = DistilBertForSequenceClassification.from_pretrained(save_dir, from_tf=True)

Input the name of the directory to save your model: 
save
Your directory name is "save". Do you wish to change it? y/n: 
n


# **Download our Model**
If you want to save your model to your computer, you can run this cell. Otherwise, you may proceed.

In [18]:
from google.colab import files

while True:
  proceed = input("Do you wish to proceed and download your model? y/n \n")
  if proceed[0] == "y" or proceed[0] == "Y":
    for file_name in os.listdir(save_dir):
      files.download(os.path.join(save_dir, file_name))
  else:
    break

Do you wish to proceed and download your model? y/n 
n


# **Use the Model!**
Running this cell allows you to use your model for the given task you specified. Now we can see how well our DistilBERT model does!

In [11]:
task_by_key = {
    "cola": "Acceptability",
    "sst2": "Sentiment",
    "mrpc": "Similarity",
    "stsb": "Similarity",
    "qqp": "Similarity",
    "mnli": "Entailment",
    "qnli": "Entailment",
    "rte": "Entailment",
    "wnli": "Entailment"
}

def accept_or_sentiment(task):
  sentence = input("Type in a sentence for " + task + ".\n")
  return sentence

def similarity():
  sentence1 = input("Type in a sentence: \n")
  sentence2 = input("Paraphrase or don't paraphrase the sentence: \n")
  return sentence1, sentence2

def entailment():
  sentence1 = input("Type in a sentence: \n")
  sentence2 = input("Type in a sentence that follows, contradicts, or has a neutral relationship to the previous sentence: \n")
  return sentence1, sentence2

def make_pred(sentence1, sentence2=None):
  if sentence2 != None:
    inputs1 = tokenizer.encode_plus(sentence1, sentence2, add_special_tokens=True, return_tensors="pt")
    pred1 = pytorch_model(inputs1["input_ids"])[0].argmax().item()
    return pred1
  else:
    inputs1 = tokenizer.encode_plus(sentence1, add_special_tokens=True, return_tensors="pt")
    pred1 = pytorch_model(inputs1["input_ids"])[0].argmax().item()
    return pred1

while True:
  proceed = input("Do you wish to continue? y/n: \n")
  if proceed[0] == "y" or proceed[0] == "Y":
    task_key = input("What was the task you chose? " + str(tasks) + "\n")
  else:
    break
  if task_key not in tasks:
    print("Restart. Please enter a valid task. Make sure to not include '' or \"\" in your response.")
    break
  task = task_by_key[task_key]
  if task == "Sentiment":
    sentence1 = accept_or_sentiment(task)
    pred1 = make_pred(sentence1)
    print("Your sentence is", "positive" if pred1 else "negative.")
  elif task == "Acceptability":
    sentence1 = accept_or_sentiment(task)
    pred1 = make_pred(sentence1)
    print("Your sentence is", "acceptable" if pred1 else "not acceptable.")
  elif task == "Similarity":
    sentence1, sentence2 = similarity()
    pred1 = make_pred(sentence1, sentence2)
    print("Your second sentence is", "a paraphrase" if pred1 else "not a paraphrase", "of your input.")
  else:
    sentence1, sentence2 = entailment()
    pred1 = make_pred(sentence1, sentence2)
    print("Your second sentence is ", "an entailment." if pred1 == 1 else "a contradiction.")

Do you wish to continue? y/n: 
y
What was the task you chose? ['cola', 'sst2', 'mrpc', 'qqp', 'stsb', 'mnli', 'qnli', 'rte', 'wnli']
mrpc
Type in a sentence: 
I want to find a research opportunity at a university this summer.
Paraphrase or don't paraphrase the sentence: 
I am looking to conduct research at a university this coming summer.
Your second sentence is a paraphrase of your input.
Do you wish to continue? y/n: 
n
