<a href="https://colab.research.google.com/github/Mozzer2310/3rd-year-project/blob/main/COMP34812_Week_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Preamble

This notebook is based on the [HuggingFace documentation](https://huggingface.co/transformers/v3.0.2/task_summary.html) which demonstrates common use cases of their library.

In completing the tasks below, you might find the Pytorch examples shown in the HuggingFace documentation useful.

#Installation of Transformers and useful datasets

In [1]:
! pip install transformers datasets
! pip3 install emoji

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m48.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.9.0-py3-none-any.whl (462 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m462.8/462.8 KB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m57.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
Coll

#Sequence Classification: Sentiment Analysis

Try some examples such as:
*   Sarcastic use of "thanks": "Thanks for making me miss the Brit Awards." or "Thanks to the lockdown, I haven't seen my cousin for more than 2 years."
*   Genuine use of "thanks": "Thanks for making me come to the exam." or "Thanks to the lockdown, I discovered knitting."
*   Traditional use of "sick": "The contaminated water made them sick."
*   Slang use of "sick": "That dance routine is sick."





##Pipeline

The code below creates a `pipeline` using a model fine-tuned to perform sentiment analysis. A pipeline (in HuggingFace) is similar to an API as it returns model results without requiring the user to write complex code. 

In [2]:
from transformers import pipeline

classifier = pipeline(model="finiteautomata/bertweet-base-sentiment-analysis")


Downloading (…)lve/main/config.json:   0%|          | 0.00/890 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/540M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/295 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/843k [00:00<?, ?B/s]

Downloading (…)solve/main/bpe.codes:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/17.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Below, try your own examples (see some suggestions above) and observe how the probability scores change. NOTE that `POS`, `NEG` and `NEU` stand for positive, negative and neutral sentiment, respectively.

In [3]:
result = classifier("We invite you to the upcoming workshop on Data Visualisation.")[0]
print(f"label: {result['label']} with score: {int(round(result['score'] * 100))}%")

label: NEU with score: 65%


In [4]:
result = classifier("Thanks to the lockdown, I haven't seen my cousin for more than 2 years.")[0]
print(f"label: {result['label']} with score: {int(round(result['score'] * 100))}%")

label: NEG with score: 92%


Check-out the default pipelines available at [HuggingFace](https://huggingface.co/docs/transformers/v4.21.2/en/main_classes/pipelines). For these pipelines you do not need to specify the model.

##Direct model use

Here, we demonstrate direct model use--an alternative to using a `pipeline`, which gives the user lower-level access to a model's outputs. The code below loads:
*   a tokeniser which segments raw text into tokens
*   a model fine-tuned to perform sentiment analysis

The sentiment analysis model comes with its own tokeniser which is why we use the same for tokenisation.

In [5]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("finiteautomata/bertweet-base-sentiment-analysis")
model = AutoModelForSequenceClassification.from_pretrained("finiteautomata/bertweet-base-sentiment-analysis")

classes = ["NEG", "NEU", "POS"]


Below, try your own examples (see some suggestions above) and observe how the probability scores change. Note that POS, NEG and NEU stand for positive, negative and neutral sentiment, respectively.

In [6]:
sequence = "We invite you to the upcoming workshop on Data Visualisation."

# PyTorch (pt) tensors will be returned by the tokeniser
tokens = tokenizer(sequence, return_tensors="pt")

# Logits (raw, non-normalised predictions) are returned by the model upon being given the token tensors as input
classification_logits = model(**tokens).logits

# Softmax is used to generate normalised probabilities based on the logits
results = torch.softmax(classification_logits, dim=1).tolist()[0]

for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(results[i] * 100))}%")


NEG: 0%
NEU: 65%
POS: 35%


#Pairwise Sequence Classification: Paraphrase Identification

##Direct model use

The code below loads a model that has been fine-tuned to perform paraphrase identification. Specifically, it was fine-tuned on the Microsoft Research Paraphrase Corpus ([MRPC](https://deepai.org/dataset/mrpc)). Since this model comes with a tokeniser, we use the same for tokenisation.

Similar to the previous case, you can download the model and tokenizer from the HuggingFace. In this case, you can use: `bert-base-cased-finetuned-mrpc`


In [8]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")

classes = ["not paraphrase", "is paraphrase"]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/433M [00:00<?, ?B/s]

Feel free to try your own examples below.

In [9]:
sequence_0 = "David and Victoria Beckham have been married for more than 20 years."
sequence_1 = "Victoria Beckham is rarely seen smiling in photos."
sequence_2 = "Posh Spice has been Beckham's wife since 1999."

pair1 = tokenizer(sequence_0, sequence_2, return_tensors="pt")
pair2 = tokenizer(sequence_0, sequence_1, return_tensors="pt")

pair1_classification_logits = model(**pair1).logits
pair1_results = torch.softmax(pair1_classification_logits, dim=1).tolist()[0]

for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(pair1_results[i] * 100))}%")

not paraphrase: 46%
is paraphrase: 54%


Similarly get the results for `pair2`

In [11]:
pair2_classification_logits = model(**pair2).logits
pair2_results = torch.softmax(pair2_classification_logits, dim=1).tolist()[0]

for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(pair2_results[i] * 100))}%")

not paraphrase: 96%
is paraphrase: 4%


#Span-based Identification: Extractive Question Answering 

##Pipeline

The code below creates a pipeline using a model that has been fine-tuned to perform extractive question answering. If no model is specified, the HuggingFace `pipeline` simply loads the default model for question answering, which was fine-tuned on the Stanford Question Answering Dataset ([SQuAD](https://rajpurkar.github.io/SQuAD-explorer/)).

In the cell immediately below, you can load the `question-answering` pipeline (i.e., no need to specify the name of any model). 

In [15]:
from transformers import pipeline

question_answerer = pipeline("question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Feel free to try your own context/passage and questions, in the cells below.

In [16]:
context = r"""
Manchester City Football Club is an English football club based in Manchester 
that competes in the Premier League, the top flight of English football. Founded in 1880 
as St. Mark's (West Gorton), it became Ardwick Association Football Club in 1887 and 
Manchester City in 1894. The club's home ground is the Etihad Stadium in east Manchester, 
to which it moved in 2003, having played at Maine Road since 1923. The club adopted their sky blue 
home shirts in 1894 in the first season of the club's current iteration, that have been used 
ever since.
"""

In [17]:
result = question_answerer(question="Where did Manchester City use to play?", context=context)
print(
    f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
)

Answer: 'Maine Road', score: 0.6756, start: 391, end: 401


In [18]:
result = question_answerer(question="What was the former name of Manchester City?", context=context)
print(
    f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
)

Answer: 'Manchester City Football Club', score: 0.9115, start: 1, end: 30


In [19]:
result = question_answerer(question="What were the former names of Manchester City?", context=context)
print(
    f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
)

Answer: 'Ardwick Association Football Club in 1887 and 
Manchester City in 1894', score: 0.3135, start: 209, end: 279


In [20]:
result = question_answerer(question="What is the colour of the home kit of Manchester City?", context=context)
print(
    f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
)

Answer: 'sky blue', score: 0.9686, start: 437, end: 445


Try it out yourself!

In [21]:
result = question_answerer(question="What country is Manchester City from?", context=context)

print(
    f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
)

Answer: 'English', score: 0.9837, start: 37, end: 44


##Direct model use

The code below loads a model that has been fine-tuned to perform extractive question answering. It is the same model as above which was trained on the Stanford Question Answering Dataset ([SQuAD](https://rajpurkar.github.io/SQuAD-explorer/)). Since this model comes with a tokeniser, we use the same for tokenisation.


The model (and tokeniser) can be downloaded from HuggingFace and has the following name:
`distilbert-base-cased-distilled-squad`

In [24]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased-distilled-squad")
model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-cased-distilled-squad")

Feel free to try your own context/passage and questions, in the cells below.

In [32]:
text = r"""
Manchester United Football Club is a professional football club based in Old Trafford, 
Greater Manchester, England, that competes in the Premier League, the top flight of English football. 
Nicknamed "the Red Devils", the club was founded as Newton Heath LYR Football Club in 1878, 
but changed its name to Manchester United in 1902. The club moved from Newton Heath 
to its current stadium, Old Trafford, in 1910.
"""

questions = [
    "Where was Manchester United based before 1910?",
    "Manchester United used to be known by which name?",
    "When did Manchester United change its name?",
]

for question in questions:
    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    outputs = model(**inputs)
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits

    # Use the argmax of the scores to obtain the most likely beginning of the answer
    answer_start = torch.argmax(answer_start_scores).tolist()
    
    # Use the argmax of the score to obtain the most likely end of the answer
    answer_end = torch.argmax(answer_end_scores).tolist()

    answer_end = answer_end + 1

    answer = tokenizer.convert_tokens_to_string(
        tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
    )

    print(f"Question: {question}")
    print(f"Answer: {answer}")

Question: Where was Manchester United based before 1910?
Answer: Old Trafford, Greater Manchester, England
Question: Manchester United used to be known by which name?
Answer: the Red Devils
Question: When did Manchester United change its name?
Answer: 1902


#Sequence Labelling: Named Entity Recognition

In the following sections, we demonstrate a named entity recognition (NER) model that can recognise the following types of named entities:
*   `PER` (PERSON)
*   `LOC` (LOCATION)
*   `ORG` (ORGANISATION)
*   `MISC` (MISCELLANEOUS)





##Pipeline

The code below creates a pipeline using a model that has been fine-tuned to perform named entity recognition (NER). If no model is specified, the HuggingFace `pipeline` simply loads the default model for named entity recognition (NER), which was fine-tuned on the [CoNLL-2003 corpus](https://paperswithcode.com/dataset/conll-2003).

In the cell below, you can load the `ner` pipeline (i.e., no need to specify the name of any model).



In [33]:
from transformers import pipeline

ner_pipe = pipeline("ner")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Feel free to try your own text/sequence below.

In [34]:
sequence = """
New York City is composed of five boroughs. The five boroughs—Brooklyn (Kings County), 
Queens (Queens County), Manhattan (New York County), the Bronx (Bronx County), and 
Staten Island (Richmond County)—were created when local governments were consolidated 
into a single municipal entity in 1898.
"""

The code below prints out only tokens belonging to recognised entities, and none of the O (outside) tokens.

In [35]:
for entity in ner_pipe(sequence):
    print(entity)

{'entity': 'I-LOC', 'score': 0.9978085, 'index': 1, 'word': 'New', 'start': 1, 'end': 4}
{'entity': 'I-LOC', 'score': 0.998159, 'index': 2, 'word': 'York', 'start': 5, 'end': 9}
{'entity': 'I-LOC', 'score': 0.99789494, 'index': 3, 'word': 'City', 'start': 10, 'end': 14}
{'entity': 'I-LOC', 'score': 0.98852634, 'index': 14, 'word': 'Brooklyn', 'start': 63, 'end': 71}
{'entity': 'I-LOC', 'score': 0.8375301, 'index': 16, 'word': 'Kings', 'start': 73, 'end': 78}
{'entity': 'I-LOC', 'score': 0.54350525, 'index': 17, 'word': 'County', 'start': 79, 'end': 85}
{'entity': 'I-LOC', 'score': 0.99087346, 'index': 20, 'word': 'Queens', 'start': 89, 'end': 95}
{'entity': 'I-LOC', 'score': 0.67543304, 'index': 22, 'word': 'Queens', 'start': 97, 'end': 103}
{'entity': 'I-LOC', 'score': 0.49837163, 'index': 23, 'word': 'County', 'start': 104, 'end': 110}
{'entity': 'I-LOC', 'score': 0.9940435, 'index': 26, 'word': 'Manhattan', 'start': 113, 'end': 122}
{'entity': 'I-LOC', 'score': 0.9716285, 'index': 2

##Direct model use

The code below loads a model that has been fine-tuned to perform NER. It is the same model as above which was trained on the [CoNLL-2003 corpus](https://paperswithcode.com/dataset/conll-2003). Since this model comes with a tokeniser, we use the same for tokenisation.

The model (and tokeniser) can be downloaded from HuggingFace and has the following name: `dbmdz/bert-large-cased-finetuned-conll03-english`

    


In [36]:
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")

Feel free to try your own text/sequence below.

In [37]:
sequence = """
New York City is composed of five boroughs. The five boroughs—Brooklyn (Kings County), 
Queens (Queens County), Manhattan (New York County), the Bronx (Bronx County), and 
Staten Island (Richmond County)—were created when local governments were consolidated 
into a single municipal entity in 1898.
"""

inputs = tokenizer(sequence, return_tensors="pt")
tokens = inputs.tokens()

outputs = model(**inputs).logits

# Use softmax to normalise logits
normalised = torch.softmax(outputs, dim=2)

# Use argmax to obtain the most likely class for each token
predictions = torch.argmax(normalised, dim=2)

The code below prints out all tokens in the sequence, including those which were not labelled as part of any named entities. Note that the tokeniser automatically adds the tokens `[CLS]` (for CLASS) and `[SEP]` (for SEPARATOR) as these tokens are expected by the NER model (in fact, these are expected by any BERT-based model--but we will discuss this in succeeding weeks).

If/when you try your own text/sequence, you are also likely to find that the tokeniser might segment a word into *subwords*, i.e., *n*-grams which are not really words but were commonly encountered by the language model (which is what pre-trained BERT models are--again, more on this in succeeding weeks).

In [38]:
for token, prediction, normalised_scores in zip(tokens, predictions[0].numpy(), normalised[0].detach().numpy()):
    print((token, model.config.id2label[prediction], normalised_scores[prediction]))

('[CLS]', 'O', 0.99968016)
('New', 'I-LOC', 0.9978085)
('York', 'I-LOC', 0.9981589)
('City', 'I-LOC', 0.99789494)
('is', 'O', 0.99994946)
('composed', 'O', 0.99994886)
('of', 'O', 0.9999398)
('five', 'O', 0.99989486)
('boroughs', 'O', 0.9992812)
('.', 'O', 0.99968016)
('The', 'O', 0.9999176)
('five', 'O', 0.9999231)
('boroughs', 'O', 0.99961746)
('—', 'O', 0.9972481)
('Brooklyn', 'I-LOC', 0.98852634)
('(', 'O', 0.9999411)
('Kings', 'I-LOC', 0.8375301)
('County', 'I-LOC', 0.54350525)
(')', 'O', 0.9999553)
(',', 'O', 0.9993923)
('Queens', 'I-LOC', 0.99087346)
('(', 'O', 0.9999474)
('Queens', 'I-LOC', 0.67543304)
('County', 'I-LOC', 0.49837163)
(')', 'O', 0.9999473)
(',', 'O', 0.9994536)
('Manhattan', 'I-LOC', 0.9940435)
('(', 'O', 0.999941)
('New', 'I-LOC', 0.9716285)
('York', 'I-LOC', 0.768913)
('County', 'I-LOC', 0.6042502)
(')', 'O', 0.999944)
(',', 'O', 0.9994462)
('the', 'O', 0.99984515)
('Bronx', 'I-LOC', 0.99308103)
('(', 'O', 0.9999442)
('Bronx', 'I-LOC', 0.66743886)
('County', '

# Trying out other readily available fine-tuned models

There are many other publicly available models which have been fine-tuned to perform the above tasks. In fact, there are so many that there is a [repository](https://huggingface.co/models). You can use the filters available (in their web interface) to find models which have been fine-tuned on specific tasks, languages or datasets. As a guide, you can use certain tags under `Tasks` as a filter:
*  If you are interested in sequence classification tasks such as sentiment analysis, use the `Text Classification` tag
*  If you are interested in paraphrase identification and other pairwise sequence classification tasks (e.g., natural language inference), use the `Sentence Similarity` tag
*  If you are interested in question answering, use the `Question Answering` tag
*  If you are interested in NER or other sequence labelling tasks, use the `Token Classification` tag

