# NLP 303 - Natural Language Processing
## By: Michael Cuffe
## Assessment 1
## Due: 2021-10-15 23:59

# Instructions

### Install Transformers

In [32]:
!pip install transformers





### Install TensorFlow

In [33]:
!pip install tensorflow





### Fix a Compatibility Error

In [34]:
!pip install tf-keras





# Testing The Installation of Transformers

In [35]:
from transformers import pipeline

## Check that transformers is functional.

#### Added default models to all pipeline declarations in this case "google/t5-base" was added to the list of known models.
#### As im running this locally i also had to define the device as 0 to enable GPU usage.
#### All pipelines use the recommended default models for the task.

In [36]:
translator = pipeline("translation_en_to_de", model="google-t5/t5-base", device=0)
print(translator("The magic of transformers lies in pre-trained models"))

[{'translation_text': 'Die Magie der Transformatoren liegt in vorgeschulten Modellen'}]


# NER Pipeline

In [37]:
# Initialize the NER pipeline
ner_pipeline = pipeline("ner", grouped_entities=True, model="dbmdz/bert-large-cased-finetuned-conll03-english",
                        device=0)

# Specify a sequence of text
text = "Barack Obama was born in Hawaii. He was the 44th President of the United States."

# Perform NER
ner_results = ner_pipeline(text)
print(ner_results)

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'PER', 'score': 0.9990641, 'word': 'Barack Obama', 'start': 0, 'end': 12}, {'entity_group': 'LOC', 'score': 0.9994017, 'word': 'Hawaii', 'start': 25, 'end': 31}, {'entity_group': 'LOC', 'score': 0.9952782, 'word': 'United States', 'start': 66, 'end': 79}]


# 

# Sentiment Analysis

In [38]:
# Initialize the sentiment analysis pipeline
sentiment_pipeline = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english", device=0)

# Give the strings to be analysed.
text1 = "I love the new design of the website!"
text2 = "The service at the restaurant was terrible."

# Perform sentiment analysis
sentiment_results1 = sentiment_pipeline(text1)
sentiment_results2 = sentiment_pipeline(text2)

# Print the results
print(sentiment_results1)
print(sentiment_results2)

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9998728036880493}]
[{'label': 'NEGATIVE', 'score': 0.9995396137237549}]


# Summarization Pipeline

In [46]:
# Initialize the summarization pipeline
summarization_pipeline = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6", device=0)

# Specify a sequence of text (350-500 words)
long_text = """
The young fox went along the river until he met a large old toad. Upon meeting the toad the fox was baffled.
The toad was not afraid of the fox and the fox was not afraid of the toad. They both sat down and without a word conversed.
The fox tilted his head and the toad blinked. The fox blinked and the toad tilted his head.
The toad with her wide eyes spotted a fly on the fox and the fox with his keen eyes spotted a mouse under the toad. 
They both looked at each-other for a moment as if to request permission for what each wanted permission neither of them knew.
They did however feel their stomachs grumble and their mouths water. They both still sat and without a sound conversed.
The fly and the mouse were both aware of the fox and the toad. Neither dared to move in terror of their impending doom.
When the fox and the toad made eye contact once more they both knew what the other wanted.
At the very same moment the fox lunged at the toad and the toad lunged at the fox. The fly and the mouse at this point had also had their own conversation wordlessly. With the time allowed to them the fly hurried toward the fox and the mouse dashed toward the toad.
The fox and the toad collided with a large resounding crash, while the fly and the mouth made their escape in the confusion.
The fox and the toad both lay on the ground dazed and confused. They both looked at each-other and without a sound conversed.
Until both their bellies grumbled loudly and they both knew what the other wanted. They both stood up and walked away in shame of their poor communication skills. 
If only they were not so shy and had shown their cards sooner they both would have had a meal.
"""

# Perform summarization
summary = summarization_pipeline(long_text, max_length=100, min_length=30, do_sample=False)
print(summary)

[{'summary_text': ' The young fox went along the river until he met a large old toad . Upon meeting the toad the fox was baffled . They both sat down and without a word conversed. They both looked at each-other for a moment as if to request permission for what each wanted neither of them knew .'}]


In [40]:
# Initialize the text generation pipeline
text_generation_pipeline = pipeline("text-generation", model="gpt2", device=0)

# Specify a starting sequence of text
starting_text = "Once upon a time in a land far, far away,"

# Generate text
generated_text = text_generation_pipeline(starting_text, max_length=500)
print(generated_text[0]['generated_text'])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Once upon a time in a land far, far away, and in a world far away, it is like you are traveling back in time. To make things all so beautiful and real, let us first examine the "perfect" time before and after the birth of the Bible, the day after the birth of Jesus Christ and before every possible date in the Old Testament and before every possible date in the New Testament.

When we begin by recalling that this God-head is our Lord, then we understand that the day after the birth of our Lord and Savior, our God is the Father, and the Son is the Holy Ghost, and the Holy Spirit represents our Lord's being here in Jesus Christ; and that He is the Lord even where He is not. The first is here at Holy Resurrection (the true one). After this, the second will occur and will be this: This, this man is Lord, the Lord also, and the Savior, who is one, and the Christ, who was crucified and rose again, came up in righteousness as the righteous have risen.

This man is Lord of all that exists and o

In [41]:
# Initialize the summarization pipeline
summarization_pipeline = pipeline("summarization")

# Specify a sequence of text (350-500 words)
long_text = """
Your long text goes here. Make sure it is between 350 to 500 words.
"""

# Perform summarization
summary = summarization_pipeline(long_text, max_length=100, min_length=30, do_sample=False)
print(summary)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Your max_length is set to 100, but your input_length is only 21. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=10)


[{'summary_text': " Your long text goes here. Make sure it's between 350 to 500 words . Make sure your long text is between 350 and 500 words long ."}]


In [42]:
# Initialize the question answering pipeline
qa_pipeline = pipeline("question-answering", model="distilbert/distilbert-base-cased-distilled-squad", device=0)

# Specify a context and a question
context = """
Your context text goes here.
"""
question = "What is the main topic of the text?"

# Perform question answering
qa_result = qa_pipeline(question=question, context=context)
print(qa_result)

{'score': 0.4472459852695465, 'start': 1, 'end': 13, 'answer': 'Your context'}


In [43]:
print('End of Program')

End of Program
