# NLP 303 - Natural Language Processing
## Task 1
### By: Michael Cuffe
### Assessment 1
### Due: 20/10/2024 23:59

### Install Necessary Libraries
The library installation messages are suppressed by adding > NUL 2>&1 to the end of the pip install command. This is done to prevent the output from being displayed in the notebook for readability.<br> 
The keras library fixes a compatibility error with the transformers library.

In [13]:
!pip install transformers > NUL 2>&1
!pip install tensorflow > NUL 2>&1
!pip install tf-keras > NUL 2>&1

# Testing The Installation of Transformers
This code block tests the installation of the transformers library. It is a simple test to ensure that the library is installed correctly and is functioning as intended.

This block also doubles as a global declaration of the transformers pipeline. This is done to ensure that the pipeline is available to all code blocks in the notebook.

In [4]:
from transformers import pipeline

## Check that transformers is functional.

As im running this locally i also had to define the device as 0 to enable GPU usage.
All pipelines use the recommended default models for the task.
clean_up_tokenization_spaces = True is just to suppress a warning about a future update.

In [15]:
translator = pipeline("translation_en_to_de", clean_up_tokenization_spaces = True, model="google-t5/t5-base", device=0)
print(translator("The magic of transformers lies in pre-trained models"))

[{'translation_text': 'Die Magie der Transformatoren liegt in vorgeschulten Modellen'}]


# NER Pipeline
### Named Entity Recognition

This code block is quite simple. The code block is broken down into 4 parts.
1. The initialization of the pipeline.
2. The specification of a sequence of text.
3. The NER of the text.
4. The results are printed in a for loop to make them more readable.

ORG = Organizations <br>
LOC = Locations <br>
PER = Persons <br>
MISC = Miscellaneous <br>

This node displays an error however the error can safely be ignored as the code is functional and the error is expected.


In [11]:
ner_pipeline = pipeline("ner", aggregation_strategy="simple", model="dbmdz/bert-large-cased-finetuned-conll03-english",
                        device=0)

text = "Steve Irwin was born in Australia. He was a leading nature conservationist and led and curated Taronga Zoo with the help of the NSW Government."

ner_results = ner_pipeline(text)

for entity in ner_results:
    print(f"Entity: {entity['word']}, Type: {entity['entity_group']}, Score: {entity['score']:.4f}")

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Entity: Steve Irwin, Type: PER, Score: 0.9997
Entity: Australia, Type: LOC, Score: 0.9995
Entity: Taronga Zoo, Type: LOC, Score: 0.8920
Entity: NSW Government, Type: ORG, Score: 0.8671


# 

# Sentiment Analysis
This code block is quite simple. The code block is broken down into 3 parts.
1. The initialization of the pipeline.
2. The specification of a sequence of text.
3. The sentiment analysis of the text.

In [7]:
sentiment_pipeline = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english", device=0)

sequence1 = "I love the coding in python!"
sequence2 = "Using 200 nested if statements is terrible."

sentiment_results1 = sentiment_pipeline(sequence1)
sentiment_results2 = sentiment_pipeline(sequence2)

print(sentiment_results1)
print(sentiment_results2)

[{'label': 'POSITIVE', 'score': 0.9996840953826904}]
[{'label': 'NEGATIVE', 'score': 0.9977462887763977}]


# Summarization Pipeline
This code block looks long but is in fact quite simple. The code block is broken down into 3 parts.
1. The initialization of the pipeline.
2. The specification of a sequence of text.
3. The summarization of the text.

In [8]:
summarization_pipeline = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6", device=0)

long_text = """
The young fox went along the river until he met a large old toad. Upon meeting the toad the fox was baffled.
The toad was not afraid of the fox and the fox was not afraid of the toad. They both sat down and without a word conversed.
The fox tilted his head and the toad blinked. The fox blinked and the toad tilted his head.
The toad with her wide eyes spotted a fly on the fox and the fox with his keen eyes spotted a mouse under the toad. 
They both looked at each-other for a moment as if to request permission for what each wanted permission neither of them knew.
They did however feel their stomachs grumble and their mouths water. They both still sat and without a sound conversed.
The fly and the mouse were both aware of the fox and the toad. Neither dared to move in terror of their impending doom.
When the fox and the toad made eye contact once more they both knew what the other wanted.
At the very same moment the fox lunged at the toad and the toad lunged at the fox. The fly and the mouse at this point had also had their own conversation wordlessly. With the time allowed to them the fly hurried toward the fox and the mouse dashed toward the toad.
The fox and the toad collided with a large resounding crash, while the fly and the mouth made their escape in the confusion.
The fox and the toad both lay on the ground dazed and confused. They both looked at each-other and without a sound conversed.
Until both their bellies grumbled loudly and they both knew what the other wanted. They both stood up and walked away in shame of their poor communication skills. 
If only they were not so shy and had shown their cards sooner they both would have had a meal.
"""

summary = summarization_pipeline(long_text, max_length=100, min_length=30, do_sample=False)
print(summary)

[{'summary_text': ' The young fox went along the river until he met a large old toad . Upon meeting the toad the fox was baffled . The toad was not afraid of the fox . They both sat down and without a word conversed. They both looked at each-other for a moment as if to request permission for what each wanted .'}]


# Text Generation
This code is broken down into 3 parts.
1. The initialization of the pipeline.
2. The specification of a starting sequence of text.
3. The generation and then printing of text.

In [9]:
text_generation_pipeline = pipeline("text-generation", model="gpt2", device=0, truncation=True)

starting_text = "Upon the spire, above the clouds rested a scintillating dragon,"

generated_text = text_generation_pipeline(starting_text, max_length=500, pad_token_id=50256)
print(generated_text[0]['generated_text'])

Upon the spire, above the clouds rested a scintillating dragon, whose form appeared in several places, bearing the likeness of a dragon, and who seemed to have passed by an abyss, and to have traveled far and near.

Some of these were of a species much affected with hatred, others of a different kind, and even a great loss of life, but all of these were of the same type, and the whole was destroyed, and nothing in its place fell from it. That those who have no knowledge of such things are those of greater age than they belong to may be judged by one or two specimens. Now we shall not say that the destruction of this creature, that is, that which was lost from one place to another, without being destroyed in one location, but we must say, that the destruction of that which is lost in one place to another is a common act of justice. It was said, When a man is killed this does not belong to him, but when death came out he died again. So it was said, If a man dies of this, and is buried th

# Question Answering
This code block is much the same as the others in structure. The code block is broken down into 3 parts.
1. The initialization of the pipeline.
2. The specification of a context and a question.
3. The question answering.

In [10]:
qa_pipeline = pipeline("question-answering", model="distilbert/distilbert-base-cased-distilled-squad", device=0)

context = """
Python is a programming language. It is used for web development, data analysis, artificial intelligence, and scientific computing.
Python is easy to learn and has a large community. Python is an interpreted language, which means that it is executed line by line.
It is known for its readability and simplicity. Python is an object-oriented language, which means that it can model real-world entities.
Python doesnt require a compiler, which makes it easier to debug. Python is a high-level language, which means that it is closer to human language.
Python has its own memory management, which means that it automatically allocates and deallocates memory.
"""
question = "What is the main topic of the text?"

qa_result = qa_pipeline(question=question, context=context)
print(qa_result)

{'score': 0.06401262432336807, 'start': 50, 'end': 105, 'answer': 'web development, data analysis, artificial intelligence'}


<br>
<br>
<br>
End of File