# NLP 303 - Natural Language Processing
## Task 2
### By: Michael Cuffe
### Assessment 1
### Due: 20/10/2024 23:59

### Install Necessary Packages
The library installation messages are suppressed by adding > NUL 2>&1 to the end of the pip install command. This is done to prevent the output from being displayed in the notebook for readability. 

In [39]:
!pip install transformers > NUL 2>&1
!pip install tensorflow > NUL 2>&1
!pip install tf-keras > NUL 2>&1
!pip install -q transformers librosa > NUL 2>&1

# Testing The Installation of Transformers
This code block tests the installation of the transformers library. It is a simple test to ensure that the library is installed correctly and is functioning as intended.

This block also doubles as a global declaration of the transformers pipeline. This is done to ensure that the pipeline is available to all code blocks in the notebook.

In [22]:
from transformers import pipeline

## Checking that transformers is functional.

This code block tests the functionality of the transformers library.

In [23]:
translator = pipeline("translation_en_to_de", model="google-t5/t5-base", device=0, clean_up_tokenization_spaces=True)
print(translator("The magic of transformers lies in pre-trained models"))

[{'translation_text': 'Die Magie der Transformatoren liegt in vorgeschulten Modellen'}]


# Masked Language Modeling with DistilBERT
This code block uses the following steps:<br>
1. Initialize the masked language modeling pipeline using distilbert-base-uncased.<br>
2. Provide an example sentence with a masked token.<br>
3. Generate text options to fill the masked input.<br>
4. Print the results in a readable format.<br>

In [24]:
from transformers import pipeline

mlm_pipeline = pipeline("fill-mask", model="distilbert-base-uncased", device=0)

sentence = "The quick brown fox leaps over the person on a [MASK]."

mlm_results = mlm_pipeline(sentence)

for result in mlm_results:
    print(f"Option: {result['sequence']}, Score: {result['score']:.4f}")

Option: the quick brown fox leaps over the person on a leash., Score: 0.1125
Option: the quick brown fox leaps over the person on a ledge., Score: 0.0519
Option: the quick brown fox leaps over the person on a ladder., Score: 0.0403
Option: the quick brown fox leaps over the person on a leap., Score: 0.0364
Option: the quick brown fox leaps over the person on a limb., Score: 0.0353


# Sentiment Analysis with ProsusAI/finbert

This code block uses the following steps:<br>
1. Initialize the sentiment analysis pipeline.<br>
2. Provide a list of stock market headlines.<br>
3. Classify the sentiment of the financial content.<br>
4. Print the results in a readable format.<br>
5. Repeat the process for each headline.<br>

In [25]:
finbert_pipeline = pipeline("sentiment-analysis", model="ProsusAI/finbert", device=0)

headlines = [
    "Cats rally as tech shares rebound",
    "Cat Predicts Market crash amid economic uncertainty",
    "Investors optimistic about new feline policies"
    "Cat elected as mayor in 30 states of America",
    "Stocks hit all-time low after poor quarterly results"
]

for headline in headlines:
    sentiment = finbert_pipeline(headline)
    print(f"Headline: {headline}, Sentiment: {sentiment[0]['label']}, Score: {sentiment[0]['score']:.4f}")

Headline: Cats rally as tech shares rebound, Sentiment: positive, Score: 0.5534
Headline: Cat Predicts Market crash amid economic uncertainty, Sentiment: negative, Score: 0.9009
Headline: Investors optimistic about new feline policiesCat elected as mayor in 30 states of America, Sentiment: positive, Score: 0.7586
Headline: Stocks hit all-time low after poor quarterly results, Sentiment: negative, Score: 0.9664


  attn_output = torch.nn.functional.scaled_dot_product_attention(


# Dialogue Generation with Microsoft DialoGPT-large

Here we use the provided code snippet to initialize the model. <br>

The code block uses the following steps:<br>
1. Initialize the DialoGPT model and tokenizer.<br>
2. Set the Chat range for 5 lines or more.<br>
3. Ask for User input.<br>
4. Append the new user input tokens to the chat history.<br>
5. Generate a response to the text with it.<br>
6. Decode the response & print the response.<br>

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-large")
model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-large")

chat_history_ids = None
for step in range(5):
    user_input = input(">> User: ")
    new_user_input_ids = tokenizer.encode(user_input + tokenizer.eos_token, return_tensors='pt')

    bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if chat_history_ids is not None else new_user_input_ids

    attention_mask = torch.cat([torch.ones_like(chat_history_ids), torch.ones_like(new_user_input_ids)], dim=-1) if chat_history_ids is not None else torch.ones_like(new_user_input_ids)
    chat_history_ids = model.generate(
        bot_input_ids, 
        max_length=1000, 
        pad_token_id=tokenizer.eos_token_id,
        attention_mask=attention_mask
    )

    response = tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)
    print(f"DialoGPT says '{response}'")

# Speech Recognition with Facebook Wav2Vec2
This code block review the facebook/wav2vec2-base-960h model. <br>
For this i used .wav audio file of Winston Churchills famous speech "Their Finest Hour". <br>
Sadly my microphone is not working so i could not record my own audio. <br>
Use the provided code to transcribe the audio. <br>


This code block uses the following steps:<br>
1. Load the model and tokenizer.<br>
2. Load the audio file.<br>
3. Tokenize the inputs.<br>
4. Store the logits (non-normalized predictions).<br>
5. Store the predicted ids.<br>
6. Decode the audio to generate text.<br>
7. Print the transcriptions.<br>

In [36]:
import librosa
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h", device=0)

speech, rate = librosa.load("their_finest_hour.wav", sr=16000)

input_values = tokenizer(speech, return_tensors='pt').input_values

logits = model(input_values).logits

predicted_ids = torch.argmax(logits, dim=-1)

transcriptions = tokenizer.decode(predicted_ids[0])
print(transcriptions)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


WHAT CANERAL BAGON AN CALLD THE BATTLE OF BRANZE IS OVER THE BATTLE OF BRITADS IS ABOUT TO BEGIN UPON THIS BATTLE DEPENDS THE SURBIBAL OF CHRISTIAN CIVILIZAG UPON IT DEPENDS OUR OWN BRITISH LIFE AND THE LONG CONTINUITY OF OUR INSTITUTION AND OUR EMPIN O FURY AND MIGHT HAVE THE ENEMY MUST VERY SOON BE TURNED ON US IT ERN OLD THAT HE WILL HAVE TO BREAK US IN THIS ISLAND OR LOVE THE WAR WE CAN STAND UP TO HIM OR EUROPE MAY BE FREE AND THE LIFE OF THE WORLD MAY MOVE FORWARD INTO BROAD UND I UPLAND BUT IF WE FAIL THEN THE WHOLE WORLD INCLUDING THE UNITED STAGE INCLUDING ALL THAT WE HAVE KNOWN AND CAD FORWILL SINK INTO THE ABYSS OF A NEW DARK AGE MADE MORE SINISTER AN THE HEAT MORE PROTRECTIVE BY THE LIGHT OF THE VIRTI SIEND LET US THEREFORE BREAK OURSELVES TO OUR DUTY SO BEAR OURSELVES THAT IF THE BRITISH EMPIRE AND ITS COMMONWEALTH LAST FOR AFOUND THEER MEN WILL STILL SAY LESH WAT THEY ARE FINDT OUR


<br>
<br>
End of File