# NLP 303 - Natural Language Processing
## Task 2
### By: Michael Cuffe
### Assessment 1
### Due: 20/10/2024 23:59

### Install Necessary Packages

In [1]:
# > NUL 2>&1 is used to suppress the output of the installation as they are very long.
# Install the transformers library.
!pip install transformers > NUL 2>&1
# Install the TensorFlow as it's a dependecy for the transformers library.
!pip install tensorflow > NUL 2>&1
# Install the TensorFlow Keras API as a fix for an issue with the transformers library.
!pip install tf-keras > NUL 2>&1
# Install the librosa library for audio processing.
!pip install -q transformers librosa > NUL 2>&1


# Testing The Installation of Transformers

In [2]:
from transformers import pipeline

## Check that transformers is functional.

#### Added default models to all pipeline declarations in this case "google/t5-base" was added to the list of known models.
#### As im running this locally i also had to define the device as 0 to enable GPU usage.
#### All pipelines use the recommended default models for the task.

In [3]:
translator = pipeline("translation_en_to_de", model="google-t5/t5-base", device=0,clean_up_tokenization_spaces=True)
print(translator("The magic of transformers lies in pre-trained models"))






[{'translation_text': 'Die Magie der Transformatoren liegt in vorgeschulten Modellen'}]


# Masked Language Modeling with DistilBERT
Initialize the masked language modeling pipeline using distilbert-base-uncased.<br>
Provide an example sentence with a masked token.<br>
Generate text options to fill the masked input.<br>

In [4]:
# Initialize the masked language modeling pipeline
mlm_pipeline = pipeline("fill-mask", model="distilbert-base-uncased", device=0)

# Example sentence with a masked token
sentence = "The quick brown [MASK] leaps over the [MASK] person on a [MASK]."

# Generate text options to fill the masked input
mlm_results = mlm_pipeline(sentence)
for result in mlm_results:
    print(f"Option: {result['sequence']}, Score: {result['score']:.4f}")

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Option: the quick brown fox jumps over the shaggy dog., Score: 0.0777
Option: the quick brown fox jumps over the barking dog., Score: 0.0541
Option: the quick brown fox jumps over the little dog., Score: 0.0191
Option: the quick brown fox jumps over the stray dog., Score: 0.0135
Option: the quick brown fox jumps over the startled dog., Score: 0.0131


# Sentiment Analysis with ProsusAI/finbert
Locate and download the ProsusAI/finbert model. <br>
Select 3 to 5 stock market headlines.<br>
Classify the sentiment of the financial content.<br>

In [5]:
# Initialize the sentiment analysis pipeline
finbert_pipeline = pipeline("sentiment-analysis", model="ProsusAI/finbert", device=0)

# Stock market headlines
headlines = [
    "Cats rally as tech shares rebound",
    "Cat Predicts Market crash amid economic uncertainty",
    "Investors optimistic about new feline policies"
    "Cat elected as mayor in 30 states of America",
    "Stocks hit all-time low after poor quarterly results"
]

# Classify the sentiment of the financial content
for headline in headlines:
    sentiment = finbert_pipeline(headline)
    print(f"Headline: {headline}, Sentiment: {sentiment[0]['label']}, Score: {sentiment[0]['score']:.4f}")

config.json:   0%|          | 0.00/758 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Headline: Stocks rally as tech shares rebound, Sentiment: negative, Score: 0.6301
Headline: Market crashes amid economic uncertainty, Sentiment: negative, Score: 0.8717
Headline: Investors optimistic about new fiscal policies, Sentiment: positive, Score: 0.6833


# Dialogue Generation with Microsoft DialoGPT-large
Download the microsoft/DialoGPT-large model.<br>
Use the provided code snippet to initialize the model. <br>
Chat for 5 lines or more. <br>

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Initialize the DialoGPT model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-large")
model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-large")

# Set the Chat range for 5 lines or more
chat_history_ids = None
for step in range(5):

#Ask for User input
    user_input = input(">> User: ")
    new_user_input_ids = tokenizer.encode(user_input + tokenizer.eos_token, return_tensors='pt')

# Append the new user input tokens to the chat history
    bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if chat_history_ids is not None else new_user_input_ids

# Generate a response to the text with it 
    chat_history_ids = model.generate(
        bot_input_ids, 
        max_length=1000, 
        pad_token_id=tokenizer.eos_token_id)

    # Decode the response
    response = tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)
    print(f"DialoGPT: {response}")

tokenizer_config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/642 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.75G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


DialoGPT: No, it was The Big Wedding.


# Speech Recognition with Facebook Wav2Vec2
Review the facebook/wav2vec2-base-960h model. <br>
Create a .wav audio file and save it. <br>
Use the provided code to transcribe the audio. <br>

In [None]:
import librosa
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

# Load model and tokenizer
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

# Load the audio file
speech, rate = librosa.load("/audio.wav", sr=16000)

# Tokenize the inputs
input_values = tokenizer(speech, return_tensors='pt').input_values

# Store logits (non-normalized predictions)
logits = model(input_values).logits

# Store predicted ids
predicted_ids = torch.argmax(logits, dim=-1)

# Decode the audio to generate text
transcriptions = tokenizer.decode(predicted_ids[0])
print(transcriptions)

<br>
<br>
End of File