<a href="https://colab.research.google.com/github/PLuisa/NLP-Project/blob/main/NLP1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Activity 1: Programming Tasks for NLP Frequent Use — Cases with Hugging Face Transformers**

Code based on the Hugging Face notebook : https://github.com/huggingface/notebooks/blob/main/course/en/chapter1/section3.ipynb

In [None]:
#Installing Hugging Face Transformers
!pip install transformers


Collecting transformers
  Downloading transformers-4.34.1-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m28.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m34.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m79.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m74.7 MB/s[0m eta [36m0:00:00[0m
Col

In [None]:

# importing the pipeline to use the pre-trained models.
from transformers import pipeline


**1. Named Entity Recognition (NER):**


> Source of Inspiration: Hugging Face Tutorial: https://huggingface.co/docs/transformers/main/en/tasks/token_classification;



> Model used: "dslim/bert-base-NER" - https://huggingface.co/dslim/bert-base-NER



In [None]:

# Create an NER pipeline

ner = pipeline("ner", model="dslim/bert-base-NER")

# Specify your sequence of text
text_for_ner = "Steven Paul Jobs was an American inventor, businessman and tycoon in the computer sector. He became famous as co-founder, president and executive director of Apple Inc."

# Perform NER on the text
ner_results = ner(text_for_ner)

# Print the NER results
for entity in ner_results:
    print(f"Entity: {entity['word']}, Label: {entity['entity']}")

# Filter entities that belong to person, organization, or location classes
filtered_entities = [entity for entity in ner_results if entity["entity"] in ["B-PER", "B-ORG", "B-LOC"]]

# Print the filtered entities
print("Filtered Entities:")
print(filtered_entities)


Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Entity: Steven, Label: B-PER
Entity: Paul, Label: I-PER
Entity: Job, Label: I-PER
Entity: ##s, Label: I-PER
Entity: American, Label: B-MISC
Entity: Apple, Label: B-ORG
Entity: Inc, Label: I-ORG
Filtered Entities:
[{'entity': 'B-PER', 'score': 0.99969745, 'index': 1, 'word': 'Steven', 'start': 0, 'end': 6}, {'entity': 'B-ORG', 'score': 0.99955696, 'index': 33, 'word': 'Apple', 'start': 158, 'end': 163}]


**2. Sentiment Analysis:**


> Source of Inspiration: Hugging Face Tutorial - https://huggingface.co/docs/transformers/main/en/tasks/sequence_classification



> Model used: https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english




In [None]:
# Import the pipeline for sentiment analysis
sentiment_analysis = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

# Specify your sequence of text
text1 = sentiment_analysis("Alice had got so much into the way of expecting nothing but out-of-the-way things to happen, that it seemed quite dull and stupid for life to go on in the common way")
text2 = sentiment_analysis("it sounded an excellent plan, no doubt, and very neatly and simply arranged; the only difficulty was, that she had not the smallest idea how to set about it; and while she was peering about anxiously among the trees, a little sharp bark just over her head made her look up in a great hurry.")

# Print the results
print("Sentiment analysis for text 1:")
for sentiment in text1:
    print(f"Label: {sentiment['label']}, Score: {sentiment['score']}")

print("\nSentiment analysis for text 2:")
for sentiment in text2:
    print(f"Label: {sentiment['label']}, Score: {sentiment['score']}")

Sentiment analysis for text 1:
Label: NEGATIVE, Score: 0.9997755885124207

Sentiment analysis for text 2:
Label: NEGATIVE, Score: 0.9810794591903687


**3. Text Summarization:**

> Source of Inspiration: Hugging Face Tutorial - https://huggingface.co/docs/transformers/main/en/tasks/summarization



> Model used: https://huggingface.co/sshleifer/distilbart-cnn-12-6





In [None]:
# Import the pipeline for text summarization
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

# Specify part of a text from Alice in the Wonderworld
text_to_summarize = "Come, there’s no use in crying like that! said Alice to herself, rather sharply; I advise you to leave off this minute! She generally gave herself very good advice, (though she very seldom followed it), and sometimes she scolded herself so severely as to bring tears into her eyes; and once she remembered trying to box her own ears for having cheated herself in a game of croquet she was playing against herself, for this curious child was very fond of pretending to be two people. But it’s no use now, thought poor Alice, to pretend to be two people! Why, there’s hardly enough of me left to make one respectable person! Soon her eye fell on a little glass box that was lying under the table: she opened it, and found in it a very small cake, on which the words “EAT ME” were beautifully marked in currants. “Well, I’ll eat it,” said Alice, “and if it makes me grow larger, I can reach the key; and if it makes me grow smaller, I can creep under the door; so either way I’ll get into the garden, and I don’t care which happens!”She ate a little bit, and said anxiously to herself, “Which way? Which way?”, holding her hand on the top of her head to feel which way it was growing, and she was quite surprised to find that she remained the same size: to be sure, this generally happens when one eats cake, but Alice had got so much into the way of expecting nothing but out-of-the-way things to happen, that it seemed quite dull and stupid for life to go on in the common way.So she set to work, and very soon finished off the cake."

# Summarize the text into 100 words or less
summary = summarizer(text_to_summarize, max_length=100, min_length=30, do_sample=False)

# Print the summary
print(summary[0]["summary_text"])


 Alice was very fond of pretending to be two people . She generally gave herself very good advice, (though she very seldom followed it), and sometimes scolded herself so severely as to bring tears into her eyes . She ate a little bit of cake, and said anxiously to herself, “Which way? Which way?”


**4. Text Generation:**



> Source if Inspiration: Hugging Face https://huggingface.co/docs/transformers/main/en/tasks/language_modeling


> Model Used: https://huggingface.co/gpt2




In [None]:
# Load model directly
text_generator = pipeline("text-generation", model='gpt2')


# Specify your starting prompt
starting_prompt = "Once upon a time, in a small city..."

# Generate 500 words of text
generated_text = text_generator(starting_prompt, max_length=500, do_sample=True)

# Print the generated text
print(generated_text[0]["generated_text"])


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time, in a small city...

That night all the children of Israel were with me....

They went to the Temple. The whole city was gone.

But they did not see a soul as they did now.

No one saw a soul as they did now.

And the last of all that was lost was the one in the land of Egypt.


THE BOUGHT OF VISION:

It is a very beautiful view of the sky as if it were seen with their eyes turned, looking straight from the horizon.

We may imagine how it must have looked to our eyes, just as the last shadow of a dead person will rise.

At the far end to the south, it has not a day on which to walk.

For the moon has never been visible to anyone in the distance, and at night the sun can never be seen.

A bright sun will never be above the horizon.

When the last of those who were not killed in the battle are gone, there will be a day on which everyone will be standing in peace, all that was lost.

This is not what the world will ever see again.

One day the world will not notice the wa

**5. Question Answering:**




Source of Inspiration: Hugging Face tutorial - https://huggingface.co/docs/transformers/main/en/tasks/question_answering


> Model used: https://huggingface.co/distilbert-base-cased-distilled-squad





In [None]:

# Import the pipeline for question answering
question_answering = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')

# Specify the question and context text
question = "What happened to Alice?"
context = "Just then Alice head struck against the roof of the hall: in fact she was now more than nine feet high, and she at once took up the little golden key and hurried off to the garden door."

# Get the answer from the context
answer = question_answering(question=question, context=context)

# Print the answer and confidence score
print(f"Answer: '{answer['answer']}', score: {round(answer['score'], 4)}, start: {answer['start']}, end: {answer['end']}")

Answer: 'head struck against the roof of the hall', score: 0.0532, start: 16, end: 56


**6. Translation:**

> Source of Inspiration: Hugging Face - https://huggingface.co/docs/transformers/main/en/tasks/translation

> Model used: https://huggingface.co/t5-base





In [None]:
# Import the pipeline for translation
translator = pipeline("translation_en_to_fr", model="t5-base")

# Specify the text to translate
text_to_translate = "“You are not attending!” said the Mouse to Alice severely. “What are you thinking of?”"

# Translate the text
translation = translator(text_to_translate)

# Print the translation
print(translation[0]["translation_text"])


For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


« Vous n'êtes pas là ! » a dit la Souris à Alice sévèrement. « Que pensez-vous? »


## **Activity 2: Programming Task for NLP Transformer Solutions**

**1. Masked Language Modeling with Distilbert**


> Source Inspiration: Hugging Face https://huggingface.co/docs/transformers/main/en/tasks/masked_language_modeling



> Model used: https://huggingface.co/distilbert-base-uncased






In [None]:

# Create a pipeline for masked language modeling
masked_lm = pipeline("fill-mask", model="distilbert-base-uncased")

# Define a sentence with a masked word
sentence = "I want to [MASK] this summer."

# Generate text options to fill the masked word
results = masked_lm(sentence)

# Print the generated options
for result in results:
    print(result["sequence"])


i want to celebrate this summer.
i want to go this summer.
i want to start this summer.
i want to fly this summer.
i want to dance this summer.


**Sentiment Analysis for Stock Market Headlines**

Source Inspiration: https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline



> Model https://huggingface.co/ProsusAI/finbert





In [None]:
# Create a sentiment analysis pipeline
sentiment_classifier = pipeline("sentiment-analysis", model="ProsusAI/finbert")

# Define stock market headlines
headlines = [
    "Oil prices jump nearly 6% amid geopolitical tensions",
    "Asia fintech MoneyHero slides on the first day of trading after a merger with Peter Thiel-backed SPAC",
    "Regional banks are in focus in the week ahead as the third-quarter earnings season ramps up",
]

# Classify the sentiment of the headlines
sentiments = sentiment_classifier(headlines)

# Print the sentiment classification results
for sentiment in sentiments:
    print(sentiment)



{'label': 'negative', 'score': 0.8534157872200012}
{'label': 'negative', 'score': 0.9664749503135681}
{'label': 'positive', 'score': 0.4299863576889038}




**Chat with Microsoft DialoGPT-large**

> Source Inspiration: https://huggingface.co/microsoft/DialoGPT-large?text=Hey+my+name+is+Julien%21+How+are+you%3F

> Model: https://huggingface.co/microsoft/DialoGPT-large?text=Hey+my+name+is+Julien%21+How+are+you%3F





In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch


tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-large")
model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-large")

# Let's chat for 5 lines
for step in range(5):
    # encode the new user input, add the eos_token and return a tensor in Pytorch
    new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')

    # append the new user input tokens to the chat history
    bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids

    # generated a response while limiting the total chat history to 1000 tokens,
    chat_history_ids = model.generate(bot_input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)

    # pretty print last ouput tokens from bot
    print("DialoGPT: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))


Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/642 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.75G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

>> User:How are you?


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


DialoGPT: I'm good, you?
>> User:I am good, what is the time right now?


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


DialoGPT: I don't know, I'm not sure.
>> User:Do you like history?


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


DialoGPT: I do, but I don't really like it.
>> User:How was your day?


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


DialoGPT: It was alright.


KeyboardInterrupt: ignored

**Transcribing Audio to Text with Facebook Wav2Vec2**

> Code Provided in the Assessment.



In [None]:
! pip install -q transformers


In [None]:
#Importing Libraries
import librosa
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

In [None]:
#load model and tokenizer
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

Downloading (…)okenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.


Downloading model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_v', 'wav2vec2.encoder.pos_conv_embed.conv.weight_g']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.masked_spec_embed']
You sho

In [None]:
##load any audio file
speech, rate = librosa.load('/content/drive/MyDrive/Colab Notebooks/Kingsway (online-audio-converter.com).wav',sr=16000)

In [None]:
#inputing Values

input_values = tokenizer(speech, return_tensors = 'pt').input_values

#Store logits (non-normalized predictions)
logits = model(input_values).logits

#Store predicted id's
predicted_ids = torch.argmax(logits, dim =-1)

#decode the audio to generate text
transcriptions = tokenizer.decode(predicted_ids[0])

#Printing the transcription
print(transcriptions)


HELLO MY NAME'S IS RESA AND I AM A THE SOFT RANGIN ER ISITUTED
