In [2]:
!pip install transformers
!pip install pydub
!pip install ddsp

Collecting transformers
  Downloading transformers-4.11.3-py3-none-any.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 33.9 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 48.8 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 43.1 MB/s 
Collecting huggingface-hub>=0.0.17
  Downloading huggingface_hub-0.0.19-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 5.4 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 47.7 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  A

In [3]:
from transformers import pipeline

# **Activity 1: Programming Tasks for NLP Frequent Use — Cases with Hugging Face Transformers**

**1.   Specify a sequence of text and using the Transformers pipeline for Named Entity Recognition (NER) and identify a list of words belonging to at least one of three classes, e.g., person, an organisation or a location.**


The Named Entity Recognition job refers to the task of classifying tokens of a presented text sequence according to a class. We assigned the task identifier to the pipeline initialization for implementation. Afterward, the object receives only one text stream.

In [None]:
sequence = r"""
John is a friend of mine who lives in America. He is a Data Scientist and works at Amazon.
"""
nlp_ner = pipeline("ner")
for entity in nlp_ner(sequence):
    print(f"{entity['word'], entity['entity']}")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

('John', 'I-PER')
('America', 'I-LOC')
('Amazon', 'I-ORG')


**2.   Again, using your own sequence of text and Transformers pipeline, identify at least two sequences of text of 10-25 words as positive or negative sentiment.**

To identify positive or negative sentiment falls under text classification which consists of committing a given text to a particular class from a set of classes. Sentiment analysis is the most ordinarily directed query in a text classification problem.
The task of dentifying if a sequence is positive or negative, leverages a fine-tuned model on sst2, which is a GLUE task. This returns a label (“POSITIVE” or “NEGATIVE”) alongside a score, as follows:

In [None]:
seq_class = pipeline("sentiment-analysis")

seq1 = f"I like reading books"
seq2 = f"I do not like horror movies"

R1 =  seq_class(seq1)[0]
R2 =  seq_class(seq2)[0]

print(f"label: {R1['label']}, with score: {round(R1['score'], 4)}")
print(f"label: {R2['label']}, with score: {round(R2['score'], 4)}")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

label: POSITIVE, with score: 0.996
label: NEGATIVE, with score: 0.9822


**3.  Use Transformers pipeline to summarise an article or sequence of text comprising 350 to 500 words or so into 100 words or less.**


In [4]:
article = r"""
Apple on Monday unveiled in an hourlong virtual event new MacBook computers powered by Apple-made processors and an updated model of its popular 
AirPods.Last year, Apple introduced a new line of computers with processors made by Apple with the assistance of a manufacturing partner, breaking its reliance 
on the chip maker Intel. The company said at the time that the new chip, called the M1, would make Apple devices faster and more power efficient. 
The new processors, the M1 Pro and the M1 Max, power a new MacBook Pro that comes in a 14-inch and 16-inch model, starting at $2,000 and $2,500. 
The upgraded computers will have faster processing speeds, better graphics, improved audio quality and a better camera, the company said. 
"""
nlp_summarizer = pipeline('summarization')
nlp_summarizer(article, max_length=100)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Downloading:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

[{'summary_text': ' Apple unveils new MacBook computers powered by Apple-made processors . The new processors power a new MacBook Pro that comes in a 14-inch and 16-inch model . The upgraded computers will have faster processing speeds, better graphics, better audio quality and a better camera .'}]

**4.   Illustrate text generation using the text generation pipeline and auto-complete 500 words from your starting point of just a few sentences (~12 to 25 words).**



In [None]:
nlp_txt_generator = pipeline('text-generation')
nlp_txt_generator("If someone says that you can become a data scientist without learning to code. Do not trust them! Definitely not yet. Learning to code is absolutely necessary to become a data scientist. Anyone can learn to code, the only important factor is, you need to learn it the right way.", max_length = 500)

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)


Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'If someone says that you can become a data scientist without learning to code. Do not trust them! Definitely not yet. Learning to code is absolutely necessary to become a data scientist. Anyone can learn to code, the only important factor is, you need to learn it the right way.\n\nWhat this all boils down to is: your skills are limited to learning coding and not just coding. Don\'t fall into them.\n\nIf you don\'t do these things then you won\'t be doing great work.\n\nIf you do these things but don\'t do good, you risk losing the work experience you want to have.\n\nWe all learn at an early age. In my experience, when starting out, we were all "the little guys": we were young, we didn\'t know or be familiar with coding, and we didn\'t know what would eventually become of us.\n\nOur initial exposure to coding was early on, but the learning curve was very, very small. What we now learned in school or training was by doing things on our own. Learning to code was some

**5.   Extract an answer from your text given a question using the Transformers pipeline questionanswering. Show the answer extracted from the text together with a confidence score with the positions of the extracted answer in the text.**




In [None]:
from transformers import pipeline
txt = r"""
Apple on Monday unveiled in an hourlong virtual event new MacBook computers powered by Apple-made processors and an updated model of its popular 
AirPods.Last year, Apple introduced a new line of computers with processors made by Apple with the assistance of a manufacturing partner, breaking its reliance 
on the chip maker Intel. The company said at the time that the new chip, called the M1, would make Apple devices faster and more power efficient. 
The new processors, the M1 Pro and the M1 Max, power a new MacBook Pro that comes in a 14-inch and 16-inch model, starting at $2,000 and $2,500. 
The upgraded computers will have faster processing speeds, better graphics, improved audio quality and a better camera, the company said. 
"""

npl_question_answerer = pipeline("question-answering")
result = npl_question_answerer(question="what would M1 chip do?", context=txt)
print(f"Answer: '{result['answer']}' , score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")


No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/249M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Answer: 'make Apple devices faster and more power efficient' , score: 0.7599, start: 402, end: 452


**6.   Translate text of 3 to 5 sentences from English to French using the translation pipeline.**



In [None]:
nlp_translator = pipeline('translation_en_to_fr')
nlp_translator("If someone says that you can become a data scientist without learning to code. Do not trust them! Definitely not yet. Learning to code is absolutely necessary to become a data scientist. Anyone can learn to code, the only important factor is, you need to learn it the right way.")

No model was supplied, defaulted to t5-base (https://huggingface.co/t5-base)


Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

[{'translation_text': "Si quelqu'un vous dit que vous pouvez devenir un chercheur en données sans apprendre à coder, ne faites pas confiance à eux! Certainement pas encore. L'apprentissage à coder est absolument nécessaire pour devenir un chercheur en données. Tout le monde peut apprendre à coder, le seul facteur important est, vous devez l'apprendre de la bonne façon."}]

# **Activity 2: Programming Task for NLP Transformer Solutions**

**1. Use the model Distilbert-base-uncased with a pipeline using your own example to illustrate masked language modelling. Here, the model generates text options to fill the masked input while mindful of a context of 5 to 10 words.**

In [None]:
unmasker = pipeline('fill-mask', model='distilbert-base-uncased')
unmasker("Hello my name is John. I work in Artifical Intelligence field and my role is  [MASK] Engineeer")

[{'score': 0.032554108649492264,
  'sequence': 'hello my name is john. i work in artifical intelligence field and my role is computer engineeer',
  'token': 3274,
  'token_str': 'computer'},
 {'score': 0.02899409644305706,
  'sequence': 'hello my name is john. i work in artifical intelligence field and my role is search engineeer',
  'token': 3945,
  'token_str': 'search'},
 {'score': 0.0235581211745739,
  'sequence': 'hello my name is john. i work in artifical intelligence field and my role is an engineeer',
  'token': 2019,
  'token_str': 'an'},
 {'score': 0.018571509048342705,
  'sequence': 'hello my name is john. i work in artifical intelligence field and my role is brain engineeer',
  'token': 4167,
  'token_str': 'brain'},
 {'score': 0.0183964055031538,
  'sequence': 'hello my name is john. i work in artifical intelligence field and my role is the engineeer',
  'token': 1996,
  'token_str': 'the'}]

**2. Locate and download the ProsusAI/finbert model from the HFace hub and select 3 to 5 stock market headlines to classify the sentiment of the financial content.**

In [None]:
from transformers import BertTokenizer, BertForSequenceClassification
import numpy as np

finbert = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone',num_labels=3)
tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-tone')

sentences = ["Recent research shows that declining interest rates are helping to fuel companies’ higher valuation",
             "lower interest rates help the biggest companies gain market share",
             "There is a shortage of capital, and company need extra financing", 
             "the bout of energy inflation is more intense than it would have been in the past", 
             ]

inputs = tokenizer(sentences, return_tensors="pt", padding=True)
outputs = finbert(**inputs)[0]

labels = {0:'neutral', 1:'positive',2:'negative'}
for idx, sent in enumerate(sentences):
    print(sent, '----', labels[np.argmax(outputs.detach().numpy()[idx])])
    


Recent research shows that declining interest rates are helping to fuel companies’ higher valuation ---- positive
lower interest rates help the biggest companies gain market share ---- positive
There is a shortage of capital, and company need extra financing ---- negative
the bout of energy inflation is more intense than it would have been in the past ---- negative


**3. Download the Microsoft DialoGPT-large model, a “large scale pretrained dialogue response generation model for multiturn conversations” (Zhang et al, 2020). The model has been trained on 147M multi-turn dialogue from Reddit. Download the code snippet provided on the HFace hub to your notebook, make necessary changes/additions to the code and try chatting. Chat for 5 lines or more.**

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch


tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-large")
model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-large")

# Let's chat for 5 lines
for step in range(5):
    # encode the new user input, add the eos_token and return a tensor in Pytorch
    new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')

    # append the new user input tokens to the chat history
    bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids

    # generated a response while limiting the total chat history to 1000 tokens, 
    chat_history_ids = model.generate(bot_input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)

    # pretty print last ouput tokens from bot
    print("DialoGPT: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/642 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

>> User:Can I be a data scientist with out learning coding
DialoGPT: You can be a data scientist with learning programming.
>> User:But that seems difficult to be a professtional
DialoGPT: It's not difficult, it's just not as fun.
>> User:That right! How can I make it fun?
DialoGPT: You can't.
>> User:Then How am I suppose to be data scientist
DialoGPT: You can be a data scientist without learning programming.
>> User:that's not possible dude
DialoGPT: I know, I know.


**4: Review the Facebook Wav2Vec2 model (Wav2Vec2-Base-960h). This is a speech recognition model learning the structure of speech from raw audio. Create a .wav audio file using the HFace hub interface to directly record your voice from browser. When satisfied with recording 10 words or so after playback through the same interface, save as “audio only.wav” or whatever name you prefer. Use the code below for transcribing audio and load onto your notebook and convert your audio to the text (speech recognition).**

In [None]:
import numpy as np
import os
import ddsp
import ddsp.training
from ddsp.colab import colab_utils
from ddsp.colab.colab_utils import (
    auto_tune, get_tuning_factor, download, play, record, 
    specplot, upload, DEFAULT_SAMPLE_RATE)


In [None]:
sample_rate = 16000  
 

record_or_upload = "Record" 
 
record_seconds = 10 
 
if record_or_upload == "Record":
  audio = audio = record(seconds=record_seconds)


write('output.wav', sample_rate, audio)  # Save as WAV file

audio = audio[np.newaxis, :]
print('\nExtracting audio features...')

# Plot.
#specplot(audio)
play(audio)

 

Starting recording for 10 seconds...


<IPython.core.display.Javascript object>

Finished recording!

Extracting audio features...


<function ddsp.colab.colab_utils.play>

In [None]:
import librosa
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

#load model and tokenizer
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

#load any audio file of your choice
speech, rate = librosa.load("output.wav",sr=16000)

input_values = tokenizer(speech, return_tensors = 'pt').input_values
#Store logits (non-normalized predictions)
logits = model(input_values).logits

#Store predicted id's
predicted_ids = torch.argmax(logits, dim =-1)

#decode the audio to generate text
#print(predicted_ids)

transcriptions = tokenizer.decode(predicted_ids[0])
print(transcriptions)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


JOHN IS MY FRIEND WE WORK TOGETHER
