<a href="https://colab.research.google.com/github/Jikhan-Jeong/Text_Data_Mining_HG/blob/main/05_22_2022_NLP_Tasks_with_HG_pipeline_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


---
# 05-22-2022-NLP_Tasks_with_HG_pipeline
---
* name: JJ
* evn: Google colab without GPU
* ref: https://huggingface.co/course/chapter1/3?fw=pt (modified)
* ref: https://huggingface.co/docs/transformers/quicktour (quick tour)
----
* The purpose of this chapter is to learn NLP tasks in the simplest ways. 
* Huggingface pipline gives us simple results without any detailed code
* Available tasks with pipelines

----
* feature-extraction (get the vector representation of a text)
* fill-mask
* ner (named entity recognition)
* question-answering
* sentiment-analysis
* summarization
* text-generation
* translation (not covered in this tutorial)
* zero-shot-classification
---

In [None]:
# download the package
!pip install transformers

Collecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 26.0 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 61.3 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 58.0 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.6.0-py3-none-any.whl (84 kB)
[K     |████████████████████████████████| 84 kB 3.6 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninstalled PyYAML-3.13
Successfully installed huggingface-hub-0

In [None]:
from transformers import pipeline

---
# Sentiment Analysis
* default: distilbert-base-uncased-finetuned-sst-2-english
---

In [None]:
# simply use
classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [None]:
classifier("JJ looks like sleep but he is happy to see his advisor on Tueseday.")

[{'label': 'POSITIVE', 'score': 0.9992756247520447}]

In [None]:
classifier(
    ["JJ looks like sleep but he is happy to see his advisor on Tueseday.",
     "I have tried several times to publish more paper but some papers were rejected. Alas"]
)

[{'label': 'POSITIVE', 'score': 0.9992756247520447},
 {'label': 'NEGATIVE', 'score': 0.9987794756889343}]

---
# Zero-shot classification
* default: Facebook's BART model/bart-large-mnli
---

In [None]:
zero_shot_classifier = pipeline("zero-shot-classification") # call different pre-trained model for zero-shot classification
zero_shot_classifier (
   "JJ looks like sleep but he is happy to see his advisor on Tueseday.",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli (https://huggingface.co/facebook/bart-large-mnli)


{'labels': ['education', 'business', 'politics'],
 'scores': [0.4480249881744385, 0.42391538619995117, 0.12805970013141632],
 'sequence': 'JJ looks like sleep but he is happy to see his advisor on Tueseday.'}

#  Text generation

In [None]:
from transformers import pipeline
generator = pipeline("text-generation") # test generation with pre-trained model 
generator("In this course, JJ will teach you about data and text mining, in particular, you will learn how to")

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, JJ will teach you about data and text mining, in particular, you will learn how to create and use a database of your own, so be sure to let your friends know for free.\n\nWhat is SQL Server?\n'}]

In [None]:
generator("In this data and text mining course, we will learn how to",
    max_length=30,
    num_return_sequences=2)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this data and text mining course, we will learn how to create a scalable solution for this problem with multiple cryptocurrencies. We will use an XDA'},
 {'generated_text': 'In this data and text mining course, we will learn how to use this technology through the online marketplace. In this course we will use data mining to'}]

----
### Change the model for text generation
* the default is *bart* so that changing it for *distilgpt2*
----

In [None]:
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this data and text mining course, we will learn how to",
    max_length=30,
    num_return_sequences=2,
)

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/336M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this data and text mining course, we will learn how to use the Python engine to efficiently mine (and possibly perform) different data types from the'},
 {'generated_text': 'In this data and text mining course, we will learn how to make sure you share your experiences and insights to the next generation of computing platforms including Google'}]

---
# Mask filling
* default model: distilroberta-base
* fill a < mask > part in a sentence
---

In [None]:
from transformers import pipeline

fill_mark_word = pipeline("fill-mask")
fill_mark_word ("JJ is the most <mask> guy in Lakeland FL.", top_k=2) # top_k = number of suggestion

No model was supplied, defaulted to distilroberta-base (https://huggingface.co/distilroberta-base)


Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/316M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

[{'score': 0.08936254680156708,
  'sequence': 'JJ is the most popular guy in Lakeland FL.',
  'token': 1406,
  'token_str': ' popular'},
 {'score': 0.040373194962739944,
  'sequence': 'JJ is the most famous guy in Lakeland FL.',
  'token': 3395,
  'token_str': ' famous'}]

In [None]:
fill_mark_word ("JJ want to buy some gift for his advisor; therefore, it would be better to buy <mask> as a gift for his advisor.", top_k=2)

[{'score': 0.09839325398206711,
  'sequence': 'JJ want to buy some gift for his advisor; therefore, it would be better to buy it as a gift for his advisor.',
  'token': 24,
  'token_str': ' it'},
 {'score': 0.0870167687535286,
  'sequence': 'JJ want to buy some gift for his advisor; therefore, it would be better to buy something as a gift for his advisor.',
  'token': 402,
  'token_str': ' something'}]

---
# Named entity recognition
* base model: bert-large-cased-finetuned-conll03-english
* Named entity recognition (NER) task map some input text to subgroups such as person (PER), locations (LOC), or organizations (ORG). 
---

In [None]:
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("My name is JJ and I work at Florida Polytechnic University in Lakeland.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

  f'`grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="{aggregation_strategy}"` instead.'


[{'end': 13,
  'entity_group': 'PER',
  'score': 0.98099864,
  'start': 11,
  'word': 'JJ'},
 {'end': 58,
  'entity_group': 'ORG',
  'score': 0.9982908,
  'start': 28,
  'word': 'Florida Polytechnic University'},
 {'end': 70,
  'entity_group': 'LOC',
  'score': 0.9908936,
  'start': 62,
  'word': 'Lakeland'}]

---
# Question answering
* default model: distilbert-base-cased-distilled-squad
* input is a pair of question and answers sentences.
* return gives an answer to the questions as a word.
---


In [None]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is JJ and I work at Florida Polytechnic University in Lakeland.",
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/249M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

{'answer': 'Florida Polytechnic University',
 'end': 58,
 'score': 0.5749695897102356,
 'start': 28}

---
### Summarization
* default: sshleifer/distilbart-cnn-12-6
* news sources:https://finance.yahoo.com/news/florida-poly-launches-two-masters-142300308.html
---

In [None]:
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Downloading:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

[{'summary_text': " Florida Polytechnic University introduces two new Master of Science degrees in data science and engineering management . The two leading-edge graduate programs were created in response to growing University enrollment and high industry demand . The data science master's degree will be offered on three pathways: a 10-month course-only option, a 16-month program with an individual or small group project ."}]

In [None]:
text = """ New Master of Science degrees in data science and engineering management will be offered at Florida Polytechnic University this fall.

LAKELAND, Fla., May 17, 2022 /PRNewswire/ -- Florida Polytechnic University is introducing two new Master of Science degrees this fall in data science and engineering management. The two leading-edge graduate programs were created in response to growing University enrollment and high industry demand.

The University's Board of Trustees approved the degrees in February and they recently were approved by the State University System's Board of Governors for inclusion in the state's inventory of degree programs.

As the University grows its campus and student body, we continue growing our academic offerings as well with degrees that meet the needs of our students and of industry, said Dr. Terry Parker, Florida Poly's provost.

The data science degree is an evolution of an existing degree track within the computer science degree program, while the engineering management degree is an evolution of an existing track within the engineering degree program. Their independent nature will now allow them to be more responsive to changes in their disciplines and industries.

We're looking forward to more students taking advantage of these opportunities, said Dr. Shahram Taj, chair of Florida Poly's data science and business analytics department, where the degrees will reside. Many data science students are securing high-paying offers even before they complete their degree.

The data science master's degree will be offered on three pathways: a 10-month course-only option, a 16-month program with an individual or small group project, and a two-year thesis-based option.

There is big demand in many disciplines for high-quality training programs in data science, said Dr. Rei Sanchez-Arias, assistant chair of the Department of Data Science and Business Analytics. The impact of data science is broad-based and there is incredible demand from a variety of disciplines in today's market.

The 10-course engineering management degree can be completed in as few as 10 months, and the program is distinctive because it incorporates emerging analytics and technological innovation principles into the traditional program of study.

Engineering managers must be equipped with data analytics skills and tools, given the impact data-driven decisions have in organizations aiming for sustainable competitive advantage and technological innovation, Sanchez-Arias said. """


In [None]:
summarizer(text)

[{'summary_text': " Florida Polytechnic University introduces two new Master of Science degrees in data science and engineering management . The two leading-edge graduate programs were created in response to growing University enrollment and high industry demand . The data science master's degree will be offered on three pathways: a 10-month course-only option, a 16-month program with an individual or small group project ."}]

In [None]:
summarizer(text,  max_length=77)

[{'summary_text': " Florida Polytechnic University introduces two new Master of Science degrees in data science and engineering management . The two leading-edge graduate programs were created in response to growing University enrollment and high industry demand . The data science master's degree will be offered on three pathways: a 10-month course-only option, a 16-month program with an individual or small group project,"}]