<a href="https://colab.research.google.com/github/MST47/Open-Source-NLP-Toolkit/blob/main/3_Universal_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 3_Universal Models


## Getting Started with Transformers

### Install the Transformers library & dependencies

In [1]:
!pip install transformers~=4.31.0  # The Transformers library from Hugging Face
!pip install sentencepiece==0.1.96  # optional tokeniser, required for some models. e.g. machine translation
!pip install wikipedia==1.4.0  # to download any text from wikipedia
# running large models with accelerate https://huggingface.co/blog/accelerate-large-models
# NOTE: we need to restart the runtime after installing accelerate
!pip install accelerate~=0.21.0

Collecting transformers~=4.31.0
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m35.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers~=4.31.0)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.19.1
    Uninstalling tokenizers-0.19.1:
      Successfully uninstalled tokenizers-0.19.1
  Attempting uninstall: transformers
    Found existing installation: transformers 4.41.2
    Uninstalling transformers-4.41.2:
      Successfully uninstalled transformers-4.41.2
Successfully installed tokenizers-0.13.3 transformers-4.31.0
Collecting sentencepiece==0.1.

#### The Hugging Face Pipeline

In [2]:
from transformers import pipeline
import pandas as pd
import numpy as np
from pprint import pprint

##Introduction

The models above are always tailored to one specific task from one dataset. The main advantage of these models is, that they are very good at this specific task and perform well on one specific dataset. In reality, however, he problems you will encounter in the real world will require a slightly different task, with different definitions of categories or on different types of texts. <br>

Universal models can partly address this issue. They also only one task. But this one task is to general/universal, that many other tasks can be reformulated as this universal task. Two examples for universal tasks are:<br>

**1. Natural Language Inference (NLI)**: A task that can solve any classification task. <br>
**2. Token generation**: An even more universal task that can solve any text-related task.


##1. Natural Language Inferences
   **i. Zero-shot Classification**: <br>  
      Assume you receive text from a customer and define class parameters. The output will be the probability of each class parameter in ascending order. You can change or add class parameters for prediction, unlike in the previous model.

In [3]:
pipeline_zeroshot_classification = pipeline("zero-shot-classification", model="MoritzLaurer/mDeBERTa-v3-base-mnli-xnli")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.07k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/16.3M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

In [4]:
text = "Customer: I have not received my reimbursement yet. What the hell is going on?"
classes = ['payment issues', 'travel advice', 'bug report']  # "account opening", "customer complaint"

#text = "I do not think the government is trustworthy anymore. We need to mobilize and resist!"
#classes = ["civil disobedience", "praise of the government", "travel advice"]  # "collective action"

output = pipeline_zeroshot_classification(text, classes, multi_label=True)

pd.DataFrame(data=[output["labels"], output["scores"]], index=["class", "probability"]).T


Unnamed: 0,class,probability
0,payment issues,0.991132
1,bug report,0.076115
2,travel advice,0.018696


#### (a) Zero-shot learning with large generative models and prompts (LLMs)

In [5]:
# info on GPU
!nvidia-smi
# info on available ram
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('\n\nYour runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

Sun Jul  7 10:28:10 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P8               9W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [9]:
# connecting to my google drive to load a large model from disk instead of downloading it
from google.colab import drive
import os
drive.mount('/content/drive', force_remount=False)

print(os.getcwd())
os.chdir("/content/drive/My Drive/generative-models")
print(os.getcwd())

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content
/content/drive/My Drive/generative-models


In [11]:
# if this doesn't work, try restarting the runtime
from transformers import pipeline
import torch

model_name = "google/flan-t5-xl" #"flan-t5-xl"  # use "google/flan-t5-xl" to download the model

pipeline_zeroshot_prompting = pipeline(
    "text2text-generation",  # "text2text-generation", "text-generation"
    model=model_name,  device_map="auto",  #device=device_id,
    torch_dtype=torch.bfloat16,  #load_in_8bit=True,
)

config.json:   0%|          | 0.00/1.44k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/53.0k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.45G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [12]:
## text classification (framed as multiple choice)
text = '''
Here is a quote:
"I do not think the government is trustworthy anymore. We need to mobilize and resist!".
Is this quote about either (a) "civil disobedience", or (b) "praise of the government", or (c) "government funded mobility"?
Only choose one of the three options.'''

output = pipeline_zeroshot_prompting(text)
output

[{'generated_text': '(a)'}]

In [13]:
## question answering
text = '''Here is a news article from Thursday 22.12.2022: "
European Parliament website hit by cyberattack after Russian terrorism vote
One official blamed pro-Russian hacking group Killnet for the DDoS attack.
The European Parliament website on Wednesday faced a "sophisticated" cyberattack disrupting its services moments after members voted to declare Russia a state sponsor of terrorism.
"I confirm that the Parliament has been subject to an external cyber attack, but the Parliamentary services are doing well to defend the Parliament," Dita Charanzová, Czech MEP and Parliament vice president responsible for cybersecurity, said in a statement.
Another senior Parliament official, requesting not to be named, said “it might be the most sophisticated attack that the Parliament has known so far.”
The attack is what's known as a distributed denial-of-service (DDoS) attack, in which massive amounts of traffic are sent to servers in an attempt to block internet users from accessing websites, Marcel Kolaja, European Parliament member for the Czech Pirate party, confirmed.
DDoS attacks are used by hacking groups to disrupt and cause chaos. It emerged as a favorite instrument of Russian hacking groups like Killnet, notably as a way to protest against political decisions in European countries to support Ukraine in the war.
The attack on the European Parliament website comes after the chamber voted on Wednesday to adopt a resolution declaring Russia a state sponsor of terrorism because of Moscow’s strikes on civilian targets in Ukraine.
"We have a strong indication that it is from Killnet, the hackers with links to Russia indeed. This is my information, but it is under control. It only cut the external access to the Parliament's website ... Unless there is extra attacks we expect it to be back and accessible very soon," said Eva Kaili, Greek member and vice president of the European Parliament.
"This morning Russia was still designated as a terrorist state in an official resolution. This afternoon the entire network collapses in [the European Parliament]," Alexandra Geese, German Greens' MEP, tweeted.
".

'''

prompt_lst = [
    "Was there a cyber attack? Yes or no.",
    "What is the name of the attacker?",
    "Which country does the attacker come from?",
    "What is the name of the victim of the cyber attack?",
    "Which country does the victim of the cyber attack come from?",
    "If there was a cyber attack, what type of cyber attack was it?",
    "What was the date of the cyber attack?",
    "What or who is the source of information on the cyber attack?",
    "What damages were caused by the cyber attack?",
    "What was the political response to the cyber attack?",
    'How certain is it that there was a cyber attack? "Very certain", "moderately certain", or "not certain"? Chose one of these options.'
]

# chain-of-thought tests https://arxiv.org/pdf/2210.11416.pdf
instructions_begin = ""  #"Answer the following question by reasoning step-by-step: "
instructions_end = ""  #" Explain the answer with step-by-step reasoning"
other_category = ' Answer "unknown" if the correct answer is not explicitly mentioned in the article.'

input_lst = [text + instructions_begin + prompt + other_category + instructions_end for prompt in prompt_lst]

output_lst = []
for input, prompt in zip(input_lst, prompt_lst):
    output = pipeline_zeroshot_prompting(input)
    output_lst.append(output)
    print(f'{prompt:90}{output[0]["generated_text"]}')


Was there a cyber attack? Yes or no.                                                      Yes
What is the name of the attacker?                                                         Killnet
Which country does the attacker come from?                                                Russia
What is the name of the victim of the cyber attack?                                       European Parliament
Which country does the victim of the cyber attack come from?                              unknown
If there was a cyber attack, what type of cyber attack was it?                            distributed denial-of-service (DDoS) attack
What was the date of the cyber attack?                                                    Wednesday
What or who is the source of information on the cyber attack?                             Eva Kaili
What damages were caused by the cyber attack?                                             unanswerable




What was the political response to the cyber attack?                                      unanswerable
How certain is it that there was a cyber attack? "Very certain", "moderately certain", or "not certain"? Chose one of these options.Very certain


In [None]:
## text summarisation
text = """
"Donald John Trump (born June 14, 1946) is an American politician, media personality, and businessman who served as the 45th president of the United States from 2017 to 2021.
Trump graduated from the Wharton School of the University of Pennsylvania with a bachelor's degree in 1968. He became president of his father's real estate business in 1971 and renamed it The Trump Organization.
He expanded the company's operations to building and renovating skyscrapers, hotels, casinos, and golf courses. He later started side ventures, mostly by licensing his name. From 2004 to 2015, he co-produced
and hosted the reality television series The Apprentice. Trump and his businesses have been involved in more than 4,000 state and federal legal actions, including six bankruptcies. Trump's political positions
have been described as populist, protectionist, isolationist, and nationalist. He won the 2016 United States presidential election as the Republican nominee against Democratic nominee Hillary Clinton despite l
osing the national popular vote. He became the first U.S. president with no prior military or government service. His election and policies sparked numerous protests. The 2017–2019 special counsel investigation
led by Robert Mueller established that Russia interfered in the 2016 election to favor the election of Trump. Trump promoted conspiracy theories and made many false and misleading statements during his campaigns
and presidency, to a degree unprecedented in American politics. Many of his comments and actions have been characterized as racially charged or racist, and many as misogynistic. Trump ordered a travel ban
on citizens from several Muslim-majority countries, diverted military funding towards building a wall on the U.S.–Mexico border, and implemented a policy of family separations for apprehended migrants.
He rolled back more than 100 environmental policies and regulations in an aggressive attempt to weaken environmental protections."
Please summarize this text by providing the key information about Donald Trump. Summary:
"""

output = pipeline_zeroshot_prompting(text)
output

[{'generated_text': '"Donald Trump (born June 14, 1946) is an American politician, media personality,'}]