# Covid19 Tweet Truth Analysis
## Fine-tune BERT inference and attention visualization

Reference:

workshop4.3 Zero-shot classification: https://colab.research.google.com/drive/1Lie9UJyJKONwR7uEdscOp2Qw4vmVlpvx#scrollTo=wEd0QskfGDVR

workshop4.4 https://colab.research.google.com/drive/1B_zidLpksK_pnctPNcnw5dJh4oxgQPfa#scrollTo=H2Zb9jchL-PP

BiLSTM https://colab.research.google.com/drive/1bDY5cg3dpCLDTqiDr0xkoAao5EWs3mKN#scrollTo=tI2_gt9Dtb0X

Attention Map https://colab.research.google.com/drive/1I4ykvsKqwb78hEDRSzzugdDTpWUIDA4e#scrollTo=EbQpyYs13jF_

attention extract https://github.com/hila-chefer/Transformer-Explainability




In [1]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Sat Apr  9 14:28:47 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [34]:
# setup CUDA
import torch

# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla P100-PCIE-16GB


# Inference example

In [3]:
!pip install transformers
!pip install azureml-core

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 12.1 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 6.7 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 42.8 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 47.5 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 48.4 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: 

In [35]:
from google.colab import drive
import os

drive.mount('/content/gdrive')
path = "/content/gdrive/My Drive/PLP_sharing/project/fake_news/"
os.chdir(path)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [5]:
from transformers import BertForSequenceClassification, AdamW, BertConfig
import torch

saved_model = torch.load('./lqq_transformer1/model_attention')


In [36]:
from transformers import BertTokenizer
import numpy as np

test_text_fake = 'Alfalfa is the only cure for COVID-19.'
test_text_real = '#IndiaFightsCorona India has one of the lowest #COVID19 mortality globally with less than 2% Case Fatality Rate. As a result of supervised home isolation &amp; effective clinical treatment many States/UTs have CFR lower than the national average. https://t.co/QLiK8YPP7E'

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

test_text = tokenizer(list([test_text_fake]), 
                          max_length = 128,           # Pad & truncate all sentences.
                          padding = 'max_length',
                          truncation=True,
                          return_attention_mask = True,   # Construct attn. masks.
                          return_tensors = 'pt'     # Return pytorch tensors.
)
# val_dataset = TensorDataset(encoded_textsValid['input_ids'], encoded_textsValid['attention_mask'], labelsValid)
test_seq = torch.tensor(test_text['input_ids']).to('cuda:0')
test_mask = torch.tensor(test_text['attention_mask']).to('cuda:0')

with torch.no_grad():
  outputs = saved_model(test_seq, test_mask) # reference: https://www.kaggle.com/akshat0007/bert-for-sequence-classification
  pred_proba = outputs[0].detach().cpu().numpy()

preds = np.argmax(pred_proba, axis = 1)

print([preds.tolist(), pred_proba.tolist()])

[[0], [[1.415144920349121, -0.7875666618347168]]]




# Attention map

https://shap.readthedocs.io/en/latest/example_notebooks/text_examples/sentiment_analysis/Positive%20vs.%20Negative%20Sentiment%20Classification.html#

https://github.com/hila-chefer/Transformer-Explainability/blob/main/BERT_explainability.ipynb

https://medium.com/analytics-vidhya/a-gentle-introduction-to-implementing-bert-using-hugging-face-35eb480cff3

In [37]:
attentions = outputs['attentions'][0]
print(attentions.shape)

re_input_id_list = test_text['input_ids'][0].tolist() # Batch index 0
re_tokens = tokenizer.convert_ids_to_tokens(re_input_id_list) 
print(re_tokens)

re_input_id = test_text['input_ids']
print(re_input_id.shape)

torch.Size([1, 12, 128, 128])
['[CLS]', 'alfa', '##lf', '##a', 'is', 'the', 'only', 'cure', 'for', 'co', '##vid', '-', '19', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]

In [38]:
import torch
from IPython.core.display import display, HTML

def print_attention(input_ids_all, attentions_all, tokenizer):
    html = []
    for input_ids, attention in zip(input_ids_all, attentions_all): 
        # print('test index', input_ids.shape)
        one_html = []
        tokens = tokenizer.convert_ids_to_tokens(input_ids)
        first_layer = attention[0]
        count_dict = dict()
        for token, attention_128 in zip(tokens, first_layer): 
            # print(token, attention_128.shape)
          if token == '[PAD]':
            break
          attention_128 = attention_128.tolist()
          attention_max = max(attention_128)
          attention_index = attention_128.index(attention_max)
          candidate_token = tokens[attention_index]
          if candidate_token in count_dict:
            count_dict[candidate_token] += 1
          else:
            count_dict[candidate_token] = 1
        
        # Count the times specific token is the most importance
        count_sum = 0
        for key, value in count_dict.items():
          if key == '[CLS]' or key == '[SEP]':
            continue
          count_sum += value

        for token in tokens:
          if token == '[PAD]':
            break
          if token == '[CLS]' or token == '[SEP]':
            continue
          if token in count_dict:
            weight = count_dict[token] / count_sum
          else: 
            weight = 0
          # print(token, weight)
          one_html.append('<span style="background-color: rgb(255,255,0,{0})">{1}</span>'.format( weight * 2, token)) 
        
        html_string = " ".join(one_html)
        html.append(html_string)

    return html


# print(test_text['input_ids'].shape, attentions.shape)
html_arr = print_attention(test_text['input_ids'], attentions, tokenizer)

print(len(html_arr))

for html in html_arr:
  display(HTML(html))

1


# Attention fake test set

In [39]:
import numpy as np # linear algebra
import pandas as pd

twValid = pd.read_csv("./covid19-fake-news-dataset-nlp-unzip/Constraint_Val.csv") #Load the tweet (tw) validation set
twValid.head() #Take a peek at the data

Unnamed: 0,id,tweet,label
0,1,Chinese converting to Islam after realising th...,fake
1,2,11 out of 13 people (from the Diamond Princess...,fake
2,3,"COVID-19 Is Caused By A Bacterium, Not Virus A...",fake
3,4,Mike Pence in RNC speech praises Donald Trump’...,fake
4,5,6/10 Sky's @EdConwaySky explains the latest #C...,real


In [47]:
fake_df = twValid[twValid['label']=='fake'] #Take a peek at the data
fake_df = fake_df['tweet']
fake_df = fake_df[:20]
print(fake_df.head())
print(fake_df.shape)

0    Chinese converting to Islam after realising th...
1    11 out of 13 people (from the Diamond Princess...
2    COVID-19 Is Caused By A Bacterium, Not Virus A...
3    Mike Pence in RNC speech praises Donald Trump’...
8    News and media outlet ABP Majha on the basis o...
Name: tweet, dtype: object
(20,)


In [48]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

test_text = tokenizer(list(fake_df), 
                          max_length = 128,           # Pad & truncate all sentences.
                          padding = 'max_length',
                          truncation=True,
                          return_attention_mask = True,   # Construct attn. masks.
                          return_tensors = 'pt'     # Return pytorch tensors.
)
# val_dataset = TensorDataset(encoded_textsValid['input_ids'], encoded_textsValid['attention_mask'], labelsValid)
test_seq = torch.tensor(test_text['input_ids']).to('cuda:0')
test_mask = torch.tensor(test_text['attention_mask']).to('cuda:0')

with torch.no_grad():
  outputs = saved_model(test_seq, test_mask) # reference: https://www.kaggle.com/akshat0007/bert-for-sequence-classification
  pred_proba = outputs[0].detach().cpu().numpy()

preds = np.argmax(pred_proba, axis = 1)

print([preds.tolist(), pred_proba.tolist()])

[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [[2.352252960205078, -2.272630214691162], [0.5723761916160583, -0.0726807489991188], [1.170621395111084, -0.8536534309387207], [2.232856035232544, -2.149444103240967], [1.8059808015823364, -1.5207321643829346], [1.1373025178909302, -0.6935441493988037], [0.5714331269264221, 0.06818801909685135], [1.5092942714691162, -1.0155099630355835], [1.2945369482040405, -1.103743553161621], [1.1388574838638306, -0.7375275492668152], [1.6128101348876953, -1.4411450624465942], [2.024237632751465, -1.9482214450836182], [2.638136148452759, -2.545273780822754], [1.2486101388931274, -0.7757182121276855], [2.2879486083984375, -2.415994644165039], [2.5636188983917236, -2.2910337448120117], [1.3931437730789185, -0.9466797113418579], [1.1714082956314087, -0.5218426585197449], [1.3578715324401855, -1.064857006072998], [1.2534674406051636, -0.5612179040908813]]]


  # This is added back by InteractiveShellApp.init_path()
  if sys.path[0] == '':


In [49]:
attentions = outputs['attentions'][0]
print(attentions.shape)

torch.Size([20, 12, 128, 128])


In [57]:
html_arr = print_attention(test_text['input_ids'], attentions, tokenizer)
fake_df = fake_df.reset_index(drop=True)

print(len(html_arr))
for index in range(len(html_arr)):
  print('[{}]'.format(index))
  print(fake_df[index])
  display(HTML(html_arr[index]))

20
[0]
Chinese converting to Islam after realising that no muslim was affected by #Coronavirus #COVD19 in the country


[1]
11 out of 13 people (from the Diamond Princess Cruise ship) who had intially tested negative in tests in Japan were later confirmed to be positive in the United States.


[2]
COVID-19 Is Caused By A Bacterium, Not Virus And Can Be Treated With Aspirin


[3]
Mike Pence in RNC speech praises Donald Trump’s COVID-19 “seamless” partnership with governors and leaves out the president's state feuds: https://t.co/qJ6hSewtgB #RNC2020 https://t.co/OFoeRZDfyY


[4]
News and media outlet ABP Majha on the basis of an internal memo of South Central Railway reported that a special train has been announced to take the stranded migrant workers home.


[5]
???Church services can???t resume until we???re all vaccinated, says Bill Gates.??�


[6]
India records yet another single-day rise of over 28000 new cases while more than 5.5 lakh individuals have recovered from COVID-19. Kerala government sets up its first plasma bank in the state following in the steps of Delhi and West Bengal. #COVID19 #CoronavirusFacts https://t.co/JhSQUqMvta


[7]
A conspiracy theory audio about #COVID19 testing in #India circulating on @WhatsApp allegedly from MLA Geeta Jain ( @connectGEETA ). We do a quick #FactCheck on this to find that the minister has already clarified on same. https://t.co/SBhSTSr1MH


[8]
Gov. Andrew Cuomo “was simply saying if we can share 20 percent of your excess your non-used ventilators to help people in other parts of the state on a voluntary basis that would be great. Of course there was a reaction to that which was not positive."


[9]
???The Democrats are pushing for an implanted microchip in humans, and everyone to be vaccinated.??�


[10]
Mike Pence introduces program to cure coronavirus carriers with conversion therapy https://t.co/A36KAO2NWa https://t.co/bp0SDO25F0


[11]
Advisory issued by the state police of Telangana (India) instructing people to be vigilant about the possibility of increase in thefts due to the COVID-19 crisis in the country.


[12]
Parent Makes Impassioned Plea To Coronavirus https://t.co/E8AHhknapG #kids #facebook #coronavirus


[13]
Police has "free entry in houses and buildings" in Malaga to identify possible coronavirus infections.


[14]
Photo purportedly showing Tom Hanks holding a volleyball claiming the hospital staff in Australia gave it to him as a tribute and to cheer him up while in quarantine.


[15]
Video shows celebrations in Tunisian hospitals because Tunisia is free from COVID19.


[16]
Is Coranavirus a biological Weapon developed by the Chinese called Wuhan -400? This book was published in 1981. Do read the excerpt.


[17]
It’s been over six months since the first confirmed case of COVID-19 in the United States, and President Trump still doesn't have an effective plan to contain its spread. It's an unjustifiable failure of leadership that costs lives every day.


[18]
In our laboratory, we found trace amount of the virus on the skin of fruits and vegetables after 12 hours of being touched by another customer who was infected. We recommend our staff to avoid salads. Do not eat the fruits within 48 hours of purchase, or pour some boiling water over the fruit before cutting. Berries, apples, cucumbers and tomatoes are the worst because some people eat the skin. This explains why the virus is spreading faster in the west than asia. Most Asians do not eat salad and very few people eat the skin of any fruit. We have to assume anything that comes from outside our home within 48 hours is infected. Shoes, clothes, our hair, all food. Hope that helps, pls take extreme care. Kind Regards.


[19]
@JATayler Costco’s own brand Kirkland Tuna is the best. So don’t throw that! Save it for the next covid lockdown


In [59]:
score_standard = ['completely incorrect and inaccurate', 'correct with missing or over extract', 'totally correct'] # 0: completely incorrect and inaccurate  1: correct but missing; 2: totally correct
score_arr = [1, 1, 2, 1, 2, 0, 1, 1, 1, 2, 0, 1, 2, 1, 0, 1, 2, 1, 2, 0]

for index in range(len(html_arr)):
  print('[{}]'.format(index), score_standard[score_arr[index]])
  print(fake_df[index])
  display(HTML(html_arr[index]))
  print('\n')


[0] correct with missing or over extract
Chinese converting to Islam after realising that no muslim was affected by #Coronavirus #COVD19 in the country




[1] correct with missing or over extract
11 out of 13 people (from the Diamond Princess Cruise ship) who had intially tested negative in tests in Japan were later confirmed to be positive in the United States.




[2] totally correct
COVID-19 Is Caused By A Bacterium, Not Virus And Can Be Treated With Aspirin




[3] correct with missing or over extract
Mike Pence in RNC speech praises Donald Trump’s COVID-19 “seamless” partnership with governors and leaves out the president's state feuds: https://t.co/qJ6hSewtgB #RNC2020 https://t.co/OFoeRZDfyY




[4] totally correct
News and media outlet ABP Majha on the basis of an internal memo of South Central Railway reported that a special train has been announced to take the stranded migrant workers home.




[5] completely incorrect and inaccurate
???Church services can???t resume until we???re all vaccinated, says Bill Gates.??�




[6] correct with missing or over extract
India records yet another single-day rise of over 28000 new cases while more than 5.5 lakh individuals have recovered from COVID-19. Kerala government sets up its first plasma bank in the state following in the steps of Delhi and West Bengal. #COVID19 #CoronavirusFacts https://t.co/JhSQUqMvta




[7] correct with missing or over extract
A conspiracy theory audio about #COVID19 testing in #India circulating on @WhatsApp allegedly from MLA Geeta Jain ( @connectGEETA ). We do a quick #FactCheck on this to find that the minister has already clarified on same. https://t.co/SBhSTSr1MH




[8] correct with missing or over extract
Gov. Andrew Cuomo “was simply saying if we can share 20 percent of your excess your non-used ventilators to help people in other parts of the state on a voluntary basis that would be great. Of course there was a reaction to that which was not positive."




[9] totally correct
???The Democrats are pushing for an implanted microchip in humans, and everyone to be vaccinated.??�




[10] completely incorrect and inaccurate
Mike Pence introduces program to cure coronavirus carriers with conversion therapy https://t.co/A36KAO2NWa https://t.co/bp0SDO25F0




[11] correct with missing or over extract
Advisory issued by the state police of Telangana (India) instructing people to be vigilant about the possibility of increase in thefts due to the COVID-19 crisis in the country.




[12] totally correct
Parent Makes Impassioned Plea To Coronavirus https://t.co/E8AHhknapG #kids #facebook #coronavirus




[13] correct with missing or over extract
Police has "free entry in houses and buildings" in Malaga to identify possible coronavirus infections.




[14] completely incorrect and inaccurate
Photo purportedly showing Tom Hanks holding a volleyball claiming the hospital staff in Australia gave it to him as a tribute and to cheer him up while in quarantine.




[15] correct with missing or over extract
Video shows celebrations in Tunisian hospitals because Tunisia is free from COVID19.




[16] totally correct
Is Coranavirus a biological Weapon developed by the Chinese called Wuhan -400? This book was published in 1981. Do read the excerpt.




[17] correct with missing or over extract
It’s been over six months since the first confirmed case of COVID-19 in the United States, and President Trump still doesn't have an effective plan to contain its spread. It's an unjustifiable failure of leadership that costs lives every day.




[18] totally correct
In our laboratory, we found trace amount of the virus on the skin of fruits and vegetables after 12 hours of being touched by another customer who was infected. We recommend our staff to avoid salads. Do not eat the fruits within 48 hours of purchase, or pour some boiling water over the fruit before cutting. Berries, apples, cucumbers and tomatoes are the worst because some people eat the skin. This explains why the virus is spreading faster in the west than asia. Most Asians do not eat salad and very few people eat the skin of any fruit. We have to assume anything that comes from outside our home within 48 hours is infected. Shoes, clothes, our hair, all food. Hope that helps, pls take extreme care. Kind Regards.




[19] completely incorrect and inaccurate
@JATayler Costco’s own brand Kirkland Tuna is the best. So don’t throw that! Save it for the next covid lockdown




