<a href="https://colab.research.google.com/github/Lexian-6/Sentiment-Analysis-towards-COVID-19-on-Twitter/blob/Juno%2FBertModel/Models_Juno.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧱 Models_Juno.ipynb

This notebook showcases the application of various models for natural language processing (NLP) tasks. It includes the following contents:

1. **Pre-trained DistilBERT Model**:
    - Utilizes a pre-trained DistilBERT model for initial text processing and analysis.
    - Demonstrates the usage and performance of the pre-trained model on a sample dataset.

2. **Customized DistilBERT Model**:
    - Implements a customized version of the DistilBERT model.
    - Fine-tunes the model on a specific dataset to improve performance for targeted NLP tasks.
    - Provides insights into the customization process and the resulting enhancements.

3. **Large Language Model (Using Gemini-Pro API)**:
    - Integrates the Gemini-Pro API to leverage a large language model for advanced text processing.
    - Relabels the entire dataset using the capabilities of the Gemini-Pro model.
    - Compares the performance and accuracy of the relabeling process with previous models.

This notebook serves as a comprehensive guide for employing different BERT-based models and large language models for efficient and accurate NLP tasks.


## 🤖Installing useful Packages

- ### transformers

  #### We use the package for:
  - **Pre-trained Models**: Provides a wide range of state-of-the-art pre-trained models for NLP tasks such as BERT, GPT, T5, and many more.
  - **Model Training**: Facilitates fine-tuning of pre-trained models on custom datasets for various NLP tasks.
  - **Tokenization**: Includes tokenizers for converting text to input format suitable for models.


- ### accelerate

  #### We use the package for:
  - **Multi-GPU Training**: Simplifies the process of scaling model training across multiple GPUs.
  - **Distributed Training**: Provides tools for distributed training across multiple devices or nodes.
  - **Optimization**: Includes various optimizations to improve training efficiency.


- ### datasets

  #### We use the package for:
  - **Dataset Processing**: Tools for preprocessing, transforming, and manipulating datasets.
  - **Integration with transformers**: Seamlessly integrates with the transformers library for easy dataset loading and” preparation for model training.
  - **Dataset Creation**: Allows us to create and share custom datasets.


In [165]:
!pip install transformers[torch]
!pip install accelerate -U
!pip install datasets



In [164]:
# @title ❓Understanding dataset size and structure
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

url = 'https://raw.githubusercontent.com/usmaann/COVIDSenti/main/COVIDSenti.csv'
url_A = 'https://raw.githubusercontent.com/usmaann/COVIDSenti/main/COVIDSenti-A.csv'
url_B = 'https://raw.githubusercontent.com/usmaann/COVIDSenti/main/COVIDSenti-B.csv'
url_C = 'https://raw.githubusercontent.com/usmaann/COVIDSenti/main/COVIDSenti-C.csv'

original_df = pd.read_csv(url)
original_df_A = pd.read_csv(url_A)
original_df_B = pd.read_csv(url_B)
original_df_C = pd.read_csv(url_C)
# Preparing / Initialization for preprocessed dataset
df = pd.read_csv(url)
df_A = pd.read_csv(url_A)
df_B = pd.read_csv(url_B)
df_C = pd.read_csv(url_C)

df['tweet'] = original_df['tweet'].str.lower()
df_A['tweet'] = original_df_A['tweet'].str.lower()
df_B['tweet'] = original_df_B['tweet'].str.lower()
df_C['tweet'] = original_df_C['tweet'].str.lower()

def remove_urls(text):
    # Match URLs starting with "http://", "https://", or "www.".
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

df['tweet'] = df['tweet'].apply(lambda x: remove_urls(x))
df_A['tweet'] = df_A['tweet'].apply(lambda x: remove_urls(x))
df_B['tweet'] = df_B['tweet'].apply(lambda x: remove_urls(x))
df_C['tweet'] = df_C['tweet'].apply(lambda x: remove_urls(x))

def remove_mentions_hashtags(text):
    mention_pattern = re.compile(r'@\w+')
    hashtag_pattern = re.compile(r'#\w+')
    text = mention_pattern.sub(r'', text)
    text = hashtag_pattern.sub(r'', text)
    return text

df['tweet'] = df['tweet'].apply(lambda x: remove_mentions_hashtags(x))
df_A['tweet'] = df_A['tweet'].apply(lambda x: remove_mentions_hashtags(x))
df_B['tweet'] = df_B['tweet'].apply(lambda x: remove_mentions_hashtags(x))
df_C['tweet'] = df_C['tweet'].apply(lambda x: remove_mentions_hashtags(x))

def remove_special_characters(text):
    text = re.sub("‚äô", "", text)
    special_char_pattern = re.compile("[^a-zA-Z0-9\s']")
    text = special_char_pattern.sub(r' ', text)
    text = re.sub('\s+', ' ', text)
    return text.strip()

df['tweet'] = df['tweet'].apply(lambda x: remove_special_characters(x))
df_A['tweet'] = df_A['tweet'].apply(lambda x: remove_special_characters(x))
df_B['tweet'] = df_B['tweet'].apply(lambda x: remove_special_characters(x))
df_C['tweet'] = df_C['tweet'].apply(lambda x: remove_special_characters(x))

label_mapping = {
  'neg': 0,
  'neu': 1,
  'pos': 2
}

df['label'] = df['label'].map(label_mapping)
df_A['label'] = df_A['label'].map(label_mapping)
df_B['label'] = df_B['label'].map(label_mapping)
df_C['label'] = df_C['label'].map(label_mapping)

from sklearn.model_selection import train_test_split
train_df, remaining_df = train_test_split(df, test_size=0.20, random_state=42)
validation_df, test_df = train_test_split(remaining_df, test_size=0.50, random_state=42)
train_df.head()

Unnamed: 0,tweet,label
51004,coronavirus no new case in nigeria health mini...,0
11453,live coronavirus total confirmed cases around ...,1
9691,who declares global health emergency as wuhan ...,1
51992,trumps budget director responds to misinformat...,1
23531,chinese foreign minister attends munich securi...,0


## 🔍Relabeling Dataset with Large Language Model API

In this section, we focus on improving the quality of our dataset by relabeling it using a large language model, specifically utilizing the Gemini-Pro API. Upon careful inspection, it was discovered that the original dataset contained a significant number of incorrect labels. These erroneous labels can adversely affect the performance and evaluation of our models, leading to misleading results.

To address this issue, we employed the Gemini-Pro API to relabel the entire dataset. This step ensures that the dataset is more accurate and reliable, providing a robust foundation for training and evaluating our models. By relabeling the data, we aim to achieve more convincing and comparable results across different models used by our team, including LSTM, CLIP encode, and SRN.

### Steps Involved:

1. **Further Dataset Analysis**: ❗❗❗***[TODO]***
    - Analyzed the original dataset to identify the extent of labeling errors.
    - Determined the need for relabeling to improve data quality.

2. **Using Gemini-Pro API**:
    - Integrated the Gemini-Pro API to process and relabel the entire dataset.
    - Leveraged the advanced capabilities of the large language model to assign accurate labels.

3. **Comparison of Results**:
    - Compared the performance of models trained on the original dataset with those trained on the relabeled dataset.
    - Highlighted the improvements in model accuracy and reliability due to the corrected labels.

This approach ensures that our models are evaluated on a consistent and correctly labeled dataset, leading to more trustworthy and meaningful comparisons. By enhancing the label accuracy, we provide a stronger basis for model assessment and subsequent development.


## Setting up Gemini API

In [3]:
!pip install -q -U google-generativeai

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/164.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━[0m [32m92.2/164.2 kB[0m [31m2.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m164.2/164.2 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/718.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m716.8/718.3 kB[0m [31m28.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m718.3/718.3 kB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [4]:
# Necessary packages
import pathlib
import textwrap

import google.generativeai as genai

from IPython.display import display
from IPython.display import Markdown

def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

# Used to securely store your API key
from google.colab import userdata

In [5]:
# Or use `os.getenv('GOOGLE_API_KEY')` to fetch an environment variable.
GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')

genai.configure(api_key=GOOGLE_API_KEY)

In [6]:
for m in genai.list_models():
  if 'generateContent' in m.supported_generation_methods:
    print(m.name)

models/gemini-1.0-pro
models/gemini-1.0-pro-001
models/gemini-1.0-pro-latest
models/gemini-1.0-pro-vision-latest
models/gemini-1.5-flash
models/gemini-1.5-flash-001
models/gemini-1.5-flash-latest
models/gemini-1.5-pro
models/gemini-1.5-pro-001
models/gemini-1.5-pro-latest
models/gemini-pro
models/gemini-pro-vision


In [136]:
gemini_model = genai.GenerativeModel('gemini-pro')

## Testing the API with a neutral prompt

In [169]:
test_prompt ="What's the sentiment of the following sentence: It's lucky no confirmed coronavirus infection happens in Anhui. Don't know how long it well end. Give one word answer from 'neg' 'neu' or 'pos'"

In [170]:
%%time
response = gemini_model.generate_content(test_prompt)

to_markdown(response.text)

CPU times: user 41 ms, sys: 6.36 ms, total: 47.4 ms
Wall time: 2.92 s


> neu

In [154]:
df_sample = df.iloc[420:440]

df_sample['pred_label'] = ''

print(df_sample)

                                                 tweet  label pred_label
420  idk who needs this just in case youre interest...      1           
421  us's first case of novel coronavirus confirmed...      1           
422  coronavirus is spreading rapidly in china 9 de...      1           
423  i got food poisoning im going to die i got the...      1           
424  really lots of people order bats soup and bats...      1           
425        coronavirus is that when youve had too many      2           
426          maybe theyll give you chinese coronavirus      1           
427  us puts up coronavirus shield all passengers f...      1           
428          the wuhan coronavirus outbreak hits korea      1           
429           i have discovered the cure to lime wedge      1           
430  death toll from the new outbreak in increases ...      1           
431  17 die from china's over 470 affected amid glo...      1           
432  here's what nh health officials and doctors ar

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sample['pred_label'] = ''


In [155]:
# Convert the DataFrame to JSON using the to_json() method

json_data = df_sample[['tweet','pred_label']].to_json(orient='records')

# See what's in the JSON data
json_data

'[{"tweet":"idk who needs this just in case youre interested in the coronavirus news here is the link to the live update on the","pred_label":""},{"tweet":"us\'s first case of novel coronavirus confirmed in snohomish county washington stay informed by following the cdc","pred_label":""},{"tweet":"coronavirus is spreading rapidly in china 9 deaths so far 440 people infected","pred_label":""},{"tweet":"i got food poisoning im going to die i got the corona virus","pred_label":""},{"tweet":"really lots of people order bats soup and bats are spreaders of coronavirus","pred_label":""},{"tweet":"coronavirus is that when youve had too many","pred_label":""},{"tweet":"maybe theyll give you chinese coronavirus","pred_label":""},{"tweet":"us puts up coronavirus shield all passengers from wuhan to be \'funneled\' to screening system at five major airport","pred_label":""},{"tweet":"the wuhan coronavirus outbreak hits korea","pred_label":""},{"tweet":"i have discovered the cure to lime wedge","pred

In [156]:
# Prompt Engineering
prompt = f"""
You are an expert linguist, who is good at classifying sentiments of tweets during COVID-19 pandemic into Positive /Neutral /Negative labels.
Help me classify customer reviews into: Positive (label=2), Neutral (label=1), and Negative (label=0).
Tweets during COVID-19 pandemic are provided between three back ticks.
In your output, only return the Json code back as output - which is provided between three backticks.
Your task is to update predicted labels under 'pred_label' in the Json code.
Don't make any changes to Json code format,please.

```
{json_data}
```
"""

print(prompt)


You are an expert linguist, who is good at classifying sentiments of tweets during COVID-19 pandemic into Positive /Neutral /Negative labels.
Help me classify customer reviews into: Positive (label=2), Neutral (label=1), and Negative (label=0).
Tweets during COVID-19 pandemic are provided between three back ticks.
In your output, only return the Json code back as output - which is provided between three backticks.
Your task is to update predicted labels under 'pred_label' in the Json code.
Don't make any changes to Json code format,please.

```
[{"tweet":"idk who needs this just in case youre interested in the coronavirus news here is the link to the live update on the","pred_label":""},{"tweet":"us's first case of novel coronavirus confirmed in snohomish county washington stay informed by following the cdc","pred_label":""},{"tweet":"coronavirus is spreading rapidly in china 9 deaths so far 440 people infected","pred_label":""},{"tweet":"i got food poisoning im going to die i got the

In [157]:
%%time
response = gemini_model.generate_content(prompt)
response.text

CPU times: user 112 ms, sys: 18.3 ms, total: 130 ms
Wall time: 9.81 s


'```\n[{"tweet":"idk who needs this just in case youre interested in the coronavirus news here is the link to the live update on the","pred_label":1},{"tweet":"us\'s first case of novel coronavirus confirmed in snohomish county washington stay informed by following the cdc","pred_label":2},{"tweet":"coronavirus is spreading rapidly in china 9 deaths so far 440 people infected","pred_label":0},{"tweet":"i got food poisoning im going to die i got the corona virus","pred_label":0},{"tweet":"really lots of people order bats soup and bats are spreaders of coronavirus","pred_label":0},{"tweet":"coronavirus is that when youve had too many","pred_label":1},{"tweet":"maybe theyll give you chinese coronavirus","pred_label":0},{"tweet":"us puts up coronavirus shield all passengers from wuhan to be \'funneled\' to screening system at five major airport","pred_label":2},{"tweet":"the wuhan coronavirus outbreak hits korea","pred_label":0},{"tweet":"i have discovered the cure to lime wedge","pred_lab

In [161]:
import json

# Clean the data by stripping the backticks
json_data = response.text.strip("`").strip()

# Load the cleaned data and convert to DataFrame
data = json.loads(json_data)
df_sample = pd.DataFrame(data)

print(df_sample)

                                                tweet  pred_label
0   idk who needs this just in case youre interest...           1
1   us's first case of novel coronavirus confirmed...           2
2   coronavirus is spreading rapidly in china 9 de...           0
3   i got food poisoning im going to die i got the...           0
4   really lots of people order bats soup and bats...           0
5         coronavirus is that when youve had too many           1
6           maybe theyll give you chinese coronavirus           0
7   us puts up coronavirus shield all passengers f...           2
8           the wuhan coronavirus outbreak hits korea           0
9            i have discovered the cure to lime wedge           2
10  death toll from the new outbreak in increases ...           0
11  17 die from china's over 470 affected amid glo...           0
12  here's what nh health officials and doctors ar...           2
13              new coronavirus airports on alert afp           2
14  its pr

In [162]:
train_df.head(5)

Unnamed: 0,tweet,label
51004,coronavirus no new case in nigeria health mini...,0
11453,live coronavirus total confirmed cases around ...,1
9691,who declares global health emergency as wuhan ...,1
51992,trumps budget director responds to misinformat...,1
23531,chinese foreign minister attends munich securi...,0


In [163]:
sentiment_list = ['neg', 'neu', 'pos']
for i in range(20):
  print(f"{i}：{df_sample.iloc[i, 0]}, dataset label: {sentiment_list[df_sample.iloc[i, 1]]}， gemini label: {sentiment_list[int(df_sample.iloc[i, 1])]}")

0：idk who needs this just in case youre interested in the coronavirus news here is the link to the live update on the, dataset label: neu， gemini label: neu
1：us's first case of novel coronavirus confirmed in snohomish county washington stay informed by following the cdc, dataset label: pos， gemini label: pos
2：coronavirus is spreading rapidly in china 9 deaths so far 440 people infected, dataset label: neg， gemini label: neg
3：i got food poisoning im going to die i got the corona virus, dataset label: neg， gemini label: neg
4：really lots of people order bats soup and bats are spreaders of coronavirus, dataset label: neg， gemini label: neg
5：coronavirus is that when youve had too many, dataset label: neu， gemini label: neu
6：maybe theyll give you chinese coronavirus, dataset label: neg， gemini label: neg
7：us puts up coronavirus shield all passengers from wuhan to be 'funneled' to screening system at five major airport, dataset label: pos， gemini label: pos
8：the wuhan coronavirus outb

In [37]:
df_sample = df.head(90)

df_sample['pred_label'] = ''

df_sample

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sample['pred_label'] = ''


Unnamed: 0,tweet,label,pred_label
0,coronavirus human coronavirus types cdc,1,
1,thats true corona virus swine flue bird flu in...,1,
2,tldr not sars possibly new coronavirus difficu...,0,
3,disease outbreak news from the who middle east...,1,
4,china media wsj says sources tell them mystery...,1,
...,...,...,...
85,pneumonia viral atualiza es neste post fonte t...,1,
86,human to human transmission of new coronavirus...,1,
87,china confirms human to human transmission of ...,1,
88,china confirms human to human transmission of ...,1,


In [38]:
# Convert the DataFrame to JSON using the to_json() method

json_data = df_sample[['tweet','pred_label']].to_json(orient='records')

# See what's in the JSON data
json_data



In [39]:
# Prompt Engineering
prompt = f"""
You are an expert linguist, who is good at classifying sentiments of tweets during COVID-19 pandemic into Positive /Neutral /Negative labels.
Help me classify customer reviews into: Positive (label=2), Neutral (label=1), and Negative (label=0).
Tweets during COVID-19 pandemic are provided between three back ticks.
In your output, only return the Json code back as output - which is provided between three backticks.
Your task is to update predicted labels under 'pred_label' in the Json code.
Don't make any changes to Json code format, please.
Error handling instruction: In case a Customer Review violates API policy, please assign it default sentiment as Negative (label=0).

```
{json_data}
```
"""

print(prompt)


You are an expert linguist, who is good at classifying sentiments of tweets during COVID-19 pandemic into Positive /Neutral /Negative labels.
Help me classify customer reviews into: Positive (label=2), Neutral (label=1), and Negative (label=0).
Tweets during COVID-19 pandemic are provided between three back ticks.
In your output, only return the Json code back as output - which is provided between three backticks.
Your task is to update predicted labels under 'pred_label' in the Json code.
Don't make any changes to Json code format, please.
Error handling instruction: In case a Customer Review violates API policy, please assign it default sentiment as Negative (label=0).

```
```



In [40]:
%%time
response_50 = gemini_model.generate_content(prompt)
response_50.text

CPU times: user 374 ms, sys: 47.2 ms, total: 421 ms
Wall time: 34.7 s




In [33]:
import json

# Clean the data by stripping the backticks
json_data = response_90.text.strip("`").strip()

json_data

                                                tweet  pred_label
0   regarding the corona virus quarantine a crowd ...           0
1     we can all have a little coronavirus as a treat           2
2   central bankers have no power against the econ...           0
3   cnn's and democrats ok with country getting co...           0
4   everyone in a high risk group for severe sympt...           1
5         very interesting take on the covid 19 virus           1
6   listening to coronavirus can not be removed wi...           0
7   chinas internal reports on coronavirus respons...           0
8   wuhan china coronavirus is spreading everywher...           0
9   say no to meat say no to corona virus allah di...           2
10  un biodiversity summit could move from china d...           1
11  surely the loo roll panic is down to self isol...           1
12  why government of india not putting traveling ...           0
13  so i think its time to take the corona virus s...           1
14  oh wel