<a href="https://colab.research.google.com/github/IshaSarangi/Edureka_Notes/blob/main/Edureka_Sentiment_Analysis_using_Transformer_and_ABSA_07_Sep.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://colab.research.google.com/drive/1j9x1T0LbvjqQVrHErjVJq97Wlj2M7rBk?usp=sharing

###Why Transformers for Sentiment Analysis?



*   Deep context understanding: captures complex semantics (meaning of words in context)
*   Pretrained models: they leverage knowledge from massive datasets and transfer it to our task
*   Flexible and Adaptable: can be fine tuned for specific domain, language or subtask
*   Multilingual: models supporting multiple languages
*   Better handling of complex language components: sarcasm, emojis, idioms etc.
*   State-of-art performance and accuracy

### Variants of BERT (Bidirectional Encoder Representations from Transformers):
*   DistilBERT: smaller, faster version of BERT
*   RoBERT: robustly optimized BERT by Facebook
*   mBERT: Multilingual BERT
*   Twitter RoBERTa: fine tuned on Twitter data

In [12]:
#Sample reviews/posts
social_media_posts = [
    "I love this product! It's amazing!",
    "HPV is one of main reason for cancer.",
    "smoking Tobacco lead to cancer.",
    "I'm so happy with the results!",
    "Absolutely terrible, would not recommend.",
    "Me encanta este producto! Es increíble!",
    "C'est le pire service que j'ai jamais eu.",
    "この商品が大好きです！素晴らしい！",
    "이 제품을 정말 좋아해요! 놀라워요!",
    "இந்த தயாரிப்பு அருமை! மிகுந்த மகிழ்ச்சி!",
    "यह उत्पाद बहुत अच्छा है! यह अद्भुत है!",
    "Das humane Papillomavirus (HPV) kann Krebs verursachen.",
    "Le tabac est l'une des principales causes du cancer.",
    "La dieta equilibrata aiuta a prevenire molte malattie.",
    "Trop de soleil peut causer le cancer de la peau.",
    "Rauchen erhöht das Krebsrisiko erheblich.",
    "Una corretta alimentazione è essenziale per la salute.",
    "L'esposizione al sole senza protection peut être dangereuse.",
    "Il virus HPV è una delle principali cause di cancro cervicale.",
    "HPV ist eine ernsthafte Bedrohung für die Gesundheit von Frauen.",
    "Fumer nuit gravement à la santé et peut provoquer un cancer."
]

In [13]:
from transformers import pipeline
import numpy as np

In [14]:
!pip install langdetect



In [15]:
#Model1: Basic BERT - only for English 'bert-base-uncased'
#Note: Basic BERT is not optimized for Sentiment Analysis tasks

from langdetect import detect

english_posts = [post for post in social_media_posts if detect(post) == 'en']

print(english_posts)

bert_analyzer = pipeline('sentiment-analysis', model = 'bert-base-uncased')
bert_result = bert_analyzer(english_posts)

bert_confidence = [result['score'] for result in bert_result]
print("BERT Confidence Score: ", bert_confidence)

#print labelled results with confidence
for post, result in zip(english_posts, bert_result):
    label = result['label']
    score = result['score']
    print(f"Post: {post}\n Sentiment Label: {label}\nConfidence Score: {score:.2f}\n")
    print("-"*60)

["I love this product! It's amazing!", 'HPV is one of main reason for cancer.', 'smoking Tobacco lead to cancer.', "I'm so happy with the results!", 'Absolutely terrible, would not recommend.']


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0


BERT Confidence Score:  [0.5308711528778076, 0.5371174812316895, 0.5428591370582581, 0.5396923422813416, 0.500452995300293]
Post: I love this product! It's amazing!
 Sentiment Label: LABEL_1
Confidence Score: 0.53

------------------------------------------------------------
Post: HPV is one of main reason for cancer.
 Sentiment Label: LABEL_0
Confidence Score: 0.54

------------------------------------------------------------
Post: smoking Tobacco lead to cancer.
 Sentiment Label: LABEL_0
Confidence Score: 0.54

------------------------------------------------------------
Post: I'm so happy with the results!
 Sentiment Label: LABEL_1
Confidence Score: 0.54

------------------------------------------------------------
Post: Absolutely terrible, would not recommend.
 Sentiment Label: LABEL_0
Confidence Score: 0.50

------------------------------------------------------------


Note: Basic BERT is not fine tuned for sentiment classification tasks.

Model 2: DistilBERT is fine tuned for sentiment analysis

Model Name: 'distillbet-base-uncased-finetuned-sst-2-english

In [16]:
#Load the sentiment analyzer model pipeline
sent_analyzer=pipeline("sentiment-analysis",model="distilbert-base-uncased-finetuned-sst-2-english")
#Analyze sentiment for each post
sent_result=sent_analyzer(social_media_posts)
#Store and show confidence score and sentiment label
for post,result in zip(social_media_posts,sent_result):
    print(f"Post:{post}\n Sentiment Label:{result['label']}\nConfidence Score:{result['score']:.2f}\n")
    print('-'*50)


Device set to use cuda:0


Post:I love this product! It's amazing!
 Sentiment Label:POSITIVE
Confidence Score:1.00

--------------------------------------------------
Post:HPV is one of main reason for cancer.
 Sentiment Label:NEGATIVE
Confidence Score:0.98

--------------------------------------------------
Post:smoking Tobacco lead to cancer.
 Sentiment Label:NEGATIVE
Confidence Score:1.00

--------------------------------------------------
Post:I'm so happy with the results!
 Sentiment Label:POSITIVE
Confidence Score:1.00

--------------------------------------------------
Post:Absolutely terrible, would not recommend.
 Sentiment Label:NEGATIVE
Confidence Score:1.00

--------------------------------------------------
Post:Me encanta este producto! Es increíble!
 Sentiment Label:NEGATIVE
Confidence Score:0.87

--------------------------------------------------
Post:C'est le pire service que j'ai jamais eu.
 Sentiment Label:NEGATIVE
Confidence Score:0.72

--------------------------------------------------
Post:

*Note:* Model is performing good on English but poorly on other languages

In [17]:
#Model 3: RoBERTa - optimized BERT with improved training, better accuracy
#Model name: 'roberta-base'

#Load the sentiment analyzer model pipeline
sent_analyzer = pipeline('sentiment-analysis', model = 'roberta-base')
#Analyze sentiment for each post
sent_result = sent_analyzer(social_media_posts)
#Store and show confidence score and sentiment label

label_mapping = {'LABEL_0': 'Negative', 'LABEL_1': 'Positive'}

for post, result in zip(social_media_posts, sent_result):
    label_id = result['label']
    label_name = label_mapping.get(label_id, label_id)
    print(f"Post: {post}\n Sentiment Label: {label_name}\n Confidence Score: {result['score']:.2f}\n")
    print('-'*50)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0


Post: I love this product! It's amazing!
 Sentiment Label: Positive
 Confidence Score: 0.55

--------------------------------------------------
Post: HPV is one of main reason for cancer.
 Sentiment Label: Positive
 Confidence Score: 0.56

--------------------------------------------------
Post: smoking Tobacco lead to cancer.
 Sentiment Label: Positive
 Confidence Score: 0.56

--------------------------------------------------
Post: I'm so happy with the results!
 Sentiment Label: Positive
 Confidence Score: 0.55

--------------------------------------------------
Post: Absolutely terrible, would not recommend.
 Sentiment Label: Positive
 Confidence Score: 0.56

--------------------------------------------------
Post: Me encanta este producto! Es increíble!
 Sentiment Label: Positive
 Confidence Score: 0.55

--------------------------------------------------
Post: C'est le pire service que j'ai jamais eu.
 Sentiment Label: Positive
 Confidence Score: 0.56

----------------------------

In [18]:
#Model 4: Sie RoBERTa - optimized BERT with improved training, better accuracy
#Model Name: 'siebert/sentiment-roberta-large-english'

#Load the sentiment analyzer model pipeline
sent_analyzer = pipeline('sentiment-analysis', model = 'siebert/sentiment-roberta-large-english')
#Analyze sentiment for each post
sent_result = sent_analyzer(social_media_posts)
#Store and show confidence score and sentiment label

label_mapping = {'0': 'Negative', '1': 'Positive'}

for post, result in zip(social_media_posts, sent_result):
    label_id = result['label']
    label_name = label_mapping.get(label_id, label_id)
    print(f"Post: {post}\n Sentiment Label: {label_name}\n Confidence Score: {result['score']: .2f}\n")
    print('-'*50)

Device set to use cuda:0


Post: I love this product! It's amazing!
 Sentiment Label: POSITIVE
 Confidence Score:  1.00

--------------------------------------------------
Post: HPV is one of main reason for cancer.
 Sentiment Label: POSITIVE
 Confidence Score:  0.98

--------------------------------------------------
Post: smoking Tobacco lead to cancer.
 Sentiment Label: NEGATIVE
 Confidence Score:  0.99

--------------------------------------------------
Post: I'm so happy with the results!
 Sentiment Label: POSITIVE
 Confidence Score:  1.00

--------------------------------------------------
Post: Absolutely terrible, would not recommend.
 Sentiment Label: NEGATIVE
 Confidence Score:  1.00

--------------------------------------------------
Post: Me encanta este producto! Es increíble!
 Sentiment Label: POSITIVE
 Confidence Score:  1.00

--------------------------------------------------
Post: C'est le pire service que j'ai jamais eu.
 Sentiment Label: NEGATIVE
 Confidence Score:  0.99

---------------------

In [19]:
#Model 5: mBERT - Multilingual BERT optimized for sentiment analysis
#Model Name: 'nlptown/bert-base-multilingual-uncased-sentiment'
#Note: this model returns stars instead of the sentiment labels

#Load the sentiment analyzer model pipeline
sent_analyzer = pipeline('sentiment-analysis', model = 'nlptown/bert-base-multilingual-uncased-sentiment')
#Analyze sentiment for each post
sent_result = sent_analyzer(social_media_posts)
#Store and show confidence score and sentiment label

#Map stars to sentiment labels
label_mapping = {
    '1 star': 'Very Negative',
    '2 stars': 'Negative',
    '3 stars': 'Neutral',
    '4 stars': 'Positive',
    '5 stars': 'Very Positive'
}

for post, result in zip(social_media_posts, sent_result):
    label_id = result['label']
    label_name = label_mapping.get(label_id, label_id)
    print(f"Post: {post}\n Sentiment Label: {label_name}\n Confidence Score: {result['score']: .2f}\n")
    print('-'*50)

Device set to use cuda:0


Post: I love this product! It's amazing!
 Sentiment Label: Very Positive
 Confidence Score:  0.95

--------------------------------------------------
Post: HPV is one of main reason for cancer.
 Sentiment Label: Negative
 Confidence Score:  0.30

--------------------------------------------------
Post: smoking Tobacco lead to cancer.
 Sentiment Label: Very Negative
 Confidence Score:  0.49

--------------------------------------------------
Post: I'm so happy with the results!
 Sentiment Label: Very Positive
 Confidence Score:  0.76

--------------------------------------------------
Post: Absolutely terrible, would not recommend.
 Sentiment Label: Very Negative
 Confidence Score:  0.95

--------------------------------------------------
Post: Me encanta este producto! Es increíble!
 Sentiment Label: Very Positive
 Confidence Score:  0.92

--------------------------------------------------
Post: C'est le pire service que j'ai jamais eu.
 Sentiment Label: Very Negative
 Confidence Score

In [20]:
#Model 6: DistilRoBERTa - a distilled version of RoBERTa, faster and smaller
#Note: This is a general DistilRoBERTa model and not specifically fine-tuned for sentiment analysis

#Load the sentiment analyzer model pipeline
# Using a common DistilRoBERTa model that can be used for various tasks, including sentiment analysis
sent_analyzer = pipeline('sentiment-analysis', model = 'distilroberta-base')
#Analyze sentiment for each post
sent_result = sent_analyzer(social_media_posts)
#Store and show confidence score and sentiment label

# The default labels for this model might be LABEL_0 and LABEL_1, similar to basic BERT and RoBERTa
# We can map them for better readability
label_mapping = {'LABEL_0': 'Negative', 'LABEL_1': 'Positive'}

for post, result in zip(social_media_posts, sent_result):
    label_id = result['label']
    label_name = label_mapping.get(label_id, label_id) # Use .get() to handle potential missing labels
    print(f"Post: {post}\n Sentiment Label: {label_name}\n Confidence Score: {result['score']: .2f}\n")
    print('-'*50)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0


Post: I love this product! It's amazing!
 Sentiment Label: Positive
 Confidence Score:  0.53

--------------------------------------------------
Post: HPV is one of main reason for cancer.
 Sentiment Label: Positive
 Confidence Score:  0.53

--------------------------------------------------
Post: smoking Tobacco lead to cancer.
 Sentiment Label: Positive
 Confidence Score:  0.53

--------------------------------------------------
Post: I'm so happy with the results!
 Sentiment Label: Positive
 Confidence Score:  0.53

--------------------------------------------------
Post: Absolutely terrible, would not recommend.
 Sentiment Label: Positive
 Confidence Score:  0.53

--------------------------------------------------
Post: Me encanta este producto! Es increíble!
 Sentiment Label: Positive
 Confidence Score:  0.53

--------------------------------------------------
Post: C'est le pire service que j'ai jamais eu.
 Sentiment Label: Positive
 Confidence Score:  0.53

---------------------