Main resources:
- https://brighteshun.medium.com/sentiment-analysis-part-1-finetuning-and-hosting-a-text-classification-model-on-huggingface-9d6da6fd856b

In [19]:
#Import Libraries
import pandas as pd
from scipy.special import softmax
from sklearn.metrics import roc_auc_score, confusion_matrix, accuracy_score, classification_report

#finetuning
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [2]:
test_data = pd.read_csv("test.csv")
test_data

Unnamed: 0,Cleaned_Review,is_negative_sentiment
0,Bangkok to Pune via Kolkata. As faultless fli...,0
1,"First, when trying to manage our booking, it d...",1
2,My flight from Bandung to Surabaya was on time...,0
3,Their on time performance is best in India and...,0
4,Flew Melbourne to Bangkok in Business class. W...,0
...,...,...
819,JQ 29 MEL-BKK on B787. First trip on the 'Drea...,1
820,We flew international from Sydney to Nadi (Fij...,0
821,General Santos to Manila. Cebu Pacific has one...,0
822,Singapore to Ho Chi Minh City. Appreciate the ...,0


In [4]:
sample = test_data.iloc[5]
sample_txt, sample_label = sample['Cleaned_Review'], sample['is_negative_sentiment']
print(sample_txt)
print(sample_label)

My trip to Singapore is only for 6 days and Scoot delayed the flight 1.5 days. Cant contact by phone from 11pm despite there is 24 hours contact. Cant get any information and can change the flight sooner because your manage is not updated. So angry because they delay this by 1.5 days without any repay, just apologies and we don't need their apologies. Worst airline ever.
1


In [5]:
def load_model(model_path):
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForSequenceClassification.from_pretrained(model_path)
    return model, tokenizer

In [6]:
output_dir = "finetune_sentiments_analysis_distilbert"
model, tokenizer = load_model(output_dir)

In [10]:
# Process the input text and return sentiment prediction
def is_negative_sentiment_score(text):
    # text = preprocess(text)
    encoded_input = tokenizer(text, truncation=True, return_tensors="pt")  # for PyTorch-based models
    output = model(**encoded_input)
    scores_ = output[0][0].detach().numpy()
    scores_ = softmax(scores_)

    # Format output dictionary of scores
    labels = ["Negative", "Positive"]
    scores = {l: float(s) for (l, s) in zip(labels, scores_)}
    return scores.get("Negative", 0.0)

In [11]:
print(sample_txt)
print(sample_label)
is_negative_sentiment_score(sample_txt)

My trip to Singapore is only for 6 days and Scoot delayed the flight 1.5 days. Cant contact by phone from 11pm despite there is 24 hours contact. Cant get any information and can change the flight sooner because your manage is not updated. So angry because they delay this by 1.5 days without any repay, just apologies and we don't need their apologies. Worst airline ever.
1


0.9994150400161743

In [12]:
# Assuming your DataFrame is called 'df' and the text column is 'text_column'
test_data['negative_sentiment_score'] = test_data['Cleaned_Review'].apply(is_negative_sentiment_score)
test_data

Unnamed: 0,Cleaned_Review,is_negative_sentiment,negative_sentiment_score
0,Bangkok to Pune via Kolkata. As faultless fli...,0,0.999379
1,"First, when trying to manage our booking, it d...",1,0.999231
2,My flight from Bandung to Surabaya was on time...,0,0.945872
3,Their on time performance is best in India and...,0,0.004759
4,Flew Melbourne to Bangkok in Business class. W...,0,0.033919
...,...,...,...
819,JQ 29 MEL-BKK on B787. First trip on the 'Drea...,1,0.999590
820,We flew international from Sydney to Nadi (Fij...,0,0.003954
821,General Santos to Manila. Cebu Pacific has one...,0,0.001756
822,Singapore to Ho Chi Minh City. Appreciate the ...,0,0.110545


In [21]:
true_labels = test_data['is_negative_sentiment']
pred_vals = test_data['negative_sentiment_score']

# Calculate the ROC AUC Score
roc_auc = roc_auc_score(true_labels, pred_vals)
print(f"ROC AUC Score: {roc_auc:.4f}")

ROC AUC Score: 0.9576


In [22]:
TH = 0.5
test_data['sentiment_acc'] = test_data['negative_sentiment_score'].apply(lambda score: 1 if score >= TH else 0)
test_data

Unnamed: 0,Cleaned_Review,is_negative_sentiment,negative_sentiment_score,sentiment_acc
0,Bangkok to Pune via Kolkata. As faultless fli...,0,0.999379,1
1,"First, when trying to manage our booking, it d...",1,0.999231,1
2,My flight from Bandung to Surabaya was on time...,0,0.945872,1
3,Their on time performance is best in India and...,0,0.004759,0
4,Flew Melbourne to Bangkok in Business class. W...,0,0.033919,0
...,...,...,...,...
819,JQ 29 MEL-BKK on B787. First trip on the 'Drea...,1,0.999590,1
820,We flew international from Sydney to Nadi (Fij...,0,0.003954,0
821,General Santos to Manila. Cebu Pacific has one...,0,0.001756,0
822,Singapore to Ho Chi Minh City. Appreciate the ...,0,0.110545,0


In [23]:
# Assuming 'sentiment_acc' holds predicted sentiment (0 or 1) and 'is_negative_sentiment' holds true labels (0 or 1)
pred_labels = test_data['sentiment_acc']
confusion_matrix_result = confusion_matrix(pred_labels, true_labels)
print(confusion_matrix_result)

[[239   9]
 [ 99 477]]


In [25]:
print(classification_report(true_labels, pred_labels))

              precision    recall  f1-score   support

           0       0.96      0.71      0.82       338
           1       0.83      0.98      0.90       486

    accuracy                           0.87       824
   macro avg       0.90      0.84      0.86       824
weighted avg       0.88      0.87      0.86       824



In [31]:
accuracy = accuracy_score(true_labels, pred_labels)
accuracy

0.8689320388349514