<a href="https://colab.research.google.com/github/CDAC-lab/BUS5001-Resources/blob/main/Notebooks/TextAnalysis/Workshop_Analysis_TextBlob_HF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis & Comparison: TextBlob vs Hugging Face
This notebook loads text reviews from a CSV file and performs sentiment analysis using:
- **TextBlob** (rule-based sentiment analysis)
- **Hugging Face Transformers** (deep learning-based classifier)
- Then compares the two outputs and optionally performs text classification.

## 1. Load Data
Ensure your CSV file contains a column named `'review'` with one review per row.

## Reuse Instructions
This notebook is designed to be reusable for any kind of text-based review dataset.

### How to Adapt It
1. Ensure your CSV file has a column with the text you want to analyze (e.g., `review`, `comment`, `feedback`).
2. Update the file name and column name in the notebook where the data is loaded:
   ```python
   df = pd.read_csv('your_file.csv')
   df = df.dropna(subset=['your_column_name'])
   df.rename(columns={'your_column_name': 'review'}, inplace=True)
   ```
3. You can now reuse the rest of the notebook without changes.
4. For very large datasets, consider applying the Hugging Face classifier on a sample (e.g., `.head(100)`) to avoid timeouts.

### Optional: Replace Zero-Shot Labels
In the final section, change the candidate labels to suit your context:
```python
labels = ['sustainability', 'product design', 'usability', 'warranty']
```

In [1]:
# Install dependencies
!pip install pandas textblob transformers --quiet

In [2]:
import pandas as pd
from textblob import TextBlob
from transformers import pipeline

In [4]:
# Load the CSV file (upload manually or specify path)
# Example: reviews.csv with column: 'review'
# Load your dataset here (make sure to have a column with review text)
df = pd.read_csv('reviews.csv')
# Remove rows where the review is missing
df = df.dropna(subset=['review'])  # remove any missing rows
df.head()

Unnamed: 0,review
0,The product quality exceeded my expectations.
1,Customer support was unhelpful and rude.
2,Delivery was fast and efficient.
3,The price is too high for the value offered.
4,I'm extremely satisfied with the purchase!


## 2. Sentiment with TextBlob

In [5]:
# Define function using TextBlob to extract polarity-based sentiment
def analyze_textblob(text):
# Get the polarity score: -1 (negative) to +1 (positive)
    polarity = TextBlob(text).sentiment.polarity
    if polarity > 0.1:
        return "POSITIVE"
    elif polarity < -0.1:
        return "NEGATIVE"
    else:
        return "NEUTRAL"

df['sentiment_textblob'] = df['review'].apply(analyze_textblob)
df[['review', 'sentiment_textblob']].head()

Unnamed: 0,review,sentiment_textblob
0,The product quality exceeded my expectations.,NEUTRAL
1,Customer support was unhelpful and rude.,NEGATIVE
2,Delivery was fast and efficient.,POSITIVE
3,The price is too high for the value offered.,POSITIVE
4,I'm extremely satisfied with the purchase!,POSITIVE


## 3. Sentiment with Hugging Face Transformers

In [6]:
# Load the sentiment-analysis pipeline
# Load pre-trained Hugging Face pipeline for sentiment classification
hf_sentiment = pipeline("sentiment-analysis")

# Apply to reviews (may take time)
# Apply sentiment classifier to each review and convert label to uppercase
df['sentiment_huggingface'] = df['review'].apply(lambda x: hf_sentiment(x)[0]['label'].upper())
df[['review', 'sentiment_textblob', 'sentiment_huggingface']].head()

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


Unnamed: 0,review,sentiment_textblob,sentiment_huggingface
0,The product quality exceeded my expectations.,NEUTRAL,POSITIVE
1,Customer support was unhelpful and rude.,NEGATIVE,NEGATIVE
2,Delivery was fast and efficient.,POSITIVE,POSITIVE
3,The price is too high for the value offered.,POSITIVE,NEGATIVE
4,I'm extremely satisfied with the purchase!,POSITIVE,POSITIVE


## 4. Compare the Results

In [8]:
# Compare side-by-side
# Compare if both methods agree on the sentiment
df['agree'] = df['sentiment_textblob'] == df['sentiment_huggingface']
agreement_rate = df['agree'].mean()
# Display the result
print(f"TextBlob and Hugging Face agree on {agreement_rate:.2%} of the reviews")
df[['review', 'sentiment_textblob', 'sentiment_huggingface', 'agree']].head(10)

TextBlob and Hugging Face agree on 80.00% of the reviews


Unnamed: 0,review,sentiment_textblob,sentiment_huggingface,agree
0,The product quality exceeded my expectations.,NEUTRAL,POSITIVE,False
1,Customer support was unhelpful and rude.,NEGATIVE,NEGATIVE,True
2,Delivery was fast and efficient.,POSITIVE,POSITIVE,True
3,The price is too high for the value offered.,POSITIVE,NEGATIVE,False
4,I'm extremely satisfied with the purchase!,POSITIVE,POSITIVE,True
5,It arrived late and the packaging was damaged.,NEGATIVE,NEGATIVE,True
6,This is the best phone I've ever owned.,POSITIVE,POSITIVE,True
7,The instructions were unclear and confusing.,NEGATIVE,NEGATIVE,True
8,Excellent value for money.,POSITIVE,POSITIVE,True
9,Terrible service. Would not recommend.,NEGATIVE,NEGATIVE,True


## 5. Text Classification (e.g. topic prediction)

In [9]:
# Load zero-shot-classification pipeline
# Load zero-shot classifier for multi-label prediction
classifier = pipeline("zero-shot-classification")
# Customize these labels depending on the topic of your reviews
labels = ["product quality", "customer service", "delivery", "pricing"]

# Apply to first 5 reviews as demo
for review in df['review'].head(5):
# Apply classifier to the review using candidate labels
    result = classifier(review, candidate_labels=labels)
# Display the result
    print(f"\nReview: {review}\nTop Label: {result['labels'][0]} with score {result['scores'][0]:.2f}")

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu



Review: The product quality exceeded my expectations.
Top Label: product quality with score 0.98

Review: Customer support was unhelpful and rude.
Top Label: customer service with score 0.98

Review: Delivery was fast and efficient.
Top Label: delivery with score 0.88

Review: The price is too high for the value offered.
Top Label: pricing with score 0.91

Review: I'm extremely satisfied with the purchase!
Top Label: product quality with score 0.85
