<a href="https://colab.research.google.com/github/CDAC-lab/BUS5001-Resources/blob/main/Notebooks/Using_A_HuggingFace_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

First, we'll import our required libraries. The transformers library
provides pre-trained models, while pandas helps us handle our data
efficiently.

In [None]:
!pip install transformers
!pip install datasets

Collecting datasets
  Downloading datasets-3.0.2-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.2-py3-none-any.whl (472 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.7/472.7 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading xx

In [None]:
# Import our required libraries
from transformers import pipeline  # For accessing pre-trained models
from datasets import load_dataset  # For loading the IMDB dataset
import pandas as pd               # For data manipulation

Understanding the Pipeline
-----------------------------------
The pipeline abstraction from transformers makes it easy to use pre-trained
models. We're using 'zero-shot-classification' which can classify text
without specific training.

In [None]:
# Initialize our classifier
classifier = pipeline(
    task="zero-shot-classification",
    model="facebook/bart-large-mnli"  # This model can handle various classification tasks
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



Data Preparation
-------------------------
Let's load a small sample of the IMDB dataset and convert it to a pandas
DataFrame. We're using a small sample size for workshop purposes.

KEY CONCEPT: The .select() method allows us to take a subset of our data

In [None]:
# Load and prepare our dataset
imdb_df = pd.DataFrame(
    load_dataset("imdb", split="test")
    .select(range(5))  # Taking 5 reviews for demonstration
)

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Setting Up Classification Categories
--------------------------------------------
In zero-shot learning, we need to provide the possible categories.
This allows the model to be flexible and adapt to different classification needs.

Creating a Sentiment Analysis Function
----------------------------------------------
We'll create a function that processes a single review. This is a common
Python pattern - creating small, focused functions that do one thing well.

KEY CONCEPT: Using dictionaries to return multiple values is a clean pattern


In [None]:
# Define our possible sentiment categories
candidate_labels = ["positive", "negative", "neutral"]

def analyze_sentiment(text):
    """
    Analyze the sentiment of a piece of text.

    Args:
        text (str): The text to analyze

    Returns:
        dict: Contains sentiment label and confidence score
    """
    # Process the text through our classifier
    result = classifier(text, candidate_labels)

    # Return a dictionary with our results
    return {
        'sentiment': result['labels'][0],      # The most likely sentiment
        'confidence': round(result['scores'][0], 2)  # How confident is the model
    }

Applying Our Analysis
------------------------------
Now we'll use pandas' apply function to process all our reviews.
This is more Pythonic than writing explicit loops.

KEY CONCEPT: pandas' apply is vectorized and more efficient than explicit loops

Extracting Results
---------------------------
We'll use lambda functions and apply to extract our results into separate
columns. This demonstrates clean data transformation patterns.

In [None]:
# Process all reviews using pandas apply
results = imdb_df['text'].apply(analyze_sentiment)

# Extract results into separate columns
imdb_df['sentiment'] = results.apply(lambda x: x['sentiment'])
imdb_df['confidence'] = results.apply(lambda x: x['confidence'])


Displaying Results
---------------------------
Let's create a clean, formatted output of our results.
This demonstrates string formatting and iteration patterns.

In [None]:
print("\nSentiment Analysis Results:")
print("-" * 50)
for idx, row in imdb_df.iterrows():
    print(f"\nReview #{idx + 1}")
    # Using string slicing to show preview of long texts
    print(f"Text: {row['text'][:200]}...")
    print(f"Sentiment: {row['sentiment']}")
    print(f"Confidence: {row['confidence']}")


Sentiment Analysis Results:
--------------------------------------------------

Review #1
Text: I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Bab...
Sentiment: negative
Confidence: 0.93

Review #2
Text: Worth the entertainment value of a rental, especially if you like action movies. This one features the usual car chases, fights with the great Van Damme kick style, shooting battles with the 40 shell ...
Sentiment: positive
Confidence: 0.74

Review #3
Text: its a totally average film with a few semi-alright action sequences that make the plot seem a little better and remind the viewer of the classic van dam films. parts of the plot don't make sense and s...
Sentiment: negative
Confidence: 0.88

Review #4
Text: STAR RATING: ***** Saturday Night **** Friday Night *** Friday Morning ** Sunday Night * Monday Morning <br /><br />Former New Or