# 📘 Notebook Summary: Sentiment Classification of IMDB Reviews using GPT-3.5-Turbo (Few-Shot Prompting)

This notebook performs **sentiment classification** of IMDB movie reviews using **few-shot prompting** with the `gpt-3.5-turbo` model via the **Azure OpenAI API**. The dataset is sourced from **Hugging Face's `imdb` corpus**.

### 🔍 Key Components:
- **Dataset**: IMDB reviews with binary sentiment labels (`positive`, `negative`).
- **Few-shot Prompting**: The notebook constructs prompts with example reviews and their sentiments to guide the model.
- **OpenAI API Integration**: Uses `openai.ChatCompletion.create` to classify test reviews.
- **Evaluation**: Computes the **micro F1 score** to evaluate GPT's performance on unseen reviews.

### ⚙️ Techniques Used:
- Environment configuration via `.env` and `dotenv`.
- Prompt engineering for `gpt-3.5-turbo` in chat format.
- Use of `scikit-learn` for data splitting and F1 scoring.
- Batched inference using OpenAI API with progress monitoring via `tqdm`.

This notebook is ideal for evaluating **LLM few-shot classification capabilities** on real-world, sentiment-labeled text data.


## Import Libraries and API Key

In [6]:
# Import all Python packages required to access the Azure Open AI API.
# Import additional packages required to access datasets and create examples.

import json
import random
import tiktoken
import session_info

import pandas as pd
import numpy as np

import openai

from datasets import load_dataset
from collections import Counter
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score


In [7]:
import os
from dotenv import load_dotenv

load_dotenv()

openai.api_key = os.getenv("OPENAI_API_KEY")

client = openai.OpenAI()

In [8]:
CHATGPT_MODEL = "gpt-3.5-turbo"
deployment_name = CHATGPT_MODEL

## Load and Process Dataset

In [9]:
imdb_reviews_corpus = load_dataset("imdb")

In [10]:
imdb_reviews_corpus

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [11]:
imdb_reviews_train_df = imdb_reviews_corpus['train'].to_pandas()

In [12]:
imdb_reviews_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    25000 non-null  object
 1   label   25000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 390.8+ KB


In [13]:
imdb_reviews_train_df.label.value_counts()

label
0    12500
1    12500
Name: count, dtype: int64

In [14]:
imdb_reviews_train_df.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


In [15]:
imdb_reviews_train_df['sentiment'] = np.where(imdb_reviews_train_df.label == 1, "positive", "negative")

In [16]:
imdb_reviews_train_df.sample(5)

Unnamed: 0,text,label,sentiment
6671,"Herbie, the Volkswagen that thinks like a man,...",0,negative
23913,"""Kolchak: the Night Stalker"" is a hugely enter...",1,positive
21140,I have to start off by apologizing because I t...,1,positive
19652,Although Cinderella isn't the obvious choice f...,1,positive
4281,Rather than move linearly from beginning to en...,0,negative


In [17]:
imdb_reviews_train_df.sentiment.value_counts()

sentiment
negative    12500
positive    12500
Name: count, dtype: int64

## Split Dataset to Train and Test

In [18]:
imdb_reviews_examples_df, imdb_reviews_gold_examples_df = train_test_split(
    imdb_reviews_train_df, #<- the full dataset
    test_size=0.2, #<- 20% random sample selected for gold examples
    random_state=42 #<- ensures that the splits are the same for every session
)

In [19]:
(imdb_reviews_examples_df.shape, imdb_reviews_gold_examples_df.shape)

((20000, 3), (5000, 3))

In [20]:
columns_to_select = ['text', 'sentiment']

In [21]:
gold_examples = (
        imdb_reviews_gold_examples_df.loc[:, columns_to_select]
                                     .sample(50, random_state=42) #<- ensures that gold examples are the same for every session
                                     .to_json(orient='records')
)

In [22]:
json.loads(gold_examples)[0]

{'text': 'Like I said at the top, four stars just aren\'t enough. It\'s one of the best films I\'ve ever seen in my almost 17 years of life. For the people that don\'t really like it or understand it, you must not have a real appreciation for art or you might have a short attention span.<br /><br />Even if I haven\'t seen all his films yet, I\'d have to say that this is Spielberg at his peak. It\'s pretty sad to see that movies as great as "The Color Purple" don\'t come along too often \'cause I think all of us are in desperate need of first-class motion picture entertainment in these hard times.<br /><br />Movies like this are more than just movies; they\'re pieces of art that need to be appreciated more.<br /><br />The idea that it was nominated for 11 Oscars (even Best Picture of the Year) and didn\'t get one trophy is a sign of how blind and stupid Hollywood can be sometimes. Spielberg wasn\'t even nominated for Best Director! It should have swept the Oscars that year.<br /><br />T

## Initialize System Message and User Message Template

In [23]:
user_message_template = """```{movie_review}```"""

In [24]:
cot_system_message = """
Classify the sentiment of movie reviews presented in the input as 'positive' or 'negative'.
Movie reviews will be delimited by triple backticks in the input.
Answer only 'positive' or 'negative'. Do not explain your answer.

Instructions:
1. Carefully read the text of the review and think through the options for sentiment provided
2. Consider the overall sentiment of the review and estimate the probability of the review being positive

To reiterate, your answer should strictly only contain the label: positive or negative.
"""

## Extract Postive and negative examples from Dataset

In [27]:
positive_reviews = (imdb_reviews_examples_df.sentiment == 'positive')
negative_reviews = (imdb_reviews_examples_df.sentiment == 'negative')

In [28]:
(positive_reviews.shape, negative_reviews.shape)

((20000,), (20000,))

In [29]:
def create_examples(dataset, n=4):

    """
    Return a JSON list of randomized examples of size 2n with two classes.
    Create subsets of each class, choose random samples from the subsets,
    merge and randomize the order of samples in the merged list.
    Each run of this function creates a different random sample of examples
    chosen from the training data.

    Args:
        dataset (DataFrame): A DataFrame with examples (text + label)
        n (int): number of examples of each class to be selected

    Output:
        randomized_examples (JSON): A JSON with examples in random order
    """

    positive_reviews = (dataset.sentiment == 'positive')
    negative_reviews = (dataset.sentiment == 'negative')
    columns_to_select = ['text', 'sentiment']

    positive_examples = dataset.loc[positive_reviews, columns_to_select].sample(n)
    negative_examples = dataset.loc[negative_reviews, columns_to_select].sample(n)

    examples = pd.concat([positive_examples, negative_examples])
    # sampling without replacement is equivalent to random shuffling
    randomized_examples = examples.sample(2*n, replace=False)

    return randomized_examples.to_json(orient='records')

In [30]:
examples = create_examples(imdb_reviews_examples_df, 2)

In [31]:
json.loads(examples)

[{'text': "I'm shocked that there were people who liked this movie..I saw it at Tribeca and most of the audience laughed through it at scenes that were not meant to be funny. I felt bad because the lead actress was in the audience, but honestly the plot to this movie needed MAJOR revision..it didn't even make sense, one second the characters question what exactly it is that they're snorting..the next scene they're hopelessly addicted and figure out how to make it?? Also the ending just took the cake..I'm not going to spoil the magnificent conclusion..but it pretty much blended right in with the rest of the horrible plot/script...see this movie for comedy if you must..",
  'sentiment': 'negative'},
 {'text': 'This is easily a 9. Michel Serrault, known more for comic roles in the earlier part of his acting career, does a stunning, even worryingly stunning job of impersonating Dr Petiot, a legendary French serial killer.<br /><br />He is just so believable at every and any moment in the f

## Create Few Shot Prompt with Examples

In [32]:
def create_prompt(system_message, examples, user_message_template):

    """
    Return a prompt message in the format expected by the Open AI API.
    Loop through the examples and parse them as user message and assistant
    message.

    Args:
        system_message (str): system message with instructions for sentiment analysis
        examples (str): JSON string with list of examples
        user_message_template (str): string with a placeholder for movie reviews

    Output:
        final_prompt (List): A list of dictionaries in the Open AI prompt format
    """

    final_prompt = [{'role':'system', 'content': system_message}]

    for example in json.loads(examples):
        example_review = example['text']
        example_sentiment = example['sentiment']

        final_prompt.append(
            {
                'role': 'user',
                'content': user_message_template.format(
                    movie_review=example_review
                )
            }
        )

        final_prompt.append(
            {'role': 'assistant', 'content': f"{example_sentiment}"}
        )

    return final_prompt

In [33]:
cot_few_shot_prompt = create_prompt(
    cot_system_message,
    examples,
    user_message_template
)

In [71]:
cot_few_shot_prompt

[{'role': 'system',
  'content': "\nClassify the sentiment of movie reviews presented in the input as 'positive' or 'negative'.\nMovie reviews will be delimited by triple backticks in the input.\nAnswer only 'positive' or 'negative'. Do not explain your answer.\n\nInstructions:\n1. Carefully read the text of the review and think through the options for sentiment provided\n2. Consider the overall sentiment of the review and estimate the probability of the review being positive\n\nTo reiterate, your answer should strictly only contain the label: positive or negative.\n"},
 {'role': 'user',
  'content': "```I didn't expect much from this film, but oh brother, what a stinker.<br /><br />I found this gem in a giant crate of awful $5 DVD's at Walmart (where else)? As cheap as this disc was, I feel ripped off. The special effects had a high school look to them, the camera work marred by wobbly tripods and sketchy lighting and the acting was a perfect example of the 'Christian School'. One can

## Token Calculation

In [None]:
def count_tokens_by_message(messages, model=CHATGPT_MODEL):
    encoding = tiktoken.encoding_for_model(model)
    
    tokens_per_message = 3  # every message has <|start|> role <|content|> <|end|>
    tokens_per_name = 1     # if 'name' is used instead of 'role'

    total_tokens = 0
    for idx, message in enumerate(messages):
        message_tokens = tokens_per_message  # start+role+end overhead
        for key, value in message.items():
            message_tokens += len(encoding.encode(value))
        print(f"Message {idx+1} [{message['role']}]: {message_tokens} tokens")
        total_tokens += message_tokens

    total_tokens += 3  # priming for reply (<|start|>assistant)
    print(f"Reply priming: 3 tokens")
    print(f"\nTotal prompt tokens: {total_tokens}")
    return total_tokens


In [34]:
Messages=cot_few_shot_prompt

In [35]:
count_tokens_by_message(Messages)

Message 1 [system]: 109 tokens
Message 2 [user]: 147 tokens
Message 3 [assistant]: 5 tokens
Message 4 [user]: 149 tokens
Message 5 [assistant]: 5 tokens
Message 6 [user]: 180 tokens
Message 7 [assistant]: 5 tokens
Message 8 [user]: 520 tokens
Message 9 [assistant]: 5 tokens
Reply priming: 3 tokens

Total prompt tokens: 1128


1128

## Predict and Evaluate Model by Micro F1 Score

In [36]:
def predict_and_compute_performance(prompt, gold_examples, user_message_template):

    """
    Return the micro-F1 score for predictions on gold examples.
    For each example, we make a prediction using the prompt. Gold labels and
    model predictions are aggregated into lists and compared to compute the
    F1 score.

    Args:
        prompt (List): list of messages in the Open AI prompt format
        gold_examples (str): JSON string with list of gold examples
        user_message_template (str): string with a placeholder for movie reviews

    Output:
        micro_f1_score (float): Micro-F1 score computed by comparing model predictions
                                with ground truth
    """

    model_predictions, ground_truths = [], []

    for example in json.loads(gold_examples):
        gold_input = example['text']
        user_input = [
            {
                'role':'user',
                'content': user_message_template.format(movie_review=gold_input)
            }
        ]

        try:
            response = client.chat.completions.create(
                model=deployment_name,
                messages=prompt+user_input,
                temperature=0, # <- Note the low temperature
                max_tokens=2 # <- Note how we restrict the output to not more than 2 tokens
            )

            prediction = response.choices[0].message.content

            model_predictions.append(prediction.strip().lower()) # <- removes extraneous white space and lowercases output
            ground_truths.append(example['sentiment'])

        except Exception as e:
            continue

    micro_f1_score = f1_score(ground_truths, model_predictions, average="micro")

    return micro_f1_score

In [37]:
predict_and_compute_performance(cot_few_shot_prompt, gold_examples, user_message_template)

0.98